Information loss destroys value, increases waste, and reduces the effectiveness of collaboration. Stop throwing away context and you’ll uncover insights in what you already possess.
This is one of a series of topics on DataOps. Read my earlier posts to get a better understanding of how to apply lean manufacturing to organizations where data is the product that you create and sell to your clients. That’s my working definition of DataOps.
Sections to follow discuss how we lose or destroy context and information about our data as we misplace, mishandle, and outright mangle our raw materials, work in process, and finished goods in data pipelines and data deliverables.
CSV — My Best Friend is My Worst Enemy
When we transfer data from one system to another, we frequently use a delimited text format. The shorthand for this is CSV, which originally meant comma-separated values but now just means text with delimiters.
Nothing destroys data more effectively than CSV does.
It removes context, significance and meaning. It removes data types. It frequently destroys the character set itself by forcing to ASCII, which destroys the original language without a path to recover it. It adds complexity without adding value. It forces rework and provides no path to validate accuracy of your data transfer.
Why do we still use it? Quite simply, because we have no common language to communicate and to understand context and typing and value and accuracy. So we completely remove all of that, to our mutual harm.
If you must use CSV, make two immediate improvements:
- Use Unicode (UTF-8 or better) instead of ASCII to preserve character sets.
- Require a schema definition that provides data types and human-readable, business-friendly descriptions of data elements.
And if you’re willing to look beyond CSV, start to work with JSON and Avro and protobuf. Schema-aware data formats are widely used by engineers, but are still rarely used for data transfer between systems. That’s a shame.
Microsoft Excel — The Addictive but Deadly Drug of Data Users
Microsoft Excel is everyone’s favorite tool. When you force all data into spreadsheet cells, you entirely give up what little control you had over data representation and data integrity.
Every Excel user falls into the same traps. It looks like a number, so remove those leading zeroes — destroying identifier columns the world over. Let me sort this data to see what’s going on — wait, why did Excel sort some columns but not the others? I loaded the data from the CSV, where the commas looked good to me, so why do some rows have a different number of columns than others?
Although it is possible to use data carefully and correctly in Excel, no one does. I’m a true super-user of data in Excel, yet I make all of the obvious mistakes and many non-obvious ones, ALL THE TIME. Excel simply looks good, and feels good, and … oh … we’re all addicts, aren’t we?
To use data effectively in Excel:
- Use tab-delimiters in your CSVs, not commas or pipe-characters.
- Look at every single column in the import-from-CSV dialog and change to text when you see leading zeroes.
- Always put data tables alone in a tab. One data table per tab.
- Use row 1 for column headers and never leave a column un-named.
- Never allow blank rows anywhere in a data tab.
- Use separate tabs for pivots or graphs or annotations or instructions.
- Use statistics packages (R, Python, SAS, Matlab) to compute statistics. I guarantee that you will get statistics wrong if you compute them in Excel. Yes, it says “AVERAGE” when it should say “MEAN” but it still won’t give you what you expect when your data, like all data, has missing values.
Relational Database — Sixty Years and They Still Destroy Context
Matrix math applied to data, and some masterful guidance by E.F.Codd and others, and we found a tool that worked. But it’s long past time to call out the shortcomings of relational database theory and practice.
Most relational databases are built without a data model. E.F.Codd would roll over in his grave. If you don’t know what your entities and relationships are, your database will fail to scale, and it will add and multiply data errors without restraint.
But let’s talk about well-constructed databases that we build from a carefully studied business context. After we fully understand the business domain, after we determine what to represent in the database and what to omit, after we calibrate relationships and data types and scalability concerns, we create database tables and start to use them.
And right there is the problem. The transition from design to use.
The database tables have data types, but they do not explain why. When later users add tables with new relationships, there is no information in the database schema that tells you that this first identifier column is generated within the database itself, but that second identifier column is coming from an external system that you cannot see and you are not free to change its representation. You cannot see the context which led to the data types.
The same is true of relationships, representation, behaviors, transactions, ranges of values, expected distributions of values, and so much more.
I cannot tell you to stop using relational databases. Well, I could, but you would ignore me. So here is what I will tell you, instead:
- Create a data dictionary for your domain. Define your data elements.
- Never add a column to a database before you add it to the data dictionary.
- Create a data catalog with entries for every database table, for every API endpoint, for every schema of every delimited text file, for every Kafka topic.
- In your data catalog entries, explain why this data exists, where it comes from, who to talk to if you wish to learn more, what transactions or behaviors it explicitly includes and which transactions or behaviors it explicitly excludes. Preserve the context.
Machine Learning — Refusing to Learn the Domain
Others will write of good AI/ML versus bad AI/ML — on bias in training sets and on confidence factors and on degradation over time. I will focus on the loss of information that AI/ML creates. AI/ML is very creative in how it destroys data.
I’ve lost track of the number of times that AI/ML models have been used to classify data where the organization already knows the answer. Choose a better question — one that has not been answered, or one where scalability must increase because we cannot continue to classify the old way. Don’t replace trusted data with untrusted data.
Use what you already know. Talk to your domain experts. Include the obvious domain knowledge in the features. Don’t let AI/ML override what you already know. Use ML to create new features. Again, don’t replace trusted data with untrusted data.
When you migrate models into production use, provide continuous measurements of effectiveness. If you are always classifying new data, you cannot measure false positives, can you? Work with your data operations team. Define success and failure with statistical distributions that can be automated by others. Automate the sample collection of unbiased training data so that retraining is reproducible. Solve for stickiness to increase stability of analytical trends that clients depend upon.
To avoid destroying data with AI/ML, it comes down to learning the domain:
- Talk to your domain experts. Don’t replace trusted data.
- Talk to those who automate the data operations. Measure results.
- Talk to those who understand what clients need. Trends matter.
Guarding Against Data Destruction
It’s a funny old world. Our most heavily used tools are those that produce the greatest loss of information. CSVs, Excel, Databases, and AI/ML models.
It certainly helps to learn best practices in each of the above tools. But the preservation of domain context is a harder problem than it seems. Data quality test automation helps. Data dictionaries and data catalogs help. Talking to domain experts is a big help.
However, the larger problem is reproducibility. And that is a topic for a future blog post.