Fighting an Epidemic of Poor Data Quality

4 min readAug 3, 2019

As with many organizations, the ones that employ me struggle with poor data quality. Data sources are obscured. Meaningful information is lost in the transit through data pipelines. Technology choices that introduce eventual consistency lead to continuous inconsistency. And the necessary efforts to track meaningful change over time produce ambiguity in data that was supposed to provide a single source of truth.

As I study the problem, I conclude that the underlying cause of poor data quality is not to be found in the sources, pipelines, technology choices or the relentless pace of change. It’s in how an organization addresses those. More specifically, the fundamental root cause of poor data quality is a failure to build and to sustain trust relationships.

Let me give you an example from another domain.

In Congo, the World Health Organization (WHO) declared an international health emergency. Attempts to combat the latest disease outbreak have failed due to an “inability to build community trust”. One in four residents of affected areas believe that the virus was fabricated. The WHO’s important conclusion is that “Medical expertise is not sufficient to end epidemics”. Progress is now measured in Congo by how well the WHO engages local residents, how fears are identified and respected and assuaged, and in how underlying instability and poverty are addressed. Tracking the spread of the disease is necessary but not sufficient to halt and reverse it.
[ Christian Science Monitor, 5 August 2019 ]

Fighting poor data quality across an organization starts with measuring data quality. As difficult and as necessary as it certainly is to gain consensus on how to measure it, data quality will not improve until the root causes of poor data quality are identified and addressed.

Let’s return to the four problem areas that I mentioned above.

Data sources can provide bad data. They can provide data that is good for a while and then starts to go bad. Or it may be that you never fully understood the way in which that data was collected and how trustworthy it is. Data itself is nothing more than observations of behaviors that are no longer visible, or the patterns and predictions that we build from those. But if data is a collection of observations, how trustworthy was the one who made those observations? Did they record only what they thought mattered? What did they fail to mention because it did not seem important at the time? Was there a time or distance gap between the behavior itself and the one who captured the observation? Is there a way to audit the veracity of the original data source?

In short, do you have a strong enough relationship with those who provide you the data to ask these questions? Can you trust the people who provide it?

Data pipelines introduce changes to data in order to correct errors, to standardize data representations, to filter out noise that does not carry meaning, and to enhance data with what we know from other sources or what we can infer based on patterns and predictions. Every one of those actions carries a risk of making things worse. If logic that originally removed 1 in 10,000 as statistical outliers has lately started to remove 1,000 of every 10,000, will you be alerted? Will the humans get involved quickly enough to save the automation from destroying the value of the data? Do you have integrated measurements within the data pipeline that validate that the impact of each step in the pipeline has an impact that remains within the intended impact of its original plan?

Can you trust that your data pipeline is not harming the data? Do you have continuous statistical evidence to watch trends over time? Do you trust that your pipeline trigger will automatically alert you when you need to intervene?

Eventual consistency will be familiar to you if you use a data serving layer that is not ACID (atomic, consistent, isolatable, durable). But data pipelines frequently capture snapshots of data to enable faster restart. Data lakes use snapshots to support reversion to prior known-good states. And almost every data serving layer supports data versioning or historical copies to support data analytics of change over time.

Here’s what you may not realize. Any time there is a copy of data, or a historical snapshot, or a saved copy along some stage of a pipeline, you introduce inconsistency across time. You’re frequently doing this intentionally so that published data does not change until you’re sure that the new set is correct. But this necessarily means that data does not agree from one point in time to another. Data across time is not consistent. Sometimes that’s valuable, but are you aware of the time delay in reverting after a failed pipeline transformation? Do you measure the loss in data freshness from waiting to publish good data? Do your data consumers trust that your data is fresh enough to use? Do they value freshness more than accuracy?

Inconsistent answers and stale data are the most common reasons that data consumers give when they say that they do not trust the data they receive. Do you ask your consumers whether they trust your data? Do you have a way to know that multiple data consumption techniques will produce inconsistent results? Do you communicate with data consumers in a way that leads them to trust your data? If every one of their data inquiries leads you to say “well, it depends”, are you confusing them to the point where they cannot trust you?

If you want to improve data quality, it is not enough to measure data quality. You also need to improve the ways in which you audit and validate data sources, the impacts of data pipeline transformations, the freshness of your data, and the clarity with which your data answers the questions that your data consumers have.

When you improve the trust relationships, you improve the data quality.

Fighting an Epidemic of Poor Data Quality

Written by Kevin Kautz

No responses yet