Data warehouses and data lakes in the wrong hands will horde data, hide data, and hold it back. Data governance that limits data use reduces the value of data to your organization. Data governance that enables innovation, encourages experimentation, and guides decision-making increases its value.
It comes down to this. Data in use has value; data at rest does not.
Incorrect Risks Will Bias Results
One approach to data governance states that the only data risk is poor data quality that leads to decisions made with incorrect data. This narrow view of data risk leads to policies that restrict access, reduce variation in data values, remove outliers, and regulate outputs that solve only for carefully constrained use cases.
A restrictive approach to data governance unfortunately creates new risks to the business that outweigh the original. Over-cleansing hides reality, prevents fresh analysis, and delays recognition of fast-developing new information. Dimensional modeling reduces the set of possible questions that data can answer. You quickly end up in a situation where you can only hunt for lost keys under the street lamp where the light is good instead of in the darkness where you lost them.
When you incorrectly identify the data risks, the steps you take to mitigate will introduce bias to the data that drives your decisions. Don’t do that.
Business Risks Live Outside of Your Data
Instead, identify the risks within your business domain. The data that you create or acquire needs to be the right data for the decisions for your business. Rather than asking if the data is good, you should start by asking whether you have the right data at all, whether you have it soon enough, whether you have enough data to challenge your own assumptions, and whether you have enough independence in data sources to unmask hidden bias.
Here are examples of risks to data from the business perspective.
- Data collection intentionally omits information that you later realize that you need.
- Critically important data has a single source, leading to unintended bias and a single point of failure.
- Data analysts are limited to dimensional data or aggregated data which prevents them from asking questions against raw or disaggregate data.
- Data is not available for new business needs until data from new sources has been cleansed, filtered, joined to master data, and aggregated to answer new questions.
The greatest risks to your business are restrictions on the use of data. If your data governance policies do not support innovation, they are adding risk to the business instead of reducing risk.
Choosing the right risks is half of the battle. If you recognize that innovation adds more value than removing bad data does, you’re positioning yourself for success.
So let’s repeat our mantra. Data in use has value; data at rest does not.
Do not measure your success by how much data you’ve accumulated in data warehouses, data marts and data lakes. Success is the reuse of that data in business processes, in decision-making, in research & development, in trend analysis, in rapid response to changing business needs, and in spawning new questions. Measure who is using the data, what they are using it for, and track the outcomes that the data supported. Measure the value derived from the use of the data.
Modern data governance still needs to be aware of regulations, to track compliance, and to identify authorized use. With the right tools, these are information security processes that we can safely implement while expanding use of data for innovation.
Your mileage may vary, but here is a checklist that works in my experience:
- Own your metadata. That is, know your data sources and manage the data schemas as they change over time.
- Gather data from more than one source. The more critical it is to your business, the more you need more than one independent source of information on which to make your decisions.
- Use a data federation layer. That is, provide a single presentation layer for data access, either through APIs or through versatile data-lake query-friendly tools such as Dremio, so that every internal data source is registered and published internally with its current schema.
- Establish roles & rules for data access. Implement the rules via your data federation layer, and enforce them. Your information security team can block access to data that does not go through the data federation layer. When implemented well, this gives greater visibility, makes it easier for engineers and analysts to find data, and provides logging that makes data use audit-friendly.
- Capture all of the data even if it is wrong, and make cleansing a measurable and repeatable step that happens later.
- When cleansing, don’t cleanse away the bad data. Instead, mark it as bad and keep both bad and good for later analysis. Include the reason why it was marked as bad. Publish the uncleansed-but-annotated data to the authorized internal users who can be trusted with it.
- Filter data to exclude bad data. Make this a measurable and repeatable step so you can see the trend of how much data has been excluded and whether that is changing over time. When you change your mind on what is good or bad, you want to be able to recreate good data using improved rules. Note that this means that cleansed and filtered data is published separately from the uncleansed data. Name them wisely and clearly.
- Publish filtered data that has been joined to master data, along with aggregated data, to a wider audience. You can still support narrowly defined data warehouses where the business needs require that.
- Expose raw data to trusted analysts alongside the classification, categorization, dimensional joins, and scoring results that you currently publish as data enhancements. Analysts need to continuously review whether those enhancements are still working as intended, and when it is time to discard them, expand them, or create new ones. Enhancing data introduces bias and you need to measure and evaluate whether it serves you well.
- Measure and publish metrics on data quality and on data use. If your client relationships are mature enough to handle it, make your internal quality & use metrics visible to your clients. It builds trust.