In the deep dark past — that is, a few decades ago — manufacturers transitioned to lean enterprises. Following shortly after, software organizations championed various flavors of lean software development. In the present, digital companies will recognize their deep debts to lean principles. It is time to learn how those apply to data operations.
I write about DataOps as a culture and a collection of practices for organizations that use data as the raw materials and produce data as the deliverable that clients pay for. Data analytics, for example, can be understood as a manufacturing process. Lean principles apply directly to data operations.
DataOps is not Engineering and Design
Lean principles and Agile methodologies are widely used in software design and development. DevOps extends this to automation in testing, builds, and deployment.
Digital companies change the financial calculus to avoid using capital to fund data centers. When software design, development and deployment make up a major portion of an organization’s headcount, it makes sense to use as little capital as possible in these areas. It makes good financial sense to allocate engineering & design costs to the products that you sell to clients.
These are good business practices. They are not DataOps.
DataOps is the application of lean principles to data operations, not to engineering and design.
DataOps is not Infrastructure as a Service
Infrastructure as a Service (IaaS) supports DevOps in the design & engineering organization, but goes beyond this into the data storage, computing resources, and the hosted application services that facilitate a microservices architecture.
Broadly speaking, all of the cloud platform costs can be treated as the cost of goods sold. This is also a great business practice for digital companies. But IaaS and cloud platforms are not DataOps.
DataOps is the application of lean principles to data operations, not to infrastructure and software services.
DataOps is not a Data Lake
When an organization invests in data science teams, and especially when innovative data analytics and statistical modeling are the incubators of new product offerings, it is a common fallacy for these R&D perspectives to undervalue the data sources, the data pipelines, the additional effort to productionize models that move into production use, and the care & feeding of continuous operations that create value for clients in the products that are constructed from enhanced data and tools to access it.
When organizations with strong data science teams fall into the trap of underfunding the data operations, it produces a number of common points of failure. One of the most quoted failure modes is the adage that data scientists spend up to 80% of their time getting access to data or in data preparation before modeling begins.
Be honest. You’re aware of that one, aren’t you? You’ve said it yourself a few times. You probably don’t actually still believe it, simply because it has become common for data science teams to also employ data engineers. Data engineers spend the time doing those things now. Data engineers spend their time gathering or creating trustworthy metadata about data sources, collecting data into storage layers that are readily accessible, and trying to make the data ingestion and standardization reproducible and reliable.
Let me be clear. Data ingestion and standardization, and the creation of data lakes that data science teams can use, is not DataOps. It’s getting closer, perhaps, and it’s starting to overlap with DataOps. But having a data lake does not mean that you’ve understood or implemented DataOps practices.
Data lakes are not DataOps. DataOps is the application of lean principles to data operations.
DataOps uses Lean Principles in Data Operations
I wish that I could stop my explanation with that sentence. It would be nice if everyone that works in a data analytics organization understood the difference between design & engineering, intrastructure & cloud hosting, and data operations. That has not been my experience. Data operations is not well understood within the context of digital companies and data analytics.
The scope of data operations includes sourcing, ingestion, standardization, enhancement, model execution (in production, not R&D), delivery, quality measurement & reporting, process controls, throughput, performance, monitoring, and security. All of those are best understood as manufacturing processes.
To understand DataOps, you must recognize that you have a manufacturing production line for data. To be effective at DataOps, you need to use lean manufacturing principles in your data operations.
Applying Manufacturing Processes to Data
The following is not an exhaustive list. I will give you a manufacturing perspective on some of the data operations activities. In each of the examples below, I try to phrase them so that you can see how lean manufacturing principles apply directly to data operations.
This article is not a primer on the lean manufacturing principles. You will see evidence below of topics such as quality metrics, continuous improvement, using measurements to drive changes, and the autonomous ownership of the manufacturer production line by the personnel who run it.
- In sourcing, operations personnel will identify and on-board new data sources. They will define quality metrics for incoming data and work with suppliers to meet the minimum quality specifications in order to accept their data as raw materials for the production line.
- In ingestion, operations personnel will define metadata for each new data source, and capture unusual behaviors that make one supplier different from another. They will track and report on continuous quality metrics that produce alerts and recommended actions when data quality exceptions occur during ingestion. And they will take the actions necessary to remediate data quality exceptions.
- In standardization, operations personnel will define quality measures, trend and report on the difference between ingested data and standardized data, and alert downstream applications users when data quality objectives cannot be met.
- In enhancement, operations personnel will define and monitor the metrics that prove the fitness for use of enhanced data.
- In model execution, operations personnel will capture data profiling characteristics before and after model execution to evaluate and respond to success or failure in model performance. They will provide human feedback as well as automated datasets that data science teams can use to retrain models when model performance falls below acceptable thresholds.
- In delivery, operations personnel will validate that data reaches client applications (or client systems) within the contractual guidelines for freshness, completeness, and accuracy. They will also validate and ensure that data is delivered only to clients who are authorized to see it.
- In production processes, operations personnel will design and monitor the effectiveness of the directed acyclic graphs (DAGs). They will fine-tune performance, intervene to correct errors, stop the production line when necessary, and continuously improve the DAGs that run the production line.
And all along the way in the above activities, operations personnel will monitor and report on quality, throughput, performance (of storage and compute resources), API behaviors, microservices platform health, compliance with contracts, and prevent and/or respond to security breaches.
DataOps is the application of lean principles to data operations.