2020 Adirondacks lakeshore. Photo by AKS. Used by permission.

Data Development Life Cycle

XP for Data Assets

Kevin Kautz
11 min readMar 10, 2021

--

Which metaphor is best-suited to elevate data to first class citizenship in modern organizations? I suggest that we borrow terminology from software development life cycles (SDLCs). To be more specific, from XP.

My audience are organizations that develop software. They understand SDLCs and with help, can see the value in XP. But the choice to use the software metaphor when talking about data is an attempt at broader applicability than metaphors such as data lakes and data warehouses.

So let’s approach data as the representation of human behavior through the lens of human creativity. That’s also what software is, by the way, although we might have forgotten this.

By the way, if you’re familiar with a data life cycle (creation, storage, use, archival, destruction), that’s not it. This discussion is broader.

Software Development Life Cycle (SDLC)

Before we apply the Development Life Cycle metaphor to data, what are the steps when the focus is software?

First, there is a business domain in which we determine how software adds value to clients and stakeholders. We review market needs, create hypotheses, and evaluate them via prototypes with limited functionality. This leads to an architectural approach, detailed design, development and testing. In modern software teams, we add continuous cloud deployment of systems and infrastructure. As it moves into production, we work with operations and support teams to satisfy end-user needs, and we work with infrastructure to maintain security and to monitor closely to scale appropriately as usage grows. We also measure the value in client adoption and in financial viability.

Let’s turn that paragraph into a list of steps. Although I describe a product-driven user-centered SDLC that aligns to XP, neither waterfall nor other agile methodologies will change these high-level steps.

Software Discovery & Framing

  1. Describe the domain
  2. Identify problems that matter to the market
  3. Propose hypotheses that solve problems
  4. Evaluate hypotheses with prototypes

Software Engineering & Design

  1. Plan the architectural context and boundary conditions
  2. Design the software behavior
  3. Develop and test software and the infrastructure to run it
  4. Deploy continuous improvements to software and infrastructure

Software Operations

  1. Equip operations and support teams
  2. Secure application access
  3. Monitor usage and scale for growth
  4. Measure client adoption and financial viability
K.Kautz March 2021

Data Development Post-Production

There is a critical distinction to make between data development and software development. With software, although changes are introduced regularly, they are developed & tested pre-production. With data assets, the equivalent behaviors happen post-production. That is, data development happens after software and schemas are deployed to production.

Data is like water. Continuous, uninterrupted flow is essential. Sources vary, quality varies, and volumes can increase or decrease. But data development and testing are continuous after deployment, while software development and testing iterate before deployment. To illustrate this distinction, think about the water you drink, and how data is similar.

Data quality requires continuous testing in the same way that the water you tested yesterday & drank today does not make tomorrow’s water safe to consume.

Data Development Life Cycle

Data Discovery & Framing

The first steps are outward-looking. Data changes faster than software does, because data represents the external actors and actions. Data consists of observations that occur within a continually changing business context.

With software, these are typically considered pre-production steps. There’s a good reason for that. You cannot reach a first viable version of a software product until you complete these. However, data assets change faster than software does. This means that iterations of data assets include continuous change to domain, significance, hypotheses, and focus.

Data domain

To describe how data represents your business domain is to build an ontology. Engage your subject matter experts. Talk to clients. Compare and contrast your market position (or intended position) against competitors. Look at business trends, not technology trends. Don’t focus on what software can solve for. Focus on the human behaviors that create the opportunity and the supply & demand that leads to the market itself for this business.

The ontology will change the least frequently.

Data that matters to the market

To identify how you will represent the human behaviors in the business domain, focus on your data subjects. That term immediately brings to mind people, and it’s likely that people will be included in your ontology. But data subjects can also be transactions, organizations, physical objects, etc. What are the central subjects of your business domain? What are the activities or behaviors that the subjects are involved in? What are the characteristics of subjects or behaviors that everyone in this domain agrees are important?

Significance changes faster than the ontology does.

Data hypotheses

To propose a representation of subjects, activities and characteristics in the form of data, ask what dimensions apply. For instance, time and location almost always matter. Sometimes sequence matters more than time. What we frequently call relationships can be as simple as structure around the characteristics. But sometimes there are patterns of behavior between subjects or across activities that need to be represented so that you know enough to record an observation.

Because both the ontology and the significance will change, you will need a feedback loop. Revise or create new hypotheses as you learn more. Sometimes, old hypotheses must be abandoned.

Data prototypes

To evaluate your proposed data schema, gather the questions that your clients will ask. Can your proposed data set answer those questions? If not, will you need additional sources of data? You want the smallest representation of data that will answer the questions that matter. You do not want the widest possible set of information. Data that is not going to be used is worse than useless, because it will corrupt your ability to measure what matters. Data in use has value; data at rest has none. Start narrow and focused. Know the questions that your data will answer.

K.Kautz March 2021

Production Changes to Data Schemas

My audience for this article will still be thinking about software, especially as we use terminology from SDLC to talk about data. I need to emphasize again that although a change in data schemas means that your data assets will look different, the business almost always needs to see trends and behaviors that were visible on both the old and new schema. You cannot throw away data from a prior schema when you start using a new one.

Each of the changes above to ontology, significance, and representation bring challenges to preserving historical data, or to combining historical data with new data. Modern tech stacks for data must handle self-service data ingestion, automatic recognition of and automatic adaption to changing schemas, and tagging or associations that sustain analysis across changing schemas that represent the same real-world subjects and behaviors.

Data Engineering & Design

Data architecture is being subsumed into software architecture. On one hand, this is appropriate because data has value only as it moves through software-implemented data pipelines. The software behaviors constrain the fitness & use of the data. Data architecture does not stand on its own.

On the other hand, it is also true that data architecture is fundamentally distinct from software architecture to the extent that it deals with data significance and schema change instead of storage and processing.

Data architecture for storage and movement and security

To plan for the data capture, data collection, data storage, data flows, data quality, and fitness for use that will happen during data design & development, you need to know your boundary conditions. How large will your data sets become? How much historical data is needed to answer the most impactful client questions? What are the government regulations on the use and storage of this data? What are the commercial contract conditions that will constrain your use of the data that you get from others? What are the privacy concerns and security expectations of your clients? How quickly will your data schemas change as you learn more?

On the systems side, your data storage and data processing tech stack should change slowly, especially if you support questions on historical trends across schemas that are changing.

Data life cycle

To design how frequently to capture data, how to standardize it, how to distinguish good data from bad, the focus is on the questions that you want to answer from the data. You need narratives that start from a persona with problems in the form of questions, and how that human can initiate an inquiry, propose a way to answer it, and then use data to get the answer. In this regard, data design overlaps substantially with software design.

However, with data questions, it is not enough to design your data once. The feedback loop may force you back to the point where you acquire the data. Data schemas that cannot change get less valuable over time. You need systems, tools and procedures that accept continuous schema evolution.

Data capture and data enhancement

To develop data is to capture, standardize, transform, and align it so that classifications are well-distributed for statistical analysis and so that dimensions support trend analysis and ad hoc queries.

Because clients continue to recognize new patterns of data subjects and their behaviors, the augmentation of new data sources, new dimensions, new classifiers, and new join conditions is also continuous. You need a feedback loop so that analysts and other consumers can contribute to improvements in data content, schemas, classifiers and associations.

Data deployment

To deploy data is to publish it to internal & external consumers, to client applications, and to analytics communities. Depending on your business context, deployment might be streaming and ephemeral, or it may be episodic and rely heavily on trending over time.

If your domain has significant regulations or if each client has content that must be protected from other clients, then your deployment may involve metadata tagging to support audit & compliance, to handle privacy regulations, or to add multi-tenancy constraints that ensure protection of client-owned data assets.

K.Kautz March 2021

Data Operations

There is less confusion about data operations in most companies that I have worked with. That is, data discovery & framing are frequently unknown. Data design & development are often underserved. Data operations will at least appear in most organizational structures.

Unfortunately, most organizations write software with the assumption that data schemas don’t change. When new data sources or modified schemas are recognized, the next assumption is that this requires changes to software. If this is still true of your organization, then you will not be able to keep up with the pace of change. Data is changing much faster than it used to, and your software tech stacks need to equip your data operations team to handle new sources and changing data schemas without the need to change the software.

Data quality alerts and remediation

To equip the data operations with the essential tools to do their jobs, provide data content definitions of what is expected, what range of values is acceptable, and what volume of data from each source is normal. Instrument the data ingestion and transformation with immediate alerts when the data does not match expectations.

Your software tech stack needs to provide tools for the data operations team to remediate data problems and either approve its use or revert & repair portions of the data content so that it can be reprocessed.

Data changes are normal. The data operations team needs to be able to modify data schemas, review & approve such changes, and keep the data flowing to meet client SLAs for on-time delivery of value.

Data access and security

With regulations becoming more common, most organizations recognize the need to tag data with metadata, sometimes in the schema, sometimes in the data content itself, so that securable classes of data are clearly marked. Data privacy laws force this. Common data classes that have unique security needs are personally identifiable information, credit, and health records.

There is an opposing principle to security, which is access. You cannot adapt quickly to changing client needs and to new competitors in your market unless your own employees can easily reach your data, analyze it, and make decisions in real-time for how to adapt.

The best approach is to increase your data security capabilities at the same time that you expand its access. More employees need access to more of the data assets, and the need keeps growing. Data federation layers can do both.

Data performance

To monitor your data sources, what you collect from them, and the additional data assets that you build from those, you need to collect statistics.

You will need more than record counts. You will need snapshots to compare change of data content over time. You will need data profiling. For classifiers, take snapshots of the distribution of values and see how distributions change. Consumption metrics can measure how quickly and how often data assets are used, and how quickly the queries run that retrieve them.

Add each of these to data quality alerts to provide analysts with actionable information on the need to scale for performance in data storage architecture, data pipeline speed, and data access response time from your applications.

Measure the value of the data to clients

To measure the value of software or of data, the most valuable feedback is to speak to clients. The value proposition for product offerings is built on direct feedback.

However, in order to run a business, we have to be able to measure what we’re doing on a daily or continuous basis. Direct client feedback is not fast enough for this. The most frequently used substitute is to develop a key performance indicator (KPI) that measures our own data throughput or data quality or response time, etc. Our internal KPIs, in an ideal world, would be ones that we can also share with clients to demonstrate that we take our contractual SLAs seriously, even if our KPIs are not mentioned directly in client contracts.

K.Kautz March 2021

Data Maturity

Does your organization treat data as a first class citizen? Do you spend more on software development and less on data development? How is that working out for you?

It may be time for your organization to measure the return on investment on data that you ingest, create, and process. The ROI on data changes as the data flows, as stakeholders trust and use data, and as they abandon it when it fails them. Would you know if you continued to build data every day that no one uses because they don’t trust it?

Do you yourself trust your data? If you have a mature data development life cycle, with feedback loops to handle changes in domain, in significance, in design, and with competent security, monitoring, and measurement of value in operations, you will.

--

--

Kevin Kautz

Professional focus on data engineering, data architecture and data governance wherever data is valued.