Let’s take a deeper look at the industry trend toward virtual data lakes.
Why do we need a data lake?
A data lake provides a consolidated view and single access point for analytical workloads such as dashboards & reporting for operations, and exploratory analysis & data science research. It expands access to data while restricting access to transactional systems to enforce security and to protect the compute resources of operational systems.
It protects & secures operations and it enables analytics.
What’s the problem with a data lake?
Copying the data from operational systems to populate a data lake adds cost and introduces delays. Forcing data into a single storage technology narrows the available tools for research, and optimizes the paths to answer individual questions without guiding teams to a shared understanding of the business and its key measures.
It costs more & takes longer and it reduces consensus.
Before we explain how a virtual data lake addresses the problems that a data lake introduces, let’s define the key terminology.
- Where data represents both the raw materials and the finished goods, the operational systems are the ones that ingest data and push it through the data pipelines that cleanse, standardize, transform and enhance the data before it is delivered to clients.
- To measure and to improve a company’s operations, each data warehouse uses a fixed schema with aggregated data cubes to enable decision-makers to answer predetermined questions via dashboards and reports. Each cube focuses narrowly on a single system from which it gathers its content.
- An organization’s data lake gathers data content from multiple operational systems. Its purpose is to enable analytical work loads for exploratory analysis and data science research.
- We make a distinction between a physical data lake and a virtual data lake. A physical data lake holds a separate copy of data content and uses a single data storage layer. A virtual data lake uses software and metadata to reduce or eliminate copies, and accepts many storage layers and formats.
When both operational systems and the data lake are in the cloud, we can increase the speed of refresh of the data lake. This allows us to copy from operational systems only once, and we can build the data warehouse cubes from the data lake.
Moving operational systems into the cloud is an essential first step toward virtual data lakes. When only the data lake is in the cloud, it is both slow and costly to copy data from operational systems. We cannot do it as frequently as we would like. Long runtimes also steal compute resources from operational systems.
Decouple compute from storage
High speed cloud networks allow us to separate compute resources from cloud data storage. In this way, we can scale up and down on compute resources independently from data resources. This is an essential building block on our path to virtualization.
Relational databases cannot decouple compute from storage because this would break transactional integrity (ACID compliance). Where transactional integrity is essential to the operational systems, we are forced to accept that we must copy the data content from relational databases to a decoupled cloud-optimized data storage layer for the data lake.
In practice, very few operational systems require transactional integrity. In particular, data pipelines that ingest, cleanse, standardize and enhance data content can be safely implemented with full recoverability without using relational databases or ACID-compliant transactions.
Decouple access & security from storage
When data content resides in cloud-optimized data storage layers, there are no access controls or security enforcement within the data storage. When we decouple data storage from compute, we also decouple the access & security from data storage.
We must reintroduce data access rules and information security through the tools that consume data from the data lake. One way to do this is to add a new component in data platform architecture for cloud data lakes: the federated data access layer. This is described in greater detail below.
Let’s review the problems that the cloud data lake introduces.
- High cost & added delays to copy from non-cloud-optimized systems
- Loss of transaction integrity that relational databases provide
- Loss of access rules and security that we build into our databases
- Departmental divergence due to data warehouses that answer narrow questions with data from individual operational systems
How does a virtual data lake address these problems?
Get the data out of the database
The best practice is unambiguous. Put your data into a cloud-optimized data storage layer. This means S3 for AWS, ADLS for Azure, and GCS for GCP. These work with high-speed cloud networks, include redundancy and recoverability, and if you plan carefully, straightforwardly support partitioning. The result is highly scalable parallelized data access.
Introduce federated data access
When we add a federated data access layer, we can expand access to data at the same time that we improve our security footprint. Secure your operational systems from non-operational users. Direct all data access for analytics and reporting to reach the data lake and data warehouses through this federated data layer. Add roles & rules to impose consistent data access policy and track the query & reporting behavior as well as analytical research so that we know who is using which data.
Reduce or eliminate copies
In the simplest definition, data virtualization means that the data exists only once in a data storage layer. The same data can be used for dashboards & reports and also for analytical exploration & data science research.
For data warehouse cubes in data visualization tools such as Tableau, PowerBI, Cliq, and Apache Superset, use the pass-through design pattern. Use your data federation layer to cache dimensions or cube content so that you no longer need to build & refresh cubes within the visualization servers.
Replace data transactions with data transformations
Data analytics developers are familiar with a different paradigm of data transformation than those who work with relational databases. Statistical packages such as SPSS and SAS, and languages such as Scala and R, have always used an approach that builds an entirely new dataset with each aggregation or transformation.
When you apply this approach to data pipelines, you can completely remove the need for transactional integrity and ACID compliance. Instead, you use reproducibility and recoverability to achieve similar results.
For those few (or many!) who prefer to think of data through the language of SQL, you can understand this approach as follows. Build your data pipeline using a sequence of CTAS (“create table as …”) where each new table contains all of the data content from the prior, with changes introduced via filters, joins, aggregations, and combinations of the above.
Use parallelized in-memory analytics
With partitioned data in cloud-optimized data storage, you can make use of newer analytics engines that populate in-memory dataframes in Python and R so that you can achieve dramatically faster execution of data science models and exploratory research.
If you use Spark in a map-reduce cluster, read & write directly from & to the cloud data storage (S3, ADLS, GCS). If you use Python and R, use the Apache Arrow Flight libraries to interact with a cluster of Arrow dataframe worker nodes. One of the federated data access vendors embeds Apache Arrow and Arrow Flight support natively (see https://www.dremio.com/eliminating-data-exports-for-data-science-with-apache-arrow-flight).
Virtual data lake best practices
- Migrating operational systems to the cloud.
- Move your data into cloud-optimized data storage.
- Redesign your data pipelines without transactions.
- Use separate compute resources for operational systems, for data visualization of data warehouses, and for analytical exploration & data science research.
- Use pass-through data warehouse design in data visualization tools.
- Deploy a federated data access layer that includes access rules and information security enforcement.
- Use analytical tools that rely on parallelized in-memory dataframes.
With these best practices, you reduce or eliminate copies of data, increase the freshness of data for decision-making, provider broader access to data across multiple operational systems, and increase the speed of analytical exploration and data science research.