Chances are that your company is looking to get value out of its data sooner rather than later. That includes getting it all the way from the point-of-entry to your new analytics dashboard that is being driven by real-time data analytics and machine learning (ML). The distance between raw data and actionable information is large, and there are no shortcuts. In order to reduce the time from data to insight, we must do what we do faster.
The steps required to turn data into insights originated from traditional Business Intelligence (BI) practices. You would setup a central database (most likely an SQL database) to act as the Data Warehouse. In order to get data into the data warehouse, you would create processes known as ETL jobs. ETL is an acronym for (E)xtract, (T)ransform, (L)oad. The Data Warehouse itself was very well defined, and data coming into the system had to be transformed, cleansed, and mastered so it could be properly stored. Finally, once all the data was loaded, the system was processed, producing a cube that could be used to power reports and dashboards.
This process was slow, and often took hours to complete. It was fairly common for the data to be updated once a day, or once every few days. Companies would often start the process at the close of business, and hope it was completed by the following morning. If anything went wrong, the company went at least a day without updated information.
Enter the Data Lake
Over the past decade data types and volumes have exploded exponentially, and access to affordable big data technologies through cloud vendors have made it easier for companies to collect and analyze their data. The rigidity of the data warehouse made it very difficult to manage with all this change. Every detail had to be physically prepared, and the slightest modification (such as the addition of a new field) would stop the entire loading process dead in its tracks. This meant every new set of data or modifications to existing sets had to go through a change management process.
In order to keep up with demand, and provide a manageable storage location for all of this data, the Data Lake was created. Data Lakes get their name for location that data can be “poured” into quickly and easily, regardless of its type. As the data is “poured” in, it exists next to other data of different types in a big pool of data. If a data warehouse is rigid, the data lake is fluid. Companies could capture and store data quickly and easily.
1. Data Lakes are not Data Warehouses
One of the mistakes I see made on a regular basis is companies forgetting they
still need to achieve the same consistency and predictability they had with the
data warehouse. The data lake has not solved this for them, all it has done is
allowed them to move their data into a central location for further analysis.
It has been extracted and loaded, but it has not been transformed.
2. Schema on read is not the answer
Remember that change management process I mentioned when new or modified data sets need to be brought into a data warehouse? It was painful, and presented a huge barrier to bringing new data online. Since the data can just be “poured” into the data lake, not paying any attention to what it looks like (i.e. it’s schema), we need to understand it before we can read it. The idea of “Schema on read” surfaced, meaning when you try to read the data, the data lake will tell you what it looks like automatically. Sounds great, doesn’t it? The problem you will quickly run into is the schema will change, depending on the data itself, and you have no way to enforce it. The problem of the data warehouse load process coming to a halt has just surfaced in your new data lake.
3. Data still needs to be mastered
Digital transformation has been a primary driver of data volume explosion over the past decade. Companies are moving away from central managed, entity resource planning (ERP) systems and moving to best-of-breed solutions for each functional department. To summarize, the sales department is using Salesforce, the accounting department is using Great Plains, and your project managers are using Jira. Data about your customers exist in all these systems, and any one customer has data in all of them. When you run that accounts receivable by sales region report, how do you know you aren’t counting customer Bob Smith more than once? Unless you have mastered that data, you don’t. The problem is your Salesforce has two Bob Smith contacts in it, and your accounts receivable has one. When you collect the data from those two systems and “pour” it into your data lake, the data lake doesn’t know that information all relates to just one Bob Smith.
4. Your business data still has quality issues
All sets of data have quality issues. Databases are living and breathing,
changing all the time and creating new issues. Data lakes have made it easier for
companies to ignore data quality problems, or even pretend they don’t exist. Data
quality problems used to surface visibly when they broke the data warehouse loading
process and your executives had to wait another day before their report would be ready.
Because data lakes can take in data that has quality problems without sounding alarm bells, it is easy to just pretend it isn’t there.
The hard processes still need to be done
The steps needed for turning data into insight still need to be completed and cannot be ignored. While the traditional data warehouse is no longer a staple in digital transformation, the lessons learned still apply. The good news is technology is advancing to help us perform these steps faster and more reliably than ever before.