A Data Science Delimia: The Problem Of Inconsistent Data Sets

A Data Science Delimia: The Problem Of Inconsistent Data Sets

It's no news that most companies struggle with ROI with data science. Despite efforts, most data science programs fail to develop technology that significantly impacts the business bottom line. But what specifically makes data science so difficult? What exactly is the problem?

Drawing from experience as a data scientist and as a data science leader at many different companies, both small and large, this article is aimed at exposing a specific set of barriers to a successful data science program. For context, this article specifically concerns the management of structured data to enable data science.? (Unstructured data is an entirely different ball game)

The biggest problem, by far, is bad data. However, this is a very nuanced and complex issue. Most companies’ platform devs will tell you that their data is good. However, if you talk to the data scientist, they will tell you a different story. Some data scientists will tell a story of lack of access, no consolidated data lake, or lack of consistency in the data over time. Other data scientists will simply be frustrated since they, themselves, are not sure why they cannot get results from the data. Still another set of data scientists will tell you they are producing good results when, in fact, they are not and are completely oblivious to the reality. But, the end result is the same, the data is bad and there are no good results coming out of data science.

So where is the disconnect here? The place to start is data engineering. Data engineering is the building and management of data pipelines, data sets, and data tools. One way to think of data engineering is the “backend to the backend”. However, don’t be fooled, it's very different from the work of a backend website developer. In fact, it’s a completely different mindset.?

Below is a list of the data engineering requirements for a data ecosystem that most impact the ability of the data science team to succeed. That is, these are the items that most often impede the progress of a data science team:

A. All the data needs to be in one location. This is the concept of a “data lake”, and, more specifically, the “data warehouse”.?A data warehouse means all the datasets are not only in one place, they are also in one format. But, which data format also matters. The data format needs to support these features:

  1. The data format must support a full schema. A full schema requires column names and a specific data type per column. (This means the data in the data warehouse cannot be in CSV or JSon format. That does not mean the original format cannot be CSV or JSon, just that it needs to be engineered into a format that supports a full schema for it to be usable by data scientists.)?
  2. Ideally, the format supports automatic compression
  3. Ideally, the format supports columnaire access so wide data sets can be efficiently read
  4. Ideally, the format supports parallel reads from parallel processing frameworks like Apache Spark
  5. Excellent formats to use are ORC, AVRO, and Parquet

What does not work are many tables spread out in SQL databases and other data store technologies. A data scientist needs easy access to all the data in one place or else, they spend all their time gaining access, learning how to pull data from each source, and dealing with asynchronous changes on each of those data stores…which is untenable for the data scientist.?

B. The data sets need to be joinable where possible. Joinable means that data sets need common ID columns that allow them to be joined together where possible. This requires data sets to be modeled out such that they can be used together.

C. The data needs to be updated by an automatic pipeline. A data pipeline adds to the dataset as new data becomes available. In an IOT or website click-stream situation, this means streaming the data into the data warehouse.? Other data may arrive in batches, but should be processed automatically and end up in the data warehouse. The reasons for an automatic pipeline are that the data science algorithm/models need both need to be built on recent data to be relevant and the model must be applied to new data within the data science environment so the algorithm/models can be monitored and evaluated.?

Data that is a one-time dump or manually updated simply never gets updated and becomes stale before the model can ever make it to production. One-time dumps never work.

C. The data needs to be consistent over time. Yeah, this is the big one

The values of the data need to be consistent over time. This is often THE major problem impeding data scientists. This requirement has broad implications for how the platform ingests the data. This is also the source of the biggest misunderstanding of what “good data” is between platform devs and the data scientist trying to build models from historical data.

I remember speaking to a platform dev and he told me that all the data on the platform was good. I then asked the platform dev if he had looked at the daily average value of any data point over the last year. He replied no, he had not. So how can he say the data on the platform is good if he has not looked for artificial fluctuations in data values over the last year? Simple, he is not concerned with data more than a few weeks old. Most platform devs are only thinking about what data is displayed on the website. Therefore, to the platform dev, data fluctuations and outages are transient. Once the outage or bad code is fixed, they just move forward. Here in lies the gap in understanding

What “consistent data over time” means is that a data scientist can pull data for the past year, two years, or maybe even ten years, depending on the business need, and the values in each column consistently reflect what they are measuring. That is, there are no changes in how the value is calculated and no significant outages of the value. This is a huge ask, but also a requirement for successful modeling and, ultimately, ROI on data science. It's that important.

A poor platform dev will say this is simply not possible. That response is often the difference between an experienced data engineer running the platform and a software/backend dev trying to be a data engineer. It can also be the difference between well-engineered data pipelines and a hodge-podge of code. That is, a poorly designed platform cannot support consistent data sets for data science.

Here are the two main requirements to build a data ecosystem that can support consistent data sets for data science:?

A. Streaming data is buffered in a big data tool like an Apache Kafka cluster

  1. Apache Kafka can buffer 7 days' worth of data and can play it back in the event of a downstream processing disruption. This is the high-availability part of the data ingestion system that protects against downstream outages for up to 7 days. Without this, the data will have significant gaps.
  2. Kafka supports a pub/sub model that allows real-time data to be consumed for multiple purposes and variable rates of consumption. This is where the web backend can access and process data in real-time data for displaying on the website. (Notice that this is where the website's real-time processing is forked from the persisted data used for data science)

B. All the data that is received, is persisted as is. Whether it's streaming data or batch data, it must be persisted in its original content BEFORE any mutations are applied. In some cases, it's ok to deserialize the data before persistence, but, in general, the content of the data should not be altered. The main reason for this step is to allow for bulk reprocessing of the original data when changes are required

  1. Streaming data is consumed from Kafka and stored in the data warehouse
  2. Batch data is persisted as is in its original format in the data lake and then processed into the correct format and stored in the data warehouse

Requirement A means that streaming data can be replayed up to 7 days back and all data in its original content can be persisted. This ensures that all data that is received can be persisted to storage, barring a storage outage for more than 7 days. Since storing the data in Kafka is simply pulling from the Kafka Topic and writing to a data store, 7 days is plenty of time to resolve any issue.?

With all the received data persisted in its original content, all downstream processing can be reapplied at any time. This means if there is a change in how a data value is processed, all affected data sets can be regenerated retroactively.

For example, a company ingests click stream data from an e-commerce website. The average price of each product viewed by a customer is compiled to show if this shopper prefers high-priced items or low-priced items. However, someone realized that the type of currency has not been taken into account. Thus, they add code to translate all prices into dollars before averaging them. The change is deployed.

The effect on the data set is that, prior to the point of deployment of the fix, the average viewed product price is calculated differently than after the deployment. For data science, this means the data before and after the deployment cannot be used in the same model because they are fundamentally different.

What is required is to regenerate the dataset going back in time with the new calculation. If the original data has been persisted, this is possible, thus allowing data science to have a data set going back in time long enough to support a model. In this case, the “product views” data set needs to be reprocessed retroactively to repopulate the average viewed product price as well as any other downstream data set that uses the average viewed product price.??

But, here is the thing, changes to calculations happen all the time. Many times it’s a fix for something else that inadvertently changes or corrupts the data value. These changes can go unnoticed and accumulate over time. For example, another developer makes what they think is an improvement to the efficiency of how the exchange rates are collected but accidentally causes the final conversion to dollars to be applied incorrectly. This goes on for a few weeks until it's noticed and fixed. However, once fixed, the last few week's worth of downstream data needs to be regenerated with the corrected calculation.?

This is a huge requirement for data engineering teams and some platform engineers will simply say they cannot do it. And that is a huge gap in the understanding of what it means to properly support data science data needs. It's also a lack of understanding of the scope of the data engineering work required to support data science. The lack of planning for this type of support on a data platform often dictates whether the data science program at a company will be successful or not. And make no mistake, being able to retroactively process and maintain consistent data sets for data science requires a significant investment in the data ecosystem…aka it requires significant time and money.

When a company has an established data processing platform that does not include the requirements as defined above, it simply will not be able to support a data science effort that required historical data. Unfortunately, many companies are in this position and do not believe they needed to overhaul their data pipelines to support data science. These are the situation where the platform devs tend to argue that this is not a reasonable requirement. The result is a conflict between the platform team and the data science team in which the data science team usually loses since they are also usually the minority and lack a significant voice in the company.

Not being able to support the data science team's data needs is often a quiet failure, where data scientists get frustrated and simply go elsewhere. When executives ask questions about why they cannot make progress on data science initiatives, they are left with explanations that don’t make sense, since the people with the answer have left. All the executives hear is the voice of the platform dev that is telling them all the data is good on the platform and data science just failed. But, make no mistake, there are real reasons for the failures.

Randolf Reiss

AI/ML/Data Science

1 年

It is interesting to note that none of the data engineering platforms I know of appear to have any mechanism for regenerating a data set with updated data. This means, supporting consistent data sets ends up being a manual and time-consuming operation for data engineering. Sure would be great to have a tool that you could click on a DAG of your job dependencies and say re-run all recursive jobs.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了