The data divide - data success factors vs friction
For some time we have witnessed the so-called Data Divide - a wide and growing difference between companies in the capabilities to collect, manage and ultimately get value from data. This trend is a continuation of the Big Data project failures that Gartner reported back in 2017, stating that 85% of the big data projects fail, and 87% of the projects never reach production.?
We will soon see similar reports when it comes to artificial intelligence (AI) projects as well - the AI divide is the next point where the technical leaders jump ahead of the incumbents. This follows naturally, since AI features require lots of data that is collected, processed and managed in a controlled manner. While many companies create prototypes or build single AI features with great effort, only a small fraction of technically leading companies succeed in consistently releasing AI functionality, if we define product success as both having sufficient quality to match leading competitors as well as yielding clear return on investment.
What do the companies that succeed do differently? Why is it so difficult for traditional companies to achieve the same results as the technical leaders? What are the differences between the enterprises and the "born digital" companies that are so much more successful with data and AI??
We have spent the last decades going between the traditional enterprise companies and the successful "grownup startups," and observed a set of differences, which we attempt to explain below. Their success turns out to have very little to do with technology and instead with ways-of-working and culture. Let’s look into the differences by examining the success factors and the friction points that prevent organisations from creating value with data and AI.?
Data success factors
After working with both successful and unsuccessful data efforts, we have concluded that the following aspects are the most important for success, in decreasing order of importance:
A: Fast iterations and feedback loops for developing, debugging, and trying out new data flows.
B: Fast iterations for integrations with entities outside the data platform - data sources, data destination systems, and authorisation structures.
C: Minimal operational overhead for data flows.
D: A consistent set of processes and ways of working with data that supports A-C.
E: A data platform technology stack that supports A-D.
Iterative, feedback-driven with fast iteration to low cost
Starting from the top, fast feedback cycles are essential for any type of software development, but even more important for developing data-driven features due to the inherent unpredictability when working with data. Many ideas turn out to be ineffective or infeasible, so it is crucial to be able to run experiments on production data, fail fast, and move to the next. So what is fast??
I worked at Spotify 10 years ago, in a team that set out to democratise the use of data in the company. We defined the ambition that all teams with developers should be able to launch a new pipeline within a day and fix pipeline problems within an hour. We made it a bit further, achieving latencies of 2-4 hours and 30 minutes, respectively. Those numbers are for ordinary pipelines for analytics or data-fed features, e.g. computing top lists or search indexes.?
Machine learning (ML) development involves longer feedback loops. The backbone of ML is data pipelines, but it is necessary to have many pipelines for each ML feature. For ML prototypes, a few data pipelines may be adequate, but for production quality ML features, dozens of data pipelines are often required for offline testing, model performance measurement, data drift monitoring, model drift monitoring, etc. Each ML development iteration cycle consists of multiple pipeline iteration cycles, and the speed of pipeline iterations is therefore crucial.?
This presentation by Josh Baer from 2022 describes Spotify's ML iteration loops and measurements on cycle latency. We'll use numbers from his presentation as an example of a high ambition when it comes to data innovation iteration speed.
In the presentation, he mentions that most teams take 2-6 weeks to be able to get feedback from evaluating an ML prototype, and that this latency is long enough to be concerning.?
He is also concerned that it took 30% of teams more than a quarter to build an ML production feature.?
He describes how an ML feature innovation iteration consists of multiple smaller iterations. These typically consist of iterations on individual data pipelines or groups of pipelines, e.g. for feature engineering and/or offline testing.
领英推荐
Their ambition for the platform mentioned in the presentation is to bring the total ML feature cycle time down to five weeks, with the fastest iteration cycles down to a day, which is consistent with the ambition we had to launch a new pipeline in a day.
These numbers are 10-100x faster than the iteration cycle times we typically see in enterprise companies, where it usually takes 6-24 months to go from idea to production. The success stories presented at conferences are in the lower range of that span, as for example the 8 month iteration time achieved by Nokia described at the NDSML Summit 2019 . Companies that take 24 months to develop flows do not usually present those achievements at conferences, but there is at least one such example presented by Dutch Rabobank a few years ago (the presentation is no longer available online…), and we have seen others in the wild. This gap in data and ML capability between the tech leaders and traditional companies is called the "data divide" or nowadays "AI divide."?
The iterations described by Josh in the presentation primarily correspond to item A in our list above - the ability to quickly change or deploy new data pipelines. They sometimes correspond to item B - the ability to iterate on integrations. Item B is needed e.g. when it is necessary to use a new data source, when emitting data to a new system, or when a new system or team needs access to datasets or databases, either for integration or for debugging purposes. These types of changes are not as frequent as data pipeline changes, but happen on a regular basis, and it is crucial that they are as quick. With today's cloud technology, it is possible to achieve a latency of hours for these iterations.
If a company manages to create lots of data pipelines, they will spend a lot of their time maintaining and operating them unless operations is cheap, which is why item C becomes important. We can get some insight into the operational data engineering efficiency of different companies by comparing the volume of their data processing. Processing of data is most often done in batches, producing refined, valuable data artifacts from raw data, e.g. by transforming user action events into reports, tables that feed dashboards, or recommendation indexes. Each such flow produces the data artifacts as datasets, emitted on a regular basis, e.g. one sales report per day. During the processing, intermediate datasets shared between flows are also produced as part of the process, e.g. a daily curated list of all users.?
We can use the number of datasets produced per day as a proxy measurement of data processing throughput and value extraction from data. Each produced dataset must have some kind of business value, or the flow that produces it would be shut down. If we look at different types of companies, we see surprisingly large differences, many orders of magnitude. Companies are rarely public with these numbers, but in a typical enterprise data warehouse environment, there are automated ETL pipelines that produce on the order of dozens or hundreds of datasets per day. From numbers presented by Spotify when describing their cloud migration (Josh again), we can deduce that they were in 2018 producing hundreds of thousands datasets per day. In 2016, Google revealed that they were creating 1.6 billion datasets per day . While each dataset in such environments is probably not as valuable as a sales report produced by a data warehouse, it is clear from the numbers that these companies are able to get more value from data than the average company. It is also clear that the operational processes must be different. Processes that support the typical data warehouse volumes of about 1 dataset / day / developer will not be economically feasible to scale to the Spotify/Google volumes of 1K-100K datasets / day / developer.
Ways of working
We have now described where the tech leaders are for items A, B, and C, and therefore which ambition we must have in order to compete with disruptive companies like Amazon in terms of data & ML innovation - iteration cycles of minutes or hours rather than days or weeks, and an operational efficiency that allows developers to handle data pipelines that produce thousands of datasets per day. I have spent the last 10 years trying to figure out how to replicate the data and AI success of born digital companies in other contexts. In order to get there, we have designed processes and ways of working with data (item D) that are optimised for quick iterations and minimal operations. We have then designed a technology stack (item E) optimised for these processes.?
Out of the items A-E, the technology stack is the least important. It is important in the sense that technology can implicitly push teams towards better or worse ways of working, but it is possible to build good data processes on different stacks. Many data efforts also limp under the weight of heavy technology. The technology we use has evolved over time, and by discarding as much heavy technology as possible and investing in light-weight technology that strengthens our processes, we have found a minimal technology stack that naturally supports the desired way of working.?
Since it is very simple and built primarily on standard cloud services, it requires minimal operational overhead. It does not have as many features as the stacks built at digitally mature Scandinavian companies, such as King, Klarna, and Spotify, but those have taken hundreds of person years of trial and error to develop. We built the first incarnation of this stack at Bonnier News in two weeks , and built up an instance for our most recent client in two days.?
The stack contains necessary functionality to support quick iterations and effective data flow operations, but it is also very simple, with few and thin layers of abstractions, which is important for agility and for avoiding operational overhead. We have managed, with a modest investment, to achieve both fast iterations as well as operational overhead on par with tech leaders. After a year with Scling's first client (in retail), a team of three developers ran 70 pipelines from 40 sources, producing 3,700 datasets / day. So it is possible to catch up and match data leaders in data processing efficiency, in spite of them having spent over a decade improving data productivity.
Data friction factors
Given that we know how to solve A-E, two major challenges remain. First, given a capability to process data, it is not necessarily easy to translate it into improved business for a company. Data mature companies manage to grow innovation by forming teams with a mix of data engineers, subject matter experts, and product managers, complemented with data scientists if necessary, and then enable those teams to innovate bottom-up. It is a slow, organic growth process, however. We don't have a good answer on how to accelerate that process apart from accelerating the processing capabilities and working closely with product and subject matter experts. No one else seems to have a better answer.?
The second major challenge is to prevent factors from creeping in to slow down the iteration cycles that need to be fast, or add operational overhead. To a large extent, modern companies move faster than traditional by doing fewer things where it matters - i.e. avoiding non-essential activities in the critical iteration loops. That does not mean they accept more risk - it means that they have found ways to manage risk that do not exclude fast iterations and low operational overhead. We know from the Accelerate book that the tradeoff between delivery speed and risk is a myth - modern DevOps methods result both in more frequent deployments and more reliable systems.
Unless there is a strong culture of identifying and eliminating iteration cycle friction factors, they tend to increase over time. They do so more rapidly in traditional enterprises due to a number of cultural and structural properties. The friction factors that we have identified from working at both ends of the capability spectrum are:
Each of the friction factors continuously contribute to adding time and overhead to iteration cycles and operations. They are not binary, but traditional companies tend to lean more towards the left for each item. It is rarely the case that an individual item adds enough overhead to prevent effective data innovation, but they rather cause death by a thousand cuts. For the last item, the connection is perhaps not obvious, but having a project mindset sets expectations. If an undertaking is seen as a one-time effort, waiting a few more days or weeks for some dependency might not seem crucial and eight months is not an unreasonable project duration, but if the initial effort is seen as one out of many iterations, each dependency is important and eight months is an eternity.
The friction factors are highly prevalent in mature enterprise companies, so much that they are part of the norm and expected. We have seen artificial friction dominate work efforts in all traditional, large companies we have worked with. During the winter 2022-2023, we interviewed a handful of candidates for a data engineering position, at a mature enterprise client. Two of the questions we asked were "How long is the latency from a data processing idea to running the code in production in your current work environment? If a manager wanted to bring that latency down to one day, what would you change?" The candidates had experience from various large companies in retail, energy, finance, etc. Their responses for latency were 6-12 weeks and from their suggestions, it was clear that they were not aware that it could be much faster. One candidate responded that if management came with unrealistic expectations, he would push back. The data divide is so wide that everyday life at one side of the divide is perceived as completely unrealistic at the other end.
Summary
There are logical reasons for this divide in culture and capabilities. Different types of companies are in different phases in the Explore/Expand/Extract curve , which affect willingness to accept risk. Many mature companies are in the Extract phase, where small safe steps are appropriate, whereas the grownup startups have gone through a journey from Explore to Expand. Building up a capability to innovate with data at a mature company requires it to create a local environment that goes into Explore+Expand mode, for which it is not culturally nor structurally positioned.
In order to eliminate friction factors, a strong culture of questioning habits and eliminating waste is necessary. It is easier to achieve such culture in organisations that grow fast and therefore change frequently and bring in new views that question the status quo. In the next blog posts in the series, we will leave the problems behind and look closer at what those that succeed with data and AI have done, and also present a safe recipe to follow their path to success.
CEO & Head Chef, DataKitchen: observe & automate every Data Journey so that data teams find problems fast and fix them forever! Author: DataOps Cookbook, DataOps Manifesto. Open Source Data Quality & Observability!
6 个月Excellent thoughts as always.
?????? Founder at Scling
6 个月Thanks Josh Baer for providing such great material to refer to. In order for other Scandinavian companies to catch up and cross the divide, it is essential that data leaders explain in detail with numbers what they are doing, so that the differences become clear. Your presentations are much appreciated - keep them coming.
Enterprise Account Executive 781 267 1398
6 个月Insightful analysis on tamed tech adoption woes.