Data management as code
- "Your business model is interesting. Tell me, how do you handle data management efficiently?"
Scling's business model is built on exploiting the data divide - the gap in data capability between the crowd of companies and a small clique of big tech companies and startups. Our primary challenge is to explain the business value potential of crossing the divide, a challenge that has so far proven to be difficult. In previous articles, I have illustrated the width of the divide, and the delusion that prevents us from seeing it. In this article, we will look at data management, a field where there is a distinct difference in practices.
The question above came from a manager of an analytics team. I used to perceive data management as a primarily technical challenge. I had to revise that perception on my first consultancy assignment that took me to an enterprise environment. Next to my desk was a team working with master data management (MDM) of the customer database. Master data management is an activity where one builds a coherent and consolidated view of all entities of a particular type, in this case customers. Five employees were combining and cleaning user records, using a commercial tool from Informatica. The tool supported automation through simple rules for recurring patterns, but the work that I observed was primarily manual. Unlike the curious analytics manager, the enterprise MDM team reacted with denial when I hinted that it was possible to raise the level of automation; the company had already made its strategic tooling investment.
Developers in another team were using another Informatica product to build data flows integrating databases with Hadoop. I was curious whether automation could be improved and asked them to show me their workflow. They showed me a graphical integration builder interface, where each developer had an individual workbench. I asked an Informatica sales engineer if it was possible to share workbenches within a team, e.g. by committing them to git.
- "We have a proprietary version control system for workbenches."
The width of the divide started to dawn on me.
I once worked in a Spotify team that was responsible for master data management of our user information - the same task as the enterprise team. We were also a team of five, but we treated the task as an engineering challenge. We had a data pipeline called MasterUser, which combined dumps from the different sources of user information - the primary user database, a NoSQL database with complementary information, web logs for anonymous users, systems for partnership integrations, etc. The picture below illustrates the dataflows. The logic complexity was similar to the enterprise's customer MDM, but with more records, about a hundred million, compared to a few millions. There were recurring patterns for cleaning and combining records, which we expressed as data processing logic. Like any other software, the pipeline went through software engineering practices, such as code review, automated testing, and continuous integration.
In addition to MasterUser, we were also responsible for delivery of nine other core datasets that were used by many teams - the tracks that had been played, the artist metadata, etc. Our primary mission, however, was not core data pipeline operations, but to develop a new generation of our data platform. During the transition, we also had to keep the old one running. We once measured the time spent on maintenance versus development of new features, and concluded that about 25% of engineering time was spent to "keep the lights on." So, given five engineers whose time was divided between new development, operation of the data platform, and operating the core dataset pipelines, it means that MasterUser and the other core pipelines each required less than 10% of a full-time engineer effort. That's a 50x efficiency difference compared to the enterprise approach.
Master data management is one aspect of data management. There is also data governance (preventing bad things from happening), data provenance (determining what has happened), data cataloguing, etc. These activities are also to a large degree manual processes and human interaction activities in enterprise environments, supported by a growing number of commercial tools. Most data management tools integrate with data storage systems, probe existing data, and do their thing on the harvested data - apply governance, calculate lineage, build a catalogue, etc.
With an everything-as-code engineering culture, all systems, data inside systems, the integrations between them, the data processing flows that create new data artifacts, as well as the permissions to access data are all defined as code. There is a class or other code definition for every entity, and we can use tools for code documentation, code navigation, and code review to manage our resources. We can leverage existing software quality assurance and deployment processes to manage operational risk and ensure compliance. This effectively removes the need for non-code or low-code data management tools. We avoid complexity and operational burden, and also allow for a higher degree of automation. If business logic is expressed in data tools focused on graphical interfaces, such as those from Informatica, Qlikview, Tableau, etc, the interface becomes an automation termination; it provides a pleasant experience for staff that use it for manual work, but prevents efficient automation on top of the tool.
Expressing all entities as code addresses static aspects of data management. There is still a need to manage dynamic aspects, such as different forms of data quality. As you have probably guessed by now, we address this with code as well - code that monitors data availability, code that measures data record quality during data processing, dedicated data jobs to measure data consistency between datasets, and code to visualise data quality in dashboards. Data quality is a first class citizen, and quality assurance code is tested and handled according to good software engineering practices, like any other code.
I have not bothered to ask Spotify for permission to publish this article, since it all happened eight years ago. We were very transparent about our work at the time, and shared it in various forums in bits and pieces, including the MasterUser picture above. But our practices have not spread much. I have not encountered any other company doing MDM with the same level of automation and efficiency. It is common, however, to encounter inadequate MDM that shines through in products, negatively affecting user experience. This article on an IKEA shopping session provides an example.
Companies that I meet are generally unaware that the tech elite is a decade ahead in data efficiency. They are not aware that every commercial data tool they buy takes them further away from the most efficient processes. As long as the unawareness persists, the capability gap will widen, leaving Scandinavian companies vulnerable to disruption. It is our mission to change that, one company at a time.
AI Strategy, Product & Tech Leadership
3 年Good article. Noting that Google and most other "digital/data/ML/etc- first" companies have relatively sane engineering-first data practices (realize we were in different parts of Google at the time, happy to tell stories). It's also true that of the companies I've interacted with directly in Sweden, only Spotify and maybe 2 more of the bigger ones had "sane" practices for this.
Engineer | Data | Scientist | Ex-McDonalds Burger Flipper
3 年Johan Kj?rgaard, Andreas Hald ^^