Don’t Democratize your Data: Defend your Domain Boundaries
It seems that one of the grounding principles of the data community these days is the concept of data democracy. In essence, this is the idea that unleashing the power of data across an organization requires making the data far more accessible. The idea assumes that the core problems we are having with empowering innovation is lack of data access. Sharing is great. We learn that in kindergarten. Shouldn’t it apply to business software systems as well? Well, I am against this idea for several reasons. To start off, democratization of data is a terrible name for this concept. Democracy is about everyone having a say in how their group functions. It has nothing to do with everyone having access to everything. We still lock our front doors in democratic countries. It’s simply a bad name for an idea that isn’t so great to begin with. So I’ll describe what’s wrong with this concept.
How can someone not like democratizing data? Anybody up for data authoritarianism? I don’t see many hands. Ok, well, let me explain it a bit more. I don’t mean that data and information should be hoarded by totalitarian business groups. I don’t mean that people shouldn’t share and collaborate. If people just used data democratization as a term for collaboration then I’d be all for that. Who wouldn’t be? But the idea is really more about making data available in well-organized, well-described formats so that people can visit a platform and help themselves to it. While this is a good idea for some data, it rarely is a good idea for most data. In fact, it’s often a terrible idea that will lead to chaos and inflexibility.
There are other questions of who should have access to data. These are privacy and security concerns. These are important topics on their own which contribute supporting criticisms of the free data concept, but this is not what I want to focus on. Even if we were all angels and had no worries about security, data democratization would still be a bad idea. The reason is that it, in fact, leads to the opposite of what we want. It actually reduces productivity by increasing coupling between systems and makes them inflexible and unmaintainable. Excess coupling between systems is actually the biggest thing wrong with industrial software and the biggest reason for productivity fall-off with scale.
The key concept is encapsulation. This idea is a mainstay of software development and is often associated with object oriented programming though it is far more general and important. In fact, it’s really the key idea of software engineering and even engineering in general. In a nutshell it is this. People can collaborate to design large systems if they can break the larger problem into loosely coupled components for which the builders of these components can deliver value while being almost completely ignorant of the details of the internals of all the other components. For example Ford Motor Company could still deliver viable cars if just about everyone outside of the engine group was under the impression that a motor was just a box containing a large hamster that drinks gasoline and burps exhaust fumes. Knowing the details of what an engine actually is, and especially having to know the details, is actually a net negative for those designing everything else. It couples them together more than they need to be, making it harder for them to do their job.
Sticking with this automobile analogy, consider the tachometers or RPM (revolutions-per-minute) gauge in a car that tells you the rotational velocity of the engine. What is this thing useful for? Basically, these days, it is not useful at all. It is somewhat of a vestige of times when anyone who drove a car needed to know a great deal about how they work in order to operate them. Basically there is no real purpose for it anymore other than that some people think it looks cool. And, I admit, it does.
The RPM gauge is not hugely problematic. It does require some additional cost for little value but that’s not the real danger of such things. The danger is that other component developers may decide to make use of that data for other things. Let’s say that the sound system group wants to use it to increase the volume of the stereo when the RPM-gauge is high so that the music isn’t drowned out by the engine. How clever! While that could possibly make sense and add value, it comes with a rather high price. Now there is strong coupling between the engine and the stereo, of all things. Now the stereo group must learn something about hamsters, or rather gasoline engines.
Let’s say the engine group is constantly innovating to improve the engine. They alter the size and density of the engine block. They fiddle with valves and other things. They make it so that it can accelerate and decelerate faster. This is a big win. But in doing so they make a change where the tachometer will briefly disengage with the engine which causes it to drop suddenly and then spike immediately after. Tachometer watchers will barely notice. But this has the unfortunate effect of causing the stereo sound to cease and then blast at full volume. This ends up causing people to freak out and, maybe, occasionally, crash. Yikes!
This is a made up example but hopefully conveys the main point. The rest of the systems in a car should not need to know that an engine is something that spins or even has an RPM measure. Ignorance is bliss in that it allows them to design systems that are mostly decoupled. They should just see it as something that turns gasoline into torque. Some groups don’t even need to know that it exists at all. These separation of concerns along domain boundaries is crucial to any successful engineering.
This enforced ignorance has huge gains in terms of reducing the complexity of building a large integrated system. They just need to worry about their own job and what services they need to provide to the system they need to be coupled to. An engine’s job is to ingest fuel and create torque. That’s it. The more freedom it has to do that job well, the better the car is going to be. Sharing data to all other systems about how it does this doesn’t spur innovation. It creates a design nightmare and will increase the cost and risk associated with building the car. Good design requires defending these domain boundaries.
One should see how this analogy applies to software. Casually putting all the data out on a platform for self service is essentially the same as posting all the engine internal measurements. It’s going to achieve the opposite of what you want which is more performant overall systems and higher productivity working environments. Now of course some data needs to be shared just like our engine takes in fuel and outputs torque. But the goal should be to minimize this data of communication and try to maximize the freedom and independence of components. That means hiding as much internal data as possible. Data platforms have a place but they should be hubs for managing these minimal couplings not data free-for-alls.
Encapsulation is by no means a new concept in software engineering. But it seems to be a topic not understood by many business people. And that’s a problem because those non-technical people often come up with the highest-level designs of business solutions. It is at these levels of decision making that many of the coupling decisions get made. Unfortunately, the technical people who understand the most about how to do that well are sidelined from those conversations and are eventually burdened with a difficult to manage, high-level design. Those who don’t appreciate the dangers of tight coupling are often the ones most excited by the ideas behind data democratization. And that’s what sends warning signals up my spine when I hear business leaders championing these ideas and technical people not objecting.
领英推è
So, the alternative is really just putting greater emphasis on encapsulation and trying to limit the amount of data that needs to be shared in order to perform necessary collaboration. There is a small subset of data that should be widely shared. This data should have the characteristics of being: completely unambiguous, unlikely to change in meaning, unlikely to ever go away, not be sensitive. For example an online retailer might create an open database of the prices of their products at any time or place (or possibly channel). They may even decide to make it completely public as it is mostly public anyway. It is just product ID, timestamp and price-change.?This could, very realistically, result in innovations that create value.
But even this simple and obvious case for data democratization has its cost in terms of coupling. What about currency? Are prices going to be decided upon in different places in one currency such as the USD and then displayed and recorded in local currencies based on some real time conversion? There is complexity there. That conversion mechanism needs to be made accessible as well for people to use the data correctly. What if the retailer wanted to experiment with dynamic pricing? What about discount codes, coupons and the like? Now the existence of that table and whatever coupled to it is an impediment. How do you handle this? The customer might have paid a different price rather than the one recorded.
All of these difficulties subtract from the value of having this openly shared data. Sharing widely like this might still be a good idea but one has to weigh these negative aspects and understand that you are trading off increases of information with added complexity and reduced flexibility.
Much of the problem comes from this concept of creating an open bulletin board for data, available to anyone who has access. This is a change from the usual idea of bilateral communication where the provider of information knows the consumer and knows what they want to do with the data. In this simpler bilateral scheme, the two parties not only maintain the data connection, but they also ensure that they share an understanding of what the data really means because they know what each other has to accomplish. And what data really means is inseparable from how it will be used.
We would always like to try to document what data really means so that consumers simply need to read the documentation. This frees the data provider from having to maintain those bilateral relationships. This idea of well designed data dictionaries is a key aspect of data democratization. However, in practice, any short description of data is going to be imperfect. Even with solid documentation, data providers will find data consumers misusing the data if they care to inquire. If they do not even know who is using data and for what purpose, we end up with problematic couplings that will lead to rigidity and error. These bilateral relationships of shared understanding should not be abandoned lightly for the unrealistic dream of widespread, self-service, hands-off data provisioning.
The careful reader might notice a seeming contradiction. Didn’t we argue that different components shouldn’t have to know much about the other components? Now we are arguing that understanding how the coupled components will use the data is important to avoid miscommunication. But this isn’t a contradiction. The shared understanding between coupled components should be what each component needs to do, not how they do it. Data elements related to how they do something is what we should not share. The data we share is the piece that both parties need to agree on if there is to be any coupling at all.
For example, the engine needs 87 octane gasoline fuel-vapor with pressure between 40-60 PSI and temperature between 40-70 degree fahrenheit. That understanding is the contract between the fuel injector and engine. They need to ensure that they both understand these terms exactly. The fuel injector group should know that the engine group is going to combust the fuel vapor to drive pistons and convert that to torque on a drive shaft. That’s enough understanding. The engine group on the other hand does not need to know where the fuel comes from, where it is stored or how it got heated into that temperature range etc.. Likewise the drive shaft group needs to know what torque to expect from the engine. It doesn’t need to know exactly how it was generated.
There is still a case to be made about documenting, to some degree, what data exists and more importantly what each component or group does. And it can make sense to allow more access to some people, such as the data scientists, purely for the purpose of exploration and experimentation. This could lead to new ideas about how to combine data or functionality across domain boundaries. However, successful ideas that arise from these experiments should then inform decisions between the leaders of these domains concerning how to collaborate in order to achieve these gains while still defending their domain boundaries in order to keep the coupling minimal. If the gains to be had are worth the complexity of this new coupling, they may choose to proceed and collaborate.
It is the ability to do this kind of collaboration well that truly allows for innovation at industry scale. This form of how these domains will interact; what work will be done where, and which data is transferred, will look very different from the data sharing pattern used in the proof-of-concept by the data scientist. And it won’t generally make sense to build a data sharing platform just for these kinds of experiments. Data can be accessed in more conventional, manual ways for this type of throwaway work.
So, in summary, I suggest we need to reign in the democratization of data fanfare. Instead, people should learn more about the crucial concept of encapsulation and how it drives optimal design which is what really allows us to control complexity and share effectively.?Sharing data without focusing on encapsulation will solve one problem only to create a worse one. Once excess coupling is in place it tends to spread and is very difficult to untangle or reverse. One might find they can build useful systems quickly but then realize that they cannot modify the subsystems without simultaneously working on many others. This leads to a quick burst of productivity followed by stagnation and inability to evolve. More attention should be given to design concepts focused on controlling complexity and maximizing flexibility. When we master that art, we can not only build useful systems, but we can keep evolving them quickly to keep up with the changing business world.