ç™»å½•æŸ¥çœ‹æ›´å¤šå†…å®¹

Donâ€™t Democratize your Data: Defend your Domain Boundaries

David Johnston

Chief Data & AI Scientist, Navalia

å‘å¸ƒæ—¥æœŸ: 2021å¹´1æœˆ19æ—¥

It seems that one of the grounding principles of the data community these days is the concept of data democracy. In essence, this is the idea that unleashing the power of data across an organization requires making the data far more accessible. The idea assumes that the core problems we are having with empowering innovation is lack of data access. Sharing is great. We learn that in kindergarten. Shouldnâ€™t it apply to business software systems as well? Well, I am against this idea for several reasons. To start off, democratization of data is a terrible name for this concept. Democracy is about everyone having a say in how their group functions. It has nothing to do with everyone having access to everything. We still lock our front doors in democratic countries. Itâ€™s simply a bad name for an idea that isnâ€™t so great to begin with. So Iâ€™ll describe whatâ€™s wrong with this concept.

How can someone not like democratizing data? Anybody up for data authoritarianism? I donâ€™t see many hands. Ok, well, let me explain it a bit more. I donâ€™t mean that data and information should be hoarded by totalitarian business groups. I donâ€™t mean that people shouldnâ€™t share and collaborate. If people just used data democratization as a term for collaboration then Iâ€™d be all for that. Who wouldnâ€™t be? But the idea is really more about making data available in well-organized, well-described formats so that people can visit a platform and help themselves to it. While this is a good idea for some data, it rarely is a good idea for most data. In fact, itâ€™s often a terrible idea that will lead to chaos and inflexibility.

There are other questions of who should have access to data. These are privacy and security concerns. These are important topics on their own which contribute supporting criticisms of the free data concept, but this is not what I want to focus on. Even if we were all angels and had no worries about security, data democratization would still be a bad idea. The reason is that it, in fact, leads to the opposite of what we want. It actually reduces productivity by increasing coupling between systems and makes them inflexible and unmaintainable. Excess coupling between systems is actually the biggest thing wrong with industrial software and the biggest reason for productivity fall-off with scale.

The key concept is encapsulation. This idea is a mainstay of software development and is often associated with object oriented programming though it is far more general and important. In fact, itâ€™s really the key idea of software engineering and even engineering in general. In a nutshell it is this. People can collaborate to design large systems if they can break the larger problem into loosely coupled components for which the builders of these components can deliver value while being almost completely ignorant of the details of the internals of all the other components. For example Ford Motor Company could still deliver viable cars if just about everyone outside of the engine group was under the impression that a motor was just a box containing a large hamster that drinks gasoline and burps exhaust fumes. Knowing the details of what an engine actually is, and especially having to know the details, is actually a net negative for those designing everything else. It couples them together more than they need to be, making it harder for them to do their job.

Sticking with this automobile analogy, consider the tachometers or RPM (revolutions-per-minute) gauge in a car that tells you the rotational velocity of the engine. What is this thing useful for? Basically, these days, it is not useful at all. It is somewhat of a vestige of times when anyone who drove a car needed to know a great deal about how they work in order to operate them. Basically there is no real purpose for it anymore other than that some people think it looks cool. And, I admit, it does.

The RPM gauge is not hugely problematic. It does require some additional cost for little value but thatâ€™s not the real danger of such things. The danger is that other component developers may decide to make use of that data for other things. Letâ€™s say that the sound system group wants to use it to increase the volume of the stereo when the RPM-gauge is high so that the music isnâ€™t drowned out by the engine. How clever! While that could possibly make sense and add value, it comes with a rather high price. Now there is strong coupling between the engine and the stereo, of all things. Now the stereo group must learn something about hamsters, or rather gasoline engines.

Letâ€™s say the engine group is constantly innovating to improve the engine. They alter the size and density of the engine block. They fiddle with valves and other things. They make it so that it can accelerate and decelerate faster. This is a big win. But in doing so they make a change where the tachometer will briefly disengage with the engine which causes it to drop suddenly and then spike immediately after. Tachometer watchers will barely notice. But this has the unfortunate effect of causing the stereo sound to cease and then blast at full volume. This ends up causing people to freak out and, maybe, occasionally, crash. Yikes!

This is a made up example but hopefully conveys the main point. The rest of the systems in a car should not need to know that an engine is something that spins or even has an RPM measure. Ignorance is bliss in that it allows them to design systems that are mostly decoupled. They should just see it as something that turns gasoline into torque. Some groups donâ€™t even need to know that it exists at all. These separation of concerns along domain boundaries is crucial to any successful engineering.

This enforced ignorance has huge gains in terms of reducing the complexity of building a large integrated system. They just need to worry about their own job and what services they need to provide to the system they need to be coupled to. An engineâ€™s job is to ingest fuel and create torque. Thatâ€™s it. The more freedom it has to do that job well, the better the car is going to be. Sharing data to all other systems about how it does this doesnâ€™t spur innovation. It creates a design nightmare and will increase the cost and risk associated with building the car. Good design requires defending these domain boundaries.

One should see how this analogy applies to software. Casually putting all the data out on a platform for self service is essentially the same as posting all the engine internal measurements. Itâ€™s going to achieve the opposite of what you want which is more performant overall systems and higher productivity working environments. Now of course some data needs to be shared just like our engine takes in fuel and outputs torque. But the goal should be to minimize this data of communication and try to maximize the freedom and independence of components. That means hiding as much internal data as possible. Data platforms have a place but they should be hubs for managing these minimal couplings not data free-for-alls.

Encapsulation is by no means a new concept in software engineering. But it seems to be a topic not understood by many business people. And thatâ€™s a problem because those non-technical people often come up with the highest-level designs of business solutions. It is at these levels of decision making that many of the coupling decisions get made. Unfortunately, the technical people who understand the most about how to do that well are sidelined from those conversations and are eventually burdened with a difficult to manage, high-level design. Those who donâ€™t appreciate the dangers of tight coupling are often the ones most excited by the ideas behind data democratization. And thatâ€™s what sends warning signals up my spine when I hear business leaders championing these ideas and technical people not objecting.

é¢†è‹±æŽ¨è

The Tech Edge: Tackling Data Hoarding Challenges in the Digital Age

The Tech Edge: Tackling Data Hoarding Challenges inâ€¦

AvePoint 3 å‘¨å‰

Introducing the 2025 Outlook: Data Integrity Trends and Insights Report

Introducing the 2025 Outlook: Data Integrity Trendsâ€¦

Precisely 5 ä¸ªæœˆå‰

Hereâ€™s Why Democratising Your Organizationâ€™s Data Empowers Your Employees Towards Innovation and Growth

Hereâ€™s Why Democratising Your Organizationâ€™s Dataâ€¦

Akaike Technologies 1 å¹´å‰

So, the alternative is really just putting greater emphasis on encapsulation and trying to limit the amount of data that needs to be shared in order to perform necessary collaboration. There is a small subset of data that should be widely shared. This data should have the characteristics of being: completely unambiguous, unlikely to change in meaning, unlikely to ever go away, not be sensitive. For example an online retailer might create an open database of the prices of their products at any time or place (or possibly channel). They may even decide to make it completely public as it is mostly public anyway. It is just product ID, timestamp and price-change.?This could, very realistically, result in innovations that create value.

But even this simple and obvious case for data democratization has its cost in terms of coupling. What about currency? Are prices going to be decided upon in different places in one currency such as the USD and then displayed and recorded in local currencies based on some real time conversion? There is complexity there. That conversion mechanism needs to be made accessible as well for people to use the data correctly. What if the retailer wanted to experiment with dynamic pricing? What about discount codes, coupons and the like? Now the existence of that table and whatever coupled to it is an impediment. How do you handle this? The customer might have paid a different price rather than the one recorded.

All of these difficulties subtract from the value of having this openly shared data. Sharing widely like this might still be a good idea but one has to weigh these negative aspects and understand that you are trading off increases of information with added complexity and reduced flexibility.

Much of the problem comes from this concept of creating an open bulletin board for data, available to anyone who has access. This is a change from the usual idea of bilateral communication where the provider of information knows the consumer and knows what they want to do with the data. In this simpler bilateral scheme, the two parties not only maintain the data connection, but they also ensure that they share an understanding of what the data really means because they know what each other has to accomplish. And what data really means is inseparable from how it will be used.

We would always like to try to document what data really means so that consumers simply need to read the documentation. This frees the data provider from having to maintain those bilateral relationships. This idea of well designed data dictionaries is a key aspect of data democratization. However, in practice, any short description of data is going to be imperfect. Even with solid documentation, data providers will find data consumers misusing the data if they care to inquire. If they do not even know who is using data and for what purpose, we end up with problematic couplings that will lead to rigidity and error. These bilateral relationships of shared understanding should not be abandoned lightly for the unrealistic dream of widespread, self-service, hands-off data provisioning.

The careful reader might notice a seeming contradiction. Didnâ€™t we argue that different components shouldnâ€™t have to know much about the other components? Now we are arguing that understanding how the coupled components will use the data is important to avoid miscommunication. But this isnâ€™t a contradiction. The shared understanding between coupled components should be what each component needs to do, not how they do it. Data elements related to how they do something is what we should not share. The data we share is the piece that both parties need to agree on if there is to be any coupling at all.

For example, the engine needs 87 octane gasoline fuel-vapor with pressure between 40-60 PSI and temperature between 40-70 degree fahrenheit. That understanding is the contract between the fuel injector and engine. They need to ensure that they both understand these terms exactly. The fuel injector group should know that the engine group is going to combust the fuel vapor to drive pistons and convert that to torque on a drive shaft. Thatâ€™s enough understanding. The engine group on the other hand does not need to know where the fuel comes from, where it is stored or how it got heated into that temperature range etc.. Likewise the drive shaft group needs to know what torque to expect from the engine. It doesnâ€™t need to know exactly how it was generated.

There is still a case to be made about documenting, to some degree, what data exists and more importantly what each component or group does. And it can make sense to allow more access to some people, such as the data scientists, purely for the purpose of exploration and experimentation. This could lead to new ideas about how to combine data or functionality across domain boundaries. However, successful ideas that arise from these experiments should then inform decisions between the leaders of these domains concerning how to collaborate in order to achieve these gains while still defending their domain boundaries in order to keep the coupling minimal. If the gains to be had are worth the complexity of this new coupling, they may choose to proceed and collaborate.

It is the ability to do this kind of collaboration well that truly allows for innovation at industry scale. This form of how these domains will interact; what work will be done where, and which data is transferred, will look very different from the data sharing pattern used in the proof-of-concept by the data scientist. And it wonâ€™t generally make sense to build a data sharing platform just for these kinds of experiments. Data can be accessed in more conventional, manual ways for this type of throwaway work.

So, in summary, I suggest we need to reign in the democratization of data fanfare. Instead, people should learn more about the crucial concept of encapsulation and how it drives optimal design which is what really allows us to control complexity and share effectively.?Sharing data without focusing on encapsulation will solve one problem only to create a worse one. Once excess coupling is in place it tends to spread and is very difficult to untangle or reverse. One might find they can build useful systems quickly but then realize that they cannot modify the subsystems without simultaneously working on many others. This leads to a quick burst of productivity followed by stagnation and inability to evolve. More attention should be given to design concepts focused on controlling complexity and maximizing flexibility. When we master that art, we can not only build useful systems, but we can keep evolving them quickly to keep up with the changing business world.

è¦æŸ¥çœ‹æˆ–æ·»åŠ è¯„è®ºï¼Œè¯·ç™»å½•

David Johnstonçš„æ›´å¤šæ–‡ç«

Are we ready for the Long Winter?

2024å¹´8æœˆ7æ—¥

Are we ready for the Long Winter?

Ah, my sweet summer child. You never care for the tales of the AI revolution, preferring instead the dark stories of AIâ€¦

1 æ¡è¯„è®º
LLMs can pass math tests but can't do math

2024å¹´8æœˆ7æ—¥

LLMs can pass math tests but can't do math

People who are marketing LLMs and want to impress others are spending a lot of time trying to get their LLM to performâ€¦

30 æ¡è¯„è®º
How to think about Large Language Models

2024å¹´6æœˆ1æ—¥

How to think about Large Language Models

Large Language Models are truly amazing things. There is no denying the importance of this breakthrough.

28 æ¡è¯„è®º
Why great developers should make great business executives

2024å¹´4æœˆ22æ—¥

Why great developers should make great business executives

I've often thought that there should be reasons why great software developers should make great business executives. Ofâ€¦

1 æ¡è¯„è®º
Why mirrors confuse us

2024å¹´4æœˆ17æ—¥

Why mirrors confuse us

People are often under the impression that mirrors swap left and right. But that seems weird when you think about it aâ€¦
Design of information systems in the age of AI

2024å¹´4æœˆ17æ—¥

Design of information systems in the age of AI

Many enterprises are facing a very similar problem these days. This is the problem of how to use AI to open up a freerâ€¦
How to leverage Gen-AI in the enterprise and avoid the pitfalls

2024å¹´4æœˆ16æ—¥

How to leverage Gen-AI in the enterprise and avoid the pitfalls

Every company appears eager to dive into the world of generative AI and implement it for an initial use case. Whileâ€¦

2 æ¡è¯„è®º
Don't go to college

2024å¹´4æœˆ7æ—¥

Don't go to college

If youâ€™ve followed my writing you might have noticed a theme. I write about a wide variety of topics including someâ€¦

13 æ¡è¯„è®º
Fed up with gerrymandering

2024å¹´3æœˆ30æ—¥

Fed up with gerrymandering

A judge is overseeing a case about gerrymandering between the two main political parties in a State. Both parties, whenâ€¦
My AI Writing Compendium

2024å¹´3æœˆ27æ—¥

My AI Writing Compendium

I decided to make a collated list of the major AI articles I have written; 10 so far. If someone really wants to knowâ€¦

1 æ¡è¯„è®º

See all articles

Donâ€™t Democratize your Data: Defend your Domain Boundaries

David Johnston

Chief Data & AI Scientist, Navalia

é¢†è‹±æŽ¨è

David Johnstonçš„æ›´å¤šæ–‡ç«

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

We, the Data

Data Act: the level of enrichment of data for it to be considered inferred or derived

December (Part 1)

Data Ecosystem - Best practices

2023, Processed: reflections on the Human Managed journey

Crap data everywhere

Poor data quality? It's simple to solve...

Stirring up a Data Controversy

Data spaces in action: Building trust and standards for global data spaces

Why you're failing at Data

é¢†è‹±æŽ¨è

David Johnstonçš„æ›´å¤šæ–‡ç«

Are we ready for the Long Winter?

LLMs can pass math tests but can't do math

How to think about Large Language Models

Why great developers should make great business executives

Why mirrors confuse us

Design of information systems in the age of AI

How to leverage Gen-AI in the enterprise and avoid the pitfalls

Don't go to college

Fed up with gerrymandering

My AI Writing Compendium

ç¤¾åŒºæ´žå¯Ÿ

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†

We, the Data

Data Act: the level of enrichment of data for it to be considered inferred or derived

December (Part 1)

Data Ecosystem - Best practices

2023, Processed: reflections on the Human Managed journey

Crap data everywhere

Poor data quality? It's simple to solve...

Stirring up a Data Controversy

Data spaces in action: Building trust and standards for global data spaces

Why you're failing at Data

é¢†è‹±æŽ¨è

å…¶ä»–ä¼šå‘˜ä¹Ÿæµè§ˆäº†