Some building blocks of a data platform - I
Govinda Rathi
Software Architect at Johnson Controls | Cloud Computing | Kubernetes | Big Data
You can't just wear nice clothes and look fresh. You need to take a shower too!
Let’s say that you are trying to meet someone for the first time and suddenly tens of your twins pop around you. A glitch in the matrix, maybe? To make it worse, these twins are missing some basic physical features. Say one doesn’t have a nose, the other doesn’t have one of the eyes, and so on. But since the person whom you are meeting for the first time doesn’t know the real you, now can’t tell who is the real one. And what if there are not just your twins around but that person’s too? How would you spot the correct one?
Well this is exactly what happens when you have tens of applications running in silos and trying to maintain their own single source of truth. But since these applications have different use cases to serve, the definition of single source of truth and by extension, the mandatory attributes of the single source of truth are not same for them. As a result, you end up having tens of incomplete clones. This adds not only to storage cost but also to the maintenance cost. And let’s just not talk about the efforts required to extract the meaningful insights from all these things which can only be drawn if they are tied together.?
领英推荐
This underscores the importance of having one organisation wide data platform which is responsible for maintaining that pristine single source of truth which can be referred by anyone and should be referred by everyone. And to do so, there has to be a governance around what could be fed into this platform and then a management process that ensures that these governance principles are always followed. But having a single source of truth is one thing and ensuring that it is complete and meaningful is other. And to help achieve the latter, you need a data dictionary.
What sets you apart from the other incomplete twins of yours? What makes you you? If the person already had known your characteristics, he could have easily identified you even with those incomplete twins around, correct? This is what a data dictionary is - a collection of names, definitions, and attributes about the data elements. It helps people understand and identify right data elements in your platform. It is like the DNA of the data element.
And once you have this dictionary in place, enforcing governance is a cake walk. And once you have the governance in place, then and only then you have earned the privilege of developing an ML model. Until then, keep cleaning the mess!