Lakes, Lakehouses, Warehouse and.....MDM?
The path to Data-Nirvana is very much an amicable one.? There are a plethora of powerful tools, languages and frameworks for building reliable, robust and powerful data pipelines. The challenge is that there are some parts of the data pipeline that cannot be automated, and need to involve a human actor to augment the decisions and play a role. You can probably imagine that this would dramatically slow down data pipelines, as no one has 1000s of humans working around the clock, just waiting for the pipelines to demand a humans response. MDM is the part of a data landscape that involves business users in what is inherently an IT-driven data pipeline. However, the good news is that you can automate huge parts of it over time as data stewards make decisions on how best to treat data quality issues.
There is a synergy that technology stacks can use to complement each other perfectly and I would like to explore one of those today.
I have spent the past 15 years as a software engineer. Although I build software using C# and data processing in C# is already "ok", python is the de facto data processing language for sure. The beauty of tools like Apache Spark is that it allows developers to build their data processing code in simple languages like Python but execute it in a distributed manner on a Spark cluster, in a way negating the fact that Python is a rather slow language compared to a compiled language like C#, C++ etc. What I am getting at here is that most data pipeline manipulation will require you to essentially hard-code your data transformation logic, or ?lookup some external database to allow the data transformation to be a bit more dynamic. In these languages, you are usually loading a dataset, doing some pre-processing of the data e.g. standardising dates, then using code that either calculates something or maybe transforms data and then spits it out the end. Thanks to tools like Apache Spark, we can essentially run this data pipeline "job" on a distributed cluster and essentially achieve a transformation over a huge amount of data in a very short time. Naturally, if you were doing things like looking up a database or calling off to an external service, then your Spark job cannot run fast as it will hit the network a lot - we need to make sure that everything we do can be run in memory if we were to load the dataset into a distributed memory cluster.
Let’s talk about the problem we want to solve. Imagine we have data coming in through our pipelines and we have a problem normalising said data - e.g. sometimes people spell Copenhagen in English and sometimes in Danish (K?benhavn). There are only a few ways to automate this:
?1: You have an “if” statement that checks whether the value of a cell is K?benhavn and then changes it to Copenhagen.
2: You reach out to a file, database or external dataset that has a list of "known transformations".
3: You look up an external REST API that takes in a city (e.g. Google Places) and then returns what the external service thinks it should be.
领英推荐
?The challenge with option one is that we are only talking about one city for now, but can you imagine how many permutations of a city there can be across different systems? This is not a scalable solution as we would always be coming back to the data team every time there was a new variation of a value that needed to be normalised.
?The second option does sound more scalable, but would require someone to populate this database with the variations. This means manually discovering the permutations in what could be a huge amount of data. Although this seems more scalable, it is rather impractical. We also have the issue of the spark job reaching out to a database while the job is running, which will pretty much kill any performance that distributed data processing could provide us. We could load this database into memory before the spark job runs to eliminate this possible challenge.
?For the same reason, the external REST service suffers from the same issue as the second (performance). We could potentially ask the third party service for a file-based version of their data, but this is becoming less and less likely to happen. If we did get this data in a file, then we could load this file in-memory as well as running the Spark job.
?At CluedIn, we have a fourth option, which helps to solve many of the challenges raised here and more. CluedIn is a platform that allows business users to discover and be prescribed fixes in data (e.g. Copenhagen and K?benhavn are essentially the same) and have business rules automatically created on their behalf. Explainable, predictable, logical business rules. The leap that we have to make is that these rules are in a JSON structure, available from a REST API, however what we need is for this to somehow become code, whether this is JAVA, Python, Go, C# or something else.
?Here comes the good news. There are libraries available in all of these languages and more, that can pull the structure from CluedIn and do exactly that - convert them from JSON to code. If this sounds like magic, it kind of is. Well, not magic, as it depends on a "technique" or language feature known as Expression Trees. It kind of works exactly the same way that a compiler does, in that a compiler will take text in a file and somehow turn it into a running application. In our case, we are doing something MUCH simpler than a compiler where we are essentially turning a JSON structured Rule into a predicate that can be executed as code. This opens up extremely exciting opportunities for companies to bridge the gap between IT and the business, allowing both parties to talk in a common language and use different interfaces they are comfortable with. Think of it like a Rosetta Stone, but for logic.
This has led to what I think is one of my most exciting revelations in some time. Combining engines like spark, with the data transformation decisions hosted in and served from CluedIn is the perfect bridge between IT and the Business. This is where MDM fits in this new world of Lakes, Lakeshouses and Warehouses.
Senior Solution arkitekt hos EWII
2 年Can this webinar be shared ?, the concept looks really great ?? keep up the work utilizing services to create something unique.