登录查看更多内容

Lakes, Lakehouses, Warehouse and.....MDM?

Tim Ward

CEO at CluedIn - Helping companies become data driven. Microsoft recommended MDM.

发布日期: 2022年3月24日

The path to Data-Nirvana is very much an amicable one.? There are a plethora of powerful tools, languages and frameworks for building reliable, robust and powerful data pipelines. The challenge is that there are some parts of the data pipeline that cannot be automated, and need to involve a human actor to augment the decisions and play a role. You can probably imagine that this would dramatically slow down data pipelines, as no one has 1000s of humans working around the clock, just waiting for the pipelines to demand a humans response. MDM is the part of a data landscape that involves business users in what is inherently an IT-driven data pipeline. However, the good news is that you can automate huge parts of it over time as data stewards make decisions on how best to treat data quality issues.

There is a synergy that technology stacks can use to complement each other perfectly and I would like to explore one of those today.

I have spent the past 15 years as a software engineer. Although I build software using C# and data processing in C# is already "ok", python is the de facto data processing language for sure. The beauty of tools like Apache Spark is that it allows developers to build their data processing code in simple languages like Python but execute it in a distributed manner on a Spark cluster, in a way negating the fact that Python is a rather slow language compared to a compiled language like C#, C++ etc. What I am getting at here is that most data pipeline manipulation will require you to essentially hard-code your data transformation logic, or ?lookup some external database to allow the data transformation to be a bit more dynamic. In these languages, you are usually loading a dataset, doing some pre-processing of the data e.g. standardising dates, then using code that either calculates something or maybe transforms data and then spits it out the end. Thanks to tools like Apache Spark, we can essentially run this data pipeline "job" on a distributed cluster and essentially achieve a transformation over a huge amount of data in a very short time. Naturally, if you were doing things like looking up a database or calling off to an external service, then your Spark job cannot run fast as it will hit the network a lot - we need to make sure that everything we do can be run in memory if we were to load the dataset into a distributed memory cluster.

Let’s talk about the problem we want to solve. Imagine we have data coming in through our pipelines and we have a problem normalising said data - e.g. sometimes people spell Copenhagen in English and sometimes in Danish (K?benhavn). There are only a few ways to automate this:

?1: You have an “if” statement that checks whether the value of a cell is K?benhavn and then changes it to Copenhagen.

2: You reach out to a file, database or external dataset that has a list of "known transformations".

3: You look up an external REST API that takes in a city (e.g. Google Places) and then returns what the external service thinks it should be.

领英推荐

Spark Dynamic Resource Allocation

Ankur Ranjan 7 个月前

A comparative study among CSV, feather, pickle, and…

Octavio Loyola-González 2 年前

?? DATA Pill #107 - dbt 1.8 is just wow, How Twitter…

Adam Kawa 5 个月前

?The challenge with option one is that we are only talking about one city for now, but can you imagine how many permutations of a city there can be across different systems? This is not a scalable solution as we would always be coming back to the data team every time there was a new variation of a value that needed to be normalised.

?The second option does sound more scalable, but would require someone to populate this database with the variations. This means manually discovering the permutations in what could be a huge amount of data. Although this seems more scalable, it is rather impractical. We also have the issue of the spark job reaching out to a database while the job is running, which will pretty much kill any performance that distributed data processing could provide us. We could load this database into memory before the spark job runs to eliminate this possible challenge.

?For the same reason, the external REST service suffers from the same issue as the second (performance). We could potentially ask the third party service for a file-based version of their data, but this is becoming less and less likely to happen. If we did get this data in a file, then we could load this file in-memory as well as running the Spark job.

?At CluedIn, we have a fourth option, which helps to solve many of the challenges raised here and more. CluedIn is a platform that allows business users to discover and be prescribed fixes in data (e.g. Copenhagen and K?benhavn are essentially the same) and have business rules automatically created on their behalf. Explainable, predictable, logical business rules. The leap that we have to make is that these rules are in a JSON structure, available from a REST API, however what we need is for this to somehow become code, whether this is JAVA, Python, Go, C# or something else.

?Here comes the good news. There are libraries available in all of these languages and more, that can pull the structure from CluedIn and do exactly that - convert them from JSON to code. If this sounds like magic, it kind of is. Well, not magic, as it depends on a "technique" or language feature known as Expression Trees. It kind of works exactly the same way that a compiler does, in that a compiler will take text in a file and somehow turn it into a running application. In our case, we are doing something MUCH simpler than a compiler where we are essentially turning a JSON structured Rule into a predicate that can be executed as code. This opens up extremely exciting opportunities for companies to bridge the gap between IT and the business, allowing both parties to talk in a common language and use different interfaces they are comfortable with. Think of it like a Rosetta Stone, but for logic.

This has led to what I think is one of my most exciting revelations in some time. Combining engines like spark, with the data transformation decisions hosted in and served from CluedIn is the perfect bridge between IT and the Business. This is where MDM fits in this new world of Lakes, Lakeshouses and Warehouses.

Brian Petersen

Senior Solution arkitekt hos EWII

2 年

Can this webinar be shared ?, the concept looks really great ?? keep up the work utilizing services to create something unique.

1 次回应

查看更多评论

要查看或添加评论，请登录

查看全部

Lakes, Lakehouses, Warehouse and.....MDM?

Tim Ward

CEO at CluedIn - Helping companies become data driven. Microsoft recommended MDM.

领英推荐

更多精彩文章

社区洞察

其他会员也浏览了

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Best Practices and Spark optimisation Tips for Data engineers

Predicate vs Projection Pushdown in Spark 3

Apache Spark 3.0 for Data Scientists : Best Practices

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

Apache Spark Aggregation Methods: Hash-based Vs. Sort-based

How to optimize Pyspark Codes for better efficiency.

What is Delta Live Tables?

February 2023 - Iceberg Community News

领英推荐

Exploring Apache Spark for Master Data Management in CluedIn

2024年4月30日

If it isn't in Purview - it doesn't exist!

2022年4月27日

Building an amazing experience for sports fans with data.

2021年12月9日

How we gave our team access to data that was ready for insight with CluedIn and Azure Purview?

2021年11月28日

Why is MDM quicker to implement on CluedIn, in Microsoft Azure?

2021年11月15日

Why is Master Data Management justified now, more than ever?

2021年10月13日

The marriage of Azure Purview and CluedIn

2021年10月8日

Why is a Cloud-Native Master Data Management platform important?

2021年1月18日

What is the Data Fabric?

2020年10月15日

Your fastest way to move off an on-premise data infrastructure to the cloud.

2020年10月6日

社区洞察

其他会员也浏览了

Spark Performance Tuning: Addressing Common Issues and Optimization Strategies

Storing Large Semi-Structured Data in Delta Tables Using Variant Type and Spark 4.0.0

Best Practices and Spark optimisation Tips for Data engineers

Predicate vs Projection Pushdown in Spark 3

Apache Spark 3.0 for Data Scientists : Best Practices

Unpacking Lazy Evaluation in Apache Spark: A Deep Dive

Apache Spark Aggregation Methods: Hash-based Vs. Sort-based

How to optimize Pyspark Codes for better efficiency.

What is Delta Live Tables?

February 2023 - Iceberg Community News