登录查看更多内容

Exploring Apache Spark for Master Data Management in CluedIn

Tim Ward

CEO at CluedIn - Helping companies become data driven. Microsoft recommended MDM.

发布日期: 2024年4月30日

While vacationing in Japan, a question from a prospective client lingered in my mind: "Why couldn't CluedIn be entirely built on Apache Spark, eliminating the need to store anything?" It was posed more out of curiosity than necessity, but it sparked (pun intended) a deep dive into the potential of Spark in the realm of Master Data Management (MDM). In this context, the prospective client meant Apache Spark more as "any engine running Spark" such as Microsoft Fabric.

Apache Spark, Simplified Apache Spark is an engine designed for running data processing tasks across a cluster of machines, offering built-in parallelism and fault tolerance. It's not typically "Always On"; you boot it up, run a job, and then either shut it down or keep it running for more jobs. It’s perfect for processing large datasets quickly and efficiently, especially for analytical tasks.

Why Not Build Everything on Spark? Consider the example of a CRM system, which, like MDM tools, is operational in nature. If built on Spark, every operation like creating a new record or performing a search would involve loading data into Spark, processing it, and then terminating the session. Operational systems need to be always on and responsive, not waking up on demand. This isn’t what Spark is optimized for.

However, Spark Has Its Place in CluedIn That said, incorporating Spark into CluedIn for specific functionalities is under R&D due to its strengths:

Deduplication: Running deduplication routines on Spark is highly effective due to its ability to process large datasets quickly.
Data Cleaning: Bulk application of data cleaning rules can be efficiently handled in Spark.
Calculating Data Quality Metrics: Ideal for non-real-time metrics computations.

领英推荐

The Data Lakehouse: The Benefits, Implementation…

Alex Merced 1 个月前

Databases Deconstructed: The Value of Data Lakehouses…

Alex Merced 8 个月前

Databricks SQL Series — Part 5 — Managing and Securing…

Krishna Yogi Kolluru 7 个月前

When it comes to batch processing or situations where data can be offloaded to files and then quickly processed, Spark can significantly outperform traditional environments.

Limitations of Spark in an MDM Context Spark is not suited for operations requiring immediate response or interactive user interfaces. For example, making remote calls per row or interacting with third-party services is an anti-pattern in Spark, leading to performance bottlenecks. Also, a UI that relies on Spark to load and interact with data would suffer from considerable latency, negatively impacting user experience.

MDM on Spark: Theoretical Possibilities and Practical Constraints In theory, an MDM system could be engineered to function on Spark, utilizing Parquet files for storage and Spark for computation-intensive tasks like data cleaning or deduplication. While feasible, this would lead to unconventional user experiences and potentially long wait times. However, the efficiency and cost utilization in specific scenarios cannot be ignored.

Why We Investigate Spark's Viability

Resource Utilization: Many companies already possess general compute clusters and seek to leverage them effectively.
Cost-Effectiveness: Using Spark can be cost-effective as charges apply only when computing tasks are being executed.
Data Duplication Concerns: Companies are interested in reducing redundant data storage, although this is more applicable to analytics than operational systems.

Final Thoughts While Spark offers formidable capabilities for certain types of data processing within CluedIn, it is not suitable as the sole technology for an operational MDM system. However, integrating Spark for specific tasks within the MDM lifecycle? Absolutely.

带有此图标的链接由领英创建，不带此图标的链接由作者添加。

要查看或添加评论，请登录

Tim Ward的更多文章

If it isn't in Purview - it doesn't exist!

2022年4月27日

If it isn't in Purview - it doesn't exist!

How CluedIn and Microsoft Purview deliver a single source of truth Microsoft Purview is making more and more sense…
Lakes, Lakehouses, Warehouse and.....MDM?

2022年3月24日

Lakes, Lakehouses, Warehouse and.....MDM?

The path to Data-Nirvana is very much an amicable one. There are a plethora of powerful tools, languages and frameworks…

2 条评论
Building an amazing experience for sports fans with data.

2021年12月9日

Building an amazing experience for sports fans with data.

I have to come clean. I don't have a passion for sports.

1 条评论
How we gave our team access to data that was ready for insight with CluedIn and Azure Purview?

2021年11月28日

How we gave our team access to data that was ready for insight with CluedIn and Azure Purview?

It is a dream of most companies today to provide their business with data products. These data products are often…
Why is MDM quicker to implement on CluedIn, in Microsoft Azure?

2021年11月15日

Why is MDM quicker to implement on CluedIn, in Microsoft Azure?

There is a reason why "time to value" is such a huge advantage when implementing technology. Time-to-value brings…

2 条评论
Why is Master Data Management justified now, more than ever?

2021年10月13日

Why is Master Data Management justified now, more than ever?

Let's start by approaching this from a different angle and asking "How can we justify not having access to high quality…

1 条评论
The marriage of Azure Purview and CluedIn

2021年10月8日

The marriage of Azure Purview and CluedIn

With Azure Purview now in GA, what better time than to talk about CluedIn's native integration with the new kid in…

5 条评论
Why is a Cloud-Native Master Data Management platform important?

2021年1月18日

Why is a Cloud-Native Master Data Management platform important?

I was recently speaking with an analyst in the master data management industry and I was informed from him that…

5 条评论
What is the Data Fabric?

2020年10月15日

What is the Data Fabric?

There was an insightful Gartner paper released recently by a Danish Analyst that described the "Data Fabric" in detail.…
Your fastest way to move off an on-premise data infrastructure to the cloud.

2020年10月6日

Your fastest way to move off an on-premise data infrastructure to the cloud.

It took many years, but the majority of companies have fundamentally realized (and have already started) that moving to…

1 条评论

See all articles

Exploring Apache Spark for Master Data Management in CluedIn

Tim Ward

CEO at CluedIn - Helping companies become data driven. Microsoft recommended MDM.

领英推荐

Tim Ward的更多文章

社区洞察

其他会员也浏览了

Creating a Local Data Lakehouse using Spark/Minio/Dremio/Nessie

Databricks vs. AWS Lakehouse

10 big data technologies you must know

Data Lake And Data Warehouse

Advanced Data Analytics with Apache’s Cutting-Edge Tools

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

Master Data Pipeline in one Crash Course

Data Engineer's Arsenal: Tools, Technologies, and Tactics

AI-Ready Data Management with LifeGraph: Empowering MongoDB Users and Beyond

Getting to Know Microsoft Fabric: An Introduction

领英推荐

Tim Ward的更多文章

If it isn't in Purview - it doesn't exist!

Lakes, Lakehouses, Warehouse and.....MDM?

Building an amazing experience for sports fans with data.

How we gave our team access to data that was ready for insight with CluedIn and Azure Purview?

Why is MDM quicker to implement on CluedIn, in Microsoft Azure?

Why is Master Data Management justified now, more than ever?

The marriage of Azure Purview and CluedIn

Why is a Cloud-Native Master Data Management platform important?

What is the Data Fabric?

Your fastest way to move off an on-premise data infrastructure to the cloud.

社区洞察

其他会员也浏览了

Creating a Local Data Lakehouse using Spark/Minio/Dremio/Nessie

Databricks vs. AWS Lakehouse

10 big data technologies you must know

Data Lake And Data Warehouse

Advanced Data Analytics with Apache’s Cutting-Edge Tools

The Evolution of Data Engineering: From Batch Processing to Real-Time Insights

Master Data Pipeline in one Crash Course

Data Engineer's Arsenal: Tools, Technologies, and Tactics

AI-Ready Data Management with LifeGraph: Empowering MongoDB Users and Beyond

Getting to Know Microsoft Fabric: An Introduction