Exploring Apache Spark for Master Data Management in CluedIn

Exploring Apache Spark for Master Data Management in CluedIn


While vacationing in Japan, a question from a prospective client lingered in my mind: "Why couldn't CluedIn be entirely built on Apache Spark, eliminating the need to store anything?" It was posed more out of curiosity than necessity, but it sparked (pun intended) a deep dive into the potential of Spark in the realm of Master Data Management (MDM). In this context, the prospective client meant Apache Spark more as "any engine running Spark" such as Microsoft Fabric.

Apache Spark, Simplified Apache Spark is an engine designed for running data processing tasks across a cluster of machines, offering built-in parallelism and fault tolerance. It's not typically "Always On"; you boot it up, run a job, and then either shut it down or keep it running for more jobs. It’s perfect for processing large datasets quickly and efficiently, especially for analytical tasks.

Why Not Build Everything on Spark? Consider the example of a CRM system, which, like MDM tools, is operational in nature. If built on Spark, every operation like creating a new record or performing a search would involve loading data into Spark, processing it, and then terminating the session. Operational systems need to be always on and responsive, not waking up on demand. This isn’t what Spark is optimized for.

However, Spark Has Its Place in CluedIn That said, incorporating Spark into CluedIn for specific functionalities is under R&D due to its strengths:

  • Deduplication: Running deduplication routines on Spark is highly effective due to its ability to process large datasets quickly.
  • Data Cleaning: Bulk application of data cleaning rules can be efficiently handled in Spark.
  • Calculating Data Quality Metrics: Ideal for non-real-time metrics computations.


When it comes to batch processing or situations where data can be offloaded to files and then quickly processed, Spark can significantly outperform traditional environments.

Limitations of Spark in an MDM Context Spark is not suited for operations requiring immediate response or interactive user interfaces. For example, making remote calls per row or interacting with third-party services is an anti-pattern in Spark, leading to performance bottlenecks. Also, a UI that relies on Spark to load and interact with data would suffer from considerable latency, negatively impacting user experience.

MDM on Spark: Theoretical Possibilities and Practical Constraints In theory, an MDM system could be engineered to function on Spark, utilizing Parquet files for storage and Spark for computation-intensive tasks like data cleaning or deduplication. While feasible, this would lead to unconventional user experiences and potentially long wait times. However, the efficiency and cost utilization in specific scenarios cannot be ignored.

Why We Investigate Spark's Viability

  • Resource Utilization: Many companies already possess general compute clusters and seek to leverage them effectively.
  • Cost-Effectiveness: Using Spark can be cost-effective as charges apply only when computing tasks are being executed.
  • Data Duplication Concerns: Companies are interested in reducing redundant data storage, although this is more applicable to analytics than operational systems.

Final Thoughts While Spark offers formidable capabilities for certain types of data processing within CluedIn, it is not suitable as the sole technology for an operational MDM system. However, integrating Spark for specific tasks within the MDM lifecycle? Absolutely.

要查看或添加评论,请登录

Tim Ward的更多文章

社区洞察

其他会员也浏览了