Distributed Analytics: Enabling More Insightful Solutions
Chuck Freedman
Growing Productive & Successful Developer Communities through Advocacy, Partnerships, and Relationships; Director of Developer Advocacy and Enablement @ MongoDB
At the recent Intel Analytics Summit, our first panel discussion featured a conversation on distributed analytics. This panel offered a lively look at topics that are close to the heart of anyone focused on extracting value from big data. While the topics were wide-ranging—from foundational technologies to applications like machine learning, the centerpiece of the discussion was a deep dive into the advantages of distributed analytics.
So what is distributed analytics? At the most basic level, distributed analytics spreads data analysis workloads over multiple nodes in a cluster of servers, rather than asking a single node to tackle a big problem. The same algorithms run across each of the nodes, processing a subset of the data. When the processing concludes, the data sets are aggregated, or brought back together, to generate collective insights.
The advantage of the distributed approach boils down to a simple concept: faster time to insight. By putting multiple nodes to work on the same problem, you can gain insights from data much more quickly than would be possible when running an analytics application on a single node with linear processing. This is a huge benefit when you want to get fast answers from massive amounts of data.
Let’s take an example. To improve patient safety and avoid costly patient readmissions, a hospital might want to put distributed analytics to work to compare a patient’s vital signs and symptoms with the records of thousands of other patients who have experienced a similar health issue, including those who were released and later readmitted. It wouldn’t be feasible to do this on a single node, even a very large one, because the results would come back way too slowly to help the clinical staff make timely decisions on releasing the patient. By distributing the workload of processing all relevant data, insights can be obtained fast enough to support this kind of solution, advising the hospital staff whether it is safe to release the patient in a matter of minutes.
At a technology level, the time is ripe for distributed analytics. It is a natural complement to the Apache Hadoop and its distributed file system (HDFS), which many organizations use as a repository for large amounts of data. By design, HDFS spreads data over different nodes, which makes it relatively easy to plug in a distributed analytics application.
Distributed analytics is also a natural fit with the popular Apache Spark processing engine, which is often paired with a Hadoop environment. Spark includes built-in modules for data streaming, SQL, machine learning and graph processing. When you pair Spark with a distributed analytics application and a lot of processing power, you’re positioned to run analytics on data as it streams in, to generate insights and answers in near real time.
Both Hadoop and Spark are spearheaded by the Apache Foundation and enriched by code contributions from Intel and the broader analytics community. The code that Intel contributes to these projects helps application developers and data scientists take full advantage of the capabilities and performance of underlying Intel architecture.
To further boost performance, Intel’s contributions include the Intel? Data Analytics Acceleration Library (Intel? DAAL) and the Intel? Math Kernel Library (Intel? MKL). Intel DAAL provides highly tuned functions for deep learning, classical machine learning, and data analytics performance. Intel MKL provides highly optimized, threaded, and vectorized functions to increase performance on Intel processors. These optimization libraries are baked into related projects like the Trusted Analytics Platform (TAP), an Intel-initiated open-source platform that accelerates advanced analytics and machine learning solutions.
Ultimately, distributed analytics is an enabler of more advanced artificial intelligence (AI) solutions that need lightning fast responses from data processing engines. AI gives us the ability to extend the reach of analytics to encompass not just data but also images, video, facial expressions, human speech and other sources of insight.
Let’s close with a key takeaway from the distributed analytics panel discussion at the Intel Analytics Summit: Emerging solutions today aren’t just based on processing large amounts of data; it’s about leveraging computing performance and spreading across multiple machines or nodes. The practice of distributed analytics helps you capitalize on this data and can set you up to gain faster, more valuable insights.
For a closer look at Intel’s contributions to distributed analytics, including a range of resources for software developers, visit intel.com/machinelearning.
[This article was originally posted at https://itpeernetwork.intel.com/distributed-analytics-enabler-insightful-solutions/ on Nov 8, 2016.]