Introduction to Apache Spark's ML library.

Introduction to Apache Spark's ML library.

MLlib is Apache Spark's native ML library. Being a native library, MLlib has tight integration with the rest of Spark's APIs and libraries, including Spark SQL Engine, DataFrame APIs, Spark SQL API, and even Structured Streaming. This gives Apache Spark the unique advantage of being a truly unified data analytics platform that can

perform all tasks pertaining to data analytics, starting from data ingestion to data transformation, the ad hoc analysis of data, building sophisticated ML models, and even leveraging those models for production use cases.?

In the early versions of Apache Spark, MLlib was based on Spark's RDD API. Starting with Spark version 2.0, a new ML library based on DataFrame APIs was introduced. Now, in Spark 3.0 and versions above this, the DataFrame API-based MLlib is standard, while the older RDD-based MLlib is in maintenance mode with no future enhancements planned.

The DataFrame-based MLlib closely mimics traditional single-machine Python-based ML libraries such as scikit-learn and consists of three major components, called:

transformers,?

estimators,

pipelines

Transformer

A transformer is an algorithm that takes a DataFrame as input, performs processing on the DataFrame columns, and returns another DataFrame. An ML model trained using Spark MLlib is a transformer that takes a raw DataFrame and returns another DataFrame with the original raw data along with the new prediction columns. A typical transformer pipeline is shown in the following diagram:

No hay texto alternativo para esta imagen

In the previous diagram, a typical transformer pipeline is depicted, where a series of transformer stages, including a VectorIndexer and an already trained Linear Regression Model, are applied to the raw DataFrame. The result is a new DataFrame with all the original columns, along with some new columns containing predicted values.

Estimators

An estimator is another algorithm that accepts a DataFrame as input and results in a transformer. Any ML algorithm is an estimator in that it transforms a DataFrame with raw data into a DataFrame with actual predictions. An estimator pipeline is depicted in the following diagram:

No hay texto alternativo para esta imagen

In the preceding diagram, a Transformer is first applied to a DataFrame with raw data to result in a Feature Vector DataFrame. An Estimator in the form of a Linear Regression Algorithm is then applied to the DataFrame containing Feature Vectors to result in a Transformer in the form of a newly trained Linear Regression Model.

Pipelines

An ML pipeline within Spark MLlib chains together several stages of transformers and estimators into a DAG that performs an end-to-end ML operation ranging from data cleansing, to feature engineering, to actual model training. A pipeline could be a transformer-only pipeline or an estimator-only pipeline or a mix of the two.

Using the available transformers and estimators within Spark MLlib, an entire end-to-end ML pipeline can be constructed. A typical ML pipeline consists of several stages, starting with data wrangling, feature engineering, model training, and model inferencing.

References

  1. Machine learning with spark and python- essential techniques for predictive analytics. Michael Bowles. John Wiley & Sons, Inc.
  2. Advanced Analytics with Pyspark- Patterns for learning from data ar scale using python and spark. Akash Tandon, Sandy Ryza, Uri Larserson, Sean Owen & Josh Wills. O′Reilly Media, Inc.
  3. Pyspark SQL Recipes- With HiveQL, DataFrame and Graphframes. Raju Kumar Mishra, Sundar Rajan Raman.
  4. Data Analysis with Python and PySpark. Jonathan Roux. Manning Publications Co.
  5. PySpark Cookbook. Denny Lee, Tomasz Drabas. Packt Publishing.

要查看或添加评论,请登录

社区洞察