Introduction to Apache Spark's ML library.
Naren Castellon
Specialist in Time Series, Machine learning, Deep learning, Data science, Mathematics, Statistics, Finance, Youtuber
MLlib is Apache Spark's native ML library. Being a native library, MLlib has tight integration with the rest of Spark's APIs and libraries, including Spark SQL Engine, DataFrame APIs, Spark SQL API, and even Structured Streaming. This gives Apache Spark the unique advantage of being a truly unified data analytics platform that can
perform all tasks pertaining to data analytics, starting from data ingestion to data transformation, the ad hoc analysis of data, building sophisticated ML models, and even leveraging those models for production use cases.?
In the early versions of Apache Spark, MLlib was based on Spark's RDD API. Starting with Spark version 2.0, a new ML library based on DataFrame APIs was introduced. Now, in Spark 3.0 and versions above this, the DataFrame API-based MLlib is standard, while the older RDD-based MLlib is in maintenance mode with no future enhancements planned.
The DataFrame-based MLlib closely mimics traditional single-machine Python-based ML libraries such as scikit-learn and consists of three major components, called:
transformers,?
estimators,
pipelines
Transformer
A transformer is an algorithm that takes a DataFrame as input, performs processing on the DataFrame columns, and returns another DataFrame. An ML model trained using Spark MLlib is a transformer that takes a raw DataFrame and returns another DataFrame with the original raw data along with the new prediction columns. A typical transformer pipeline is shown in the following diagram:
In the previous diagram, a typical transformer pipeline is depicted, where a series of transformer stages, including a VectorIndexer and an already trained Linear Regression Model, are applied to the raw DataFrame. The result is a new DataFrame with all the original columns, along with some new columns containing predicted values.
Estimators
An estimator is another algorithm that accepts a DataFrame as input and results in a transformer. Any ML algorithm is an estimator in that it transforms a DataFrame with raw data into a DataFrame with actual predictions. An estimator pipeline is depicted in the following diagram:
In the preceding diagram, a Transformer is first applied to a DataFrame with raw data to result in a Feature Vector DataFrame. An Estimator in the form of a Linear Regression Algorithm is then applied to the DataFrame containing Feature Vectors to result in a Transformer in the form of a newly trained Linear Regression Model.
Pipelines
An ML pipeline within Spark MLlib chains together several stages of transformers and estimators into a DAG that performs an end-to-end ML operation ranging from data cleansing, to feature engineering, to actual model training. A pipeline could be a transformer-only pipeline or an estimator-only pipeline or a mix of the two.
Using the available transformers and estimators within Spark MLlib, an entire end-to-end ML pipeline can be constructed. A typical ML pipeline consists of several stages, starting with data wrangling, feature engineering, model training, and model inferencing.
References
Analista Power BI - SQL & Python
2 年Excelent!