Machine Learning
The use of Machine Leaning (ML) has increased substantially in enterprise data analytics scenarios to extract valuable insights from the business data. Hence, it is very important to have an ecosystem to build, test, deploy, and maintain the enterprise grade machine learning models in production environments. The ML model development involves data acquisition from multiple trusted sources, data processing to make suitable for building the model, choose algorithm to build the model, build model, compute performance metrics and choose best performing model. The model maintenance plays critical role once the model is deployed into production. The maintenance of machine learning model includes keeping the model up to date and relevant in tune with the source data changes as there is a risk of model becoming outdated in course of time. Also, the configuration management of ML model play an important role in model management as the number of models grow. This article focuses on principles and industry standard practices, including the tools and technologies used for ML model development, deployment, and maintenance in an enterprise environment.
Machine Learning (ML) Model Lifecycle refers to the process that covers right from source data identification to model development, model deployment and model maintenance. At high level, the entire activities fall under two broad categories, such as ML Model Development and ML Model Operations.
Machine Learning (ML) model development includes a series of steps as mentioned in the Fig. 1.
Fig 1:Machine Learning (ML) Model Development Lifecyle
The ML model development lifecycle steps can be broadly classified as – data exploration, model building, model hyperparameters tuning and model selection with optimum performance.
Exploratory data analysis is an important step that starts once business hypothesis is ready. This step takes 40-50% of total project time as the model outcome depends on the quality of input data being fed to train the model. Exploratory data analysis involves data attributes identification, data preprocessing and feature engineering. Attributes’ identification involves identification of predictor/features variables (inputs) and target/class variable (output), along its data types (string or numeric or datetime) and classification of features into categorical and continuous variables that helps in applying appropriate treatment to be given to the variable by the algorithm while building the model. Data pre-processing involves identification of missing values and outliers and fill these gaps by computing mean or median for quantitative attributes and mode for qualitative attributes of data to improve the predictive power of model. The outliers cause increased mean and standard deviation, that can be eliminated by taking natural log value which reduces the variation caused by extreme values.
Feature Engineering is the next most important step in exploratory data analysis where the raw dataset is processed to convert data types of string or datetime or numeric ones to numeric vectors for an ML algorithm to understand and build efficient predictive model. Also, the labelled categorical data (e.g., color red, green, blue; performance → poor, fair, good, very good, excellent; risk → low, medium, high; status → started, in-progress, on-hold, closed; gender → male/female; is_loan_approved/is_claim_fraudulent → yes/no) cannot be understood by an ML algorithm in its true context, hence it is required to be converted into numeric data. Feature Encoding is the widely used technique to transform the categorical data into continuous (numerical) values, e.g., Encode color as 1,2,3; performance as 1,2,3,4,5; risk as 1,2,3; status as 1,2,3,4; gender / is_loan_approved / is_claim_fraudulent as 0/1 etc. It is recommended to use label encoding in case categorical variables have no ordered relationship (e.g., color and status), ordinal encoding in case categorical variables have an ordered relationship (e.g., performance, risk) and one hot encoding in case categorical variable data is binary in nature (e.g., gender, is_loan_approved, is_claim_fraudulent). A set of libraries are available in R or Python to implement these encoding methods. In some cases, a set of dummy variables or derived variables are created, especially in handling ‘date’ data types. Once the categorical text data is converted into numeric data, the data is ready to be fed to the model; As a last step, it is required to choose the appropriate features that help in improving the accuracy of model, by using the techniques such as Univariate Selection (statistical measure), Feature Importance (model property) and Correlation Matrix (identifies which features are most related to the target variable). These methods detect Collinearity between two variables where they are highly correlated and contain similar information about the variance within a given dataset. One may find the Variance Inflation Factor (VIF) useful to detect Multicollinearity where highly correlated three or more variables are included in the model. The key points in exploratory data analysis are represented in Fig.2, as shown below.
领英推荐
Fig 2:Exploratory Data Analysis
Building an ML Model requires splitting of data into two sets, such as ‘training set’ and ‘testing set’ in the ratio of 80:20 or 70:30; A set of supervised (for labelled data) and unsupervised (for unlabeled data) algorithms are available to choose from depending on the nature of input data and business outcome to predict. In case of labelled data, it is recommended to choose logistic regression algorithm if the outcome to predict is of binary (0/1) in nature; choose decision tree classifier (or) Random Forest? classifier (or) KNN if the outcome to predict is of multi-class (1/2/3/...) in nature; and choose linear regression algorithm (e.g., decision tree regressor, random forest regressor) if the outcome to predict is continuous in nature. The clustering (Unsupervised) algorithm is preferred to analyze unlabeled / unstructured (text) data (e.g., k-means clustering). Artificial Neural Network (ANN) algorithms are suggested to analyze other unstructured (image/voice) data types, such as Convolutional Neural Networks (CNN) for image recognition and Recurrent Neural Networks (RNN) for voice recognition and Natural Language Processing (NLP). The model is built using training dataset and make prediction using test dataset. Use of deep learning (neural networks) models is preferred over regression models (ML models) for better performance as these models introduce extra layer of non-linearity with the introduction of Activation Function (AF). The algorithm selection under various scenarios has been represented in Fig.3, as shown below.
Fig 3:Choose Right Algorithm
Computation of Model Performance is next logical step to choose the right model. The model performance metrics will decide on final model selection, that include computation of accuracy, precision, recall, F1 score (weighted average of precision and recall) , along with confusion matrix for classification models and co-efficient of determination for regression models. It is not recommended to use accuracy as a measure to determine the performance of classification models that are trained with imbalanced/skewed datasets, rather precision and recall are recommended to be computed to choose right classification model.
Model Hyperparameters Tuning is highly recommended step in the process, continue till the model performance reach around 80%-85%. For example, the Random Forest algorithm takes maximum depth, maximum number of features, number of trees etc., as hyperparameters which can be intuitively tuned for improving model accuracy. Similarly, Neural Networks algorithm takes number of layers, batch size, number of epochs, number of samples etc. It is recommended to use Grid-search method to find the optimal hyperparameters of a model which results in the most ‘accurate’ predictions. Besides this, it is also recommended to perform cross-validation as sometimes improvement in model accuracy may be due to overfitting (too sensitive model, unable to generalize) or underfitting (very generalized model) of model, using k-fold cross validation technique. To avoid overfitting, increase training data sample size that introduces more patterns or, reduce number of features that avoids complexity or, perform data regularization data using Ridge and Lasso regularization methods?that reduces error/penalty. Similarly, to avoid underfitting, increase model complexity such as moving from linear to non-linear or adding more hidden layers (epochs) to neural network or add more features that introduce hidden patterns. However, adding more data volume does not solve the problem of underfitting, rather it hampers the model performance.
Finally. choose the model with optimum performance.
ML Model Development Best Practices: The recommended ML Model Development Best Practices are (1) Perform clear hypothesis for the identified business problem before attributes identification itself; (2) Build the model using a basic algorithm first, like a logistic regression or decision tree and compile performance metrics, that gives enough confidence about relevance of data before going for a fancier algorithms, like neural networks; (3) Keep intermediate checkpoints in building the model to keep track of its model hyperparameters and its associated performance metrics that gives an ability to train the model incrementally and make good judgement when it comes to performance vs training time; (4) Use real-world production data for training the model to improve correctness of predictions.