Get your machine learning programs right every time - most comprehensive guide ever ( with code)!
Krishna Yogi Kolluru
Data Science Architect | ML | GenAI | Speaker | ex-Microsoft | ex- Credit Suisse | IIT - NUS Alumni | AWS & Databricks Certified Data Engineer | T2 Skilled worker
As a data scientist, I often would fumble once in a while to apply the right algo for the given problem and end up spending quite a bit of time in the search.
The idea of this article is to ensure we follow the scientific approach in dealing with machine learning algo search space, there will be a similar article for deep learning here.
There are 11 parts to a modern machine learning pipeline:
- Data Peeking
- Visualization
- Data scaling
- Encoding
- Feature selection
- Train and Test splits
- Performance metrics
- Multi-algo search (Classification and Regression )
- Ensemble methods
- Performance Tuning
- Save and Load models
- Dimensionality Reduction ( pending )
Data peeking is the very first step in data science.
After importing the data, things to watch out for.
- data.head() - prints the first 10 rows of your data so you can physically inspect the data.
- data.describe() - describes your data in detailed statistics
- data.dtypes() - lists out the various data types so that you can observe
- data.skew() . - list the skew in the data
- data.groupby('class').size() - class distribution ( useful for classification type problems)
- data.corr(method='pearson') - Correlations Between Attributes
- data.skew() - Skew of Univariate Distributions
Visualization helps in litterally visualizing and often giving far better insights into data
Univariate Plots
- Histograms -
- Density Plots.
- Box and Whisker Plots.
Multivariate Plots
- Correlation Matrix Plot.
- Scatter Plot Matrix.
Scaling your data to suit the algorithm
different algorithms require your data to be scaled differently need to be watchful about this.
- Rescale data
Most algos need data to be scaled to 0-1 so that they can perform optimally.
scaler = MinMaxScaler(feature_range=(0, 1))
- Standardize data
Standardization is a useful technique to transform attributes with a Gaussian distribution and differing means and standard deviations to a standard Gaussian distribution with a mean of 0 and a standard deviation of 1.
StandardScaler().fit(X)
- Normalize data
Normalizing involves rescaling each observation (row) to have a length of 1 (called a unit norm or a vector with the length of 1 in linear algebra). This pre-processing method can be useful for sparse datasets (lots of zeros) with attributes of varying scales when using algorithms that weight input values such as neural networks and algorithms that use distance measures such as k-Nearest Neighbors.
Normalizer().fit(X)
- Binarize data
You can transform your data using a binary threshold. All values above the threshold are marked 1 and all equal to or below are marked as 0.
Binarizer(threshold=0.0).fit(X)
Encoding - Encoding is extremely helpful in cases where you have categorical data as machine learning algos cannot deal with him.
There are two main types of encoders
Label encoder - Label encoding converts categorical data to respective numbers Labelencoder.fit_transform()
One hot encoder
One hot encoding is more sophisticated than simple label encoding, one hot creates separate columns for each categorical variable ensuring the algos don't treat label as simple numbers and try the extracted value.
Feature Selection
Three benefits of performing feature selection before modeling your data are:
Reduces Overfitting: Less redundant data means less opportunity to make decisions based on noise.
Improves Accuracy: Less misleading data means modeling accuracy improves.
Reduces Training Time: Less data means that algorithms train faster.
Techniques
Univariate Selection. Statistical tests can be used to select those features that have the strongest relationship with the output variable.
SelectKBest(score_func=chi2, k = 4) ( this one uses chi-square test and selects 4 best features)
Recursive Feature Elimination works by recursively removing attributes and building a model on those attributes that remain.
Principle Component Analysis PCA) uses linear algebra to transform the dataset into a compressed form. Generally this is called a data reduction technique.
Feature Importance Bagged decision trees like Random Forest and Extra Trees can be used to estimate the importance of features.
Splitting your data into train and test sets
Train and Test Sets - Simply split your data into train and test sets so that you can perform a basic testing
k-fold Cross Validation works by splitting the dataset into k-parts (e.g. k = 5 or k = 10). Each split of the data is called a fold. The algorithm is trained on k ? 1 folds with one held back and tested on the held back fold. This is repeated so that each fold of the dataset is given a chance to be the held back test set.
Leave One Out Cross-Validation You can configure cross-validation so that the size of the fold is 1 (k is set to the number of observations in your dataset). This variation of cross-validation is called leave-one-out cross-validation. The downside is its computationally very expensive.
Repeated Random Test-Train Splits. Another variation on k-fold cross-validation is to create a random split of the data like the train/test split described above, but repeat the process of splitting and evaluation of the algorithm multiple times, like cross-validation. This has the speed of using a train/test split and the reduction in variance in the estimated performance of k-fold cross-validation.
K-fold CV is generally the most preferred option provided you have reasonable time and computing power at hand.
Performance Metrics
Performance metrics are the metrics that are used to measure how good a model is performing and hence decides on its optimization strategies making the metrics a vital part of model building.
Classification Accuracy Classification accuracy is the number of correct predictions made as a ratio of all predictions made
Logarithmic Loss Logarithmic loss (or logloss) is a performance metric for evaluating the predictions of probabilities of membership to a given class. The scalar probability between 0 and 1 can be seen as a measure of confidence for a prediction by an algorithm.
Area Under ROC Curve
Area under ROC Curve (or AUC for short) is a performance metric for binary classification problems. The AUC represents a model’s ability to discriminate between positive and negative classes or more precisely a plot between TPR and FPR. An area of 1.0 represents a model that made all predictions perfectly. An area of 0.5 represents a model that is as good as random. ROC can be broken down into sensitivity( TPR or recall) and specificity ( 1- FPR).
Confusion Matrix
The confusion matrix is a handy presentation of the accuracy of a model with two or more classes. The table presents predictions on the x-axis and accuracy outcomes on the y-axis. The cells of the table are the number of predictions made by a machine learning algorithm.
Classification Report
The scikit-learn library provides a convenience report when working on classification problems to give you a quick idea of the accuracy of a model using a number of measures. The classification report() function displays the precision, recall, F1-score and support for each class.
Mean Absolute Error.
The Mean Absolute Error (or MAE) is the sum of the absolute differences between predictions and actual values. It gives an idea of how wrong the predictions were.
Mean Squared Error.
The Mean Squared Error (or MSE) is much like the mean absolute error in that it provides a gross idea of the magnitude of the error. Taking the square root of the mean squared error converts the units back to the original units of the output variable and can be meaningful for description and presentation.
R2
The R2 (or R Squared) metric provides an indication of the goodness of fit of a set of predictions to the actual values. In statistical literature, this measure is called the coefficient of determination. This is a value between 0 and 1 for no-fit and perfect fit respectively.
Algo Search - Classification
If you have read this far, I have to thank you for your patience. Having said that this is obviously the most important part of your ML programming.
here are two linear machine learning algorithms:
Logistic Regression
Logistic regression assumes a Gaussian distribution for the numeric input variables and can model binary classification problems. You can construct a logistic regression model using the LogisticRegression class
Linear Discriminant Analysis
Linear Discriminant Analysis or LDA is a statistical technique for binary and multiclass classification. It too assumes a Gaussian distribution for the numerical input variables.
Then let's look at four nonlinear machine learning algorithms:
k-Nearest Neighbors
The k-Nearest Neighbors algorithm (or KNN) uses a distance metric to find the k most similar instances in the training data for a new instance and takes the mean outcome of the neighbors as the prediction.
Naive Bayes
Naive Bayes calculates the probability of each class and the conditional probability of each class given each input value. These probabilities are estimated for new data and multiplied together, assuming that they are all independent (a simple or naive assumption). When working with real-valued data, a Gaussian distribution is assumed to easily estimate the probabilities for input variables using the Gaussian Probability Density Function.
Classification and Regression Trees
Classification and Regression Trees (CART or just decision trees) construct a binary tree from the training data. Split points are chosen greedily by evaluating each attribute and each value of each attribute in the training data in order to minimize a cost function (like the Gini index).
Support Vector Machines
Support Vector Machines (or SVM) seek a line that best separates two classes. Those data instances that are closest to the line that best separates the classes are called support vectors and influence where the line is placed. SVM has been extended to support multiple classes. Of particular importance is the use of different kernel functions via the kernel parameter. A powerful Radial Basis Function is used by default. You can construct an SVM model using the SVC class
Algo Search - Regression
Linear Regression.
Linear regression assumes that the input variables have a Gaussian distribution. It is also assumed that input variables are relevant to the output variable and that they are not highly correlated with each other (a problem called collinearity). You can construct a linear regression model using the LinearRegression class
Ridge Regression.
Ridge regression is an extension of linear regression where the loss function is modified to minimize the complexity of the model measured as the sum squared value of the coefficient values (also called the L2-norm). You can construct a ridge regression model by using the Ridge class
LASSO Linear Regression
The Least Absolute Shrinkage and Selection Operator (or LASSO for short) is a modification of linear regression, like ridge regression, where the loss function is modified to minimize the complexity of the model measured as the sum the absolute value of the coefficient values (also called the L1-norm). You can construct a LASSO model by using the Lasso class
Elastic Net Regression
ElasticNet is a form of regularization regression that combines the properties of both Ridge Regression and LASSO regression. It seeks to minimize the complexity of the regression model (magnitude and number of regression coefficients) by penalizing the model using both the L2-norm (sum squared coefficient values) and the L1-norm (sum absolute coefficient values).
Now let's learn to compare all machine learning algos for a given problem in one shot
The key to a fair comparison of machine learning algorithms is ensuring that each algorithm is evaluated in the same way on the same data. You can achieve this by forcing each algorithm to be evaluated on a consistent test harness.
# Compare Algorithms from pandas import read_csv from matplotlib import pyplot from sklearn.model_selection import KFold from sklearn.model_selection import cross_val_score from sklearn.linear_model import LogisticRegression from sklearn.tree import DecisionTreeClassifier from sklearn.neighbors import KNeighborsClassifier from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.naive_bayes import GaussianNB from sklearn.svm import SVC # prepare models models = [] models.append(('LR', LogisticRegression())) models.append(('LDA', LinearDiscriminantAnalysis())) models.append(('KNN', KNeighborsClassifier())) models.append(('CART', DecisionTreeClassifier())) models.append(('NB', GaussianNB())) models.append(('SVM', SVC())) # evaluate each model in turn results = [] names = [] scoring = 'accuracy' for name, model in models: kfold = KFold(n_splits=10, random_state=7) cv_results = cross_val_score(model, X, Y, cv=kfold, scoring=scoring) results.append(cv_results) names.append(name) msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std()) print(msg) # boxplot algorithm comparison fig = pyplot.figure() fig.suptitle('Algorithm Comparison') ax = fig.add_subplot(111) pyplot.boxplot(results) ax.set_xticklabels(names) pyplot.show()
Improve Performance with Ensembles
The three most popular methods for combining the predictions from different models are:
Bagging. Building multiple models (typically of the same type) from different subsamples of the training dataset.
Boosting. Building multiple models (typically of the same type) each of which learns to fix the prediction errors of a prior model in the sequence of models.
Voting. Building multiple models (typically of differing types) and simple statistics (like calculating the mean) are used to combine predictions.
Performance Tuning
Algorithm tuning is a final step in the process of applied machine learning before finalizing your model. This is one of the cheats that can improve performance often dramatically.
It is sometimes called hyperparameter optimization where the algorithm parameters are referred to as hyperparameters, whereas the coefficients found by the machine learning algorithm itself are referred to as parameters.
Grid Search Parameter Tuning
Grid search is an approach to parameter tuning that will methodically build and evaluate a model for each combination of algorithm parameters specified in a grid. You can perform a grid search using the GridSearchCV class
Random Search Parameter Tuning
Random search is an approach to parameter tuning that will sample algorithm parameters from a random distribution (i.e. uniform) for a fixed number of iterations. A model is constructed and evaluated for each combination of parameters chosen. You can perform a random search for algorithm parameters using the RandomizedSearchCV class.
Algorithm parameter tuning is an important step for improving algorithm performance right before presenting results or preparing a system for production.
Save and Load Models
Saving and loading models are often ignored most inexperienced data scientists but can help you in giving robustness to your work. Once you save your model by pickling it, you can simply load it from there instead of running from scratch and end up saving quite a bit of time and CPU.
Preserve your model with Pickle ( pun intended )
Pickle is the standard way of serializing objects in Python. You can use the pickle1 operation to serialize your machine learning algorithms and save the serialized format to a file. Later you can load this file to deserialize your model and use it to make new predictions.
from pickle import dump, load # save the model to disk filename = 'finalized_model.sav' dump(model, open(filename, 'wb')) # load the model from disk loaded_model = load(open(filename, 'rb')) result = loaded_model.score(X_test, Y_test) #
Finalize Your Model with Joblib
The Joblib library is part of the SciPy ecosystem and provides utilities for pipelining Python jobs. It provides utilities for saving and loading Python objects that make use of NumPy data structures, efficiently. This can be useful for some machine learning algorithms that require a lot of parameters or store the entire dataset (e.g. k-Nearest Neighbors).
from sklearn.externals.joblib import dump # save the model to disk filename = 'finalized_model.sav' dump(model, open(filename, 'wb')) # load the model from disk loaded_model = load(open(filename, 'rb')) #
That's it fellas, hope you liked this, feel free to correct my pipeline with any comments and suggestions.
References:
Various Articles from Medium.com
Jason Brownlee's Machine learning mastery blogs.