Procurement Innovation with Machine Learning!
Gaurav Sharma
I love Ai and Procurement. Also, everything around Spend Analysis, Negotiations and Digital Procurement
My motivation is to create procurement managers as data scientists and vice versa! Join in! (Link to previous chapters will be at the end of this post)
Chapter 1 — Day 2
(Noob Day)
Dimensionality Reduction
Dimensionality Reduction is a subfield of unsupervised learning.
With the procurement problem statement, we often deal with data with high dimensionality. In a simpler context, it means each data field comes up a high number of its own measurement (or properties).
Higher the dimensionality slower will be the computational performance of our machine learning algorithm. Unsupervised Dimensionality Reduction is a common approach in feature preprocessing. It helps to do the following
1.) Remove noise from the data. Noise in the data can also degrade the predictive performance of the algorithm
2.) Compress the data onto a smaller dimensional subspace while retaining most of the relevant information
3.) It can also be useful for visualizing data. For example, 6-dimensional data can easily be visualized in 3 dimensions.
Basic Terminology & Notations
Let's begin learning by doing. Step 1 exercise in any machine learning is journey is playing with Iris Dataset.
Iris data set is like Hello World of programming languages. The Iris dataset contains the measurements of 3 different species — Setosa, Versicolor and Virginica.
Flower measurements are stored in the columns (also called as features) of the dataset. The measurements are in centimeters.
We will use matrix and vectors notation to refer to our dataset from now. Each sample will be represented as a seperate row in a feature matrix X. Each feature is stored as a separate column.
So, X belong to R 150x4
Roadmap for building machine learning models
There are 3major components of building a machine learning model.
a.) Preprocessing: Its all about getting the data into the right shape
This is one of the most crucial steps in any machine learning model. Our objective is to churn out meaningful features from the raw data set.
In preprocessing, we clean the data first. By cleaning the data, I mean the following:
(i) Removing erroneous values
(ii) Removing blank values
(iii) Normalizing the ranges: Transforming the values in the range of [0,1].
(iv) Ignoring the outliers
(v) Ensuring the data is correctly labeled
(vi) Remove highly correlated and redundant data
This is by far not an exhaustive list.
Therefore, Dimensionality Reduction techniques are useful here to compress the features into lower dimensional subspace. (Also read about signal-to-noise ratio).
After cleaning of the data, we divide our dataset into two parts
(i) Training Dataset:
Training dataset is used to build and train our machine learning model.
For example, if we are doing regression analysis, the model will learn
(ii) Testing Dataset:
Test dataset is used to evaluate our final model
Often, the split is done random division basis.
b.) Learning (Training):
There are many different machine learning algorithms. However, selection of algorithm depends upon many factors including business case itself.
In practice, we compare different algorithms in order to train and select the best performing model. However, we must be clear in terms of how are we going to measure the results (and performance). One commonly used metric is accuracy. Accuracy is defined as a proportion of correctly classified instances.
Each algorithm comes with own set of setting parameters, also called as Hyperparameters. There are default settings, to begin with, but we change these hyperparameters according to the performance of our algorithm.
c.) Evaluation and Prediction:
After we finalize the best performing algorithm, we then use our test dataset to estimate how well it performs on the unseen data to estimate the error percentage. Once we are satisfied with this error percentage, we can then use this algorithm to predict new data.
Important Python Packages for Machine Learning
We will be using python language for this series as it is the most popular language around. We will be using the following libraries
1.) Scikit-learn
2.) Numpy
3.) Scipy
4.) Matplotlib
5.) Pandas
This marks the end of chapter 1.
In Chapter 2, we will start with the implementation of a classificationalgorithm and the perceptron.
See you tomorrow!
Link of Chapter -1 Day -1 :https://www.dhirubhai.net/pulse/procurement-innovation-machine-learning-gaurav-sharma/
Note: I am using Python machine learning book written by Sebastian Paschka and Vahid Mirjalili for this series.