S1E2 : Machine Learning
In the 1st Episode, we covered evolution and definition of AI, along with AI’s components, its application in the real world, types of AI and also tried to understand why suddenly AI became so prominent now?
In the 2nd Episode, we would be covering ML and its types.
Introduction to Machine Learning
Machine Learning was first coined by Arthur Samuel in the year 1959 which is just three years from when AI was coined.
“A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P if its performance at tasks in T, as measured by P, improves with experience E.” - Tom M. Mitchell
Machine Learning (ML), technically speaking,
- is a predefined programming model or algorithm,
- trained by a huge amount of data
- to make predictions or suggestions
- without using explicit instructions, relying on patterns and inference instead.
It is based on the idea that systems can be programmed to learn automatically from their experience. By analyzing data and identifying patterns, machines can improve and make better predictions or decisions with minimal human intervention.
In basic terms, ML is the process of training a piece of software, called a model, to make useful predictions using a data set. This predictive model can then serve up predictions about previously unseen data. We use these predictions to take action in a product; for example, the system predicts that a user will like a certain video, so the system recommends that video to the user.
In short, ML uses data to answer specific questions. Today, enjoying something of a resurgence, ML is where a computer system is fed large amounts of data, which it then uses to learn how to carry out a specific task, such as understanding speech or captioning a photograph. It can help automate a lot of processes that humans otherwise have to repeat on a daily basis. Additionally, it can make decisions that are based on statistics and probability and may in some cases, be better than human decisions that are affected by irrationality or bias.
Development of ML
Two important realizations supported the development of Machine Learning algorithms as a way to train AI entities quickly and efficiently -
- In 1959, Arthur Samuel realized it might be possible for a computer to “teach itself” to learn.
- The second realization came about more recently, and is based on using the Internet, and the incredible amount of digital information available for training AI entities. With the availability of Big Data by way of the Internet, engineers recognized it would be much more efficient to design AI entities to imitate human thinking. They could then be plugged into the Internet, allowing them to learn from a broad, extensive information base.
Is ML and AI the same thing?
Artificial Intelligence and Machine Learning are two popular catchphrases that are often used interchangeably. The two are not the same thing, and the assumption they “are” can lead to confusing breakdowns in communications. Both terms are used frequently when discussing Analytics and Big Data, but the two catchphrases do not have the same meaning.
Artificial Intelligence (AI) came first, as a concept, with Machine Learning (ML), as a method for achieving Artificial Intelligence, emerging later.
AI is a sort of process or methodology in which we make machines learn to behave like humans. Machine learning is a way by which we feed lot of data to machine so that it can make its own decision.
Need for ML
Need for ML began since the technical revolution itself. Since technology became center of everything we began generating an immeasurable amount of data. As per research, we generate around 2.5 quintillion bytes of data every single day, and it’s only going to grow from here. By 2020, Its estimated that 1.7 MB of data will be created every second for every person on earth.
With the availability of so much of data, its finally possible to build a predictive model that can study and analyse complex data to find useful insights and deliver more accurate results. Top tech companies like Netflix and Amazon build such ML models by using tons of data in order to identify any profitable opportunity and avoid any unnecessary risks.
Few reasons why ML is so important –
- Increase in data generation – with excessive production of data, we need to find a method to structure, analyse and draw useful insight from data, this is where ML comes in to solve problems and find solutions to most complex tasks faced by an organisation.
- Need to improve decision making – by making use of various algorithms, ML can be used to make various decisions, ex., sales forecast, stock market predictions etc..
- Uncover patterns and trends in data – finding hidden patterns and extracting key information from data is the most imp part of ML. By building predictive models and using statistical techniques, ML allows you to dig beneath the surface and explore the data at the very minute scale. Understanding data and extracting patterns manually takes a lot of time but if we use ML algorithms, we can perform similar computation in fraction of a second.
- Need to solve complex problems – detecting the genes linked to deadly ailing disease, to building self driving cars, ML can be used to solve the most to most complex problems. At present we have also found a way to spot stars like 2400 light years away from our planet.
ML – Frequently used terms
Algorithm : a set of rules and statistical techniques used to learn patterns from data
Model – a model is trained by using ML algorithm. Diff between algorithm and model is – an algorithm maps all the decisions that model is supposed to take decision based on the given input in order to get the correct output. The model will use the ML algorithm in order to draw the insight and an outcome that is very precise.
Predictor variable : it is feature(s) of data that can be used to predict the output.
Response variable : it is the feature or the output variable that needs to be predicted by using the predictor variable(s)
Training data – when u feed data to the machine it will be divided into three data sets. Splitting the data into parts is also known as data splicing. One set is the training data, second is validation data and the third section is called testing data. Training data helps the model to identify key trends and patterns which are essential to predict the output
Validation data - The sample of data used to provide an unbiased evaluation of a model fit on the training dataset while tuning model hyperparameters. The evaluation becomes more biased as skill on the validation dataset is incorporated into the model configuration.
Testing data – The sample of data used to provide an unbiased evaluation of a final model fit on the training dataset. ML model is evaluated using the testing data. After the model is trained it must be tested in order to evaluate how accurately it can predict the outcome. It is used to test the efficiency of the model.
ML Process
"Machine Learning changes the way you think about a problem. The focus shifts from a mathematical science to a natural science, running experiments and using statistics, not logic, to analyse its results." - Peter Norvig - Google Research Director
In traditional software engineering, you can reason from requirements to a workable design, but with machine learning, it will be necessary to experiment to find a workable model.
Models will make strange mistakes that are difficult to debug, due to anything from skewed training data to unexpected interpretations of data during training. Furthermore, when machine-learned models are incorporated into products, the interactions can be complicated, making it difficult to predict and test all possible situations.
To address the challenges of transitioning to ML, it is helpful to think of the ML process as an experiment where we run test after test after test to converge on a workable model. Like an experiment, the process can be exciting, challenging, and ultimately worthwhile.
- Set the research goal.
- Make a hypothesis.
- Collect the data.
- Test your hypothesis.
- Analyze your results.
- Reach a conclusion.
- Refine hypothesis and repeat.
Steps to follow to find a solution
- Define objective-
- What are we trying to predict? Is the output be continuous variable or discrete variable?
- What are the target features?
- What is the input data or different predictor variables
- What kind of problem are we facing? Binary classification (categorical – use classification algorithm, ex – logistics regression, support vector machine, knife bias) or clustering or regression
- Data gathering –
- What kind of data is needed?
- Is this data available? And if yes, from where and how can I get this data
- Synthetic data?
- Preparing data – involves getting rid of inconsistencies in data such as missing values or redundant variables to make data ready for analysis. Inconsistencies can lead to wrongful predictions and insights. This step takes up 80%of the process time
- Transform data into desired format
- Data cleaning – missing variables, corrupted data, unnecessary data
- Data exploration- exploratory data analysis (EDA) involves understanding the patterns and trends in the data. At this stage all the useful insights are drawn and correlations between the variables are understood.
- Building a model – at this stage a predictive model is built by using ML algo such as linear regression, decision tree etc..
- ML model is built by using the training data
- The model is the ML algo that predicts the output using the data fed to it.
- Model evaluation – efficiency of model is evaluated and any further improvement in the model are implemented. Testing data set is used to check the accuracy of the model and how accurately it can predict the outcome. Methods like parameter tuning, cross validation method etc.
- Prediction – the final outcome is predicted after performing performance tuning and improving the accuracy of the model.
Types of ML
The types of machine learning algorithms differ in their approach, the type of data they input and output, and the type of task or problem that they are intended to solve.
Often, people talk about ML as having two paradigms, supervised and unsupervised learning. However, it is more accurate to describe ML problems as falling along a spectrum of supervision between supervised and unsupervised learning.
Supervised
Supervised learning is a type of ML where the model is provided with labeled training data.
The labelled data set is the teacher that will train machine to understand the pattern in the data set. So the labelled data set is nothing but the training data set. Labeling every input data that is fed to the model. The output would be the different labelled classes
In supervised machine learning, you feed the features and their corresponding labels into an algorithm in a process called training. During training, the algorithm gradually determines the relationship between features and their corresponding labels. This relationship is called the model.
To tie it all together, supervised machine learning finds patterns between data and labels that can be expressed mathematically as functions. Given an input feature, you are telling the system what the expected output label is, thus you are supervising the training. The ML system will learn patterns on this labeled data. In the future, the ML system will use these patterns to make predictions on data that it did not see during training.
Supervised learning algorithms build a mathematical model of a set of data that contains both the inputs and the desired outputs. The data is known as training data, and consists of a set of training examples. Each training example has one or more inputs and a desired output, also known as a supervisory signal. In the mathematical model, each training example is represented by an array or vector, sometimes called a feature vector, and the training data is represented by a matrix. Through iterative optimization of an objective function, supervised learning algorithms learn a function that can be used to predict the output associated with new inputs. An optimal function will allow the algorithm to correctly determine the output for inputs that were not a part of the training data. An algorithm that improves the accuracy of its outputs or predictions over time is said to have learned to perform that task.
In order to solve a given problem of supervised learning, one has to perform the following steps:
- Determine the type of training examples. Before doing anything else, the user should decide what kind of data is to be used as a training set.
- Gather a training set. The training set needs to be representative of the real-world use of the function. Thus, a set of input objects is gathered and corresponding outputs are also gathered, either from human experts or from measurements.
- Determine the input feature representation of the learned function. The accuracy of the learned function depends strongly on how the input object is represented. Typically, the input object is transformed into a feature vector, which contains a number of features that are descriptive of the object. The number of features should not be too large, because of the curse of dimensionality; but should contain enough information to accurately predict the output.
- Determine the structure of the learned function and corresponding learning algorithm. For example, you may choose to use support vector machines or decision trees.
- Complete the design. Run the learning algorithm on the gathered training set. Some supervised learning algorithms require the user to determine certain control parameters. These parameters may be adjusted by optimizing performance on a subset (called a validation set) of the training set, or via cross-validation.
- Evaluate the accuracy of the learned function. After parameter adjustment and learning, the performance of the resulting function should be measured on a test set that is separate from the training set.
Setup
Let us formalize the supervised machine learning setup, but before that quickly go thorough few keywords.
- Label is the variable we're predicting,Typically represented by the variable y
- Features are input variables describing our data,Typically represented by the variables {x1, x2, ..., xn}
- Labeled example has {features, label}: (x, y) : Labeling is used to train the model
- Unlabeled example has {features, ?}: (x, ?) : Used for making predictions on new data
Our training data comes in pairs of inputs (x,y), where x ∈ R^d is the input instance and y its label. The entire training data is denoted as
where:
- R^d is the d-dimensional feature space
- x_i is the input vector of the ith sample
- y_i is the label of the ith sample
- C is the label space
The data points (x_i,y_i) are drawn from some (unknown) distribution P(X,Y). Ultimately we would like to learn a function h such that for a new pair (x,y)~P, we have h(x)=y with high probability (or h(x)≈y). We will get to this later. For now let us go through some examples of X and Y.
Examples of Label Spaces
There are multiple scenarios for the label space C:
Examples of feature vectors
We call x_i a feature vector. Each one of its d dimensions is a features describing the i?th sample. Let us look at some examples:
- Student Data in a school. xi=(x1i,x2i,?,xdi), where x1i=0 or 1, may refer to the student i's gender, x2i could be the height of student i in cm, and x3i may be his/her in years, etc. In this case, d≤100 and the feature vector is dense, i.e., the number of nonzero coordinates in xi is large relative to d.
- Text document in bag-of-words format. xi=(x1i,x2i,?,xdi)xi=(xi1,xi2,?,xid), where xαi is the number of occurrences of the αth word in a dictionary in document i (often referred to as term frequencies). In this case, d~100,000?10M and the feature vector is sparse, i.e., xi consists of mostly zeros. A common way to avoid the use of a dictionary is to use feature hashing instead to directly hash any string to a dimension index (the advantage is that no dictionary is needed, but a minor disadvantage can be that multiple words are hashed into the same dimension.) A popular improvement over bag-of-words features is TF-IDF, which down-scales common words and highlights rare words.
- Images. Here, the features typically represent pixel values. xi=(x1i,x2i,?,x3ki), where x3j?2 x3j?1i,xi3j?1, and x3ji refer to the red, green, and blue values of the jth pixel in the image. In this case, d~100,000?10M and the feature vector is dense. A 7MP camera results in 7M×3=21M features.
Hypothesis classes and No Free Lunch
We call the set of possible functions the hypothesis class. By specifying the hypothesis class, we are encoding important assumptions about the type of problem we are trying to learn.
Before we can find a function h, we must specify what type of function it is that we are looking for. It could be an artificial neural network, a decision tree or many other types of classifiers. The No Free Lunch Theorem states that every successful ML algorithm must make assumptions. This also means that there is no single ML algorithm that works for every setting.
There are four major issues to consider in supervised learning:
- Bias-variance tradeoff : first issue is the tradeoff between bias and variance. Imagine that we have available several different, but equally good, training data sets. A learning algorithm is biased for a particular input x if, when trained on each of these data sets, it is systematically incorrect when predicting the correct output for x. A learning algorithm has high variance for a particular input x if it predicts different output values when trained on different training sets. The prediction error of a learned classifier is related to the sum of the bias and the variance of the learning algorithm. Generally, there is a tradeoff between bias and variance. A learning algorithm with low bias must be "flexible" so that it can fit the data well. But if the learning algorithm is too flexible, it will fit each training data set differently, and hence have high variance. A key aspect of many supervised learning methods is that they are able to adjust this tradeoff between bias and variance (either automatically or by providing a bias/variance parameter that the user can adjust).
- Function complexity and amount of training data : second issue is the amount of training data available relative to the complexity of the "true" function (classifier or regression function). If the true function is simple, then an "inflexible" learning algorithm with high bias and low variance will be able to learn it from a small amount of data. But if the true function is highly complex (e.g., because it involves complex interactions among many different input features and behaves differently in different parts of the input space), then the function will only be able to learn from a very large amount of training data and using a "flexible" learning algorithm with low bias and high variance.
- Dimensionality of the input space : If the input feature vectors have very high dimension, the learning problem can be difficult even if the true function only depends on a small number of those features. This is because the many "extra" dimensions can confuse the learning algorithm and cause it to have high variance. Hence, high input dimensional typically requires tuning the classifier to have low variance and high bias. In practice, if the engineer can manually remove irrelevant features from the input data, this is likely to improve the accuracy of the learned function. In addition, there are many algorithms for feature selection that seek to identify the relevant features and discard the irrelevant ones. This is an instance of the more general strategy of dimensionality reduction, which seeks to map the input data into a lower-dimensional space prior to running the supervised learning algorithm.
- Noise in the output values : If the desired output values are often incorrect (because of human error or sensor errors), then the learning algorithm should not attempt to find a function that exactly matches the training examples. Attempting to fit the data too carefully leads to overfitting. You can overfit even when there are no measurement errors (stochastic noise) if the function you are trying to learn is too complex for your learning model. In such a situation, the part of the target function that cannot be modeled "corrupts" your training data - this phenomenon has been called deterministic noise. When either type of noise is present, it is better to go with a higher bias, lower variance estimator.
In practice, there are several approaches to alleviate noise in the output values such as early stopping to prevent overfitting as well as detecting and removing the noisy training examples prior to training the supervised learning algorithm. There are several algorithms that identify noisy training examples and removing the suspected noisy training examples prior to training has decreased generalization error with statistical significance.
Other factors to consider (important)
- Heterogeneity of the data. If the feature vectors include features of many different kinds (discrete, discrete ordered, counts, continuous values), some algorithms are easier to apply than others. Many algorithms, including Support Vector Machines, linear regression, logistic regression, neural networks, and nearest neighbor methods, require that the input features be numerical and scaled to similar ranges (e.g., to the [-1,1] interval). Methods that employ a distance function, such as nearest neighbor methods and support vector machines with Gaussian kernels, are particularly sensitive to this. An advantage of decision trees is that they easily handle heterogeneous data.
- Redundancy in the data. If the input features contain redundant information (e.g., highly correlated features), some learning algorithms (e.g., linear regression, logistic regression, and distance based methods) will perform poorly because of numerical instabilities. These problems can often be solved by imposing some form of regularization.
- Presence of interactions and non-linearities. If each of the features makes an independent contribution to the output, then algorithms based on linear functions (e.g., linear regression, logistic regression, Support Vector Machines, naive Bayes) and distance functions (e.g., nearest neighbor methods, support vector machines with Gaussian kernels) generally perform well. However, if there are complex interactions among features, then algorithms such as decision trees and neural networks work better, because they are specifically designed to discover these interactions. Linear methods can also be applied, but the engineer must manually specify the interactions when using them.
When considering a new application, compare multiple learning algorithms and experimentally determine which one works best on the problem at hand (see cross validation). Tuning the performance of a learning algorithm can be very time-consuming. Given fixed resources, it is often better to spend more time collecting additional training data and more informative features than it is to spend extra time tuning the learning algorithms.
Loss Functions
There are typically two steps involved in learning a hypothesis function h().
- we select the type of machine learning algorithm that we think is appropriate for this particular learning problem. This defines the hypothesis class H, i.e. the set of functions we can possibly learn.
- The second step is to find the best function within this class, h∈H. This second step is the actual learning process and often, but not always, involves an optimization problem.
Essentially, we try to find a function h within the hypothesis class that makes the fewest mistakes within our training data. (If there is not a single function we typically try to choose the "simplest" by some notion of simplicity, we will cover this in more detail in future article.)
How can we find the best function? For this we need some way to evaluate what it means for one function to be better than another. This is where the loss function (aka risk function) comes in. A loss function evaluates a hypothesis h∈H on our training data and tells us how bad it is. The higher the loss, the worse it is - a loss of zero means it makes perfect predictions.
It is common practice to normalize the loss by the total number of training samples, n, so that the output can be interpreted as the average loss per sample (and is independent of n).
Examples:
Zero-one loss : The simplest loss function is the zero-one loss. It literally counts how many mistakes an hypothesis function h makes on the training set. For every single example it suffers a loss of 1 if it is mispredicted, and 0 otherwise. The normalized zero-one loss returns the fraction of misclassified training samples, also often referred to as the training error. The zero-one loss is often used to evaluate classifiers in multi-class/binary classification settings but rarely useful to guide optimization procedures because the function is non-differentiable and non-continuous. Formally, the zero-one loss can be stated has:
This loss function returns the error rate on this data set D. For every example that the classifier misclassifies (i.e. gets wrong) a loss of 1 is suffered, whereas correctly classified samples lead to 0 loss.
Squared loss : The squared loss function is typically used in regression settings. It iterates over all training samples and suffers the loss .
The squaring has two effects:
- the loss suffered is always nonnegative;
- the loss suffered grows quadratically with the absolute mispredicted amount.
The latter property encourages no predictions to be really far off (or the penalty would be so large that a different hypothesis function is likely better suited). On the flipside, if a prediction is very close to be correct, the square will be tiny and little attention will be given to that example to obtain zero error. For example, if |h(xi)?yi|=0.001 the squared loss will be even smaller, 0.000001, and will likely never be fully corrected. If, given an input x, the label y is probabilistic according to some distribution P(y|x) then the optimal prediction to minimize the squared loss is to predict the expected value, i.e. h(x)=EP(y|x)[y]. Formally the squared loss is:
Absolute loss:Similar to the squared loss, the absolute loss function is also typically used in regression settings. It suffers the penalties |h(xi)?yi|. Because the suffered loss grows linearly with the mispredictions it is more suitable for noisy data (when some mispredictions are unavoidable and shouldn't dominate the loss). If, given an input x, the label y is probabilistic according to some distribution P(y|x) then the optimal prediction to minimize the absolute loss is to predict the median value, i.e. h(x)=MEDIANP(y|x)[y]. Formally, the absolute loss can be stated as:
Generalization:
Given a loss function, we can then attempt to find the function h that minimizes the loss:
A big part of machine learning focuses on the question, how to do this minimization efficiently.
If you find a function h(?) with low loss on your data D, how do you know whether it will still get examples right that are not in D?
Train / Test splits
To resolve the overfitting issue, we usually split D into three subsets: DTR as the training data, DVA, as the validation data, and DTE, as the test data. Usually, they are split into a proportion of 80%, 10%, and 10%. Then, we choose h(?) based on DTR, and evaluate h(?) on DTE.
Why do we need DVA?
DVA is used to check whether the h(?) obtained from DTR suffers from the overfitting issue. h(?) will need to be validated on DVA, if the loss is too large, h(?) will get revised based on DTR, and validated again on DVA. This process will keep going back and forth until it gives low loss on DVA. Here's a trade-off between the sizes of DTR and DVA: the training results will be better for a larger DTR, but the validation will be more reliable (less noisy) if DVA is larger.
How to Split the Data?
You have to be very careful when you split the data in Train,Validation,Test. The test set must simulate a real test scenario, i.e. you want to simulate the setting that you will encounter in real life. For example, if you want to train an email spam filter, you train a system on past data to predict if future email is spam. Here it is important to split train / test temporally - so that you strictly predict the future from the past. If there is no such thing as a temporal component, it is often best to split uniformly at random. Definitely never split alphabetically, or by feature values.
By time, if the data is temporally collected. In general, if the data has a temporal component, we must split it by time.
Uniformly at random, if (and, in general, only if) the data is i.i.d.
The test error (or testing loss) approximates the true generalization error/loss.
Putting everything together:
We train our classifier by minimizing the training loss:
where H is the hypothetical class (i.e., the set of all possible classifiers h(?)). In other words, we are trying to find a hypothesis h which would have performed well on the past/known data.
We evaluate our classifier on the testing loss:
If the samples are drawn i.i.d. from the same distribution P, then the testing loss is an unbiased estimator of the true generalization loss:
No free lunch theorem. Every ML algorithm has to make assumptions on which hypothesis class HH should you choose? This choice depends on the data, and encodes your assumptions about the data set/distribution PP. Clearly, there's no one perfect HH for all problems.
________________________________________________________________________
Following episodes would cover topics like unspervised learning, reinforcement learning, algorithms, its limitations, Deep Learning, Neural Network etc..
!!Stay Tuned!!
Sr. Project Manager at Toluna Corporate
5 年It's worth reading, Sunjit. A good one and planning to read your other articles too.