Will the real ML problem please stand up
We are gradually entering in the world filled and driven by the data around us, that looks like us, acts like us and talks like us. For those of us who are driving it, they know we are not kidding.
There was a time when the (some still have) handful of data in a structure format, lots of time and limited actions to be taken on the basis of that. While a good percentage of the world still acts the same. We would like to steer our thoughts on the what is slowly becoming the standard of the industry irrespective of the domain , functional knowledge or the business use case i.e., to feed the data to a black box machine learning model and try to sell its outcome attached with the confidence score to the same proportion of guys sitting on the opposite side.
When a machine is taking a decision on the basis of the data its being fed, the person who is designing the model should either have ample knowledge of the business or have enough data to create that correlation and identify patterns. One of the tricky part comes that the person sitting on the opposite side is either not aware of the assumption that machine learning outcome is not called out by God himself. But, its just on the basis of the data it has been fed and the manipulations done on the top of the data.
Its good to know the different types of machine learning systems here :-
- If the machine learning algorithms have been trained / not-trained under human supervision (supervised, unsupervised, semi supervised or Reinforcement learning)
- If they can learn incrementally on the fly (online vs. batch learning)
- Whether they compare new data with matching data or identify patterns in the data (instance-based vs model-based learning)
The above snippet taken from internet, describes some of the familiar use cases on which either the
Supervised Learning, which involves creating labels against each desired solution. The classification algorithm like spam or not spam typically does the same.
However, if you have to predict a numeric value Regression come for the rescue. Some of the key Supervised learning algorithms one needs to be aware of are :
a) Linear regression - Line's slope is good enough to tell if the outcome would be positive or negative.
b) Logistic regression - Can also be used to solve a classification problem along with confidence score.
c) k-nearest neighbor
d) Support vector machines (SVMs)
e) Decision Trees and Random Forests
f) Neural networks
Unsupervised ; learning, is when the training data is unlabeled.
- Clustering : If you want to detect groups of similar visitors shopping, buying the algorithm comes handy. (k-means, Hierarchical Cluster Analysis (HCA) , Expectation maximization)
- Visualization and Dimensionality reduction (Principal Component Analysis (PCA), Kernal PCA, Locally-linear embedding, t-SNE (t-distribution Stochastic Neighbor Embedding)
- Association rule learning (Apriori, Eclat)
Semi supervised learning is what Google photos use for labeling your images on the basis of selected labels applied on few faces.
Reinforcement Learning uses a learning system called agent. The agent learns by itself on the basis of a policy which either penalizes the agent on wrong decision or rewards for each right decision. This keeps on iterating till the agent is able to reduce its losses in minimum iterations or steps.
Batch and Online Learning
In case, of the batch learning the system is incapable of learning incrementally: it must be trained on all the available data. So in the production system, with each new increment to the batch of data.
- You would need to train an algorithm in isolation with new data.
- Switch off the production system
- Discontinue the older algorithm and replace it with new model trained with whole data
Whereas, in case of the online learning the system incrementally feeds the data instance sequentially. This is a good machine learning system in cases such as stock price prediction, ad / product recommendation on e-commerce websites.
Instance-based vs Model-based Learning
Instance based learning the system learns the examples (labels) by heart and generalizes the new use cases using a similarity measure.
If you create model of these labels and use the outcome to predict the outcome. Like using a linear regression to achieve a correlation between the life expectancy and Average Income.
This is a just a way to be disciplined in an ever changing world of so much noise around the ML project with main challenges of Data Quality, Business / Function knowledge, Irrelevant / Missing data, Insufficient Quantity of data, Non representative trained data.
I am pretty sure some of us might already trapped in the dogma surrounding the force fitting of a ML model even when traditional approach might work just fine.
Amen
Python | pyspark |Data Analyst | ML |kafka| Dimension Modelling | Quick learner | Data Science | Databricks
6 年Very nice explanation...