Basics of Machine Learning
Ranjan Bhasin
Senior Director - Information Technology, Automation & AI at Greystar India
?
There are so many apprehensions in any aspiring data scientist when he/she starts his/her career. Why do we have to focus on Machine Learning today or why is Machine Learning so much talked about ? Is Machine Learning recently came into existence ? I think that there are tens of definitions available and tens of reasons why does someone should get into it. We don’t want to re-invent the wheel here but would like to address few vital aspects relevant for aspiring data scientists.
?
Why Machine Learning
We know that there is humungous amount of data getting generated every minute for eg, retail payments, GPS, photos, blogs, videos, e-commerce, investments, insurance, healthcare, accounting, logistics, utilities and much more. Just because there is so much data, there lies opportunity to become predictive across all these aspects. Being predictive means being ready for future by taking right decisions in the present.
?
Definition
Machine Learning is study of computer science, statistics & mathematics to either make predictions or cluster data. Most widely used definition is that Machine learning?is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
?
Glossary
1.???????? Independent Variable: There are set of variables/fields (often referred as features) which when combine derive the output of the data. For eg, rain prediction at any place can be forecasted with the help of multiple fields/variables like geographical status of the areas (tropical, coastal, mountain, etc), month of year, previous day’s status, humidity level, etc. These fields/variables are called dependent variables or features. There are usually multiple independent variables in any data-set.
2.???????? Dependent Variable: The data point which we intend to predict is dependent variable. In above example, dependent variable will be rain forecast (Yes or No) or how much are we expecting rain (measured in mm). There is usually 1 dependent variable in any data-set however there could be multiple dependent variables as well.
3.???????? Data-set: Combination of independent variables and independent variables is called data-set. In other words, many different data points, around a business problem, combined together is called dataset. For a basic machine learning problem, it is usually in the tabular form every row is one data entry and every column is one feature (or independent variable). For eg, in above rain forecasting example, 300 examples (300 days of data) or rows and 5 cols (across 4 features and 1 output) combined together is referred as dataset. In above example, every day will refer to Rows and different dimensions like humidity, temperature, altitude, dayOfWeek, etc are referred to Columns.
4.???????? Training Data: Complete data-set is divided into 3 parts and training data is usually the biggest chunk of the divided set. This is called training data because usually machine learning algorithm works on this set and creates its model (technically called equation)
5.???????? Validation Data: This is the 02nd chunk of data (from the bigger set of complete data-set) which is used to validate the accuracy or correctness of the model created. The model or equation (created during the training) is run on this validation set and while the model is being run, model changes hyper-parameters to improve the accuracy further.
6.???????? Test Data: This is the final chunk of data-set on which the model is run to predict the accuracy score
7.???????? Fitting the data (or training): Whenever any says that data is being fit or data is being trained, it means that machine learning algorithm is creating a model or creating a generalized equation to which a data can fit. For eg, the equation of a circle in 2-dimensional space is (x-h)2 - (y-k)2 = r2 where r is radius and circle is centered as (h,k). Now, this equation is a generalized equation where we put any x,y and it will create a circle. Similarly, after a model is created, whenever we put new values of independent variables, machine learning model will give value of dependent variable.
8.???????? Loss: Loss is referred as difference between predicted value and actual value of a single training record. This gives an estimate how far is the predicted value from the actual value
9.???????? Cost Function: Cost function is the mean of losses across all training samples
10.????? Optimization: It is a process of minimizing loss by adjusting weights or parameters. It is achieved by taking partial derivatives (differentiation) of all the weights with respect to the cost function
11.????? Parameters: Parameters are the weights associated with each independent variable. These weights are changed with every iteration of optimization. When these weights do not change or get updated significantly, we assume that optimization is completed
?
?Categories of Machine Learning
Lets understand that machine learning works on data and that too on numeric data. This means that all text has to be converted into numeric data and then machine learning algorithm will be applied. This is discussed in the later section of Process.
?
1.???????? Supervised Learning: This category of algorithms are required when we have independent variable (output) assigned to each set of dependent variables (features or columns in data-set). The problem is to predict the independent variable (output) given a new set of dependent variables (features).
2.???????? Unsupervised Learning: This category of algorithms are required or used when data-set is required to be clustered or segregated into categories. For eg, if we have categorize students of a school in multiple categories based on their characteristics (address, height, weight, age, marks obtained last year, drawing skill, sports medals, music skill, wears spectacles or not, etc. Basis data in these features, a unsupervised model can categorize students into 3 or 4 categories viz a) studious, b) athlete, c) artist.
3.???????? Reinforcement Learning: Reinforcement learning is subset of machine learning where the model calculates all possible paths/options to reach/calculate the destination and then choosing the path/option which gives rewards (positive points) with least penalties (negative points)
?
?Algorithms
Below is the list of commonly used algorithms
?
1.???????? Linear Regression
2.???????? Logistic Regression
3.???????? Decision Tree
4.???????? SVM
5.???????? Naive Bayes
6.???????? kNN
领英推荐
7.???????? K-Means
8.???????? Random Forest
9.???????? Dimensionality Reduction Algorithms
10.????? Gradient Boosting algorithms
a)???????? GBM
b)???????? XGBoost
?
?Process
We have explained this process in our Data Science section and as machine learning is part of Data Science, all steps of Machine Learning is similar to steps in Data Science. Please read this until end as the last step is your bonus item here
1.????????? Data Collection: Data collection is a foundation building block to any machine learning problem to solve. Data can be collected through structured format (databases, available data-set, internet history) or unstructured formats (videos, blogs, etc).
2.????????? Cleansing the Data: Cleaning/Scrubbing the data refers to the process where the data doesn’t have NULL values, data doesn’t too many outliers, irrelevant columns have been removed and so on
3.????????? Exploratory Data Analysis: Visualize the data with the help of charts to identify patterns, outliers or key insight basis which further actions are required explained in next step (Feature Engineering)
4.????????? Feature Engineering: This step is done to achieve right set of features with the help of following - a) add more records in the data-set, b) add more features, c) group operations (for eg max, min, pivot) d) normalization / scaling of data, e) logarithmic or exponential transformation, f) re-engineer the features for eg dimensionality reduction or identifying collinearity,? g) one-hot encoding and there are few more
5.????????? Algorithm Selection: There are multiple algorithms which can be applied for a single business problem hence we can test multiple algorithms. For eg in case of classification, we can use logistics regression or decision trees or Naive Bayes depending on which algorithm gives better accuracy.
6.????????? Modelling: This includes training the model which means finding right set of weights (associated with columns/features/data) to create a generalized equation. It includes tuning the model with the help of cross-validation and updating hyperparameters. This is then followed by evaluating the accuracy of model by running the model on unseen data (known as test data). Once the accuracy reaches the business expected threshold, it is then moved to production.
7.???????? Production deployment: Production deployment is a very key element of any model and it is rarely talked / explained. Below are tangents which we need to consider while deploying the model
a)???????? Whether the model is developed for a web based interaction or device based interaction ?
b)???????? Do we need real time scaling of model (expecting more number of users in certain time periods) or the users will not change ?
c)???????? Does it need to be integrated with external devices like webcams, e-commerce portals, etc ?
d)???????? What are the security implications of using the model ?
e)???????? Do we want our customers to initialize the algorithm every time and call predict function or do we want to use REST APIs for our model ?
?
?
General Plan on how to study Machine Learning in detail
1. Use Youtube (to begin with) to learn about Supervised and Unsupervised learning. Park Reinforcement learning for few months until you gain confidence about Supervised and Unsupervised
2. Enroll yourself on few courses from Coursera, Udemy or any other online platform. There is no harm in looking for free ones or wherever you can get financial aid however do look into content and match it up with above
3. Practice few algorithms on multiple different datasets (you can get lot of free datasets online)
4. Important
a)???????? Create account with Kaggle and download data sets
b)???????? Create profile on github and showcase your work on github
c)???????? Similarly, update your profile on linkedIn as well
?
?
This is explained at a quite basic level and we recommend to practice at least 2-3 algorithms for each type of machine learning to start gaining confidence.
Digital Transformation through AI and ML | Decarbonization in Energy | Consulting Director
1 年Great summary Ranjan Bhasin - thanks for sharing. You have incorporated an important lesson for me as I reflect on learning journey; aim to balance course material with real world practice. Learning concepts in a structured way is valuable, but even more so is getting your hands dirty with realities of data. Kaggle is a good source of data but can sometimes already be "cleaned" and ready for analysis. Getting datasets e.g., from government websites - I often use the EPA - shows the realities of data processing and management which can take up more time than modeling.