ML Models
Darshika Srivastava
Associate Project Manager @ HuQuo | MBA,Amity Business School
*******ML(Machine Learning)********
1- ML is an application of artificial intelligence (AI).
Artificial Intelligence provides systems the ability to automatically learn and improve from experience without being explicitly programmed.
2- ML focuses on the development of computer programs that can access data and use it to learn for themselves.
3- ML is an approach based on semantic analysis that mimics the human ability to understand the meaning of a text.
The Primary Aim of ML is to allow the computers to learn automatically without human intervention or assistance and adjust actions accordingly.
Methods of ML -
1- The process of learning begins with observations or data.
2- Machine learning algorithms are often categorized as supervised or unsupervised.
a- Supervised machine learning algorithms
b- unsupervised machine learning algorithms
c- Semi-supervised machine learning algorithms
d- Reinforcement machine learning algorithms
Techniques of ML -
a- Machine Learning Regression(supervised) — in this technique we use it to predict a continuous and numerical target and begins by working on the data set values we already know. using mean or median.
we generally observe two types of Regression
1- linear Regression(can denote the relationship between a target and predictor as a straight line)
2- non-linear Regression(we observe a non-linear relationship between a target and predictor)
b- Machine Learning Classification(supervised) — Classification is a data mining technique that lets us predict group membership for data instances. By ‘prediction’, we mean we classify data into the classes they can belong to.
Methods of Classification -
1- Decision Tree Induction -We build a decision tree from the class labeled tuples.
2- Rule-based Classification — This classification is based on a set of IF-THEN rules
3- Classification by Backpropagation- Neural network learning(It iteratively processes data and compares the target value with the results to learn.)
4- Lazy Learners — the machine stores the training tuple and waits for a test tuple.
c- Clustering(unsupervised) — This is an exploratory data analysis with no labeled data available. With clustering, we separate unlabeled data into finite and discrete sets of data structures that are natural and hidden.
We observe two kinds of clustering-
1- Hard Clustering- One object belongs to a single cluster.
2- Soft Clustering- One object may belong to multiple clusters.
d- Anomaly Detection — An anomaly is something that deviates from its expected course. With machine learning, sometimes, we may want to spot an outlier. Such situations raise suspicion and anomaly detection is a great way to highlight these anomalies since this isn’t something we’re looking for specifically.
********* ML Model ***********
The term ML model refers to the model artifact that is created by the training process. The following process is followed for using ML Model.
Build -> Train -> Deploy
AWS provides a tool for this called Sagemaker (https://aws.amazon.com/sagemaker/ ) but this is a very costly and complex process for an end-user.
领英推荐
we can categories this process in the following parts mainly
1- Data collection — we can specify here from where we want to fetch our data like s3 bucket, mongo, etc.
2- Problem definition — in this phase, we have to select which ML technique we want to use to build our Model. like regression, classification …
3- Data pre-processing — we can also call this data-cleaning phase. here we are using spark (https://spark.apache.org/docs/latest/api/python/pyspark.ml.html ) built-in processing libraries for cleaning the data.
Possible tasks in data cleaning
a- reformatting or replacing text
b- performing calculations
c- removing garbage or incomplete data
problems while this process
a- performance
b- organizing data flow
why spark?
a- scalable
b- powerful framework for data handling
c- no additional cost, training on-premises, flexibility, in-memory operation, High availability model deployment.
spark schema
a- define the format of DataFrame
b- may contain various data types(strings, dates, integers, arrays)
c- can filter garbage data during import
d- improve read performance
4- Feature Engineering — Feature Engineering is the process of using domain knowledge of the data to create features. here we are checking the correlation between the variable and verify which features are important and are not important.
steps in feature Engineering
a- Brainstorm features
b- Create features
c- check how features work with the model
d- start again from first until the feature work perfectly.
5- Data Segregation — Dividing data into training and testing. Split subsets of data to train the model and further validate how it performs against new data.
There are many strategies to do this, four of the most common ones are:
a- Use a default or custom ratio to split it into the two subsets, sequentially
b- Use a default or custom ratio to split it into the two subsets via a random seed.
c- Use either of the methods above (sequential vs. random) but also shuffle the records within each dataset.
d- Use a custom injected strategy to split the data, when an explicit control over the separation is needed.
6- Model Training — The process of training an ML model involves providing an ML algorithm (that is, the learning algorithm) with training data to learn from. The training data must contain the correct answer, which is known as a target or target attribute. The learning algorithm finds patterns in the training data that map the input data attributes to the target (the answer that you want to predict), and it outputs an ML model that captures these patterns.
7-Model Deployment