Demystify Machine Learning

Demystify Machine Learning

I am going to talk about some basics of supervised learning, unsupervised learning, reinforcement learning some evaluation methods and more, unlike conventional computer programs machine learning techniques will literally learn from data feed into the program moreover algorithms which can actually find insights and apply some mathemetical formulas to make sense out it.

And that's what separates a machine learning algorithm from a typical computer program.

You're just giving the machine learning algorithm a set of rules to follow instead of actually telling it what to look for it will find the shrewdness on its own.

Now in this article, we're going to be addressing three major types of machine learning algorithms or machine learning topics and that is supervised learning unsupervised learning and reinforcement learning will also touch on other topics.

So let's first discuss supervised learning.

Supervised learning uses labelled data to predict the label given some features and that's the really important part.

The fact that the data is labelled so whenever you think of supervised learning think label if the labels continuous It's called the regression problem and if it's categorical It's called a classification problem.

So let's go ahead and give an example of a classification problem that would fall under supervised learning for your data.

You'll have some features such as height and weights and the label could be something like gender.

So then your task could be given a person's height and weight to predict their gender.

So what does this actually look like?

Well for instance we could just plot out a couple of points here.

No alt text provided for this image

Remember since this is supervised learning and classification we already know the labels in this case our labels are male and female genders and we have height in weight as our features.

So for a classification task, our model ends up being trained on some training data here than in the future we'll get a new point who features we do know such as we know the weight and the height but we

don't know what class it belongs to then our machine learning algorithm will predict according to what it's been trained on what class it should be.

And in that case, it predicts that male then there's also regression problems.

No alt text provided for this image

This is again a supervised learning technique because a Russian problem does have a given label based on historical values.

Now the only difference here is that the label is sort of being categorical such as male and female.

It's continuous such as the house price.

So, in this case, we'll have a dataset with features such as the square footage of a house how many rooms it has etc. and we need to predict some continuous values such as house price so that when the task is given a house size and the number of rooms predicts the selling price of the house so when we plot out this data, it looks like this.

We have a price and let's say square feet.

So here only using one feature.

So on the x-axis, we have our feature the square feet of the house indicating how big the house is and then on the y axis, we have the actual label that we're trying to predict.

And in this case, the label is continuous because it can't be split up into categorical units instead of its continuous Preissing value.

So your model will end up creating some sort of fit to the data.

In this case, it kind of has a trend here that the larger the houses the higher in price.

So then when you get a new house whose price you don't know but you do know its features such as the square footage of the house you end up checking out your model and it returns its predicted price.

So that's how we regression supervised learning algorithm works.

And again this is a very basic example of this so supervised learning has the model train on historical data that is already labelled such as those previous house sales.

Once the model is trained on that historical data it can then be used on new data or only the features are known to attempt prediction.

So that can be really useful if you're a real estate agent.

You can look up all the features of previous houses that have sold and then match them up to their prices.

Train your model and then when the new house comes onto the market base of its features you can predict what price it should sell for.

Now the question arises What if you don't have historical labels for your data you only have features since you technically have no rights or correct answer to fit on that is you have no label.

You actually need to look for patterns in the data and find the structure.

And this is known as an unsupervised learning problem because you don't actually have the labels.

So let's walk through an example of an unsupervised learning problem.
No alt text provided for this image

It really common unsupervised learning task is called clustering where you're given data with just the features no labels and your task is to cluster into similar groups.

So, for example, Mavin you're given data that has as feature heights and weights for breeds of dogs and however, this is unsupervised learning you actually don't have the label you don't know what actual

breeds these are so you have no label for the breeds you just have the actual features the heights and weights of these.

So your task is to cluster together with the data into similar groups.

It is then up to the data scientist or whoever is performing this machine learning task to interpret what the clusters actually means and that usually indicates that unsupervised learning has a lot to do with domain knowledge.

As far as interpreting the results so what does this look like as a really basic example.

Here plot out all our data points for these various heights and the weights of these dogs.

And then after computing your clustering algorithm, you end up deciding that you have these two clusters you're machine learning model says hey I think these two clusters are pretty similar to each other but you should note that clustering isn't actually able to tell you what the group labels should be.

It can't report back what actual breeds of dogs these are.

All I can tell you is that these points in each cluster are similar to each other based on the features.

So it's a really important thing to know especially when it comes to evaluating unsupervised learning models.

Now you might be wondering what about machine learning tasks that I've heard about or read about like a computer learning to play a video game or drive a car etc. and that sort of reinforcement learning comes into play.

And it's not quite like supervised learning or unsupervised learning.

Reinforcement learning works through trial and error which actions yield the greatest rewards.

So when it comes to reinforcement learning there are three major components.

That is the agent the environment and the actions and we'll cover this all more in-depth when we actually show you how to do reinforcement learning of Python.

But to start the first major component is the agent and that is the learning or decision-maker than the agent has the environment.

And that's what the agent interacts with.

So for reinforcement learning trying to learn how to drive a car or self-driving car the environment Maybe what it's reading in from the camera Thiede such as the street signs etc...

Or if you're training the agent to learn how to play a video game that would be the actual pixels on the screen that can read then you have actions.

And that is what the agent can actually do in response to the environment.

For example, a self-driving car you could say break or hit the accelerator turn etc. for a videogame reinforcement learning it would be what button to press based on the environment.

Then for the actual process of reinforcement learning what occurs is that the agent will choose the actions that maximize some specified reward metric over a given amount of time.

Maybe if you're training an agent to play a video game that actual Ward is your high score then what you're going to do is have the agent learn the best policy with the environment and it's going to respond with the best actions.

So let's go ahead and walk through the basic machine learning process for a supervised learning problem.

Then afterwards we're going to discuss some key differences for unsupervised learning as well as discuss holdout datasets at the very beginning of this article.

Most of what we're going to cover falls under supervised learning are unsupervised learning.

So it's important that we basically discuss what that whole process actually looks like because we're going to be using tent's flow and you're own that's to actually solve these problems later on much later in the article, we'll discuss reinforcement learning which kind of has its own particular machine learning process.

It doesn't really fall quite into this general machine learning process.

OK so let's go ahead and go step by step.

What an M-L process looks like.

No alt text provided for this image


The first thing you have to do is actually acquire the data.

Now, this really depends on what task you're trying to solve.

If it's something like a regression problem you need to acquire a previous house sales and you maybe get that from something like Zill dot com or if you're trying to classify images into dogs versus cats you somehow acquire the various data of images of dogs and cats.

Then the next step is to clean and organize the data.

So again maybe you got those actual images but they have too many pixels so it's too much information and it's going to take a really long time to train your model.

So maybe you clean it down or take away edges or just try to get the faces of the dogs and the cats instead of the whole body etc. or maybe try to do things like normalizing the data.

So you do some sort of standard scaling on your data and unfortunately, a lot of your time is actually spent on cleaning the data and not so much on making the cool models.

So again most of your time is going to be spent here on data cleaning.

So once you do that you do what's called a train test split and so you going to split your data into a training set and a testing set.

Now there are lots of split ratios you can use a really common split ratio is to have 30 per cent of your data test and then 70 per cent of the data be training.

But it really depends on the situation how clean your data is how much do you have access to.

And then we'll also discuss hold up data sets later on once you perform that train to split it's time to actually train or fit your model on the training data.

So you'll have some sort of model.

For that we can use tensor-flow and neural networks as our model and train that model solely on that training set. (but tensor-flow is out of the scope of this article but stay tuned :)

Then once you've trained that model it becomes time to evaluate that model.

And this is where that test set comes in.

Now the reason we use that separate test set is so we don't basically cheat since the model has already been trained on the training set.

We want to assess it fairly against data that it has never seen before just as it would in the real world.

Once it becomes time to deploy that model and that's the main idea behind that test train split.

So you train your model on that training data and evaluate your model on data it hasn't seen before such as that test set.

Once you've done that you go ahead and fit model parameters to try to get a better fit onto that test set.

So again train your model on the training data to evaluate how it performs on that test data set and then you can make improvements and cycle back and forth.

Once you're satisfied if your model you can then deploy it onto new incoming data.

So that's a very fundamental approach to machine learning acquire the data clean data split it into a test and a train set to train your model on that training set evaluated against a test set adjust the model parameters.

Repeat that process until you're ready and set aside to deploy the model.

Now let's go ahead and address unsupervised learning remember unsupervised learning.

Those are data sets that had no labels.

No alt text provided for this image


So for unsupervised learning problems, most of the time you're not actually going to do some sort of test train split because it doesn't really make sense to assess your model against some test. After all, you don't know the correct labels to evaluate against.

So are we going gonna end up doing and said we're going to use all the data as training data and then you're going to evaluate against the training data based on some sort of unsupervised learning metric and we'll discuss those evaluation metrics in just a little bit.

Although unsupervised learning typically not going to have that test train split because it doesn't really make sense. After all, you don't actually know the correct answer to evaluate against then finally let's go ahead and discuss a hold-out set or an evaluation set.

Seldom it's also called that and does a really similar process to a test train split.

Besides we actually after we clean the data split it into three groups a training set a testing set and what we call a holdout set.

Furthermore again the ratios between train test and hold out really different depending on how much study you have and what that particular situation is.

Consequently there's no real right or wrong answer on what the ratio should be between those three sets as far as their sizes are.

However the actual process is really similar to what we saw before we take our data in we clean it we split it into those three sets.

We train our model on the training data and then we test our model against the test data and the base of those results we can adjust the model parameters test again etc. go through that little loop and then once we're ready to deploy our model we have our holdout data set.

Promptly the purpose of the hold out data set is to try to get some sort of final metric or idea of how well your model is going to perform.

No alt text provided for this image


The deployment you can think of it as not trying to cheat again with the test data because technically we've also been adjusting model parameters against the test data.

We still don't have a true understanding of how well the model performs against data that it's truly never seen before and truly never been adjusted for.

Moreover that's what we have that holds that data set.

Now the main idea here is that once you evaluate your model against the whole dataset you're not really allowed to go back and adjust the model parameters.

The purpose of that holdout dataset is to get some sort of final report some sort of final metric to let you know hey when we deploy this to the real world this is the sort of metrics that we're going to expect because the model has truly never seen this data before and it's never had the parameters adjusted for that data.

Therefore that's the purpose of that whole that data set so we've been discussing a lot about model evaluation. That last step has a lot to do with evaluating it against either the test data or the evaluation data that hold out data.

So let's quickly dive into more details for certain problems later on in the article so supervised learning for classification evaluation metrics you have things such as accuracy recall and precision and which the metric that is the most important really depends on the particular condition.

Yet a lot of times in this article we're just going to be using accuracy because it's the easiest to understand.

Essentially, all accuracy is the number of correctly classified samples divided by the total number of samples given to the actual model.

So again pretty straightforward for regression evaluation tests.

Again that falls under supervised learning.

There are lots of evaluation metrics things like mean absolute error MEA, mean squared error MSE root mean square error RMSE.

Essentially all these are just measurements of on average how far off are you in your prediction from the correct continuous value.

So mean absolute error means square root mean squared error to some manner or degree.

They're all trying to say the same thing on average your model predicts about this far off numerically.

So we'll be using these metrics when we do regression tests for unsupervised learning as far as evaluating that model that becomes actually much harder to evaluate.

And it really depends on the overall goal of the task.

Again remember for unsupervised learning you never really had the correct labels to compare it to.

Nevertheless, you can use things like cluster homogeneity or something called the Rand index to evaluate your unsupervised learning model.

Now remember for unsupervised learning even if you have good metrics your model may not have performed well and you can see here especially kind of on that second row from the top that for humanized it's really easy to see look correct clusters should be for that kind of moon shapes but depending on your evaluation metrics you may get bad splits or bad clustering on your data where the metrics turn out really well but the actual groupings don't look correct.

So again unsupervised learning clustering is just a really hard problem and evaluating it is also a really hard problem.

Now for reinforcement learning evaluation is usually a lot more obvious since that evaluation or reward the metric is actually built into the actual training of the model.

So it's typically just how well the model performs against the task it's assigned.

So that particular score in the video game etc. and again we'll discuss this a lot more when we actually show you how to perform reinforcement learning in this series.

As a quick review, we discussed machine learning the types of machine learning the general machine learning the process just the basic overview and basic overview of evaluation metrics.

要查看或添加评论,请登录

Muhammad A.的更多文章

  • OpenAI LLM using Langchain Agent JS

    OpenAI LLM using Langchain Agent JS

    In today's rapidly evolving technological landscape, the rise of Language Models (LLMs) has been nothing short of…

  • Semantic-UI props cheat sheet

    Semantic-UI props cheat sheet

    Semantic UI props are properties that can be passed to a component to customize its appearance and behavior. These…

  • ChatGPT vs GPT-3

    ChatGPT vs GPT-3

    ChatGPT and GPT-3 are two artificial intelligence (AI) models developed by OpenAI, a research organization focused on…

  • Blockchain-only if you actually need it

    Blockchain-only if you actually need it

    These days, blockchain is a hot issue in computer science. The Bitcoin cryptocurrency, which (as you may know) uses a…

    1 条评论
  • My few bits on Machine Learning via Kaggle

    My few bits on Machine Learning via Kaggle

    Machine learning resembles sex in secondary school. Everybody is discussing it, a couple of realizing what to do, and…

  • Scalable Distributed System Design Principals

    Scalable Distributed System Design Principals

    Do you ever wonder how software like Uber, Facebook, google are designed, I am not talking about initial design when…

  • Algorithms, why when where

    Algorithms, why when where

    It's vital to know good algorithms to be good programmers, and at the age of Machine Learning and AI, it's crazy…

    2 条评论
  • Data Structures with applications

    Data Structures with applications

    The data management is the backbone of any application, usually in small to medium-sized applications and programmers…

  • Build highly performant web sites like google/Netflix

    Build highly performant web sites like google/Netflix

    I remember about 10 years ago, even before JQuery era, I wrote some article on web applications performance and mainly…

    1 条评论
  • 2x Highly performant Microservices using gPRC/http2

    2x Highly performant Microservices using gPRC/http2

    I strongly believe its a future of microservices and I have no doubt about it, it's way faster and lighter than…

社区洞察

其他会员也浏览了