Machine Learning for Humans: Linear Regression

Machine Learning for Humans: Linear Regression

In an earlier article we discussed the various approaches for machine learning. One of the more common approaches is Linear Regression, which is a technique which models a relationship via straight lines. Think back to early algebra where a line was defined as: y= mx+b, where "m" is the slope , "b" is the y intercept, "x" is the input, and "y" is the output. Rename "y" as sales, and "x" as "advertising expense" and we are looking to define a relationship between how much we sell based on how much we spend on advertising. This definition includes only one input (advertising sales), but other inputs can be added, and the approach is the same. 

Let's see how this is applied. A useful example is how to decide on the price of a house. Generally, this is done in conjunction with a realtor as they review recent sales in the neighborhood that are most comparable to yours. A price range is given, and the owner picks something they feel comfortable with accepting. If they do not feel their house is of a given value, they debate on improvements they can make to bring it to an appropriate level. Sometimes, this works, sometimes this does not. How can machines be used to improve this valuation decision?

One approach to machine learning is to gather as much data on all home sales along with the attributes of the house (square feet, number of bedrooms, does it have a pool, etc.). We now have the inputs (attributes mentioned above) and the output (the sale of the house in dollars) and we are able ask the machine to create a model or equation that maps the two together. The computer will also be able to tell us which attribute contributes the most (or least) to the value of the house (is it worthwhile to upgrade the kitchen, add a bath, add a pool?). We, therefore, have the ability predict the cost of the house based on these attributes. 

But how accurate is the model? And how do we know? The data scientist will have many models and will need to run trials to determine which is the best model based on the attributes. The data scientist will 'train' the model on one set of data, and test the model against a different set of data. By reviewing these results, the data scientist can make modifications to narrow it down to a reasonable model.  

In general, machine learning can be broken down into three categories:

  • Supervised Learning - the case where inputs and outputs are known, as in the example above
  • Unsupervised Learning- where only inputs are known.
  • Deferred Learning - where outputs are provided after the model has been run.

We will now go through a Supervised Learning example, with the others reserved for follow on blogs.

Supervised Learning - The input and output pairs of the model are both known. The machine is trained to define a relationship between them which will be used to predict the output values when new input values are given. Let's put some numbers behind the example given above. We will be using housing data from the city of Boston to train our model, then use separate Boston data to test our model. We will use linear regression to build the model. As it implies, linear regression is a technique to use straight lines to define the relationship between the cost of the house and its attributes. 

In addition to the cost of the house, we will determine which of the specific attributes contribute the most to the sales price. We will start off simple, very simple. In the diagram below, we have asked the machine to build a model of the cost of the house, based only on number of rooms. Our thought here is,  the bigger the house, the higher the cost. (Referencing the description above, "y" is the cost of the house, "x" is the number of rooms). 


The model works, sort of. The red line is the model, the dots are the actual house prices in our training data. Though the price of the house does increase with the number of rooms, there are obviously other factors at work. There are clumps of houses in the center of the graph that are difficult to distinguish, and then there are the isolated points on the edges of the graph. These are outliers, and in any model will be difficult to predict. 

All is not lost. Better models can be built by analyzing more data attributes - rarely does a single attribute provide proper prediction. The number of attributes is more likely to be limited by the amount of reliable data we can collect. With any model there will always be some uncertainty, but the goal is to minimize that uncertainty to a workable level.

So now we will run our model with more attributes. Since it is difficult to plot all the inputs, we have switched to a residuals vs fitted graph. This plot shows the fitted values plotted against the residuals (which is the difference between the Observed values and the Predicted values).  So for each of the Predicted values it indicates how close we were to the actual value. This is a better fit than our first attempt with only the number of rooms. Seventy-five percent of the home price is explained through the model; the remaining differences are unexplained.

This, however, is not the endpoint. Maybe there are some attributes that are not linear, or maybe they are interdependent. Each of these attributes can be added, but many times it is trial and error which improves the model, or maybe it is an attribute that is not collected. 

This is just one technique that can be used for machine learning. Other techniques will be explored in other parts of this series. 


要查看或添加评论,请登录

Vince Kegel的更多文章

社区洞察

其他会员也浏览了