Using machine learning to fit every nook and cranny of your data

Using machine learning to fit every nook and cranny of your data

See here for the corresponding Colab Notebook with Python code and explanation

What is the difference between statistics and machine learning? Is there a difference?

In many ways machine learning is part of and very much like traditional statistics: you try to predict something using a mathematical model.

However, today's computing power allows us to use very fine-grained algorithms, that were not feasible in the past. I sometimes tell my students that machine learning allows you to fit "every nook and cranny" of your data. To illustrate that, let's take the expression literally. Let's use spatial data in a machine learning algorithm to predict something.

In this article I will show how machine learning algorithms like random forest are vastly superior to linear regression when it comes to representing local structure. What do I mean by "local structure"? Well, let's take house prices for example. Some cities are obviously more expensive than others. Within those cities, you have rich and poor neighborhoods. Which in turn contain rich and poor streets, and so on. So there is structure at various levels. How would you model this?

For this article I will make use of a dataset that was scraped from a website, containing 8412 house prices in California (thanks to Mats van der Burg for providing me with the data).

A linear model

First, let's have a look at our dataset:

No alt text provided for this image

Since there are a lot of expensive houses in California, I used a 10log scale to be able to see the whole range of prices. So 5 translates to 100,000 USD (10^5) and 6 to 1,000,000 USD (10^6). We can clearly see the expensive cities and suburbs of Los Angeles (south) and San Francisco (northeast).

When we fit a linear model to the data using only the longitude and the latitude as X variables, this is the result:

No alt text provided for this image

What's plotted here are the test data, which haven't been used to train the model. So this gives us a fairly objective test of how the model is doing. And it's doing pretty poorly! R-squared is 0.215, meaning that we can explain 21.5% of the variance in the data.

Why doesn't this work? Well, we're only modeling a linear increase (or decrease) in both directions, without any interaction between the two variables. This results in a gradient of increasing house prices, from the northeast to the southwest. Although the global pattern is reasonably accurate, there isn't any local variation.

Using a machine learning method

The problem with using a linear model is that the relations here are, well, not linear. For example, there are vast differences in price between different neighborhoods within a city. However, the differences in coordinates within a city are minor, relative to the entire map. So if we assume a linear relation between map coordinates and price, then there is no way to model the differences between neighborhoods.

Could we improve our linear model? Well, we could make use of the fact that longitude and latitude correspond to cities, neighborhoods, streets. We could get the right databases to match coordinates to locations, merge them, use dummy coding to use these variables in the model... Argh, this is starting to sound like a lot of work! There must be an easier way...

Which brings us to machine learning. The nice thing about modern machine learning algorithms is that they tend to make few assumptions about the data. They don't assume structures within the data, such as linear relations. As such, they usually do quite well "out of the box" with differently structured data.

For example, here is the result using random forest (R-squared = 0.530):

No alt text provided for this image

Random forest is an ensemble method that combines multiple decision trees. Decision trees are built up of different nodes, each representing a "decision" (for example: "Is the latitude higher than 36?"). The trick is that the tree contains many such decisions in a tree-like structure, so that it is able to model highly local structure. This allows us to correctly predict house prices in small areas.

Conclusion

Although "machine learning" sounds futuristic, it's not necessarily all that new or different from "traditional" statistics. The algorithm I used here, random forest, is over 20 years old (from work by Ho, 1995 and Breiman, 2001, among others). Another effective machine learning algorithm, k-Nearest Neighbor, is actually from 1951 (Fix & Hodges)! Algorithms that can model complicated, non-linear patterns have existed for a long time.

There are two reasons you're hearing more about machine learning these days: bigger datasets, and increases in computational power. For example, in this study (Dabbah et al., 2021) the researchers used random forest to predict mortality due to covid. They included 11,245 participants and 12,000 features. Working with this level of dimensionality would have been unfeasible in the past, due to a lack of computational resources. However, these day we can give highly individualized predictions.

In other words: we're fitting "every nook and cranny" of the data.

要查看或添加评论,请登录

Jonas Moons的更多文章

  • Recensie: Het spiegelpaleis van data

    Recensie: Het spiegelpaleis van data

    In zijn boek Het spiegelpaleis van data, vandaag gepresenteerd bij het Instituut voor Media van de Hogeschool Utrecht…

  • Hoe geef je les in programmeren? (2)

    Hoe geef je les in programmeren? (2)

    In deze blogserie (2 van 2) bespreek ik de didactiek van programmeeronderwijs. Hierover doorpraten? Schrijf je in voor…

    5 条评论
  • Hoe geef je les in programmeren?

    Hoe geef je les in programmeren?

    "Hij doet het niet!" Sinds ik les geef in programmeren, hoor ik dat zinnetje voortdurend om me heen. En ik hoor mezelf…

    19 条评论

社区洞察

其他会员也浏览了