Why scale matters in learning and predictive models?

It has been broadly believed that to sustain in future marketplaces it is one of the key abilities to acquire, store, analyze and make predictions based upon the massive amount of real-world data. According to a research released by IDC Digital Universe (2007-2014), the data is expected to double every two years for the next decade, hitting 45,000 exabytes in 2020.

It is common sense that having more data is “better” than having fewer data in the first glance, assuming these extra data can provide additional information in equal or better quality. However, there’s usually another important factor people always ignore when using extra data in their learning and predictive models - scale. Mixing data in different scales, both in spatial and temporal dimensions usually degrades your model skills instead of improving them.

Good models disclosure the most important relationships among real-world entities by simulating them at the right scale.  

A few years ago, I was working on some dynamic global circulation models to understand and predict some hydrologic variables on a very small scale (in a watershed daily). The prediction skills were much less than if I used it to predict on a weekly regional basis, even if I use much simpler models for the latter.  

It is easier to understand that you cannot predict smaller variables in smaller scale by using only the larger scale variable.

This is very obvious to a climatologist that the global model cannot capture the subscale climatic mechanisms no matter how much data you have or how powerful the predictive model is.  It is just an insufficient scale level to make such predictions.

This is usually easy to be found and understood by people in other domains. For instances, it is very unlikely to predict well how many books a person in SomewhereVille is going to buy from Amazon on a certain day even with all historical sales data on Amazon with 100% accuracy.  Another example would be predicting if a patient would be hit by heart attack at certain day/week given all patient’s historical data, and all similar patients’ reference data even with 100% accuracy.  

However, people always ignore the issue on the flip side of this issue: predicting variables in larger scale using a lot of data in smaller scales is usually a problem too.

Again in my research before on climatologies, one of my trails is to use sea surface temperature at a very fine scale (e.g., on 1km * 1km grid daily) to predict seasonal precipitation at regional scales (e.g., 100km * 100km). It, however, does not always (actually less possible) outperform a simple linear model relating the average seasonal average sea surface temperature by a leading time of 30 days.

This is where people working with big data usually make mistakes in scale matching.  For example: Did someone tell you that he or she can predict and optimize the transportation system precisely if he or she can train their ANN models using data corrected from everybody’s mobile devices plus other monitoring devices’ data? Did someone tell you that if all genomic and environmental data about a person is collected and feed into a complex model, he or she can predict all diseases that person is going to have?

It is simply not going to work. Why? Because the big data plus complex model does not simulate the correct relationships at the right scale, and it is not possible to emulate the exact physics controlling such complex relationships. Scale mismatched learning models always capture more noises than predominant signals.

This is equivalent to the situation as overfitting your model in statistics.



要查看或添加评论,请登录

Feng Zhang, PhD的更多文章

社区洞察

其他会员也浏览了