登录查看更多内容

Why scale matters in learning and predictive models?

Feng Zhang, PhD

Principal Software Engineer @ Wherobots

发布日期: 2017年6月27日

It has been broadly believed that to sustain in future marketplaces it is one of the key abilities to acquire, store, analyze and make predictions based upon the massive amount of real-world data. According to a research released by IDC Digital Universe (2007-2014), the data is expected to double every two years for the next decade, hitting 45,000 exabytes in 2020.

It is common sense that having more data is “better” than having fewer data in the first glance, assuming these extra data can provide additional information in equal or better quality. However, there’s usually another important factor people always ignore when using extra data in their learning and predictive models - scale. Mixing data in different scales, both in spatial and temporal dimensions usually degrades your model skills instead of improving them.

Good models disclosure the most important relationships among real-world entities by simulating them at the right scale.

A few years ago, I was working on some dynamic global circulation models to understand and predict some hydrologic variables on a very small scale (in a watershed daily). The prediction skills were much less than if I used it to predict on a weekly regional basis, even if I use much simpler models for the latter.

It is easier to understand that you cannot predict smaller variables in smaller scale by using only the larger scale variable.

This is very obvious to a climatologist that the global model cannot capture the subscale climatic mechanisms no matter how much data you have or how powerful the predictive model is. It is just an insufficient scale level to make such predictions.

This is usually easy to be found and understood by people in other domains. For instances, it is very unlikely to predict well how many books a person in SomewhereVille is going to buy from Amazon on a certain day even with all historical sales data on Amazon with 100% accuracy. Another example would be predicting if a patient would be hit by heart attack at certain day/week given all patient’s historical data, and all similar patients’ reference data even with 100% accuracy.

However, people always ignore the issue on the flip side of this issue: predicting variables in larger scale using a lot of data in smaller scales is usually a problem too.

Again in my research before on climatologies, one of my trails is to use sea surface temperature at a very fine scale (e.g., on 1km * 1km grid daily) to predict seasonal precipitation at regional scales (e.g., 100km * 100km). It, however, does not always (actually less possible) outperform a simple linear model relating the average seasonal average sea surface temperature by a leading time of 30 days.

This is where people working with big data usually make mistakes in scale matching. For example: Did someone tell you that he or she can predict and optimize the transportation system precisely if he or she can train their ANN models using data corrected from everybody’s mobile devices plus other monitoring devices’ data? Did someone tell you that if all genomic and environmental data about a person is collected and feed into a complex model, he or she can predict all diseases that person is going to have?

It is simply not going to work. Why? Because the big data plus complex model does not simulate the correct relationships at the right scale, and it is not possible to emulate the exact physics controlling such complex relationships. Scale mismatched learning models always capture more noises than predominant signals.

This is equivalent to the situation as overfitting your model in statistics.

要查看或添加评论，请登录

Feng Zhang, PhD的更多文章

Optimizing Spatial Queries with Distance-Bound kNN Join

2025年3月18日

Optimizing Spatial Queries with Distance-Bound kNN Join

Queries in k-Nearest Neighbors (kNN) Joins can at times be very inefficient. This inefficiency occurs when the join…
Simplifying Geospatial Analytics with the New Sedona STAC Reader

2025年3月5日

Simplifying Geospatial Analytics with the New Sedona STAC Reader

Integrating extensive satellite imagery and geospatial datasets into analytics platforms has traditionally been a…
Optimizing KNN Joins with Broadcast in Apache Sedona

2025年2月10日

Optimizing KNN Joins with Broadcast in Apache Sedona

One of the key challenges in performing k-Nearest Neighbors (KNN) joins in distributed systems is the performance…

1 条评论
Understanding Prolly Trees: A Step-by-Step Guide to How They Work

2024年11月16日

Understanding Prolly Trees: A Step-by-Step Guide to How They Work

Prolly trees are an advanced data structure designed for immutability and efficiency, making them perfect for versioned…
Decoding the Complexities of Scalable KNN Joins

2024年10月21日

Decoding the Complexities of Scalable KNN Joins

K-Nearest Neighbor (KNN) Join is widely used in geospatial analysis, recommendation systems, and machine learning…

1 条评论
Exploring the Convergence of Federated JOIN & RAG

2024年3月28日

Exploring the Convergence of Federated JOIN & RAG

Two powerful concepts in data integration and AI stand out for their ability to synthesize information from disparate…

2 条评论
Job Openings at Aetion Inc. (LA)

2017年4月21日

Job Openings at Aetion Inc. (LA)

We have immediate openings at our Aetion LA office: Graduate Engineers:
Job Openings at Aetion

2016年4月15日

Job Openings at Aetion

Please see the following link: Engineering Director of QA NYCSystems Engineer NYCUI/UX Engineer NYC (preferred), LA…

See all articles

Why scale matters in learning and predictive models?

Feng Zhang, PhD

Principal Software Engineer @ Wherobots

Feng Zhang, PhD的更多文章

社区洞察

其他会员也浏览了

Complexity: Time, Space, & Sample

Relation between statistical machine learning and big data

Stock Market Predictions with XGBoost: Combining Machine Learning with Rust for Performance and Accuracy above 84%

Data Requirements and Model Selection in Machine Learning

Titanic Machine Learning from Disaster

Standardization and Normalization Techniques in Machine Learning - Part 07

Applying Machine Learning to Business Problems

Unlocking the Power of XGBoost: Why It’s the Champion of Machine Learning Models

What is RandomizedSearchCV in Machine Learning

Feng Zhang, PhD的更多文章

Optimizing Spatial Queries with Distance-Bound kNN Join

Simplifying Geospatial Analytics with the New Sedona STAC Reader

Optimizing KNN Joins with Broadcast in Apache Sedona

Understanding Prolly Trees: A Step-by-Step Guide to How They Work

Decoding the Complexities of Scalable KNN Joins

Exploring the Convergence of Federated JOIN & RAG

Job Openings at Aetion Inc. (LA)

Job Openings at Aetion

社区洞察

其他会员也浏览了

Complexity: Time, Space, & Sample

Relation between statistical machine learning and big data

Stock Market Predictions with XGBoost: Combining Machine Learning with Rust for Performance and Accuracy above 84%

Data Requirements and Model Selection in Machine Learning

Titanic Machine Learning from Disaster

Standardization and Normalization Techniques in Machine Learning - Part 07

Applying Machine Learning to Business Problems

Unlocking the Power of XGBoost: Why It’s the Champion of Machine Learning Models

What is RandomizedSearchCV in Machine Learning