Keynote Address -Connecting the Dots with Geospatial Foundation Models.

Keynote Address -Connecting the Dots with Geospatial Foundation Models.

As a follow-up to my address last month in NYC, I wanted to highlight the central hypothesis of the presentation. AI Companies have Exhausted Human Knowledge for Training, and Satellite data is the new fuel for AI models.

The theory is that in the race to build ever-more capable AI, companies have scraped up enormous datasets from all corners of the digital world. Recently, tech figures like Elon Musk have claimed that AI developers have “exhausted the cumulative sum of human knowledge” in training their models.


?This bold statement raises pressing questions: How do AI companies gather and use data? Is it really possible for an AI to consume all human knowledge, or is this an exaggeration? What limitations persist even after training on virtually everything? This post delves into these issues, drawing on expert insights and studies to assess whether we’re genuinely hitting a “knowledge saturation” point in AI – and what that means for the future of satellite data.

Modern AI models, especially large language models (LLMs), are trained on massive datasets drawn from a wide array of sources. For example, OpenAI’s GPT-3 was trained on about 45 terabytes of text from sources like a filtered version of Common Crawl (a repository of billions of web pages), large collections of books, Wikipedia articles, and more

?Common Crawl data alone supplied roughly 60% of GPT-3’s training corpus (after filtering), with curated books and Wikipedia making up a significant portion of the rest.

In essence, AI companies deploy web crawlers and leverage open datasets to vacuum up online text, including news sites, social media posts, discussion forums, code repositories, academic papers, and public domain works. All this raw text becomes the “food” for AI training.

However, more data isn’t automatically better – data quality matters. Companies invest heavily in data preprocessing to clean and organize these gigantic corpora. This involves parsing HTML to extract textual content, removing non-text noise (like navigation menus or duplicate pages), filtering out offensive or nonsensical text, and deduplicating to avoid repetition

In addition to public web data, AI firms incorporate proprietary and domain-specific datasets when available. For example, Bloomberg collected decades of internally curated financial documents to train BloombergGPT, a domain-specific language model for finance

Such proprietary data can give models expertise in specialized areas (e.g. financial reports, legal contracts, scientific literature) that aren’t well-covered in the general internet crawl. The overall strategy is to cast a wide net over “the sum of human knowledge” that is digitally accessible – from classic books to Reddit threads – and then carefully clean and balance this data for training.


The notion that AI has gobbled up all human knowledge stems from the observation that we have only a finite amount of text data available online. As Musk put it, “The cumulative sum of human knowledge has been exhausted in AI training. That happened basically last year”

This claim aligns with views of other AI experts: Ilya Sutskever (co-founder of OpenAI) recently described the industry as reaching “peak data,” meaning we’ve tapped most of the easily available real-world data for training

?Sutskever noted that “we have but one internet” – comparing data to a finite resource like fossil fuel – and predicted that “pre-training as we know it will end” once this resource is fully used

In practical terms, companies like OpenAI have already ingested much of the public web into models like GPT, so each new model has fewer new websites or books to learn from. One indicator: the largest LLM training sets have been growing exponentially (50%+ per year), whereas the internet’s total textual content grows only ~7% per year and is slowing down

In conclusion, AI companies have indeed swept up unprecedented quantities of human-created text and media to train their models – so much so that they are now bumping against the limits of what’s readily available. Claims that we’ve “exhausted the sum of human knowledge” in AI training are somewhat hyperbolic, but they contain a kernel of truth: for certain domains (especially high-quality English text), we are running out of new data to ingest.

My next article will focus on Satellite data’s role in filling this gap.

Darin David

Growth Executive | Commercial Sales @Airbus Defence and Space | Business Development, Strategic Planning | Geospatial Intelligence, Infrastructure Asset Integrity | Driving Digitalization and Industry Transition

1 周

One recommendation to consider on the hypothesis, not to conflate parallel technological innovations. Indeed, the prevalence of Earth Observation and all other geo-data will fuel advancements for ML/ Deep Learning to extract more patterns, trends, and insights at high granularity.?LLMs and GenAI in turn will assist user interpretation and analysis of deep learning from geo-data to synthesize and provide contextual awareness derived from published and/or proprietary geospatial IP. Access to run Deep Learning, foundational models on continuous geo-data like Earth Observation will not fill the gap in exhausted online and published textual content to build more capable LLMs. Deep Learning on geo-data will enhance our capacity of timely measurement, and identification of variables, patterns, trends, risks, etc., at scale to build specialized GeoAI drawing from all: textual, language, and image foundation models for actionable insight, predictive analysis, and automate processes determinate on location and spatial relationships.?

John Metzger

CaaS / Earth Monitoring (EM) and Geomatics / New Business Program Development

1 周

"Exhausted Human Knowledge for Training" The models created by the limited minds of the model creators ... what do the models say about the other models .. can they actively compare and contrast across themselves -- humans can do that in a room in minutes, and often by intuition. Is there #Ai_intuition ? .. still a lot of work to do ......

回复
Glenn Stowe

Co-Founder/Vice President @ CubeWerx Inc. | Consulting & Product Strategy, Geospatial, Earth Observation

1 周

Up to a point. Satellite data is very important in many domains, but it doesn't have the breadth of use covered by the types of data AI models have been trained on to date. I'd expand that hypothesis to say that all geospatial data are the new fuel. Particularly reality capture, 3D models, digital twins. Things closer to earth that affect more people in their daily lives. The problem of course is harmonizing all that information for AI training.

要查看或添加评论,请登录

Mike Spaeth的更多文章

社区洞察

其他会员也浏览了