Why can't Google Maps find grass?

Why can't Google Maps find grass?

The end of the pixel king


Fig 1. A screenshot of Google Maps with a search for “grass” returns with “No results” even when it’s obvious to find the grass on the satellite image.

Google Maps excels at providing detailed information on restaurants, live updates on public transport, and more. Yet, it struggles to identify something as simple as grass or an entire forest, which appears as a gray blob:?

Fig 2. The same search the normal map layer shows the entire forest and grass as grey blobs.

Of course Google Maps could create those maps of land use. But then what about finding tulips? of finding corals? Or how do we keep it updated as the map updates? The point here is not one of several semantics, the point here is finding at least the same semantic any human could instantly and easily see.

This is not picking on Google Maps, our entire geospatial toolset suffers from similar limitations, which impose a significant gap in our understanding and use of Earth data.?

>We’ve increased the amount of Earth data much faster than the amount of useful information... with more barriers of entry.

Imagine if we could easily locate trees, grass, deforestation, muddy roads, coral bleaching, floods, and more. The implications for climate change, sustainability, biodiversity, and environmental monitoring are profound. Despite the influx of data from satellites, planes, and drones, the tools to process this data have not advanced at the same pace, leaving a growing gap between data availability and useful information extraction.

The Promise of AI for Earth

“How many trees have been cut this year?”, “Have the crops changed over the years due to climate change?”, … We know we have the data to answer these questions, but they are locked in an ever larger pile of data that remains slow, expensive and complex to work with. Is like getting access to the Library of Alexandria, to find out it’s all written on a dead language.

AI for Earth can bridge the gap. By leveraging advanced AI models, we can create extremely small summaries of images—called embeddings—that retain most of the information while being significantly smaller in size. For instance, our Clay model can reduce an image to an embedding of just 768 numbers, making it possible to process and analyze vast amounts of data quickly and efficiently. In our example, AI is the librarian that has read, classified and summarized all the books, and is ready to hand you to right book, and page every time, instantly.

Let's go down this rabbit hole.?

>Reader’s note: This is a long deep overview of embeddings in the context of AI for Earth, the work we do at our nonprofit Clay. I try to thread the headings and first paragraphs on each section as easy, fast reads, and nontechnical as possible, but I also go pretty down to the most nuisance details and conceptual tools. I hope to give increasingly more technical information to help readers approach this new, amazing, and unexplored work of AI for Earth.?        

Images Are Deceptively Simple

Before diving more into embeddings, I want to make the point that looking at images includes a strong bias, since the power of human eyes and brain is often taken for granted, which puts computer vision on a strong handicap.

We do not realize, but when we look at an image, we leverage millions of years of evolution, and the reader’s years of education. We lean on that, and in less than a second we can capture millions of possible semantics. We don't really understand how. This book is the closest attempt to understand it.?

It also took millions of dollars to develop, fly and operate?digital cameras on planes (or satellites) and capture the signals that our eyes can use, as if we were on the spot where the camera took the picture.

Look at these images:

Fig 3. Six random locations of California in high resolution, taken by the NAIP program with planes.

These are random locations across California. Any person can easily and immediately understand what’s on these images: pools in suburban settings, agriculture fields, a lake, ... Depending on your location, background, and experience, you might even identify the types of trees, or crops planted, and a million other things. The roots of geospatial and remote sensing are to understand images like these: Geospatial is the “What is where”.

Imagine now that your job is to count all trees in California. Easy enough to count or estimate trees on a single image.

Fig 4. A random location in California from NAIP as Fig 5. It crealy shows countable trees.

If you want to count trees in the whole of California, you then have to do that "easy count" 20 million times, that's roughly the number of image chips that cover California with tiles like the above. Obviously, we use computers, computer vision, and geospatial tools to do this. The best current approach without AI is to build a machine learning based bespoke tree detector based on the circular shape and colors, segment each image, and count the trees. And we've done this for decades, explicitly encoding exactly what we want.

To really understand a remote sensing image, you need to also understand well the physics of what you are looking at (e.g. the way sediments appear tells you about river flows), looking through (the effect of the atmosphere) and looking with ( the instrument, the optics, the sensor, ...). This is especially important with data from advanced and less common but very powerful? instruments like SAR, hyperspectral, pansharpening, NDVI, ... Amazingly, our brains do -or can learn- to understand most of that intuitively or intellectually.? It takes incredible effort to rebuild that capacity with computers, since we need to start from scratch. And we need to do that so we can scale this process up.?

But it’s even worse. Imagine that after you finish, you also need to categorize types of crops. Or count swimming pools. Or trace roads. Or find lakes. Again, the human brain is deceptively quick to switch tasks, and impossible to efficiently scale. On the other hand, computer vision, in most cases, requires to re-do the whole process starting from scratch with the same images. We might have some common tools, but computer vision is largely a "picture to outputs" pipeline.

> While images are innately easy to understand, computer vision needs entire dedicated pipelines for each output
Fig 5. Depiction of the typical geospatial pipeline from images to outputs. Usually these pipelines are largely adapted to specific inputs and outputs.

This is not only intuitively a duplication of efforts, it also wastes time, resources and is more prone to errors.

This is also precisely how embeddings offer a faster, cheaper, less redundant creation of outputs across pipelines.

Image Encoding

Now let’s look at those images like a computer. Images are made of pixels, each pixel is made of three bands (red, green, and blue), and each band is digitally encoded using 8 bits, which allows for numbers from 0 to 255. In our case, each image is 256x256 (width and height) x3 (red, green, blue), that’s 200k numbers, and each pixel is 1 byte, or 8bits (from 0 to 255). So this image comes to? ~200KB (or 200K numbers, where each is anything from 0 to 255). There’s a lot you can do with that many numbers. What if I told you we can reduce any image to ~1KB (a thousand numbers, from 0 to 255), yet retain most of the information? Look how it looks if we force one of our images to be less than 4KB (1KB was impossible).

Fig 6. Same image as Fig 4 downscaled to 4KB down from ~200KB.

There really isn't much one can do with that image.?

Yet, the answer to find grass, or pretty much anything on that image, is a new ability to create those extreme summaries of the image using AI. So efficient that they are, in the case of Clay, just 768 numbers. These summaries are called “embeddings”, and they are the new kings of geospatial.

Here's the catch, embeddings are not an image, but a list of numbers. Embeddings literally looks like this:

-0.13493, 0.03277, 0.12231, 0.02537, 0.09245, 0.12471, 0.07522, 0.21141, -0.02121, -0.05623,...

Despite having 0.3% of the size, AI tools can retrieve very similar results than without AI. Here's some examples of tests we've run with Clay: +90% of the same biomass estimate with embeddings than using full images. Or +90% of the land cover maps. Or detect more than 90% of the aquaculture locations.?

But there's even more. It takes 100s, thousands, even 10.000s times less time to recreate these outputs with embeddings than with the full image.

At Clay we released not only an AI model to do this, but to our knowledge both the largest training, and only model that is global, instrument agnostic and fully open: Clay Model v1 . And, we also just released a demo for anyone to click around and do semantic search.

https://explore.madewithclay.org/

Embeddings are too promising not to learn about them.

One way to think of embeddings is that they highly abstracted semantic summaries of the images, in mathematical terms. Images are pixels, and only the interpretations of those values and patterns define semantics. Embeddings encode directly those abstractions, as numbers. This means they already embed a lot of the computations one otherwise needs to detect things. Embeddings make retrieving and computing semantics extremely fast, partly because they have part of the computation already baked in.

Fig 8. Comparison to Fig 5. When using AI foundational models like Clay model v1, and embeddings, we can reuse the model, and switch inputs and outputs much more easily all with the same model or even same embeddings.

They are also extremely new, especially in geospatial, hence we don't really yet understand them well, or how to work with them, or how to create the best ones. But it's already patently clear that they encode highly abstracted semantics at a very small fraction of the size.

That's why I'm writing this article. To help myself understand better the embeddings, and bring others along to find if and how to use embeddings.

Where do embeddings come from?

Embeddings are not the point of the AI for Earth model, but their utility and modular nature have drawn a lot of attention to them as separate assets. Embeddings are the highly abstracted summaries of the input data, and the narrow neck of the model,? that it uses to perform the task it is asked. These models tend to have a "U" shape with a wide input and output, and a narrow middle point that serves as the choke point to force the model to learn by abstracting least most useful semantics.?

Fig 9. Sketch of a foundational AI model like Clay, that has a mirror structure with an encoder going from a large input image down to a small embedding, and then back up on the decoder side to recreate the input image, and compare the output. The difference give the model the task to learn.

The AI model takes an image of some size (width and height, say 512x512 pixels) with several bands (say red, green, blue). On each band we typically have 8 bits, so a number from 0 (black) to 255 (white). So in total in our case we have 256x256x3 = 200K dimensions. At the end of the encoding process we will have figured out summarizing the entire image into just 768 dimensions. That's quite impressive! A factor .4% of the size, yet contains most of the information. This is even more impressive when we have 13 bands, like on Sentinel-2 satellite images, when then the ratio is 0.01%.

Fig 10. When dealing with an input image, Transformer models like Clay split the image into small chunks. Embeddings are done at this level, for each chunk, and they are made looking both at the content of the chunk but also all other chunks, and the relative positions.

The image gets split into chunks, of size 8x8 in the case of Clay. These chunks of images are the units of the embeddings. We actually create embeddings for each chunk and then make the average of all the chunks to create the final embedding for the whole image. Why the average? There is no hard rule, and we can certainly improve on this aspect. It seems very crude to me to just average them all.

But why averaging them at all? Because a Transformer-based model learns to embed each chunk not only the semantics within, but also in context of all the chunks around it, and their relative position (this is called "self attention", and that idea proved so powerful, that the paper that introduced this is called "Attention is all you need"). This means that embedding of a patch will also include semantics outside of itself. This makes it really powerful for some applications, but also confusing when you only care about what’s within a specific context.

After the embedding, there is usually a decoder that mirrors the encoder to bring back up the reconstruction of the same input image. The difference between that reconstruction and the input is literally the "loss" to minimize. The model looks at what changes help the loss go down, and it slowly updates all the millions of parameters to make this loss as small as possible.

After the task is finished, you can also replace the decoder with another architecture whose output is for example the amount of biomass in the image (regression problem), or the land cover class (segmentation problem), … Because you already have an encoder (or embeddings), these decoders are much lighter, faster and flexible than traditional methods where each output requires building a whole pipeline starting from the input image.


How are Earth Semantics learned?

I believe Earth semantics are learned into embeddings through 3 main mechanisms:

  1. The actual value of the pixels: In text, embeddings start literally with random numbers. In visual transformers, like Clay and other Transformer based models, embeddings start with a linear projection of the actual pixel values. Hence embeddings are actually rooted on the ground, not floating around without anchors. I think this means that Earth embeddings across model runs are much more similar than text embeddings.
  2. The context around them: This is exactly the same mechanism in all other Transformer-based models. The value of an embedding depends on the value of the embeddings around them, and their relative position. In our case the context is strictly limited to the size of the image. This means that the unit of embedding is the image; patches see other patches but only within the same artificially tiled bounds of the image. They cannot usually see patches beyond their image, even if Earth is continuous. The only way such model learns across images is through the metadata of latitude and longitude.
  3. Masking: The task we ask the model to solve is to reconstruct an input image after compressing it into an embedding of much smaller size, but to make the task harder and the learning by context more strong, we actually mask out up to 70% of the image, so the model needs to extrapolate how to fill out the missing parts with access to only 30% of the image. This is obvious in some cases (like deserts), very relevant in others ( a highway across the image), and impossible in some cases (an isolated static boat in the water).

The way the model learns is also affected by other factors, for example how many images we use before allowing the model to update the way it makes the guesses ("backpropagation of model weights") to achieve high scores in our task (with stochastic gradient descent). If we update the model with every image, the learning will be very noisy and bumpy, trying to learn from all errors, even those from very rare cases. If we update the model after averaging the errors of too many examples, we will improve very smoothly, but missing many opportunities to pay attention to more rare but still common examples.

Embeddings, embeddings, embeddings

I've lately often said that bringing the AI to geospatial, or "geoAI" or "AI for Earth" is the end of pixels. Of course we will always use pixels, images are really powerful to tell stories, but with AI comes the power of embeddings, with such obvious advantages that we can’t at least seriously consider them.

But, how can be try to understand embeddings?? It's impossible to imagine a vector of 782 dimensions. So let’s take a bunch of them and see if looking at how groups of embeddings behave we can get some intuition. For one of our tests at Clay we created embeddings for the entire state of California at ~1m resolution, in little tiles of size 256x256 (the ones at the top of this post). That's 20,851,044 tiles. Let's see what we can learn from having ~20M embeddings.

If we plot the length (norm) of each vector we see that the vast majority of them are of the same length, around 3.45.

Fig 11. Histogram of the length (norm) of all ~20M embedding vectors that represent California encoded with Clay V1. Most embeddings are of length around 3.45.

Quite literally most embeddings have the same length, even if they have 782 dimensions to play with. Let's see what the extremes look like.

These are particularly short embeddings:

Fig 12. Six random examples of Clay v1 embeddings of California whose length is less than 0.31. There is no obvious common pattern across them all.

Note: Really short ones (norm <0.31 are actually chips with errors where most of the image is black (edges of the source raster). That’s too the source of the little bump around 0.33

There's no common pattern here.

And longest ones:

Fig 13. Six random examples of Clay v1 embeddings of California whose length is more than 0.36. There is an obvious common water pattern across them all.

Long vectors are clearly mostly water. Since water doesn't seem to be particularly hard to describe, it seems safe to assume that the length of the vector is not that relevant, especially when the vast majority of the vectors are the same length. Turns out this is one purpose, since normalizing all vectors to the same length has many advantages in computation. E.g. Computers struggle when they need to divide by very small numbers.

The implication of this is that when working with embeddings, angles between vectors are much more relevant than “straight” distances, like euclidean. These “flat” distances might be useful when working very locally, but since the embedding space has such defined overall shape, doing such metrics at global scales tells you about the topology shape more than it tells you about the semantics. In other words, you don’t measure the distance between Boston and Madrid going through Earth, you measure on the surface of Earth.

We can imagine the embedding space roughly as a hollow sphere (in 768 dimensions, not just 3) with a radius ~3.45 and a populated crust of less than 1/10 of the radius (~0.3). On that crust, somehow, we have all the semantics, like clusters of dots making strange patterns.

Fig 14. Sketch of the topology of the 782 dimensions of the embeddings all with virtually the same length, as a 3D hollow sphere

On that “crust” all our embeddings are grouped by similarity, and these groups are arranged also by similarity. It seems easy to imagine a cluster of for example only water, and another group of only land, and a stream of dots in between with more and more coast… but one can also imagine islands, or pools, … where would be put them?. I think this is where having 782 dimensions really helps the model to find as many directions as needed to build separate links across semantics. But it’s not an easy intuition. Are these dots all packed? Are they spread across the entire crust? I imagine that distributing the dots across the entire space allows for even more ways to find relationships of similarity (even in 2D there are infinite directions from where to approach a dot). So, for example is the whole sphere surface populated? One way to check this is to reduce the dimensions to two, either with PCA (retain variability of data with least dimensions) or tSNE (keep distances between points while reducing dimensions), or UMAP (similar to tSNE but tries to keep global distributions more intact than tSNE).

Fig 15. Left: Random 1% sample of all California embeddings with 782 dimensions, reduced via Principal Component Analysis (PCA). PCA1 is the x axis, PCA2 is the y axis and color is PCA3. PCA reduces dimensions choosing axis that keep the most variance. Right: Histogram of the first PCA components.

The way I read this PCA (and the histogram on the right) is that the first two components PCA 1 and PCA 2, are clearly bounded for most of the space. This seems to align with the hollow sphere hypothesis. We also see the left part of the blob more opaque (blue histogram peak around “-1”, hence more dense, which would also mean that the sphere is more dense in some places, and thus the crust of that hollow sphere more dense in some places. We also have a weak component with high PCA2 ... maybe something besides the sphere that doesn't follow the same pattern, in fact, seems to belong to a different distribution, since PC3 (color) seems to align with it more than with the blob that makes most of PCA1-PCA2.


Fig 16. Left: Random 1% sample of all California embeddings with 782 dimensions, reduced via?t-distributed stochastic neighbor (tSNE). t-SNE1 is the x axis, t-SNE is the y axis and color is t-SNE 3. t-SNE reduces dimensions aimign to keep the relative distance of each point. Right: Histogram of the t-SNE components.

This t-SNE graph also aligns with the PCA analysis. Because t-SNE works by trying to keep the distances between points, it also tells us that the embedding space is extremely rich in concepts with many small clusters, which sometimes group into clusters of clusters, and with a few very strong compact clusters, which I assume will be water tiles, or ice. The more distributed but still compact clusters are probably urban or similar semantics that are much differentiated than the rest of the land.

Fig 17. Left: Random 1% sample of all California embeddings with 782 dimensions, reduced via?Uniform manifold approximation and projection (UMAP). UMAP-1 is the x axis, UMAP-2 is the y axis and color is UMAP3. UMAP is similar to tSNE which aims to keep global structure more intact than just individual relative distances. Right: Histogram of the UMAP components.

UMAP is like t-SNE but gives more importance to retain large scale distributions. A bundle of bounded dots seems to confirm the hollow sphere, and also a separate distribution, with rich internal clusters of semantics. There are also some isolated and very separated semantics, which again I would imagine are blank images, water, snow or similarly virtual clones of empty semantics.

Doing dimensionality reduction seems to be a good way to get an intuitive understanding. Let’s pull the images of for example that stream at the top of the PCA.

Fig 18. Top left, top right and bottom right: PCA, t-SNE and UMAP scatterplots respectively, all with a red box selecting the same set of images . Bottom Right: 9 random images from the selection depicted on the red box.

We defined a region on the PCA scatter, and then calculated the corresponding bounds on t-SNE and UMAP, and also pulled 9 random example images within. This cluster seems to clearly be water, and it's clearly differentiated on all three dimensionality reduction. Makes sense, water is very distinct. It seems to also make sense that also in all three reductions there is a stream towards the rest of the main pack, these would be coasts, and lakes, and other images with water.

Let’s take another example and pick the two dots on the middle right of the t-SNE:

Fig 19. Top left, top right and bottom right: PCA, t-SNE and UMAP scatterplots respectively, all with a red box selecting the same set of images . Bottom Right: 9 random images from the selection depicted on the red box.

These are clearly agricultural plots, with roads. There are two dots, which I checked and seemed to correspond to images of agriculture plots with and without roads/paths. We also see that what makes a tight semantic on tSNE corresponds to a wide region on UMAP and PCA. This makes sense. Imagine organizing books by year, or genre, or author, or cover color, … a tight group of books in one classification, might be widely distributed in another.

Let’s look for expected semantics, like urban. I took the image of an urban location, and plotted its 8 closest neighbors.

Fig 20. Top left, top right and bottom right: PCA, t-SNE and UMAP scatterplots respectively, all with a red cross selecting the image on the top right of the grid on the bottom right pannel.? . Bottom Right: top left is the chosen image depicted on the red cross on the histograms, the other 8 are the closest images, measured by cosine similarity of the entire 20M dataset.

This semantic seems to be located in the middle of the pack. It might make sense that there are many ways an urban image might change. In fact all 9 examples are uncannily similar, with diagonal roads, big and small buildings, … Remember that there are 782 dimensions on the embedding, so somewhere there you might find trees, or red cars parked, or zebra crossings, …

We can plot the same location, but zoom in on the scatter plot:

Fig 21. Same a Fig 20 with zoomed axis around the selected location.

Again we see that dimensionality reduction in some makes clustering much less obvious, like here PCA or tSNE, but on UMAP we can clearly see a dense cluster, which might represent this type of urban setting. Because there are 782 dimensions, and on each many possible values, there are lots of ways to articulate similarities across semantics. Dimensionality reduction is a crude forced attempt to make a continuous semantic space fit a much more limited range of options.

But sometimes some semantics reduce much better than others, probably because of the limited variability, like solar panels:?

Fig 22. Top left, top right and bottom right: PCA, t-SNE and UMAP scatterplots respectively, all with a red cross selecting the image on the top right of the grid on the bottom right pannel.? . Bottom Right: top left is the chosen image depicted on the red cross on the histograms, the other 8 are the closest images, measured by cosine similarity of the entire 20M dataset.

?Solar panels are clearly isolated on UMAP, but not on PCA or tSNE. I wonder what semantics are close to solar panels on the other reductions. I’m also glad that on UMAP this concept is clearly isolated. Let’s zoom in:

Fig 23. Top left, top right and bottom right: PCA, t-SNE and UMAP scatterplots respectively, all with a red cross selecting the image on the top right of the grid on the bottom right pannel.? . Bottom Right: top left is the chosen image depicted on the red cross on the histograms, the other 8 are the closest images, measured by cosine similarity of the entire 20M dataset.

At close range, solar panels are indeed alone on UMAP, but also somewhat differentiated on t-SNE. We can only imagine how all this looks like on the full 782 dimensions, but this is a great example to show that these semantics are computable. We clearly only need one, or few, examples to define a bounded concept (specially in UMAP) that finds solar panels. We could tag this cluster of dots as “solar panels”. We can even easily count how many there are, and therefore find and compute all solar panels in California.

This is the dream we talk about. A method to index abstract concepts. If Google Maps had these tagged embeddings, it would be extremely easy to find solar panels, or any other cluster. We are then only limited to our capacity to find, and label these clusters. Or if we don’t want to label them, just give it a few examples to find similar ones.

Naming the semantic clusters:

So far we've gone from images to embeddings of semantics. We still cannot search for the word "grass" even though we now know that the semantics for "grass" is encoded in the embeddings. In practice, embeddings are written in mathematics, not human language. This is not a problem, since there are several approaches we can take to bridge those embeddings to text. One we've tried that has success in some cases are literally forcing them to align: For each image of Earth we pull the information from a normal map (we use Open Street Map), things like "road here", "house there", "lake here". We then use the embedding of the image, and create an encoder that makes a random encoding of the description of the text. Then we ask the text encoder to learn to tweak the text embedding so that it's the same as the embedding of the image.? Thus, we can go from image to text and vice versa. We can even take an image, make the image embedding, find the closest text embedding that describes it, and then the closest text embeddings that describe images whose descriptions yield closest embeddings. A bit of a roundabout, but essentially a similarity search based on map descriptions.

Computable semantics

How do we find similar images? As we’ve seen above 782 dimensions are many more than we can get an intuition of clusters, let alone relationships between clusters. It is therefore hard to even conceptualize how to operate with them. Is the average embedding of water and desert a beach? It seemed so, and if I check it does, but why? Why is the midpoint of those semantics the expected one? I don’t know. I suspect that in that case it is not a new concept but having both concepts in the image, just like the midpoint of the tree and parking lot is a parking lot with trees. But why is not something else completely random? Is like going from an extremely poor village to a very rich one and expecting to see the suburbs. It seems too good, and unpredictable.

What's even more crazy is that these semantics, operate with highly abstracted concepts. We can retrieve land cover classes, or find floods, or biomass within the image... We are so early in understanding Earth embeddings.

I believe part of the challenge in understanding Earth semantics is that they inherit known properties of other types of embeddings (like text embeddings) but they are also unique in other ways. One of them is what I called polysemy versus semantic colocation:

Polysemy vs semantic colocation

One of the biggest differences between text embeddings and Earth embeddings is how we deal with cases of embeddings of concepts that must contain different meanings.

In text, the word "bank" could mean where you put your money, or the side of a river, or a group of fish. A word with many meanings (polysemy) is really common. This is very common. In our case, the embedding of the word "bank" needs to encode all those meanings. The models we talk about here, transformers, are literally trained on looking at the context of the word ("self-attention"). This means that while the word itself needs to encode different meanings, the context on each case will define which of them applies to each case. This makes the task simpler. My intuition here is that the embedding can locate the different meanings in different locations within the embedding vector, so when the model performs the self-attention, it just zeros out the irrelevant parts. But if you take the word embedding, you will have all those meanings, so if you operate with the embeddings directly without the model, you'll need to deal with many meanings. It's hard to imagine the many dimensions of an embedding, but my intuition is that it's easy to encode them in different axes. E.g. with our embeddings having 3 dimensions, we could dedicate one for each meaning. This makes it easy to encode similarities to all the similar words in all directions. The bottom line is that this is known and there are many ways to deal with this.

But on Earth data, we have a different problem, and I think we have not yet figured out how to solve this. Let's take this example:

Fig 24. Four examples of houses with different context, as seen by high-resolution satellite images.

All 4 images contain the semantic "house" in different contexts (desert in California, crops in France, soil in Mongolia and water in Maldives). We can split images, in fact the model does, but we will never have a unit that just has the concept "house", which then will carry the core of the concept (with or without different meanings). With Earth data we both have absolute anchors on the actual pixels of the locations, and relative anchors of the information around them. We never have "words" isolated, or tokens. It is always patterns, and their surroundings. From that the model must learn the concept of "house", and "Water", and "crops", ... Semantics of Earth images are more deeply rooted in both pixels and context, than semantics of text where words can live isolated in the abstract, in fact we split text by them (or sub-words, tokens). Moreover, I believe that this colocation has a very small variability. That is, most things tend to be close to few other things. E.g. houses and roads, not houses and corals. This makes learning to reconstruct Earth locations with embeddings easy, but isolating Earth semantics more difficult.

Let's consider the case where we want to find "houses". If we pick one image with a house and see what other images are closest, we might also see one with "desert" on it. If I include all but the mongolian Yurt in the bottom right, we might average out the surroundings of the house, but will also reinforce the idea that houses are only squares, and we'll miss the circular yurt. In essence, I believe it is hard to define precisely semantics in Earth data, and fundamentally different than text.

One approach we follow is to search both with positive and negative examples. If we take an image of a house surrounded by grass as a positive example, and then give it a negative example of an image with only grass, we are much closer to the concept of house, without reinforcing the specific houses in the examples.

> Negative examples in embeddings means to stay as far away as possible from that point, just like a positive example is to stay as close as possible. We must be careful to remember that a negative example is not an example of the opposite concept (if that exists). Embeddings of opposite concepts are not necessarily in opposite locations in the embedding space. E.g. A person might consider water and desert opposites, but in the embedding space they might actually be close to each other. Also worth noting that embeddings cannot encode negative concepts (e.g. "not a house"). Embeddings are abstractions of the data, and therefore cannot encode the specific ways data might be missing.

Because of this, a while ago I tried to increase the quality of our similarity search by doing what I called "semantic pruning". Basically use the few available examples of a semantic to find out which dimensions of the embedding are more important for that semantic, and drop the rest. This, in theory, would make similarity searches on fewer dimensions faster and cleaner. It's quite simple to do that: I took the few examples I had available and I fit a Random Forest classifier (this method basically picks random dimensions and random thresholds to divide the data into ever smaller buckets, and the answer is the random choices that yield the most accurate buckets with the right labels). This method also tells you what dimensions are most important ("feature importance", or what bucket choice splits the data most accurately towards the labels). Since Random Forest is very fast, we can filter out dimensions after every example given, and repeat the process. Long story short, it yielded no improvements in overall speed or quality.

We know Earth embeddings are extremely useful, and we also know we don’t yet know how to work with them well.

There is no road ahead, explorers needed

We have the tools, the data, and the demand, to fundamentally disrupt and improve what and how we know what is happening where on Earth. It can capture very nuisance semantics, extremely fast, cheaply and openly. But the process is still very new, poorly understood, and not robust, and extremely different from the well tested and tried tools we use today. As I hope to explain in this article the promise and first results are just too promising not to explore further.

But there is no road ahead, everything is new, we are building as we go. The things we see are promising, and we also see clear gaps, like working with semantic colocation, or how to tweak the models to get the most out of the embeddings. And we also know there are many unexplored directions.

To me, it’s clear a future where we leverage these tools is a much better one, but it is also clear that it won’t happen easily. For one, there are not many people with skills both on AI and geo to build, or even travel, this path. My main intention here is to put some light on the challenges and opportunities of this geoAI opportunity. But the ecosystem of AI is mostly focused on text applications, and some on generating images. Earth AI is different, and we need different tools.

Moreover, we believe the basis of this technology, and services, should be a public asset, not a for-profit. We do see tremendous potential for commercial and profit services at many points of this stack, and on the applications, but we believe that the fastest path to increase both this field and to generate profits and impact from a market here, is to create a common base to seed the field.

AI for Earth is like clay, able to be so many things, ready to be shaped. That’s why we chose that name for our ngo…

Are you ready to make Clay?

https://madewithclay.org/

Noel Png

New Space l Entrepreneurship

3 个月

Annalisa Riccardi thought of you

回复
Yana Gevorgyan

Woman courageously doing what it takes. Director of Secretariat | Group on Earth Observations (GEO)

3 个月

Nicholas Murray Mark Otterlee and #GlobalEcosystemsAtlas

Aaron Williamson

Founder, CEO at Goal17, Inc

3 个月

Bruno, this is a wonderful post - I enjoyed the journey in and out and back into levels that I could or couldn't fully grasp. A few questions coming from a place of profound ignorance, as I am not a practitioner: - are there folks outside of EarthAI and geo that might have useful bodies of work? I'm thinking that there are some, ahem, groups (that maybe you don't want to work with) that (**clears throat**) have dedicated considerable resources to annotating images from above. - as you move from embeddings to semantics, could you put your models in conversations (GANs?) with models that deal primarily with semantics, like LLMs, to speed your work? - can you leverage mechanisms like Captchas to distribute the annotation part of this? - (...and this question is probably particularly stupid)...is there value in tiling at different zoom levels for pattern recognition (ie seeing the forest for the trees) or does that just come out in the wash computationally from the embeddings? Thanks for sharing this - I love the transparency and curiosity.

回复
Raphael Esterhazy

Host of the Deep Seed podcast ???Regenerative Food Systems | Agroecology | Sustainable Diets | Rewilding & Biodiversity ??

3 个月

Really interesting! Thanks a lot for sharing your work

回复
Miika Kostamo

Executive Director at GeoForum Finland | Lecturer at Metropolia UAS | Vice President MIL ry | MBA | M.Eng. | JCI Senator

3 个月

Very interesting article. Finnish Geospatial Research Institute (FGI) has been studying similar topics including the use of AI with LiDAR data in for building and tree extraction. https://www.youtube.com/watch?v=aJryoFPwiTQ Best accuracies they achieved for AI-based change detection, which can already limit the processing area even with manual processing.

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了