Applying language models to satellite image for fast detections on open dictionary.
Jeff Faudi ????????
Solving Earth challenges with Satellite Imagery and Deep Learning
After a refreshing summer break, I’m excited to return with a new blog post about real-time object detection on satellite images. This time, I’m diving into the world of Pleiades satellite imagery, the Fused platform, GroundingDINO, and Modal.com.
Today, foundation models and large language models are at the forefront of technological discussions, especially with the rise of tools like ChatGPT that have revolutionized the industry. The concept revolves around leveraging a pre-trained model on vast amounts of data, either using it as-is or fine-tuning it for a specific purpose. This approach often proves more efficient than building a model from scratch.
For this project, I’m utilizing a foundation model known as DINO, developed by Facebook AI. DINO is a robust, self-supervised model designed for object classification in images. While it excels in recognizing a wide array of objects, it doesn’t natively offer precise localization—meaning it doesn’t provide the exact positions of objects within an image.
To address this, a team from the International Digital Economy Academy (IDEA) developed a framework called GroundingDINO. GroundingDINO enhances the DINO model by integrating it with a grounded pre-training approach, which allows it to detect arbitrary objects based on textual inputs, such as category names or descriptions. This framework marries DINO with a Transformer-based architecture that tightly fuses visual and language modalities, enabling precise object detection with bounding boxes. GroundingDINO is open-source and is provided under the Apache license.
The core of GroundingDINO’s functionality lies in its innovative approach to fusing features from both image and text modalities. It operates by dividing the traditional object detection model into three key phases: the neck (feature enhancement), query initialization, and the head (region refinement). By introducing language-guided query selection and cross-modality decoders, GroundingDINO effectively combines visual features with text-based prompts. This allows for a more generalized and versatile object detection process, which is particularly useful in open-set scenarios where the categories of objects might not have been encountered during training.
DINO, and by extension GroundingDINO, was originally trained on standard photographs, but probably also on some satellite images. So provided the resolution is high enough for common objects to be recognizable, GroundingDINO will do a decent job.
Check out this app for more examples: https://dl4eo-open-detection-optical-satellite.hf.space/. This demo uses GroundingDINO without any fine-tuning. You can try out different scenarios here and upload your own images. The model requires both an image and a prompt specifying what you want to find in the image. Depending on the object you’re searching for, the results may vary in accuracy.
领英推荐
Building on the same architecture as in my previous posts, I successfully deployed a GroundingDINO model on a Modal.com endpoint. This setup provides an API that accepts an image and a prompt, then returns a list of bounding boxes with corresponding object classes.
Now, let’s move on to a larger satellite image. To demonstrate the use of this Open Set Detector on satellite imagery, I selected a large CNES Pleiades satellite image of Hong Kong with a 50 cm resolution. The image also features a slight acquisition angle, which allows the objects to appear more as they would in a photograph, rather than a traditional overhead view. A special thanks to Airbus Defence and Space for granting permission to use this image for the demo.
In Fused.io, I’ve developed two powerful User Defined Functions (UDFs) to bring this project to life.
The first UDF converts the original TIFF imagery into WebMercator tiles. The imagery, stored in the cloud-optimized GeoTIFF (COG) format, is reprojected to WebMercator, making it easy to display on web mapping applications.
The second UDF takes it a step further by calling my model’s endpoint for each tile at zoom level 16, which corresponds to sending a roughly 2048x2048 image and a prompt. It then returns the results as bounding boxes, indicating the detected objects.
When you put it all together in a web mapping application, it’s quite remarkable. You can enter what you’re searching for in the upper left corner—try terms like ‘building,’ ‘car,’ or ‘tree.’ Fused.io will tile the imagery, send it to the API endpoint where the GroundingDINO model processes it on a GPU, and return the results as a VectorTile, ready to be displayed on any web mapping application.
Check out the final demo here: https://fused.deepl.earth/hongkong. Start by searching for ‘car,’ ‘building,’ ‘tree,’ and then let your imagination run wild! You will soon notice that semantically close prompts produce a very similar results.
While this is still a prototype, the potential is immense. The next steps will involve fine-tuning GroundingDINO with more satellite imagery or leveraging a foundation model trained specifically on such imagery, like Clay.
If you’re excited about the possibilities of real-time object detection on satellite images, I’d love to hear your thoughts and ideas—what applications do you envision for this technology? Let’s connect and explore the future of satellite imagery and large language models together!
Thrilled to see #Airbus satellite imagery used in this project to demonstrate real time object detection and geospatial analysis with DINO. Keep up the great work Jeff Faudi ????????!
Solving Earth challenges with Satellite Imagery and Deep Learning
3 个月Access the demo directly here: https://fused.deepl.earth/hongkong