FlySight Collaborates with Data Machine Intelligence to Validate the Gap Between Synthetic and Real-World Data

FlySight Collaborates with Data Machine Intelligence to Validate the Gap Between Synthetic and Real-World Data

FlySight and DMI Benchmark Synthetic Data, Achieving Training Results at the Level to the Renowned Singapore Maritime Dataset

Announcement

From fusing virtual scenarios with real-world data to rigorously validating the "GAP" between the virtual simulations created by DMI and actual real-world scenarios, the core objective of this collaboration is to ascertain how closely DMI's virtual representations can mirror reality. This enhances the fidelity and applicability of virtual simulations.

The essence of this collaboration lies in accurately determining and minimizing the discrepancies between synthetic datasets and real-world observations. Through a meticulous comparison process, this initiative aims to refine the precision of virtual models, ensuring they closely align with real-world data. This endeavor is crucial for fields that rely on exacting standards of data validation and model reliability, providing a concrete basis for the application of virtual models in real-world scenarios.

THE CONTEXT: Overcoming Data Challenges in AI - The Role of Synthetic Training Data

In the field of safety-critical AI development, the scarcity of suitable training data represents a significant obstacle. This is especially pertinent for AI systems that must be prepared for events that are not only infrequent but could also pose substantial risks to human safety. For example, an autonomous Search and Rescue (SAR) system needs to reliably identify sinking ships. To effectively train a computer vision system for this task, images of ships in distress, often in extreme weather conditions, are essential.

To avoid the necessity of sinking ships just for photographic purposes, generating artificial training data presents a promising alternative. This method could significantly simplify the development process and reduce costs.

Recognizing this potential, FlySight and Data Machine Intelligence have conducted a test to ascertain whether synthetic data can be produced to a standard where its training effectiveness is on par with that of actual datasets.

Targeting Precision: Bridging the Virtual-Real Divide

The metric leveraged by FlySight in this innovative venture is the Mean Average Precision (MAP). This metric serves as a critical tool in quantifying the accuracy of object detection models by gauging their precision in identifying and classifying various targets within images. The application of MAP is instrumental in achieving a standardized, objective measure of how virtual data stands up against real-world counterparts, thus facilitating a thorough validation of the "GAP" between virtual and real.

About the test

The initiative ventures into simulating complex scenarios where targets appear at differing distances and may be partially obscured by other targets or obstacles, like buoys. This nuanced approach aims to replicate real-world complexities within virtual simulations, offering a robust platform for validation.

The dataset pivotal to this collaboration has been meticulously captured under varied environmental conditions, predominantly during daylight hours. A commitment to realism drives the endeavor to simulate a broad spectrum of lighting conditions, alongside introducing elements of unpredictability, such as a 10% inclusion of images with light haze effects.

Scene setups are thoughtfully constructed, ranging from stationary cameras positioned on the shore to dynamic setups on board speed boats. In scenarios involving movement, particular attention is paid to adjusting camera angles to accurately capture the influence of environmental factors, such as wave movements, thus enriching the virtual data with real-world complexities.

Test results

We trained three different YOLO-based object detectors using these data variations:

  1. Singapore real data
  2. Synthetic data
  3. Hybrid Dataset: 100 real data images selected randomly and all the synthetic data

The results show that real and synthetic data models perform in comparable ways (MAP 31 % vs 32%) whereas the “hybrid model” behaves considerably better (MAP 39%).

Conclusion

The results confirm that together, FlySight and Data Machine Intelligence were able to find a reliable and scalable way to generate synthetic data for AI model training, either in combination with a scarce number of real data or on its own.

Based on these results, we will take the next steps and plan to expand our collaboration.

DMI Labs automation development platform

The synthetic dataset was created using the DMI Labs automation development platform. Data Machine Intelligence has leveraged its proprietary simulation engine, connected to Unreal Engine for rendering, to produce highly accurate datasets. By reproducing an environment similar to that captured in the Singapore Maritime Dataset, the simulation utilized a ray tracing engine adapted to emulate realistic camera characteristics. This process resulted in an annotated dataset with 20,000 instances, comprising 2,893 images of synthetic data and 4,062 images of real data.

Key Features of the Synthetic Dataset

●?????? Boat Categories and Frequencies: Four categories of boats were generated to match the frequencies observed in the real data.

●?????? Simulated Environment: A 5 Km x 5 Km field was created, where boats were placed randomly to reflect natural distribution.

●?????? Position Probability Enhancements: Parameters were set to increase the likelihood of finding boats in their natural harbor positions. Sailing boats, boats, and speedboats were positioned nearer to the camera, while container ships were placed further away. These thresholds were adjusted throughout the simulation.

●?????? Dynamic Boat Placement: Boats were placed and removed after each frame, creating high frame-to-frame variation.

●?????? Water and Weather Variability: Two water lines were simulated to vary water roughness, with parameters adjusted throughout. Sky conditions, sun positions, and weather elements were generated procedurally, with particular attention to fog, using four different models.

●?????? Realistic Occlusions: static environment objects, such as buoys, were added in some frames to simulate real data conditions and introduce occlusions.

●?????? Bounding Box Adjustments: Encapsulated bounding boxes were merged when fully included in another of the same type.

This advanced simulation process ensured that the synthetic dataset closely mirrored the real-world conditions of the reference dataset, providing a robust foundation for our benchmarking test.

We are truly satisfied with the results we have achieved. The test shows the possibilities of synthetic data and has given us clear confirmation that the additional improvements that we have on our roadmap are attainable and will provide real value. In the Flysight simulation, we used an optical camera, but there are technically no limits regarding the sensor types. For example, very shortly we will be able to provide simulation in thermal imaging, radar and lidar, within the same scenario, to also train and assess sensor fusion. Our mission is to accelerate the development of safe and robust AI systems - having a fine tuned data set generation engine at hand is a big step forward in this way.

Matteo Marone , CTO Synthetic Data, Data Machine Intelligence

FlySight method of validation / Description - To insert

Experiments were conducted using different sets of data to fine-tune the YOLO-X object detector. Specifically, fine-tuning was tested with real data only, synthetic data only , and a combination of both. All models were tested on real data. The results showed that models fine-tuned with real and synthetic data alone achieved similar performance (31% mean average precision). However, combining 100 randomly selected real images with synthetic data boosted the performance to 39% .

The first thing that comes to mind when someone asks us to develop an AI method is: where do we get the data for training and testing? My first choice is always to look for a public dataset. However, I often encounter the same questions: Is the data sufficient in quantity? Is there enough variety in the data? Considering privacy and ownership, can I use this data for training and testing my algorithm? What if I need more data? Synthetic data generation has the potential to address all these issues and many more. Our tests with DMI have shown that this is not just a possibility for the future—it is a reality right now. We are very interested in continuing this collaboration and exploring new scenarios and contexts.

Niccolò Camarlinghi Ph.D. , Head of Research, FlySight



要查看或添加评论,请登录