Data efficiency: solving a problem of visual analytics
Artificial intelligence methods can select and treat data based on their information content.

Data efficiency: solving a problem of visual analytics

A consistent pattern in AI research is the need for a great volume of training data. This creates an issue for visual AI, which needs thousands of manually labeled images to learn how to identify and denominate different images. However, there are ways to improve data efficiency. Here is a story of how Samsung SDS has researched and developed a different AI model that is far more efficient.

Identifying the problem

Computer vision and visual analytics are becoming increasingly valuable in various different industries. Observing and identifying objects in images and video has been a human-driven job, requiring people to work long hours staring at screens and images. It is routine, repetitive work that an AI engine could do. Indeed, in the world of surveillance, supply chains, and diagnostics of assets, visual AI has been deployed, successfully in many cases.

The main issue remaining is data efficiency. Visual information is complex to read without all of the mental schemas that humans have naturally developed. Machines are still able to read images and identify subjects within them. Or, rather, mark out features of objects in them based on learned data. In order to accomplish this, they need data learn from, in the form of images that depict the subjects the AI is meant to identify or demarcate.

These images also need to be labeled. The AI can’t simple attach meaning or identification to abstract blobs, even if humans are aware that the blob in question is a “cat”. The features of that cat have to be highlighted or marked in some manner and identified as such. This creates a new challenge: labelling the data, like analyzing images in the first place, is repetitive, time consuming, and pricey.

Learning Methods

This new challenge is based on supervised learning. The AI platform or neural network requires direct interfacing by a human to train it. It takes the form of manually labeled images or videos in the case of computer vision. For other types of AI, it is huge excel spreadsheets, vast seas of data, much of it carefully cultivated to ensure that the AI processes the information and is then capable of work with new, unstructured data in the future.

Naturally, this method of learning is incredibly data and time intense. It also is inefficient with the data. It takes a great deal of images to train a visual AI platform, all of which has to be checked and supervised by a technician. It is expensive, defeating much of the cost-savings and time savings from using a machine in the first place.

But what if the data used could be more efficient?

Data efficiency

Samsung SDS, by experimenting with active learning and gaussian process (GP) classifiers, have identified a way of improving the data efficiency in the case of visual AI. This method focuses on the relative information that can be obtained form subsequent data entries.

Each new image that the AI is given can only provide a vanishingly small amount of new information to learn from. The first image of a cat in the above example will teach the AI a large amount of information. Cats have fur, they have four legs, two eyes, whiskers, and so on. The next image may help diversify that description, including different breeds of cats or sizes of cats. But as the images go on, and the visual data becomes increasingly iterative, the AI will learn less and less useful information about cats. But because the bulk image input is largely random, and not so selective with the data being applied, this diminishing return isn’t being accounted for. They’re just images of cats, after all.

First, the researchers used a pool of images (roughly 14,000 samples). The goal of this was to prime the AI with the body of images for pre-classification. Then, around 140 images were selected at random from the pool as the initial active training set, where each image is labeled correctly. The GP classifier is trained with these images set against the pool of 14,000. This process is better at handling imbalanced sets of images, where each class of image has few representations. The trained GP evaluates the new information that is obtained from the training set, and accuracy of the GP classifier. The results can then be compared back to the body of images, where the GP classifier can show the relative novelty and confusion regarding the body of images. The next set of images that are fed into the classifier will be the ones that are flagged as the most “confusing,” thus providing the highest degree of new information to the platform.?This process can then be repeated, until there is high certainty in the accuracy of the classifier.

The result of this process is that instead of manually labeling thousands of images, Autolabel is taught using only 10% of the total. By selecting the most “confusing”, or rather the most informative images to label and re-input, the AI’s training is accelerated. This maximizes data efficiency and avoids the copious time, data, and headaches that the previous methods required.

Where to next?

The applications for this method are universal with regards to visual analytics. Training time can be cut down drastically, allowing for cost-effective and quick onboarding of the platform. This benefits the end-user of visual AI by giving them the functioning algorithm as soon as possible.

The implications for learning algorithms and methodologies for AI are interesting as well. By experimenting further, it may be possible to develop semi-supervised learning, where the platform learns not just the patterns in images, but the schemas to be able to interpret images after limited supervised input.

This data efficiency allows for far more complex visual AI as well. By maximizing data efficiency, then additional data sets can be used to teach more and more concepts. Instead of merely cats, why not dogs, or sheep? From a time and data perspective, this allows for more flexibility and greater interpretive power for each platform with the same training time as before.

Want to read more?

PatchNet: Unsupervised Object Discovery based on Patch Embedding

Highly Efficient Representation and Active Learning Framework for Imbalanced Data and its Application to COVID-19 X-Ray Classification



Credits: This article was written by?Ryan Cann.

Dominik Schlicht

Digital Engineering and Manufacturing for a sustainable future

3 年

Interesting approach, but how do you monitor the bias of the training set and the confusion set?

Lester Ifill

Field Operations Specialist in Asset Integrity Mgt. (Fabric M'tce./Mech.- QA/QC) || Well Integrity Mgt. "WIMS" (Well Control-Supervisory Level-4 Surface Stack) || Reliability & Data Analyst Engineer (Drilling Fluids).

3 年

Thanks for posting Patrick, greatly appreciated.

Charles Brun

Entrepreneur, Sales Leader, Advisor

3 年

Great article. Another reason why synthetic data is increasingly being used to train and validate models.

Thought-provoking - thank you for sharing.

要查看或添加评论,请登录

Patrick Bangert的更多文章

社区洞察

其他会员也浏览了