Overcoming Data Scarcity: Doing More With Less Using Data Centric AI
“Instead of focusing on the code, companies should focus on developing systematic engineering practices for improving data in ways that are reliable, efficient, and systematic. In other words, companies need to move from a model-centric approach to a data-centric approach.” - Andrew Ng
The Traditional Approach to ML?????????????????????????????????????????????????????????
For a decade now, the dominant approach to machine learning has been focused on the architecture of models and the myriad ways to train them. The process involves tweaking ML algorithms to get more accurate results while keeping the data fixed. The training of models is seen as a continuous process to be revisited throughout the lifecycle of the ML project while working with data is largely a static process after the preliminary handling is complete.
There have been significant advances using this approach so much so that?ML models are now widely available not only for practitioners in the field but also to amateurs who use them in a host of different domains . With the large-scale commodification of ML models, it seems that this approach has been utilized sufficiently and to further advance the field, a fresh perspective is required. More so, different points of views are required in an enterprise context, where the domain starts shifting out of images and videos and data gets more context centric.
Obstacles faced by Model-Centric AI
In practice, it is an often overlooked fact that data plays a vital role in the ML pipeline as it is the first ingredient in any AI system and any issues with data would invariably seep into the entire process. Even the performance of models ultimately depends on the data being fed and if that data happens to be noisy and inconsistent, so will the outcome. A shortage of data and especially labeled data is another prominent challenge faced by ML engineers.
Procuring additional data and labeling are?expensive, time consuming and in certain situations just not possible. This makes being model-centric a liability due to the data hungry nature of ML models.
The scarcity of data has plagued diverse fields, some examples of which are core industries such as steel industries, medical services, computer vision and NLP. One of the most severely hit domains is that of the financial services, where it is imperative for ML applications to work efficiently with lesser data because of its sensitive nature and the large number of privacy considerations for customers.?
Is Transitioning to a Data-centric perspective the answer?
These factors have resulted in a shift in focus to all things data related such as data labeling, curation and augmentation. AI?experts all agree that a transition from big data to good data is necessary. The philosophy of data centricity is to treat data processing as an iterative process and to continuously improve it using incoming information.
It doesn’t just advocate making data good but more importantly keeping data good throughout the project.
It appears to be the solution to a number of challenges faced by a model governed approach such as insignificant improvements in accuracy rates, overfitting data and lack of data for training sets.
How does being Data-Centric help with the Issue of Data Scarcity? Data Augmentation is the Key!
There are a few ways to deal with data scarcity problems such as the newly popularized synthetic data as well as data augmentation.?
Data augmentation is an integral part of data centric AI and involves at the most basic level an efficient way to get additional data by using mindful transformations of available data.
Apart from increasing the training set, this leads to the development of a more general model with increased performance in real life scenarios provided by higher coverage of data and more use cases. It prevents data overfitting and errors creeping in due to biased datasets.
领英推荐
It is easy to discern the procedure undertaken for data augmentation by taking simple examples from image analysis.
If the model must identify cats then first, some images would be chosen from the initial training dataset. They would undergo various changes such as adding noise, changing hue and saturation and translation or even a combination of these. The new artificially created images would be added to the training dataset allowing the model to recognise a cat in different scenarios.
The above example also illustrates how critical it is to interact with subject matter experts during this operation to get utilizable results and to avoid adding redundant data points. For instance, it is highly improbable that the model would face a cat rotated upside down as in the example. Domain experts can catch mistakes like these and give actionable insight to straighten out the data augmentation process to get more accurate results without additional computational effort.
As a result, data augmentation helps immensely in resolving data scarcity and results in getting better models, allowing more accurate predictions and most importantly staves off pesky privacy issues. This becomes especially significant when dealing with high scale, real time complex systems. It can get quite tricky to keep track of everything that can go wrong in terms of data accumulation, storage space, data pipelines and business logic transformations. This process is further complicated by the fact that all these systems are interconnected and need to be handled in a holistic manner.
How VuNet Systems Utilizes data centric AI to do Better with Lesser Data?
At VuNet Systems, we strive to find better and more robust methods to deal with data shortage issues by incorporating innovative solutions put forth by the use of data centric AI while also valuing our customer’s privacy and sensitivity of the data handled by us.
Through our extensive research and our tie ups with leading academic institutions, we equip ourselves with an ever expanding arsenal of weapons against the current challenges in the Fintech sector.?
Operating in the space of business journey observability and helping enterprises improve user experience, by optimizing transaction turnaround times and reducing failure rates, VuNet’s platform handles billions of digital transactions monthly, which also includes handling of TBs of logs, metrics and traces.?
Our current research work on ML models at scale centers around handling varied time series data and metrics, automated parsing of unstructured logs, accelerated root cause analysis to provide business?and operational outcomes to help reduce the mean time to detect or resolve failures and to provide early warning indicators from incident prediction to better capacity planning.
A practical concern for us in our space is the lack of labeled data on events, alerts and other metrics, absence of good quality data or telemetry and beyond this, we also need to seamlessly integrate a ton of enterprise business, domain and environment context to real time metrics for meaningful results.
All this necessitated our research team to take a data centric AI approach and create a blend of ML( Deep learning based ) and statistical models to good effect. Data simulation and augmentation are heavily used in our areas of work to recreate production type scenarios and observe various trends and patterns from daily, weekly to monthly and drive our domain centric models to production with increasing accuracy. We also augment our data sets with knowledge from domain experts and our solution architects, where we take a programmatic approach to labeling to be more iterative and more real time.
For the most significant and effective results, it is vital that the focus be on optimizing every ingredient in the ML project, right from the data to the model architecture. We believe in using a hybrid approach which combines the traditional decade old wisdom of the model-centric perspective with the cutting edge technological advancements of data-centricity to give our customers the utmost value while stepping into the ever changing AI landscape.