Tabular data as a challenge
Leonid Nekhymchuk
Co-Founder & CEO at Datuum.ai | Experienced Chief Technology Officer | Helping data teams automate data onboarding, freeing them to focus on innovation
The abundance of tabular data being accumulated by enterprises internally and online represents a big challenge for data integration and analysis for companies that strive to be data-driven.
While the main focus of AI/ML research and community during the last several years has been on processing unstructured data, the tabular data still is where the most time and money are spent in the Data Integration world.
These days every company, whether big or small, has dozens of different systems (data sources). Many of those have great integration capabilities but still every day people export data from various systems and ingest it into proprietary enterprise Data Lakes, Data Warehouses, Reporting databases, etc.
That requires a lot of time and qualified Data Engineers.
To keep up with quickly changing business requirements and growing data size there have been many great technologies and tools developed by companies and communities: big data stack, data wrangling tools, quality and observability, visualization and NLU interfaces to the data.
All those technologies allowed collecting and processing unprecedented amounts of data. Though it also makes the modern data engineering stack very complex, which, in turn, results in a long and expensive development cycles that don’t meet business requirements
Tools assume that the user understands the data, knows how it’s structured and how to cross reference data that comes from different sources.
And that represents probably the biggest problem in today’s corporate world. In order to do something with the data you have to understand it, know where it is and how it’s structured. There is a misconception that there are people who fully understand all the nuances of companies' data and can efficiently manage it.
And it becomes exponentially more difficult when a company grows and people change.
In its work, Datuum specifically targets this challenge by applying the state-of-the-art AI models to help Data Workers to handle the variety and complexity of tabular data. Datuum speeds-up and optimizes the integration of multiple data sources by helping data workers to find data they need and connect the dots between source and destination. As the result of interacting with your data our AI learns more with every iteration and that data so the institutional knowledge is not lost.
Why is it not easy?
Imagine that you have two different data sources with seemingly similar data stored in them. For instance, data from medical or insurance domains, in which some tables could contain hundreds or even thousands of columns. Our major task is to unify this data into a single source of truth.
The cornerstone of this unification is the process of column matching – discovering combinations of columns from different data sources which store the same information (Fig. 1).
While it could be trivial to match columns with generalized data like name, email or address, it's not the case for columns containing domain-specific information – to map this kind of data, we usually need domain expertise and manual intervention.
The other important phase of the unification pipeline would be record matching – the process in which records from different data sources are merged into a single entity. For highly regulated domains like healthcare, this step could be extremely important. And it's even harder to get right when large datasets are being processed, even for human annotators.
Transformers to the rescue!
Thankfully, the latest research in large language models gives us the key to the simplification of the matching process. It doesn't yet fully remove the need for human experts (especially in complex domains), but could give them a substantial boost compared to standard setting.
By definition, language modeling is the usage of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models train by analyzing text to correctly predict the next or masked word in a sequence (Fig. 2).
The task of language modeling is not new, but in recent years it has become one of the main areas for research and development in the machine learning field. Basically, there are 3 reasons why this sub-field experienced such an enormous progress:
As a result, transformer models have found use in a wide range of tasks, ranging from question answering to video understanding. Naturally, people also looked at applying this architecture to tabular data as well.
Tabular data is textual in its nature. One could argue that numeric columns do not constitute text, but in fact, numerical values are nothing other than sequences of specialized text characters. This means that tabular data, similar to textual one, could be effectively modeled as a language using Transformers.
Modeling tabular data with Transformers facilitates the solution of other downstream tasks, such as column type prediction or column name recovery. Some of the proposed models could also learn efficient dense representations of the table parts: columns and rows.
Deep learning journey
Machine learning approaches using tabular data can hardly be called new. For many years, tree-based solutions, such as random forests or gradient boosting algorithms, were used for various classification and regression tasks, such as credit score prediction or fraud detection.?
However, complexity of aforementioned models lets us learn and understand patterns of data on observation (row) level.?
领英推荐
For a much deeper understanding of data, as relations between columns in a table or between tables in a database, neural networks became very handy.?
In the pre-transformer era, deep learning models had a limited capability in processing sequential data like column values or names and, hence, learning contextual information from it. As a result, they often relied on manually engineered features for model input and processed each column separately.
One prominent example of such a model is Sherlock [1] – the first successful application of deep learning to the task of semantic type classification. Sherlock is a regular feed forward neural network that uses a set of statistical features calculated from column values to predict one of 78 semantic column types. It was the first deep learning model used to improve column matching in Datuum.??
Despite being quite effective for common column types, like name, address or zip-code, Sherlock had a set of limitations, such as high reliance on annotated data, especially in previously unseen domains or inability to take table context and column name into account
Obviously that represented an obstacle to successfully applying Datuum in new verticals fast. As that architecture required additional training on the new data.
Fortunately enough, the Transformer-based model proved to be a perfect alternative to Sherlock.
The case for transformers
Our history with application of transformers to the task of column matching started with an extensive research of related work. As a result, we chose TABBIE [2] and TaBERT [3] models for further experiments. It should be noted that at the beginning of 2022 almost all SOTA approaches to the task of column type classification were based on Transformers (Fig. 3).
By fine-tuning these models on multiple relevant datasets, we proved the superiority of transformer-based methods over Sherlock. For example, TABBIE F1-macro score on TURL dataset [4] was almost 20% higher than Sherlock’s.
In fact, the performance of transformers on classification tasks wasn’t what we were looking for. The property that caught our attention most was the representational ability of these models.
For instance, by training transformer-based models like TABBIE on domain specific dataset for the task of semantic column type classification, we were able to learn dense representations of the elements of these tables: rows and columns. This, in turn, allowed us to extend the application of the model to previously unseen domains without the need to annotate new data.
Our column matching pipeline is specifically designed for a system that has a regular inflow of new data from previously unseen domains (Fig. 4).
Generally, our approach works the following way:
This pipeline has few important benefits compared to the pipeline based on Sherlock model, namely:
Combined, these advantages set a new standard in accuracy and adaptivity of column matching employed in Datuum.
Conclusion
Representational ability of Transformer-based language models is bringing new capabilities to almost all tasks that rely on textual data, and table cross-reference is no exception.?
By employing a Transformer model into our column-matching pipeline, we were able to increase matching accuracy and generalization ability of the system that doesn’t depend on annotated data, which is especially useful in domains where data access is restricted.
AI/LLM Disruptive Leader | GenAI Tech Lab
1 年See also my new NoGAN technology: Generative AI Technology Break-through: Spectacular Performance of New Synthesizer, at https://mltblog.com/3DBfjsi
Thanks for posting
CEO at Langate EMEA | Introducing innovations and transforming business in Europe/America | Lecturer, Mentor, Advisor
1 年Thanks for sharing!