登录查看更多内容

Tabular data as a challenge

Leonid Nekhymchuk

Co-Founder & CEO at Datuum.ai | Experienced Chief Technology Officer | Helping data teams automate data onboarding, freeing them to focus on innovation

发布日期: 2023年1月18日

The abundance of tabular data being accumulated by enterprises internally and online represents a big challenge for data integration and analysis for companies that strive to be data-driven.

While the main focus of AI/ML research and community during the last several years has been on processing unstructured data, the tabular data still is where the most time and money are spent in the Data Integration world.

These days every company, whether big or small, has dozens of different systems (data sources). Many of those have great integration capabilities but still every day people export data from various systems and ingest it into proprietary enterprise Data Lakes, Data Warehouses, Reporting databases, etc.

That requires a lot of time and qualified Data Engineers.

To keep up with quickly changing business requirements and growing data size there have been many great technologies and tools developed by companies and communities: big data stack, data wrangling tools, quality and observability, visualization and NLU interfaces to the data.

All those technologies allowed collecting and processing unprecedented amounts of data. Though it also makes the modern data engineering stack very complex, which, in turn, results in a long and expensive development cycles that don’t meet business requirements

Tools assume that the user understands the data, knows how it’s structured and how to cross reference data that comes from different sources.

And that represents probably the biggest problem in today’s corporate world. In order to do something with the data you have to understand it, know where it is and how it’s structured. There is a misconception that there are people who fully understand all the nuances of companies' data and can efficiently manage it.

And it becomes exponentially more difficult when a company grows and people change.

In its work, Datuum specifically targets this challenge by applying the state-of-the-art AI models to help Data Workers to handle the variety and complexity of tabular data. Datuum speeds-up and optimizes the integration of multiple data sources by helping data workers to find data they need and connect the dots between source and destination. As the result of interacting with your data our AI learns more with every iteration and that data so the institutional knowledge is not lost.

Why is it not easy?

Imagine that you have two different data sources with seemingly similar data stored in them. For instance, data from medical or insurance domains, in which some tables could contain hundreds or even thousands of columns. Our major task is to unify this data into a single source of truth.

The cornerstone of this unification is the process of column matching – discovering combinations of columns from different data sources which store the same information (Fig. 1).

No alt text provided for this image — [Figure 1]: Table cross-reference comprises of column and record matching

While it could be trivial to match columns with generalized data like name, email or address, it's not the case for columns containing domain-specific information – to map this kind of data, we usually need domain expertise and manual intervention.

The other important phase of the unification pipeline would be record matching – the process in which records from different data sources are merged into a single entity. For highly regulated domains like healthcare, this step could be extremely important. And it's even harder to get right when large datasets are being processed, even for human annotators.

Transformers to the rescue!

Thankfully, the latest research in large language models gives us the key to the simplification of the matching process. It doesn't yet fully remove the need for human experts (especially in complex domains), but could give them a substantial boost compared to standard setting.

By definition, language modeling is the usage of various statistical and probabilistic techniques to determine the probability of a given sequence of words occurring in a sentence. Language models train by analyzing text to correctly predict the next or masked word in a sequence (Fig. 2).

The task of language modeling is not new, but in recent years it has become one of the main areas for research and development in the machine learning field. Basically, there are 3 reasons why this sub-field experienced such an enormous progress:

The nature of language modeling task allowed us to use tremendous amounts of data for unsupervised learning.
Transformer attention-based architecture allowed the massive parallelism of training on sequential data.
Transformers pretrained on a language modeling task turned out to be great at solving other, unrelated to language modeling, tasks.

As a result, transformer models have found use in a wide range of tasks, ranging from question answering to video understanding. Naturally, people also looked at applying this architecture to tabular data as well.

Tabular data is textual in its nature. One could argue that numeric columns do not constitute text, but in fact, numerical values are nothing other than sequences of specialized text characters. This means that tabular data, similar to textual one, could be effectively modeled as a language using Transformers.

Modeling tabular data with Transformers facilitates the solution of other downstream tasks, such as column type prediction or column name recovery. Some of the proposed models could also learn efficient dense representations of the table parts: columns and rows.

Deep learning journey

Machine learning approaches using tabular data can hardly be called new. For many years, tree-based solutions, such as random forests or gradient boosting algorithms, were used for various classification and regression tasks, such as credit score prediction or fraud detection.?

However, complexity of aforementioned models lets us learn and understand patterns of data on observation (row) level.?

Data & Analytics 5 个月前

The 7 Aspects of a Modern Data Platform

Xebia 2 年前

Democratizing Data Analytics with SLEGO System

Data & Analytics 5 个月前

For a much deeper understanding of data, as relations between columns in a table or between tables in a database, neural networks became very handy.?

In the pre-transformer era, deep learning models had a limited capability in processing sequential data like column values or names and, hence, learning contextual information from it. As a result, they often relied on manually engineered features for model input and processed each column separately.

One prominent example of such a model is Sherlock [1] – the first successful application of deep learning to the task of semantic type classification. Sherlock is a regular feed forward neural network that uses a set of statistical features calculated from column values to predict one of 78 semantic column types. It was the first deep learning model used to improve column matching in Datuum.??

Despite being quite effective for common column types, like name, address or zip-code, Sherlock had a set of limitations, such as high reliance on annotated data, especially in previously unseen domains or inability to take table context and column name into account

Obviously that represented an obstacle to successfully applying Datuum in new verticals fast. As that architecture required additional training on the new data.

Fortunately enough, the Transformer-based model proved to be a perfect alternative to Sherlock.

The case for transformers

Our history with application of transformers to the task of column matching started with an extensive research of related work. As a result, we chose TABBIE [2] and TaBERT [3] models for further experiments. It should be noted that at the beginning of 2022 almost all SOTA approaches to the task of column type classification were based on Transformers (Fig. 3).

By fine-tuning these models on multiple relevant datasets, we proved the superiority of transformer-based methods over Sherlock. For example, TABBIE F1-macro score on TURL dataset [4] was almost 20% higher than Sherlock’s.

In fact, the performance of transformers on classification tasks wasn’t what we were looking for. The property that caught our attention most was the representational ability of these models.

For instance, by training transformer-based models like TABBIE on domain specific dataset for the task of semantic column type classification, we were able to learn dense representations of the elements of these tables: rows and columns. This, in turn, allowed us to extend the application of the model to previously unseen domains without the need to annotate new data.

Our column matching pipeline is specifically designed for a system that has a regular inflow of new data from previously unseen domains (Fig. 4).

Generally, our approach works the following way:

We fine-tuneTABBIE model with domain-specific data for a task that does not require additional annotation. For example, corrupted cell detection or cell in-filling. Such training is called semi-supervised.
The model learns dense representations of columns that leverage both column information and table context.
At runtime, we encode columns from tables that need to be matched and combine them into matching pairs based on distance measure between dense representations in vector space.
Human experts review low-confidence matches for correctness.

This pipeline has few important benefits compared to the pipeline based on Sherlock model, namely:

Matching is not limited to a defined set of column types.
The need for annotation is eliminated.
Ability to include encoded column data, metadata and context into single representation.

Combined, these advantages set a new standard in accuracy and adaptivity of column matching employed in Datuum.

Conclusion

Representational ability of Transformer-based language models is bringing new capabilities to almost all tasks that rely on textual data, and table cross-reference is no exception.?

By employing a Transformer model into our column-matching pipeline, we were able to increase matching accuracy and generalization ability of the system that doesn’t depend on annotated data, which is especially useful in domains where data access is restricted.

References

Vincent Granville

AI/LLM Disruptive Leader | GenAI Tech Lab

1 年

See also my new NoGAN technology: Generative AI Technology Break-through: Spectacular Performance of New Synthesizer, at https://mltblog.com/3DBfjsi

Yevhen Tsyhanenko

1 年

Thanks for posting

Irina Shymko

CEO at Langate EMEA | Introducing innovations and transforming business in Europe/America | Lecturer, Mentor, Advisor

1 年

Thanks for sharing!

1 次回应

查看更多评论

要查看或添加评论，请登录

Tabular data as a challenge

Leonid Nekhymchuk

Co-Founder & CEO at Datuum.ai | Experienced Chief Technology Officer | Helping data teams automate data onboarding, freeing them to focus on innovation

Why is it not easy?

Transformers to the rescue!

Deep learning journey

领英推荐

The case for transformers

Conclusion

References

更多精彩文章

社区洞察

其他会员也浏览了

2024 Big Data Trends

Modern Data Lineage: Increasing Trust in Data and AI Outcomes

What Makes Up Next-Gen Data Matching?

Give your data a voice with DataVox: Meet the power of unstructured data

The Power of Data Science at Sadup Softech

Top Big Data Technologies rising in 2022

Trusted Data Foundations Are Key to Unlocking GenAI

Data Observability: Transform Crude Data Into Oil

Building a Competitive Intelligence Flywheel using Data & Artificial Intelligence

From Data to Dollar: Making a Successful Living Through Data Transformation

Why is it not easy?

Transformers to the rescue!

Deep learning journey

领英推荐

The case for transformers

Conclusion

References

Interoperability Simplified: How to Map Your Data to FHIR at a Fraction of the Cost

2023年5月31日

Is there a way to reduce cost in your Data Engineering department?

2023年3月15日

社区洞察

其他会员也浏览了

2024 Big Data Trends

Modern Data Lineage: Increasing Trust in Data and AI Outcomes

What Makes Up Next-Gen Data Matching?

Give your data a voice with DataVox: Meet the power of unstructured data

The Power of Data Science at Sadup Softech

Top Big Data Technologies rising in 2022

Trusted Data Foundations Are Key to Unlocking GenAI

Data Observability: Transform Crude Data Into Oil

Building a Competitive Intelligence Flywheel using Data & Artificial Intelligence

From Data to Dollar: Making a Successful Living Through Data Transformation