AI Content Engineering and Complex Tabular Data Processing

It's not a surprise that an AI Models can be trained to be able to process data in various formats and adhering to various semantics including structured,semi-structured and unstructured data, depending on the particular needs and goals to be achieved; Of course, this does not constitute a goal to be achieved by all of the models: some models may be fine-tuned and trained accordingly to perform repetitive work on top of structured data - such as a translator that converts code from Java to COBOL or vice-versa in the PDLC field. Some other models can act on top of unstructured data such as a text-to-speech generator that can convert typed characters into voice, or LLMs such as GPT-4 that can summarize, synthesize, create a resume, or even give an opinion from a certain unstructured text provided.

But there's a certain area where a typical billion-parameter level such as GPT-4 can even fail: the processing of multi-tabular formatted data that can include nested data in the form of tables of tables, or matrices, which could also include images embedded on them, or even indications on how the table needs to be read. This leads us to the topic of multi-tabular complex data processing, a topic that belongs to the field of the content engineering, part of a content pipeline and data ingestion for either RAG or training/fine-tuning purposes.

Let's clarify with a simple example what this concept is about. An image worth better than a million words:

Nested Tabular Data


The above table shows a simple example, with a two-level nested table, however in real-world there could be more complex scenarios with three or four-level tables that may include tables-into-tables.

Of course such complexity can be handled by involving human-in-the-loop (HITL) processing by involving human processing as part of the content engineering workflow within the Content Hub. However, HITL processing means higher level of costs - specialized trained people to work with the content, and transsform it into a readable structured format, such as JSON - but also implies higher levels of lagging and delaying within our content pipeline because manual tasks in nature are going to be lengthy than any automated task. Also in terms of overall AI system scalability, involving HITL means that if we want the system to scale, we will need more people - costs can quickly become ridiculously high when on the other hand, our TCO mandate is to keep costs lower otherwise the AI solution can become non viable.

PDF ingestion and Content Extraction on Content Pipeline


And then we need to rely on bot-based processing for our pipeline if we want to scale properly. So either we rely on some solution that is already there on the market, or otherwise, we build our own - the classical dichotomy of the buy-versus-build. Of course, if we go for the way route, there are some interesting offrers from some hyper-scalers such as Azure Form Recognizer - now recently part of the Azure Document Intelligence - which enables us to extract complex data from input documents in nested-tabular format efficiently of course, at a cost, which could solve the problem but as expected, you're being locked to the vendor to perform such functionality.

Azure Document Intelligence Architecture


There are as well some other OSS data extraction tools being offered from the open-source side that can help on extracting complex content such as the nested tabular we are dealing with. One of the downsides is that still some of them may become challenging when it comes to resolve complex OCR scenarios, it depends on how far are you going with it.

  • Apache PDFBox: Apache PDFBox is an open-source Java library for working with PDF documents. While it's more of a developer tool, it can be used to extract text from PDFs, including tables.
  • Camelot: Camelot is a Python library that leverages OpenCV and other libraries to extract tables from PDFs. It can handle simple and complex tables, including nested ones.
  • PDFTables: PDFTables is a web-based tool that can convert PDFs with tables into Excel, CSV, or JSON formats. It can handle nested tables to some extent.

Remember that while these tools can handle nested tables to varying degrees, the accuracy of extraction may vary depending on the complexity and formatting of the tables in the PDF document. It's a good idea to try out a few options to see which works best for your specific use case.

Going our own way

To extract information from smaller documents, it’s time taking to configure deep learning models or write computer vision algorithms. Instead, we can use regular expressions in Python to extract text from the PDF documents. Also, remember that this technique does not work for images. We can only use this to extract information from HTML files or PDF documents. This is because, when you’re using a regular expression, you’ll need to match the content with the source and extract information. With images, you’ll not be able to match the text, and the regular expressions will fail. Let’s now work with a simple PDF document and extract information from the tables in it. Below is the image:

In the first step, we load the PDF into our program. Once that’s done, we convert the PDF to HTML so that we can directly use regular expressions and thereby, extract content from the tables. For this, the module we use is pdfminer. This helps to read content from PDF and convert it into an HTML file.

Below is the code snippet:

from pdfminer.pdfinterp import PDFResourceManager 
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import re


def convert_pdf_to_html(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0 #is for all
    caching = True
    pagenos=set()
    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)
    fp.close()
    device.close()
    str = retstr.getvalue()
    retstr.close()
    return str        

We imported a lot of modules inclusive of Regular Expression and PDF related libraries. In the method convert_pdf_to_html, we send the path of the PDF file which needs to be converted to an HTML file. The output of the method will be an HTML string as shown below:

'<span style="font-family: XZVLBD+GaramondPremrPro-LtDisp; font-size:12px">Changing Echoes\n<br>7632 Pool Station Road\n<br>Angels Camp, CA 95222\n<br>(209) 785-3667\n<br>Intake: (800) 633-7066\n<br>SA </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> TX DT BU </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> RS RL OP PH </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> CO CJ \n<br></span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> SF PI </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> AH SP\n<br></span></div>'        

Regular expression is one of the trickiest and coolest programming techniques used for pattern matching. These are widely used in several applications, say, for code formatting, web scraping, and validation purposes. Before we start extracting content from our HTML tables, let’s quickly learn a few things about regular expressions.

Challenges on Table Detection and Extraction

Table Detection

In this phase, we identify where exactly the tables are present in the given input. The input can be of any format, such as Images, PDF/Word documents and sometimes videos. We use different techniques and algorithms to detect the tables, either by lines or by coordinates. In some cases, we might encounter tables with no borders at all, where we need to opt for different methods. Besides these, here are a few other challenges:

  • Image Transformation: Image transformation is a primary step in detecting labels. This includes enhancing the data and borders present in the table. We need to choose proper preprocessing algorithms based on the data presented in the table. For example, when we are working with images, we need to apply thresholding and edge detectors. This transformation step helps us to find the content more precisely. In some cases, the contours might go wrong and the algorithms fail to enhance the image. Hence, choosing the right image transformation steps and preprocessing is crucial.
  • Image Quality: When we scan tables for information extraction, we need to make sure that these documents are scanned in brighter environments which ensures good quality images. When the lighting conditions are poor, CV and DL algorithms might fail to detect tables in the given inputs. If we are using deep learning, we need to make sure the dataset is consistent and has a good set of standard images. If we use these models on tables present in old crumpled papers, then first we need to preprocess and eliminate the noise in those pictures.
  • Variety of Structural Layouts and Templates: All tables are not unique. One cell can span over several cells, either vertically or horizontally, and combinations of spanning cells can create a vast number of structural variations. Also, some emphasize features of text, and table lines can affect the way the table’s structure is understood. For example, horizontal lines or bold text may emphasize multiple headers of the table. The structure of the table visually defines the relationships between cells. Visual relationships in tables make it difficult to computationally find the related cells and extract information from them. Hence it’s important to build algorithms that are robust in handling different structures of tables.
  • Cell Padding, Margins, Borders: These are the essentials of any table - paddings, margins, and borders will not always be the same. Some tables have a lot of padding inside cells, and some do not. Using good quality images and preprocessing steps will help the table extraction process to run smoothly.

Table Extraction

This is the phase where the information is extracted after the tables are identified. There are a lot of factors regarding how the content is structured and what content is present in the table. Hence it’s important to understand all the challenges before one builds an algorithm.

  • Dense Content: The content of the cells can either be numeric or textual. However, the textual content is usually dense, containing ambiguous short chunks of text with the use of acronyms and abbreviations. In order to understand tables, the text needs to be disambiguated, and abbreviations and acronyms need to be expanded.
  • Different Fonts and Formats: Fonts are usually of different styles, colors, and heights. We need to make sure that these are generic and easy to identify. Few font families especially the ones that fall under cursive or handwritten, are a bit hard to extract. Hence using good font and proper formatting helps the algorithm to identify the information more accurately.
  • Multiple Page PDFs and Page Breaks: The text line in tables is sensitive to a predefined threshold. Also with spanning cells across multiple pages, it becomes difficult to identify the tables. On a multi-table page, it is difficult to distinguish different tables from each other. Sparse and irregular tables are hard to work with. Therefore, graphic ruling lines and content layout should be used together as important sources for spotting table regions.


Table Conversion

The last phase includes converting the extracted information from tables to compiling them as an editable document, either in excel or using other software. Let’s learn about a few challenges.

  • Set Layouts: When different formats of tables are extracted from scanned documents, we need to have a proper table layout to push the content in. Sometimes, the algorithm fails to extract information from the cells. Hence, designing a proper layout is also equally important.
  • Variety of value presentation patterns: Values in cells can be presented using different syntactic representation patterns. Consider the text in the table to be 6 ± 2. The algorithm might fail to convert that particular information. Hence the extraction of numerical values requires knowledge of possible presentation patterns.
  • Representation for visualization: Most of the representation formats for tables, such as markup languages in which tables can be described, are designed for visualization. Therefore, it is challenging to automatically process tables.

These are the challenges that we face during the table extraction process using traditional techniques. Now let’s see how to overcome these with the help of Deep Learning. It is being widely researched in various sectors.

Using NanoNets(r) API

NanoNets provide a full and scalable solution to deal with complex tables across the three areas of table detection, extraction and table conversion. It provides a robust API that can be integrated within the code. You can try it your own by visiting https://app.nanonets.com/#/signup






Florencia Cattelani

Chief Operations Officer (COO) en Cloudgaia | MuleSoft Ambassador | MuleSoft Meetup Leader

9 个月

Interesting, Willy! From my point of view, we should consider AI as another powerful application or tool. However, it still requires deep technical knowledge to configure and make the most out of it. From an integration perspective, it's amazing how much more we can do now with the level of automation and complex processing.

回复
Christian Di Costanzo

Solutions Architect / Software Architect / .Net Architect / AWS Architect / Fintech / Banking

9 个月

Nice post Guille. Another interesting topic is how to integrate AI with SQL data.

回复

要查看或添加评论,请登录

Guillermo Wrba的更多文章

社区洞察

其他会员也浏览了