AI Content Engineering and Complex Tabular Data Processing
Guillermo Wrba
Autor de "Designing and Building Solid Microservice Ecosystems", Consultor Independiente y arquitecto de soluciones ,evangelizador de nuevas tecnologias, computacion distribuida y microservicios.
It's not a surprise that an AI Models can be trained to be able to process data in various formats and adhering to various semantics including structured,semi-structured and unstructured data, depending on the particular needs and goals to be achieved; Of course, this does not constitute a goal to be achieved by all of the models: some models may be fine-tuned and trained accordingly to perform repetitive work on top of structured data - such as a translator that converts code from Java to COBOL or vice-versa in the PDLC field. Some other models can act on top of unstructured data such as a text-to-speech generator that can convert typed characters into voice, or LLMs such as GPT-4 that can summarize, synthesize, create a resume, or even give an opinion from a certain unstructured text provided.
But there's a certain area where a typical billion-parameter level such as GPT-4 can even fail: the processing of multi-tabular formatted data that can include nested data in the form of tables of tables, or matrices, which could also include images embedded on them, or even indications on how the table needs to be read. This leads us to the topic of multi-tabular complex data processing, a topic that belongs to the field of the content engineering, part of a content pipeline and data ingestion for either RAG or training/fine-tuning purposes.
Let's clarify with a simple example what this concept is about. An image worth better than a million words:
The above table shows a simple example, with a two-level nested table, however in real-world there could be more complex scenarios with three or four-level tables that may include tables-into-tables.
Of course such complexity can be handled by involving human-in-the-loop (HITL) processing by involving human processing as part of the content engineering workflow within the Content Hub. However, HITL processing means higher level of costs - specialized trained people to work with the content, and transsform it into a readable structured format, such as JSON - but also implies higher levels of lagging and delaying within our content pipeline because manual tasks in nature are going to be lengthy than any automated task. Also in terms of overall AI system scalability, involving HITL means that if we want the system to scale, we will need more people - costs can quickly become ridiculously high when on the other hand, our TCO mandate is to keep costs lower otherwise the AI solution can become non viable.
And then we need to rely on bot-based processing for our pipeline if we want to scale properly. So either we rely on some solution that is already there on the market, or otherwise, we build our own - the classical dichotomy of the buy-versus-build. Of course, if we go for the way route, there are some interesting offrers from some hyper-scalers such as Azure Form Recognizer - now recently part of the Azure Document Intelligence - which enables us to extract complex data from input documents in nested-tabular format efficiently of course, at a cost, which could solve the problem but as expected, you're being locked to the vendor to perform such functionality.
There are as well some other OSS data extraction tools being offered from the open-source side that can help on extracting complex content such as the nested tabular we are dealing with. One of the downsides is that still some of them may become challenging when it comes to resolve complex OCR scenarios, it depends on how far are you going with it.
Remember that while these tools can handle nested tables to varying degrees, the accuracy of extraction may vary depending on the complexity and formatting of the tables in the PDF document. It's a good idea to try out a few options to see which works best for your specific use case.
Going our own way
To extract information from smaller documents, it’s time taking to configure deep learning models or write computer vision algorithms. Instead, we can use regular expressions in Python to extract text from the PDF documents. Also, remember that this technique does not work for images. We can only use this to extract information from HTML files or PDF documents. This is because, when you’re using a regular expression, you’ll need to match the content with the source and extract information. With images, you’ll not be able to match the text, and the regular expressions will fail. Let’s now work with a simple PDF document and extract information from the tables in it. Below is the image:
In the first step, we load the PDF into our program. Once that’s done, we convert the PDF to HTML so that we can directly use regular expressions and thereby, extract content from the tables. For this, the module we use is pdfminer. This helps to read content from PDF and convert it into an HTML file.
Below is the code snippet:
from pdfminer.pdfinterp import PDFResourceManager
from pdfminer.pdfinterp import PDFPageInterpreter
from pdfminer.converter import HTMLConverter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO
import re
def convert_pdf_to_html(path):
rsrcmgr = PDFResourceManager()
retstr = StringIO()
codec = 'utf-8'
laparams = LAParams()
device = HTMLConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
fp = file(path, 'rb')
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0 #is for all
caching = True
pagenos=set()
for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages,password=password,caching=caching, check_extractable=True):
interpreter.process_page(page)
fp.close()
device.close()
str = retstr.getvalue()
retstr.close()
return str
We imported a lot of modules inclusive of Regular Expression and PDF related libraries. In the method convert_pdf_to_html, we send the path of the PDF file which needs to be converted to an HTML file. The output of the method will be an HTML string as shown below:
领英推荐
'<span style="font-family: XZVLBD+GaramondPremrPro-LtDisp; font-size:12px">Changing Echoes\n<br>7632 Pool Station Road\n<br>Angels Camp, CA 95222\n<br>(209) 785-3667\n<br>Intake: (800) 633-7066\n<br>SA </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> TX DT BU </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> RS RL OP PH </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> CO CJ \n<br></span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> SF PI </span><span style="font-family: GDBVNW+Wingdings-Regular; font-size:11px">s</span><span style="font-family: UQGGBU+GaramondPremrPro-LtDisp; font-size:12px"> AH SP\n<br></span></div>'
Regular expression is one of the trickiest and coolest programming techniques used for pattern matching. These are widely used in several applications, say, for code formatting, web scraping, and validation purposes. Before we start extracting content from our HTML tables, let’s quickly learn a few things about regular expressions.
Challenges on Table Detection and Extraction
Table Detection
In this phase, we identify where exactly the tables are present in the given input. The input can be of any format, such as Images, PDF/Word documents and sometimes videos. We use different techniques and algorithms to detect the tables, either by lines or by coordinates. In some cases, we might encounter tables with no borders at all, where we need to opt for different methods. Besides these, here are a few other challenges:
Table Extraction
This is the phase where the information is extracted after the tables are identified. There are a lot of factors regarding how the content is structured and what content is present in the table. Hence it’s important to understand all the challenges before one builds an algorithm.
Table Conversion
The last phase includes converting the extracted information from tables to compiling them as an editable document, either in excel or using other software. Let’s learn about a few challenges.
These are the challenges that we face during the table extraction process using traditional techniques. Now let’s see how to overcome these with the help of Deep Learning. It is being widely researched in various sectors.
Using NanoNets(r) API
NanoNets provide a full and scalable solution to deal with complex tables across the three areas of table detection, extraction and table conversion. It provides a robust API that can be integrated within the code. You can try it your own by visiting https://app.nanonets.com/#/signup
Chief Operations Officer (COO) en Cloudgaia | MuleSoft Ambassador | MuleSoft Meetup Leader
9 个月Interesting, Willy! From my point of view, we should consider AI as another powerful application or tool. However, it still requires deep technical knowledge to configure and make the most out of it. From an integration perspective, it's amazing how much more we can do now with the level of automation and complex processing.
Solutions Architect / Software Architect / .Net Architect / AWS Architect / Fintech / Banking
9 个月Nice post Guille. Another interesting topic is how to integrate AI with SQL data.