登录查看更多内容

Unleashing the Power of HtmlRAG: Transforming RAG with HTML Enhanced Table Data from UnDatas.io

Alex Zhang

Founder of UnDatas.IO | Unstructured Data Processing & Financial Modeling Expertise | Driving Business Value Through Data & Analytics | Empowering Businesses with Data-Driven Insights

发布日期: 2025年2月27日

In the dynamic landscape of natural language processing, Retrieval-Augmented Generation (RAG) has emerged as a game-changer, offering a promising solution to enhance the capabilities of Large Language Models (LLMs) and mitigate their notorious hallucination problem. Today, we’ll explore how HtmlRAG, in conjunction with the data extraction capabilities of UnDatas.io, is revolutionizing the RAG paradigm, particularly when it comes to handling table data in HTML format.

The Rise of RAG and the Challenge of Hallucination

LLMs have demonstrated remarkable prowess in various natural language tasks. However, issues like hallucination, where models generate plausible but factually incorrect information, remain a significant hurdle. RAG addresses this by retrieving external knowledge and integrating it into the generation process. Traditional RAG systems often rely on plain text as the format for retrieved knowledge. But as we’ve seen, this approach can lead to a loss of crucial information, especially when dealing with complex documents such as financial reports or technical manuals that contain tables.

HtmlRAG: Why HTML is a Game - Changer for RAG

HtmlRAG takes the concept of using HTML in RAG systems to the next level. The idea is simple yet profound: instead of converting HTML to plain text, we use HTML directly as the format for retrieved knowledge in RAG. This approach offers several benefits.

First, HTML can better represent the original document’s structure and semantics compared to plain text. In the context of table data, HTML tags can clearly define table headers, rows, and cells, providing a more structured input for LLMs. This structured input helps LLMs understand the data better and reduces the likelihood of misinterpreting the information, thus alleviating the hallucination problem.

Second, LLMs have already encountered HTML during their pre - training. This means they have an inherent ability to understand HTML without the need for extensive fine-tuning. As modern LLMs are evolving to support longer input windows, it has become increasingly feasible to input more comprehensive HTML documents, including complex tables.

The HtmlRAG Workflow

Overview of the HtmlRAG pipeline

The HtmlRAG workflow is designed to make the most of HTML - formatted data. It starts with retrieving HTML documents, which can be the table - rich documents extracted by UnDatas.io. However, raw HTML documents often contain a lot of noise, such as CSS styles, JavaScript, and other elements that are not relevant to the knowledge extraction process.

To address this, HtmlRAG incorporates an HTML cleaning module. This module removes the extraneous content while preserving the essential structural and semantic information. After cleaning, the HTML document is still relatively long for LLMs. So, HtmlRAG further refines the document using a two - step block - tree - based pruning strategy.

The first step involves pruning based on text embedding. This step calculates the similarity between different parts of the HTML document and the user’s query, removing less relevant blocks. The second step, generative fine - grained block pruning, uses a generative model to further refine the HTML, ensuring that only the most relevant information is retained.

领英推荐

The past, present, and future of semantic search

Algolia 1 年前

Mastering the Ingestion Phase of Retriever Augmented…

Snigdha Kakkar 11 个月前

Why GraphQL Will Rewrite the Semantic Web

Kurt Cagle 3 年前

UnDatas.io: A Gateway to Structured Data in HTML

UnDatas.io is a powerful platform that has been making waves in the data analysis and RAG space. One of its key strengths lies in its ability to extract data from various sources, including PDF documents. Notably, when it comes to table extraction, UnDatas.io doesn’t just provide plain text; it preserves the data in HTML format. This is a significant advantage because HTML retains the structural and semantic information of the table, which is often lost during the conversion to plain text.For example, in a financial report, tables might contain crucial financial figures, and the relationships between different columns and rows are vital for accurate analysis. UnDatas.io ensures that all this information is intact in the extracted HTML - based table data. This HTML - formatted table data becomes the foundation for more informed and accurate interactions with LLMs in the RAG framework. Data Extraction Results

rendered HTML

Experimental Validation

The effectiveness of HtmlRAG has been thoroughly tested in experiments. Researchers have conducted tests on six different QA datasets, comparing HtmlRAG with various baselines. The results are compelling: HtmlRAG outperforms traditional RAG systems that rely on plain text in most cases. When using HTML - formatted table data from UnDatas.io in the HtmlRAG framework, LLMs are able to generate more accurate answers, reducing the incidence of hallucination.

For instance, in datasets where questions require extracting specific information from tables, HtmlRAG’s use of HTML - based table data enables LLMs to precisely identify and extract the relevant information, leading to higher exact match scores and better overall performance.

query = 'The amount of revenue in Q1-2025.'

response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a data analysis expert. Please extract information from the data provided by the user. Note that only the information asked by the user should be returned, and nothing else should be returned. Data: %s" % (result.data, )},
            {"role": "user", "content": query},
        ],
        stream=False
    )

res_data = response.choices[0].message.content
res_data

Code running result:

Conclusion and Future Outlook

In conclusion, the combination of UnDatas.io’s HTML - based table data extraction and HtmlRAG’s innovative approach to using HTML in RAG systems is a powerful solution for enhancing the performance of LLMs and reducing hallucination. As the field of natural language processing continues to evolve, we can expect HtmlRAG to play an even more significant role in shaping the future of RAG - based applications.

Future research could focus on further optimizing the HtmlRAG workflow, exploring how to better integrate other types of structured data in HTML format, and improving the efficiency of the pruning algorithms. With these advancements, we can look forward to more reliable and accurate language - based applications that leverage the full potential of HTML - enhanced RAG.

??See Also

UndatasIO Newsletter

59 位关注者

要查看或添加评论，请登录

Alex Zhang的更多文章

Legal AI: Unleashing the Power of Unstructured Data in Document Processing

2025年3月24日

Legal AI: Unleashing the Power of Unstructured Data in Document Processing

Introduction: The Legal Revolution is Here The legal landscape is undergoing a seismic shift, driven by the relentless…
Can Undatas.io Really Deliver Superior PDF Parsing Quality? Sample-Based Evidence Speaks!

2025年3月19日

Can Undatas.io Really Deliver Superior PDF Parsing Quality? Sample-Based Evidence Speaks!

Introduction Previously, we conducted a comprehensive and in-depth evaluation of Mistral OCR. Based on the PDF samples…
In-depth Review of Mistral OCR A PDF Parsing Powerhouse Tailored for the AI Era

2025年3月16日

In-depth Review of Mistral OCR A PDF Parsing Powerhouse Tailored for the AI Era

Introduction In today’s era where AI technology is reshaping workflows, the unstructured nature of PDF documents has…
Leveraging UnDatas.io vs. Traditional OCR for RAG Applications: A Comparative Analysis

2025年2月23日

Leveraging UnDatas.io vs. Traditional OCR for RAG Applications: A Comparative Analysis

In the ever-evolving landscape of data analysis, the ability to extract accurate information from various document…
Leveraging UnDatas.io and deepseek to Analyze Tesla Gen Report: A Step-by-Step Guide

2025年2月19日

Leveraging UnDatas.io and deepseek to Analyze Tesla Gen Report: A Step-by-Step Guide

This Blog will introduce to you, dear readers, through a notebook example how to use the Undatas.io platform and the…
Step-by-step Guide to Convert PDF to JSON Using Python

2025年2月16日

Step-by-step Guide to Convert PDF to JSON Using Python

Converting PDF files to JSON unlocks a world of possibilities for data manipulation. By converting PDF to JSON, you…
Feature Upgrade Series3: Advanced Table Processing Capabilities

2025年2月4日

Feature Upgrade Series3: Advanced Table Processing Capabilities

1. Introduction: The Importance of Accurate Data Extraction Tables are a fundamental element of data presentation in…
Undatas.io Feature Upgrade Series2 : OCR Multilingual Expansion

2025年1月24日

Undatas.io Feature Upgrade Series2 : OCR Multilingual Expansion

5minRead time 1. Introduction: The Need for Multilingual Communication In today’s globalized world, effective…
Undatas.io Feature Upgrade Series1: Layout Recognition Enhancements

2025年1月21日

Undatas.io Feature Upgrade Series1: Layout Recognition Enhancements

1. Introduction: The Importance of Layout Recognition in Digital Transformation In the rapidly evolving digital…
Assessment of Microsoft's Markitdown series 1:Parse PDF Tables from simple to complex

2024年12月20日

Assessment of Microsoft's Markitdown series 1:Parse PDF Tables from simple to complex

This article will introduce to you how the Markitdown library parses Excel files containing tables of varying…

See all articles

Unleashing the Power of HtmlRAG: Transforming RAG with HTML Enhanced Table Data from UnDatas.io

Alex Zhang

Founder of UnDatas.IO | Unstructured Data Processing & Financial Modeling Expertise | Driving Business Value Through Data & Analytics | Empowering Businesses with Data-Driven Insights

The Rise of RAG and the Challenge of Hallucination

HtmlRAG: Why HTML is a Game - Changer for RAG

The HtmlRAG Workflow

领英推荐

UnDatas.io: A Gateway to Structured Data in HTML

Experimental Validation

Conclusion and Future Outlook

??See Also

UndatasIO Newsletter

59 位关注者

Alex Zhang的更多文章

社区洞察

其他会员也浏览了

RAG Chunking Strategies with LlamaIndex: Optimizing Your Retrieval Pipeline

High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub

Dissecting Llama 3.1: A Deep Dive

Take your RAG system to the next level

Building an AI-Powered Search System using RAG and Elasticsearch

Demystifying Semantic Kernel

Query Understanding, Divided into Three Parts

Vector Database for Movie Recommendations: A Toy Example

Introduction to LangChain

The Rise of RAG and the Challenge of Hallucination

HtmlRAG: Why HTML is a Game - Changer for RAG

The HtmlRAG Workflow

领英推荐

UnDatas.io: A Gateway to Structured Data in HTML

Experimental Validation

Conclusion and Future Outlook

??See Also

UndatasIO Newsletter

59 位关注者

Alex Zhang的更多文章

Legal AI: Unleashing the Power of Unstructured Data in Document Processing

Can Undatas.io Really Deliver Superior PDF Parsing Quality? Sample-Based Evidence Speaks!

In-depth Review of Mistral OCR A PDF Parsing Powerhouse Tailored for the AI Era

Leveraging UnDatas.io vs. Traditional OCR for RAG Applications: A Comparative Analysis

Leveraging UnDatas.io and deepseek to Analyze Tesla Gen Report: A Step-by-Step Guide

Step-by-step Guide to Convert PDF to JSON Using Python

Feature Upgrade Series3: Advanced Table Processing Capabilities

Undatas.io Feature Upgrade Series2 : OCR Multilingual Expansion

Undatas.io Feature Upgrade Series1: Layout Recognition Enhancements

Assessment of Microsoft's Markitdown series 1:Parse PDF Tables from simple to complex

社区洞察

其他会员也浏览了

RAG Chunking Strategies with LlamaIndex: Optimizing Your Retrieval Pipeline

High Fidelity Retrieval Augmented Generation (RAG) with Meta Llama 3.1 at PubNub

Dissecting Llama 3.1: A Deep Dive

Take your RAG system to the next level

Building an AI-Powered Search System using RAG and Elasticsearch

Demystifying Semantic Kernel

Query Understanding, Divided into Three Parts

Vector Database for Movie Recommendations: A Toy Example

Introduction to LangChain