2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

In the era of large language models (LLMs) and AI applications, one critical challenge is effectively handling unstructured data. ?? Whether it's text from web pages, PDFs, or other sources, transforming this raw information into a structured format is crucial for training and utilizing LLMs effectively.

What Does a Document Contain?

Documents come in various formats, but at their core, they encapsulate information organized in a hierarchy of sections, paragraphs, lists, tables, and more. ??? This structure is often represented using markup languages like HTML or XML, which provide semantic tags to delineate the different elements.

Consider an HTML file from a popular blogging platform like Medium. It might contain titles, author information, body text, images, and more. While the raw HTML is human-readable, it's not the most efficient format for an LLM to consume and process.

How to Extract Data from an HTML File?

The key is to transform the unstructured HTML into a structured representation, such as JSON or a Python data structure. This process, known as data extraction, involves parsing the HTML and identifying the relevant elements and their hierarchical relationships.

Here's an example of how we can use the partition_html function from the Unstructured library to extract elements from an HTML file:

from unstructured.partition.html import partition_html
import json

filename = "example_files/medium_blog.html"
elements = partition_html(filename=filename)
element_dict = [el.to_dict() for el in elements]
        
The partition_html function takes the filename as input and returns a list of Element objects, each representing a specific element in the HTML structure (e.g., titles, paragraphs, images). ??

We then convert each Element object into a Python dictionary using the to_dict method, resulting in a list of dictionaries (element_dict). This structured representation makes it easier to process and analyze the data using Python or pass it to an LLM.

if we print the element_dict data, we will see the data have been listed as follows:

[{'type': 'Title', 'element_id': '7100b12091b2d2bea5e2d50c46ba4438', 'text': 'Open in app', 'metadata': {'category_depth': 0, 'last_modified': '2024-07-01T10:04:41', 'link_texts': ['Open in app'], 'link_urls': ['https://rsci.app.link/?%24canonical_url=https%3A%2F%2Fmedium.com%2Fp%2F6c2659eda4af&%7Efeature=LoOpenInAppButton&%7Echannel=ShowPostUnderCollection&source=---two_column_layout_nav----------------------------------'], 'page_number': 1, 'languages': ['eng'], 'file_directory': 'example_files', 'filename': 'medium_blog.html', 'filetype': 'text/html'}}, {'type': 'Title', 'element_id': '5e2b8e96503d722e7ebf61b9bc3e9988', 'text': 'Sign up', 'metadata': {'category_depth': 0, 'last_modified': '2024-07-01T10:04:41', 'emphasized_text_contents': ['Sign up'], 'emphasized_text_tags': ['span'], 'page_number': 1, 'languages': ['eng'], 'file_directory': 'example_files', 'filename': 'medium_blog.html', 'filetype': 'text/html'}}        

Let's inspect a few elements from the extracted data:

example_output = json.dumps(element_dict[0:1], indent=2)
print(example_output)        

As you can see, the extracted data is presented in a structured JSON format, making it easier to process and analyze. Each element is represented as a dictionary with keys like type, text, and metadata. The metadata field contains additional information about the element, such as its category depth, modification date, page number, and more. ??


In Summery:

In summary, there 3 steps (partition_html and to_dict()) are crucial for transforming unstructured HTML data into a structured, machine-readable format . in the process of extracting data from HTML files.

The first step is:

elements = partition_html(filename=filename)        

This line of code uses the partition_html function from the Unstructured library to parse the HTML file specified by filename and extract its elements (titles, paragraphs, images, etc.). The partition_html function returns a list of Element objects, where each object represents a specific element in the HTML structure.

The purpose of this step is to transform the unstructured HTML data into a more structured representation. HTML files are designed to be human-readable and visually appealing, but they can be challenging for machines to process directly. By parsing the HTML and extracting its elements, we can create a more structured and machine-readable representation of the data.

The second step is:

element_dict = [el.to_dict() for el in elements]        

This line of code iterates over the list of Element objects returned by partition_html and converts each Element object into a Python dictionary using the to_dict() method. The resulting element_dict is a list of dictionaries, where each dictionary represents an HTML element with its associated metadata.

The reason for this step is to further enhance the structured representation of the data. While the Element objects provide a structured way to represent HTML elements, converting them to dictionaries makes the data even more accessible and easier to work with in Python. Dictionaries are a fundamental data structure in Python, and they provide a convenient way to store and access key-value pairs.

By converting the Element objects to dictionaries, we can more easily access and manipulate the data using standard Python operations. For example, we can iterate over the list of dictionaries, filter or sort the elements based on specific criteria, or pass the structured data to other Python libraries or Machine Learning models.

The final step:

example_output = json.dumps(element_dict[11:15], indent=2)        

This line of code demonstrates how we can work with the structured data. It takes a slice of element_dict (from index 11 to 15, excluding 15) and converts it to a JSON string using the json.dumps function from the built-in json module. The indent=2 argument ensures that the resulting JSON string is pretty-printed with a 2-space indentation for better readability.

JSON (JavaScript Object Notation) is a widely-used data format for representing structured data. By converting the structured data to JSON, we can easily share or store it, integrate it with other systems or APIs, or even use it as input for Machine Learning models that accept JSON data.

This structured representation paves the way for further processing and analysis, such as training LLMs, building search engines, or creating knowledge bases. By transforming unstructured data into a structured format, we unlock the power of AI and enable a wide range of applications. ??

要查看或添加评论,请登录

Yiman H.的更多文章

社区洞察

其他会员也浏览了