登录查看更多内容

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

Yiman H.

Gen AI开发工程师 | 全栈开发工程师 | 用AI改变世界 | 我的B站 @ 德国Viviane

发布日期: 2024年7月2日

In the era of large language models (LLMs) and AI applications, one critical challenge is effectively handling unstructured data. ?? Whether it's text from web pages, PDFs, or other sources, transforming this raw information into a structured format is crucial for training and utilizing LLMs effectively.

What Does a Document Contain?

Documents come in various formats, but at their core, they encapsulate information organized in a hierarchy of sections, paragraphs, lists, tables, and more. ??? This structure is often represented using markup languages like HTML or XML, which provide semantic tags to delineate the different elements.

Consider an HTML file from a popular blogging platform like Medium. It might contain titles, author information, body text, images, and more. While the raw HTML is human-readable, it's not the most efficient format for an LLM to consume and process.

How to Extract Data from an HTML File?

The key is to transform the unstructured HTML into a structured representation, such as JSON or a Python data structure. This process, known as data extraction, involves parsing the HTML and identifying the relevant elements and their hierarchical relationships.

Here's an example of how we can use the partition_html function from the Unstructured library to extract elements from an HTML file:

from unstructured.partition.html import partition_html
import json

filename = "example_files/medium_blog.html"
elements = partition_html(filename=filename)
element_dict = [el.to_dict() for el in elements]

The partition_html function takes the filename as input and returns a list of Element objects, each representing a specific element in the HTML structure (e.g., titles, paragraphs, images). ??

We then convert each Element object into a Python dictionary using the to_dict method, resulting in a list of dictionaries (element_dict). This structured representation makes it easier to process and analyze the data using Python or pass it to an LLM.

if we print the element_dict data, we will see the data have been listed as follows:

[{'type': 'Title', 'element_id': '7100b12091b2d2bea5e2d50c46ba4438', 'text': 'Open in app', 'metadata': {'category_depth': 0, 'last_modified': '2024-07-01T10:04:41', 'link_texts': ['Open in app'], 'link_urls': ['https://rsci.app.link/?%24canonical_url=https%3A%2F%2Fmedium.com%2Fp%2F6c2659eda4af&%7Efeature=LoOpenInAppButton&%7Echannel=ShowPostUnderCollection&source=---two_column_layout_nav----------------------------------'], 'page_number': 1, 'languages': ['eng'], 'file_directory': 'example_files', 'filename': 'medium_blog.html', 'filetype': 'text/html'}}, {'type': 'Title', 'element_id': '5e2b8e96503d722e7ebf61b9bc3e9988', 'text': 'Sign up', 'metadata': {'category_depth': 0, 'last_modified': '2024-07-01T10:04:41', 'emphasized_text_contents': ['Sign up'], 'emphasized_text_tags': ['span'], 'page_number': 1, 'languages': ['eng'], 'file_directory': 'example_files', 'filename': 'medium_blog.html', 'filetype': 'text/html'}}

Let's inspect a few elements from the extracted data:

example_output = json.dumps(element_dict[0:1], indent=2)
print(example_output)

领英推荐

Open Source Data Exploration Tools You Need to Know…

Open Data Science Conference (ODSC) 2 年前

Exploring the Frontier of AI Scraping: A Fireside Chat…

Zyte 1 年前

Real-Time Data Extraction: Building Live Dashboards…

Kite Metric 8 个月前

As you can see, the extracted data is presented in a structured JSON format, making it easier to process and analyze. Each element is represented as a dictionary with keys like type, text, and metadata. The metadata field contains additional information about the element, such as its category depth, modification date, page number, and more. ??

In Summery:

In summary, there 3 steps (partition_html and to_dict()) are crucial for transforming unstructured HTML data into a structured, machine-readable format . in the process of extracting data from HTML files.

The first step is:

elements = partition_html(filename=filename)

This line of code uses the partition_html function from the Unstructured library to parse the HTML file specified by filename and extract its elements (titles, paragraphs, images, etc.). The partition_html function returns a list of Element objects, where each object represents a specific element in the HTML structure.

The purpose of this step is to transform the unstructured HTML data into a more structured representation. HTML files are designed to be human-readable and visually appealing, but they can be challenging for machines to process directly. By parsing the HTML and extracting its elements, we can create a more structured and machine-readable representation of the data.

The second step is:

element_dict = [el.to_dict() for el in elements]

This line of code iterates over the list of Element objects returned by partition_html and converts each Element object into a Python dictionary using the to_dict() method. The resulting element_dict is a list of dictionaries, where each dictionary represents an HTML element with its associated metadata.

The reason for this step is to further enhance the structured representation of the data. While the Element objects provide a structured way to represent HTML elements, converting them to dictionaries makes the data even more accessible and easier to work with in Python. Dictionaries are a fundamental data structure in Python, and they provide a convenient way to store and access key-value pairs.

By converting the Element objects to dictionaries, we can more easily access and manipulate the data using standard Python operations. For example, we can iterate over the list of dictionaries, filter or sort the elements based on specific criteria, or pass the structured data to other Python libraries or Machine Learning models.

The final step:

example_output = json.dumps(element_dict[11:15], indent=2)

This line of code demonstrates how we can work with the structured data. It takes a slice of element_dict (from index 11 to 15, excluding 15) and converts it to a JSON string using the json.dumps function from the built-in json module. The indent=2 argument ensures that the resulting JSON string is pretty-printed with a 2-space indentation for better readability.

JSON (JavaScript Object Notation) is a widely-used data format for representing structured data. By converting the structured data to JSON, we can easily share or store it, integrate it with other systems or APIs, or even use it as input for Machine Learning models that accept JSON data.

This structured representation paves the way for further processing and analysis, such as training LLMs, building search engines, or creating knowledge bases. By transforming unstructured data into a structured format, we unlock the power of AI and enable a wide range of applications. ??

要查看或添加评论，请登录

Yiman H.的更多文章

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min PPT/PDF/EXCEL Data Extraction]

2024年7月3日

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min PPT/PDF/EXCEL Data Extraction]

In the ever-evolving landscape of AI and large language models (LLMs), one of the critical challenges we face is…
4 AI agent design patterns recommended by Andrew Ng

2024年4月14日

4 AI agent design patterns recommended by Andrew Ng

What are the 4 most popular AI agent design patterns from Andrew Ng? Reflection Mode Tool Use Mode Planning Mode…

6 条评论
2024 Prompt Engineering: Crafting prompt-generated videos with Sora

2024年3月15日

2024 Prompt Engineering: Crafting prompt-generated videos with Sora

Today, I'll share insights on how to leverage the power of prompt words to unlock creativity and bring video ideas to…
Optimizing Machine Learning Workflows: Comprehensive Data Access Solutions

2024年3月13日

Optimizing Machine Learning Workflows: Comprehensive Data Access Solutions

Here is the machine learning workflow : The machine learning workflow in the model development lifecycle: Data Access…

3 条评论
2024 The Art of Prompting: Crafting prompt-generated videos with Sora

2024年2月17日

2024 The Art of Prompting: Crafting prompt-generated videos with Sora

Now, to unleash the full potential of the Sora and to create the prompt-generated videos it's essential to grasp the…

1 条评论
LLM Development: LangChain's Memory Types and their Applications for Chatbots

2024年2月8日

LLM Development: LangChain's Memory Types and their Applications for Chatbots

why use memory in LangChain? 1. ConversationBufferMemory: What: It stores all messages in a conversation.
2024 LangChian Guide|How to use output parsers to structure large language models responses

2024年2月7日

2024 LangChian Guide|How to use output parsers to structure large language models responses

Output Parsers in LangChain are like handy organizers for the stuff language models say. They're like the magic…

1 条评论
Machine Learning|Loss is consistently decreasing, but accuracy isn't improving. Why?

2024年2月5日

Machine Learning|Loss is consistently decreasing, but accuracy isn't improving. Why?

Most Common Reasons: Overfitting, Small Dataset, Complex Network:If the dataset is small and the network is complex…
Top 15 methods to avoid overfitting |2024 Deep Learning Beginner Guide-PyTorch

2024年2月4日

Top 15 methods to avoid overfitting |2024 Deep Learning Beginner Guide-PyTorch

Feature Selection: What it is: Feature selection is the process of choosing a subset of relevant features from the…
How to build your own AI personal assistant in 10 lines of code - Python

2024年2月1日

How to build your own AI personal assistant in 10 lines of code - Python

Recently I have developed my own GEN AI Applications MollyJob, and I think it is quite cool for everyone to have their…

3 条评论

See all articles

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

Yiman H.

Gen AI开发工程师 | 全栈开发工程师 | 用AI改变世界 | 我的B站 @ 德国Viviane

What Does a Document Contain?

How to Extract Data from an HTML File?

领英推荐

In Summery:

Yiman H.的更多文章

社区洞察

其他会员也浏览了

??PandasAI: The future of data science analysis??

Generative AI Frameworks and Tools Every Developer/AI/ML Engineer Should?Know!

What is Semantic Convention in Observability and Why it Matters

RAG Beyond Basics:

Release of the SPHN Schema Forge web service and the SPHN Dataset2RDF Tool.

Issue #216 - THE ML ENGINEER ??

Handling Long Context RAG for LLMs with Contextual Summarization

Issue #192 - THE ML ENGINEER ??

Setting Up Vector Embeddings and Oracle Generative AI with Oracle Database 23ai

Evaluating Snowflake for Generative AI Solutions: A Journey from Novice to Practitioner

What Does a Document Contain?

How to Extract Data from an HTML File?

领英推荐

In Summery:

Yiman H.的更多文章

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min PPT/PDF/EXCEL Data Extraction]

4 AI agent design patterns recommended by Andrew Ng

2024 Prompt Engineering: Crafting prompt-generated videos with Sora

Optimizing Machine Learning Workflows: Comprehensive Data Access Solutions

2024 The Art of Prompting: Crafting prompt-generated videos with Sora

LLM Development: LangChain's Memory Types and their Applications for Chatbots

2024 LangChian Guide|How to use output parsers to structure large language models responses

Machine Learning|Loss is consistently decreasing, but accuracy isn't improving. Why?

Top 15 methods to avoid overfitting |2024 Deep Learning Beginner Guide-PyTorch

How to build your own AI personal assistant in 10 lines of code - Python

社区洞察

其他会员也浏览了

??PandasAI: The future of data science analysis??

Generative AI Frameworks and Tools Every Developer/AI/ML Engineer Should?Know!

What is Semantic Convention in Observability and Why it Matters

RAG Beyond Basics:

Release of the SPHN Schema Forge web service and the SPHN Dataset2RDF Tool.

Issue #216 - THE ML ENGINEER ??

Handling Long Context RAG for LLMs with Contextual Summarization

Issue #192 - THE ML ENGINEER ??

Setting Up Vector Embeddings and Oracle Generative AI with Oracle Database 23ai

Evaluating Snowflake for Generative AI Solutions: A Journey from Novice to Practitioner