登录查看更多内容

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min PPT/PDF/EXCEL Data Extraction]

Yiman H.

Gen AI开发工程师 | 全栈开发工程师 | 用AI改变世界 | 我的B站 @ 德国Viviane

发布日期: 2024年7月3日

In the ever-evolving landscape of AI and large language models (LLMs), one of the critical challenges we face is effectively handling diverse types of unstructured data. ?? While text data is commonly addressed, rich media formats like PowerPoint presentations, PDFs, and Excel spreadsheets pose unique challenges for data extraction and preprocessing.

The Importance of Data Extraction

To fully leverage the power of LLMs, we need to transform unstructured data into a structured format that these models can understand and process efficiently. ?? This is where data extraction comes into play, enabling us to extract meaningful information from various file formats and represent it in a way that AI models can consume.

Extracting Data from PowerPoint Presentations

So how to extract information from ppt like this:

Let's start with an example of extracting data from a PowerPoint presentation using the Unstructured library. Imagine you have a presentation file named "msft_openai.pptx" that contains valuable information you want to leverage for your AI application. Here's how you can extract the data:

from unstructured.partition.pptx import partition_pptx
filename = "example_files/msft_openai.pptx"
elements = partition_pptx(filename=filename)
element_dict = [el.to_dict() for el in elements]
print(json.dumps(element_dict, indent=2))

The partition_pptx function from the Unstructured library takes the PowerPoint file as input and returns a list of Element objects, representing different elements within the presentation (e.g., titles, paragraphs, images). ??

We then convert each Element object into a Python dictionary using the to_dict method, resulting in a list of dictionaries (element_dict). This structured representation allows us to work with the data more efficiently, filter it, or pass it to an LLM for further processing.

Finally, we use the json.dumps function to convert the list of dictionaries into a JSON string, which can be easily shared, stored, or integrated with other systems or APIs. ??

领英推荐

Preparing data for AI: A guide for data engineers

Forte Group 5 个月前

5 Best AI Tools for Data Analysts

Blockchain Council 11 个月前

How to Leverage Embeddings for Data Curation in…

Superb AI Inc. 1 年前

Extracting Data from PDFs

PDFs are another common file format that often contains valuable information. Let's explore how to extract data from a PDF file named "CoT.pdf" using the Unstructured API:

from unstructured.shared import Files, PartitionParameters, S3UploadParameters
from unstructured.openai_api import SDK
import json

s = SDK.get_instance(
   api_key=shared.options.openai.openai_key,
   api_key_manager=shared.options.openai.openai_key_manager,
)

filename = "example_files/CoT.pdf"
with open(filename, "rb") as f:
    files = Files(content=f.read(), file_name=filename)

req = PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["eng"],
)

try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements, indent=2))
except SDKError as e:
    print(e)

In this example, we first create a Files object by reading the content of the PDF file. We then define a PartitionParameters object, specifying the files to be processed, the strategy for partitioning ('hi_res' for high resolution), and whether to infer table structures within the PDF. ??

Next, we create an instance of the Unstructured SDK and call the partition method, passing the PartitionParameters object. This method extracts the elements from the PDF and returns them as a list of Element objects.

Finally, we print the list of Element objects as a JSON string using json.dumps. This structured representation can be further processed, analyzed, or integrated into your AI application. ??

The Power of Structured Data

By transforming unstructured data from various file formats into a structured representation, we unlock a world of possibilities for building powerful AI applications. ?? Whether it's training LLMs, building search engines, or creating knowledge bases, having access to structured data is a game-changer.

Reference: https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications

要查看或添加评论，请登录

Yiman H.的更多文章

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

2024年7月2日

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

In the era of large language models (LLMs) and AI applications, one critical challenge is effectively handling…
4 AI agent design patterns recommended by Andrew Ng

2024年4月14日

4 AI agent design patterns recommended by Andrew Ng

What are the 4 most popular AI agent design patterns from Andrew Ng? Reflection Mode Tool Use Mode Planning Mode…

6 条评论
2024 Prompt Engineering: Crafting prompt-generated videos with Sora

2024年3月15日

2024 Prompt Engineering: Crafting prompt-generated videos with Sora

Today, I'll share insights on how to leverage the power of prompt words to unlock creativity and bring video ideas to…
Optimizing Machine Learning Workflows: Comprehensive Data Access Solutions

2024年3月13日

Optimizing Machine Learning Workflows: Comprehensive Data Access Solutions

Here is the machine learning workflow : The machine learning workflow in the model development lifecycle: Data Access…

3 条评论
2024 The Art of Prompting: Crafting prompt-generated videos with Sora

2024年2月17日

2024 The Art of Prompting: Crafting prompt-generated videos with Sora

Now, to unleash the full potential of the Sora and to create the prompt-generated videos it's essential to grasp the…

1 条评论
LLM Development: LangChain's Memory Types and their Applications for Chatbots

2024年2月8日

LLM Development: LangChain's Memory Types and their Applications for Chatbots

why use memory in LangChain? 1. ConversationBufferMemory: What: It stores all messages in a conversation.
2024 LangChian Guide|How to use output parsers to structure large language models responses

2024年2月7日

2024 LangChian Guide|How to use output parsers to structure large language models responses

Output Parsers in LangChain are like handy organizers for the stuff language models say. They're like the magic…

1 条评论
Machine Learning|Loss is consistently decreasing, but accuracy isn't improving. Why?

2024年2月5日

Machine Learning|Loss is consistently decreasing, but accuracy isn't improving. Why?

Most Common Reasons: Overfitting, Small Dataset, Complex Network:If the dataset is small and the network is complex…
Top 15 methods to avoid overfitting |2024 Deep Learning Beginner Guide-PyTorch

2024年2月4日

Top 15 methods to avoid overfitting |2024 Deep Learning Beginner Guide-PyTorch

Feature Selection: What it is: Feature selection is the process of choosing a subset of relevant features from the…
How to build your own AI personal assistant in 10 lines of code - Python

2024年2月1日

How to build your own AI personal assistant in 10 lines of code - Python

Recently I have developed my own GEN AI Applications MollyJob, and I think it is quite cool for everyone to have their…

3 条评论

See all articles

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min PPT/PDF/EXCEL Data Extraction]

Yiman H.

Gen AI开发工程师 | 全栈开发工程师 | 用AI改变世界 | 我的B站 @ 德国Viviane

The Importance of Data Extraction

Extracting Data from PowerPoint Presentations

领英推荐

Extracting Data from PDFs

The Power of Structured Data

Yiman H.的更多文章

社区洞察

其他会员也浏览了

Put the Power of Machine Learning in the Hands of Operational Experts

Building Automated Knowledge Graph from Unstructured Data Using LLMs and Neo4j

What's New in DataOps Suite 2.0.0

Unleashing the Power of Data: A Journey Beyond Spreadsheets

ML Modeling and Output Integration: A Data Scientist's Guide for 2025

All Hands on Data #96

??From Chaos to Clarity: How you can level up your Data Engineering team with the help of Generative AI ??

How to approach a Machine Learning Project ?

Data Preprocessing: A Critical Step in the Machine Learning Pipeline

Data Cleaning and Transformation for Machine Learning

The Importance of Data Extraction

Extracting Data from PowerPoint Presentations

领英推荐

Extracting Data from PDFs

The Power of Structured Data

Yiman H.的更多文章

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min HTML Data Extraction]

4 AI agent design patterns recommended by Andrew Ng

2024 Prompt Engineering: Crafting prompt-generated videos with Sora

Optimizing Machine Learning Workflows: Comprehensive Data Access Solutions

2024 The Art of Prompting: Crafting prompt-generated videos with Sora

LLM Development: LangChain's Memory Types and their Applications for Chatbots

2024 LangChian Guide|How to use output parsers to structure large language models responses

Machine Learning|Loss is consistently decreasing, but accuracy isn't improving. Why?

Top 15 methods to avoid overfitting |2024 Deep Learning Beginner Guide-PyTorch

How to build your own AI personal assistant in 10 lines of code - Python

社区洞察

其他会员也浏览了

Put the Power of Machine Learning in the Hands of Operational Experts

Building Automated Knowledge Graph from Unstructured Data Using LLMs and Neo4j

What's New in DataOps Suite 2.0.0

Unleashing the Power of Data: A Journey Beyond Spreadsheets

ML Modeling and Output Integration: A Data Scientist's Guide for 2025

All Hands on Data #96

??From Chaos to Clarity: How you can level up your Data Engineering team with the help of Generative AI ??

How to approach a Machine Learning Project ?

Data Preprocessing: A Critical Step in the Machine Learning Pipeline

Data Cleaning and Transformation for Machine Learning