2024 Build LLM Applications: Preprocessing Unstructured Data [2 min PPT/PDF/EXCEL Data Extraction]

2024 Build LLM Applications: Preprocessing Unstructured Data [2 min PPT/PDF/EXCEL Data Extraction]

In the ever-evolving landscape of AI and large language models (LLMs), one of the critical challenges we face is effectively handling diverse types of unstructured data. ?? While text data is commonly addressed, rich media formats like PowerPoint presentations, PDFs, and Excel spreadsheets pose unique challenges for data extraction and preprocessing.


The Importance of Data Extraction

To fully leverage the power of LLMs, we need to transform unstructured data into a structured format that these models can understand and process efficiently. ?? This is where data extraction comes into play, enabling us to extract meaningful information from various file formats and represent it in a way that AI models can consume.

Extracting Data from PowerPoint Presentations

So how to extract information from ppt like this:

from deeplearning.ai

Let's start with an example of extracting data from a PowerPoint presentation using the Unstructured library. Imagine you have a presentation file named "msft_openai.pptx" that contains valuable information you want to leverage for your AI application. Here's how you can extract the data:

from unstructured.partition.pptx import partition_pptx
filename = "example_files/msft_openai.pptx"
elements = partition_pptx(filename=filename)
element_dict = [el.to_dict() for el in elements]
print(json.dumps(element_dict, indent=2))        

The partition_pptx function from the Unstructured library takes the PowerPoint file as input and returns a list of Element objects, representing different elements within the presentation (e.g., titles, paragraphs, images). ??

We then convert each Element object into a Python dictionary using the to_dict method, resulting in a list of dictionaries (element_dict). This structured representation allows us to work with the data more efficiently, filter it, or pass it to an LLM for further processing.

Finally, we use the json.dumps function to convert the list of dictionaries into a JSON string, which can be easily shared, stored, or integrated with other systems or APIs. ??


Extracting Data from PDFs

PDFs are another common file format that often contains valuable information. Let's explore how to extract data from a PDF file named "CoT.pdf" using the Unstructured API:

from unstructured.shared import Files, PartitionParameters, S3UploadParameters
from unstructured.openai_api import SDK
import json

s = SDK.get_instance(
   api_key=shared.options.openai.openai_key,
   api_key_manager=shared.options.openai.openai_key_manager,
)

filename = "example_files/CoT.pdf"
with open(filename, "rb") as f:
    files = Files(content=f.read(), file_name=filename)

req = PartitionParameters(
    files=files,
    strategy='hi_res',
    pdf_infer_table_structure=True,
    languages=["eng"],
)

try:
    resp = s.general.partition(req)
    print(json.dumps(resp.elements, indent=2))
except SDKError as e:
    print(e)
        

In this example, we first create a Files object by reading the content of the PDF file. We then define a PartitionParameters object, specifying the files to be processed, the strategy for partitioning ('hi_res' for high resolution), and whether to infer table structures within the PDF. ??

Next, we create an instance of the Unstructured SDK and call the partition method, passing the PartitionParameters object. This method extracts the elements from the PDF and returns them as a list of Element objects.

Finally, we print the list of Element objects as a JSON string using json.dumps. This structured representation can be further processed, analyzed, or integrated into your AI application. ??


The Power of Structured Data

By transforming unstructured data from various file formats into a structured representation, we unlock a world of possibilities for building powerful AI applications. ?? Whether it's training LLMs, building search engines, or creating knowledge bases, having access to structured data is a game-changer.

Reference: https://learn.deeplearning.ai/courses/preprocessing-unstructured-data-for-llm-applications

要查看或添加评论,请登录

Yiman H.的更多文章

社区洞察

其他会员也浏览了