Using Kor (LangChain Extension), Generative Language Models & Prompt Engineering
Picture by the author: A sunny day in Katty Perry Park with Downtown and Mount Rainier View, Seattle, Washington (April 2023)

Using Kor (LangChain Extension), Generative Language Models & Prompt Engineering

Authors:? Kaushik Shakkari (repost with authors permission?original post)

Background:

In this digital age, businesses are immersed in a sea of data. Every interaction, transaction, and online engagement generates vast amounts of data. This data comes in various forms like text, images, audio, and video. While some data is structured and well-organized, a significant portion still needs to be more structured and semi-structured, presenting some unique business challenges and opportunities.

Unstructured data refers to information that needs a predefined format or organization. It includes sources like emails, social media posts, customer feedback, and even multimedia content. On the other hand, semi-structured data possesses some structure but lacks the level of organization found in structured data. Examples of Semi-structured data include invoices, utility bills, resumes, etc.

Interested in learning about searching across unstructured data? — check out my?10 articles semantic search series!

Businesses can leverage unstructured and semi-structured data to gain valuable insights, enhance competitiveness, and improve decision-making processes.?However, extracting meaningful and actionable information from this vast, untamed, non-structured data landscape is challenging and requires advanced techniques to identify, organize, and extract relevant information.

No alt text provided for this image

Introduction to Generative Models, LangChain & Kor:

The use of generative models is growing in popularity. These models, such as GPT, can accurately generate text that resembles human writing when given a prompt. This makes them highly beneficial for various applications based on information extraction, question-answering text, summarization, etc.

LangChain is an open-source platform that simplifies developing applications using language models. Their reusable components and integrations to external data enable users to build pipelines for complex applications quickly. Kor is a library built on LangChain that helps extract text from unstructured and semi-structured data into a custom-structured format.

In my?last article, I introduced generative models and LangChain. In this article, I'll show how to write simple code using Kor and OpenAI's GPT 3.5 model to extract relevant invoice data. We'll also see how Kor makes using prompts easy and efficient so that you can get the most out of our model.

Code in Action:

No alt text provided for this image

Let's use the above sample invoice. One would be interested in extracting the invoice number, item details (description, quantity, unit price, total), total balance details, and addresses (company, billing, and shipping) from invoices.

The first step is to extract text from the invoice. We can use open-source libraries like py-tesseract, PYPDF2, PDF Miner, or commercial services like AWS's Textract, Azure's Form Recognizer, and Google Cloud Vision API for extraction. Let's use PDF Miner on the above sample.

Link to execute the below code:?colab notebook

The first step is to load PDF file from google drive and use PDF Miner to extract raw text. Note: you need to?download?invoice to your local drive.

#Mount Google Drive

from google.colab import drive

drive.mount('/content/gdrive/', force_remount=True)
%cd gdrive/MyDrive
        

#install PDF Miner package

!pip install pdfminer.six

#import extract_text module from PDF Miner and extract text

from pdfminer.high_level import extract_text

text = extract_text('sample_invoice.pdf')
        

#perform basic processing to remove \n

processed_text = " ".join(text.split("\n"))
print(processed_text)        

The above code should print the following text.

SAMPLE INVOICE GPT Solutions 123 Marvel Street Los Angeles, CA, 90007?[email protected]?213–7654–9876 05/14/2023 INV-28913 <Payment terms (due on receipt, due in X days)> BILL TO Pepper Potts Stark Industries SHIP TO Happy Hogan Stark Industries Robo Street, Malibu, CA 90265 Iron Street, Malibu, CA 90265 213–546–3610 123–456–3601?DESCRIPTION QTY UNIT PRICE TOTAL Lambda Scalar 4U AMD GLU 16?inch MacBook Pro — Space Gray 12.99 inch iPad Pro 2nd generation Apple Pencil Space Gray AirPods Max Service Fee Remarks / Payment Instructions: 1 2 2 1 1 1 $160,090.00 $160,090.00 $3,500.00 $1,200.00 $130.00 $550.00 250.00 $7,000.00 $2400.00 $130.00 $550.00 $250.00 SUBTOTAL $170420.00 DISCOUNT TAX RATE LABOR FEE SHIPPING/HANDLING 0.00 10.00% $0.0 100.00 Balance Due $187562.00

No alt text provided for this image

Please note that the text is not accurately extracted as table extraction happened row-wise for the header and column-wise for values (highlighted in bold in the above output). This is a prevalent issue with extracting information from structured documents. Often reading order is not preserved with these tools. Despite this issue, the GPT 3.5 model performed very well, as shown in the code below, within a few minutes.

The second step is to install the required packages and load libraries for modeling.

#install packages using pip

#pip install openai

#pip install langchain

#pip install kor

#you need to create and view your API keys at /

#https://platform.openai.com/account/api-keys

#replace '' with your OpenAI API key

openai_api_key = ''
        

#import langChain ChatOpenAI module

from langchain.chat_models import ChatOpenAI

#load GPT 3.5 model

llm = ChatOpenAI(
    model_name="gpt-3.5-turbo",
    temperature=0,
    max_tokens=2000,
    openai_api_key=openai_api_key
)

# import neccessary packages from korr
from kor.extraction import create_extraction_chain
from kor.nodes import Object, Text, Number        

Ensure you replace the OpenAI's key in the code (create your keys?here). ChatOpenAI has many parameters. Here are some from LangChain:

  1. model_name: the name of the chat model.
  2. temperature: In generative models, the temperature is a hyperparameter that controls the randomness of the generated text. A higher temperature will produce more random text, while a lower temperature produces more predictable text.
  3. max_tokens: max_tokens is a hyperparameter that controls the maximum length of the generated text. If you increase the max_tokens, the generated text will be longer, but if you decrease it, the text will be shorter.

You can find more models and parameters?here. We also imported the chain extraction module and different data types nodes like Objects, Text, and Numbers from Kor.

The next step is to create schema and provide examples

Let's first create a schema for the invoice number.

#object creation for invoice_number

invoice_schema = Object(
    id="invoice_extraction",
    description="extraction of relevant information from invoice",
    attributes=[
        Text(
            id="invoice_number",
            description= "unique number (identifier) of given invoice",
        examples=[
            ("Invoice Number: INV-23490", "INV-23490"),
            ("INVNO-76890", "INVNO-76890"),
            ("Invoice: INV-100021", "INV-100021")
        ])
    ],
    many=False,
)        

In the above code, an invoice_number is added with a description. We also provided examples so language models understand the format and semantics nicely. The next step is to create a chain and run on a sample text. Chains, which are core components of LangChain, allow us to combine multiple components to create a required application. We also have “False” for the “many” parameter, as there will be only one invoice number for an invoice.

invoice_chain = create_extraction_chain(llm, invoice_schema)
invoice_chain.predict_and_parse(text="the invoice Number is INV-32489")["data"]

#output
{'invoice_extraction': [{'invoice_number': 'INV-32489'}]}        

Note: from the above example, we can also see that even though we added some filler words like "the" and "is" in the sample text, the GPT model ignored them and appropriately extracted the invoice number in the format we instructed.

We can also see the prompt generated by Kor from our object definition to pass it to the GPT model.

print(invoice_chain.prompt.format_prompt(text="[user input]").to_string())

#output
"""
Your goal is to extract structured information from the user's input \
that matches the form described below. When extracting information \
please make sure it matches the type information exactly. \
Do not add any attributes that do not appear in the schema shown below.

```TypeScript

invoice_extraction: { // extraction of relevant information from invoice
 invoice_number: string // unique number (identifier) of given invoice
}
```

Please output the extracted information in CSV format in Excel dialect. \
Please use a | as the delimiter. 
Do NOT add any clarifying information. 
Output MUST follow the schema above. 
Do NOT add any additional columns that do not appear in the schema.

Input: Invoice Number: INV-23490
Output: invoice_number
INV-23490

Input: INVNO-76890
Output: invoice_number
INVNO-76890

Input: Invoice: INV-100021
Output: invoice_number
INV-100021
"""        

Awesome! Let's now create address_schema and run on our extracted invoice text.

address_schema = Object(
    id="address",
    description="address details",
    attributes=[
        Text(id="name", description="the name of person and organization"),
        Text(id="address_line", description="the local delivery information such as street, building number, PO box, or apartment portion of a postal address"),
        Text(id="city", description="the city portion of the address"),
        Text(id="state_province_code", description="the code for address US states"),
        Number(id="postal_code", description="the postal code portion of the address")
    ],
    examples=[
        (
            "James Bond, Bond Industries 5000 Forbes Avenue Pittsburgh, PA 15213",
            {
                "name": "James Bond, Bond Industries",
                "address_line": "Bond Industries 5000 Forbes Avenue",
                "city": "Pittsburgh",
                "state_province_code": "PA",
                "postal_code": "15213",
            },
        ),
        (
            "Kaushik Shakkari 840 Childs Way, Los Angeles, CA 90089",
            {
                "name": "Kaushik Shakkari",
                "address_line": "840 Childs Way",
                "city": "Los Angeles",
                "state_province_code": "CA",
                "postal_code": "90089",
            },
        ),
        
       (
            "Shakkari Solutions PO Box 1234 Atlanta GA 30033",
            {
                "name": "Shakkari Solutions",
                "address_line": "PO Box 1234",
                "city": "Atlanta",
                "state_province_code": "GA",
                "postal_code": "30033",
            },
        ) 
    ],
    many=True,          

We have both text and number data types above. The above object should extract different kinds of addresses from the invoice. Let's create the chain and predict our processed text extracted using PDF Miner earlier.

address_chain = create_extraction_chain(llm, address_schema)
print(address_chain.predict_and_parse(text=processed_text)['data'])

#output
"""
{'address': [{'name': 'GPT Solutions',
   'address_line': '123 Marvel Street',
   'city': 'Los Angeles',
   'state_province_code': 'CA',
   'postal_code': '90007'},
  {'name': 'Pepper Potts, Stark Industries',
   'address_line': 'Stark Industries Iron Street, Malibu',
   'city': 'Malibu',
   'state_province_code': 'CA',
   'postal_code': '90265'},
  {'name': 'Happy Hogan, Stark Industries',
   'address_line': 'Stark Industries Robo Street, Malibu',
   'city': 'Malibu',
   'state_province_code': 'CA',
   'postal_code': '90265'}]}
"""        

Different addresses are extracted accurately. Let's say we want to extract just the billing address. For this, we will reuse the address schema defined above.

billing_address_schema = address_schema.replace(
    id="billing_address", description="where the bill for a product or service is sent so it can be paid by the recipient"
)
billing_address_chain = create_extraction_chain(llm, billing_address_schema)
billing_address_chain.predict_and_parse(text=processed_text)['data']
        

#generated output

"""
{'billing_address': [{'name': 'Pepper Potts',
   'address_line': 'Stark Industries',
   'city': 'Iron Street, Malibu',
   'state_province_code': 'CA',
   'postal_code': '90265'}]}
"""        

The billing address is accurately extracted based on our description prompt.

Now, let's move to the products schema. Like the address schema, product schema will have many = “True.”

We will also pass on an example so the GPT model understands it well.

products_schema = Object(
    id="bill",
    description="the details of bill",
    attributes=[
        Text(id="product_description", description="the description of the product or service"),
        Text(id="count", description="number of units bought for the product"),
        Text(id="unit_item_price", description="price per unit"),
        Text(id="product_total_price", description="the total price, which is number of units * unit_price"),
    ],
    examples=[
        (
            "iphone 14 pro black 2 $1200.00 $2400.00",
            {
                "product_description": "iphone 14 pro black",
                "count": 2,
                "unit_item_price": 1200,
                "product_total_price": 2400,
            },
        ),
    ],
    many=True
)

products_chain = create_extraction_chain(llm, products_schema)
products_chain.predict_and_parse(text=processed_text)['data']
        

#generated output

"""
{'bill': [{'product_description': 'Lambda Scalar 4U AMD GLU',
   'count': '1',
   'unit_item_price': '160090',
   'product_total_price': '160090'},
  {'product_description': '16?inch MacBook Pro - Space Gray',
   'count': '2',
   'unit_item_price': '3500',
   'product_total_price': '7000'},
  {'product_description': '12.99 inch iPad Pro',
   'count': '2',
   'unit_item_price': '1200',
   'product_total_price': '2400'},
  {'product_description': '2nd generation Apple Pencil',
   'count': '1',
   'unit_item_price': '130',
   'product_total_price': '130'},
  {'product_description': 'Space Gray AirPods Max',
   'count': '1',
   'unit_item_price': '550',
   'product_total_price': '550'},
  {'product_description': 'Service Fee',
   'count': '1',
   'unit_item_price': '250',
   'product_total_price': '250'}]}
"""        

Despite providing just one example, the GPT model could extract item details accurately.

Isn’t this easy? Let's now create our final schema, the final bill!

total_bill_schema = Object(
    id="total_bill",
    description="the details of total amount, discounts and tax",
    attributes=[
        Number(id="total", description="the total amount before tax and delivery charges"),
        Number(id="discount_amount", description="discount amount is total cost * discount %"),
        Number(id="tax_amount", description="tax amount is tax_percentage * (total - discount_amount). If discount_amount is 0, then its tax_percentage * total"),
        Number(id="delivery_charges", description="the cost of shipping products"),
        Number(id="final_total", description="the total price or balance after removing tax, adding delivery and tax from total"),
    ],
    examples=[
        (
            "total $100000.00 discount 0% tax 5 percentage delivery cost $100.00 final_total $95100.00",
            {
                "total": 100000,
                "discount_amount": 0,
                "tax_amount": 5000,
                "delivery_charges": 100,
                "final_total": 105100
            },
        ),
    ],
    many=False
)

total_chain = create_extraction_chain(llm, total_bill_schema)
total_chain.predict_and_parse(text=processed_text)['data']
        

#generated output

"""
{'total_bill': [{'total': '170420',
   'discount_amount': '0',
   'tax_amount': '17042',
   'delivery_charges': '100',
   'final_total': '187562'}]}
"""        

Note that the invoice doesn't have tax_amount. However, from our definition, it calculated tax_amount from the total amount, discount, and percentage. Note that “many” is “False,” as we will have only one total.

Creating definitions and examples prompts is an iterative process. One should check outputs for their prompts and re-create them for better predictions. To avoid hallucinations, provide more examples so extraction is robust and accurate.

Nested Objects Schema:

We can also pack different schemas defined above into a nested schema. For this, we need to use JSON encoding instead of the default CSV encoding in Kor (Refer to?Kor?for more details.) this way, we can save some cost as we need not pass the entire text for different object schemas multiple times to the GPT model.

invoice_schema = Object(
    id="invoice_information",
    description="relevant invoice parsing from raw extracted text",
    attributes=[
        Text(
            id="invoice_number",
            description= "unique number (identifier) of given invoice",
        examples=[
            ("Invoice Number: INV-23490", "INV-23490"),
            ("INVNO-76890", "INVNO-76890"),
            ("Invoice: INV-100021", "INV-100021")
        ]),
        billing_address_schema,
        products_schema,
        total_bill_schema
    ],
    many=True,
)

invoice_chain = create_extraction_chain(llm, invoice_schema, encoder_or_encoder_class="json")
invoice_chain.predict_and_parse(text=processed_text)['data']
        

#generated output

"""
{'invoice_information': [{'invoice_number': 'INV-28913'},
  {'billing_address': [{'name': 'GPT Solutions',
     'address_line': '123 Marvel Street',
     'city': 'Los Angeles',
     'state_province_code': 'CA',
     'postal_code': '90007'}]},
  {'total_bill': {'total': 170420,
    'discount_amount': 0,
    'tax_amount': 17042,
    'delivery_charges': 100,
    'final_total': 187562}},
  {'bill': [{'product_description': 'Lambda Scalar 4U AMD GLU',
     'count': 1,
     'unit_item_price': 160090,
     'product_total_price': 160090},
    {'product_description': '16?inch MacBook Pro - Space Gray',
     'count': 2,
     'unit_item_price': 2400,
     'product_total_price': 4800},
    {'product_description': '12.99 inch iPad Pro',
     'count': 2,
     'unit_item_price': 130,
     'product_total_price': 260},
    {'product_description': '2nd generation Apple Pencil',
     'count': 1,
     'unit_item_price': 1200,
     'product_total_price': 1200},
    {'product_description': 'Space Gray AirPods Max',
     'count': 1,
     'unit_item_price': 550,
     'product_total_price': 550},
    {'product_description': 'Service Fee',
     'count': 1,
     'unit_item_price': 250,
"""        

Conclusion:

No alt text provided for this image
Photo from the author: Pleasant evening at Eatonville, WA (April, 2023)

The article focused on using Kor, a library built on top of LangChain, which can extract text from unstructured and semi-structured data (specifically invoices) and present it in a structured form. The article emphasizes the ease and efficiency of using prompts and showcases a code example to demonstrate the process.

Kudos, you have learned how to custom parse!

Stay tuned for more articles on generative language modeling!!!

Ahmed EBEN HASSINE

Software Engineer @Codéin | Symfony certified ??

1 年

Bill Liu your insights are greatly appreciated. I have a couple of questions: 1. Can the "examples" key be utilized to exclude specific results from being extracted? 2. Is it feasible to interact with Vectorstore through the kor library?

回复
Debasish Pradhan

Generative AI Engineer | ML | DL | NLP | Pytorch | Building Production-Ready Applications | Developing Expertise in Real-Time

1 年

Can you please tell me instead of giving text how can i give embedding vectors as input to our llm

回复

要查看或添加评论,请登录

社区洞察

其他会员也浏览了