Create a RAG Knowledge Base from Large Documents with ChatGPT and Code Analysis
An overview on how a PDF extracts, index, and generates the desired JSON content

Create a RAG Knowledge Base from Large Documents with ChatGPT and Code Analysis

Sometimes, you may want to create a knowledge base for future queries, either to assist a custom GPT or to establish a RAG system. This knowledge base requires normalized information in a relevant structure, such as multiple JSON files with numerous meta attributes.

This article aims to take a research paper from a diverse prompt pattern catalog, break it down into separate JSON files with uniform attributes, and compile these files into a zip folder using only ChatGPT and no external tools.

Extraction

The first step involves extracting the PDF files. For this, we can use a prompt similar to the following.

Using code analysis extract each page of this PDF as a plain text 
in a separate file        

This step is important as ChatGPT may not always be able to read the entire PDF. Now, all the text is accessible in separate text files for download.

Indexing pages

Now we ask GPT to read the extracted text files. This will load the paper content as context.

Please read the different extracted text pages so that you can 
answers questions about it.        

Index

We want GPT to index each page according to a pattern. This helps GPT identify prompt patterns and determine which parts of the original PDF to use when generating knowledge. Depending on the PDF, some patterns may require indexing without python. In this case, there are no clear headers for indexing, so manual GPT indexing is necessary.

Create an index that maps the prompt patterns to the different
available pages        

Generation

I’ve discovered that the most effective way to generate content is by focusing on the specific parts you wish to produce. In my situation, I want to create a description, template, and an example using this prompt pattern.

Here are the various prompts I’ve utilized for this task.

Now for each pattern create a small description explaining it        
Now for each pattern create a small description on when would be 
appropriate to use it.        
Now for each prompt pattern provide an example using this pattern. 
Make sure the example is the actual prompt we would provide to the LLM. 
Make sure to use examples targeted to the lay person, like creating a 
nutritionist expert in the persona pattern. Or generating a field trip 
plan with the flipped interaction pattern.        

ChatGPT will promptly respond to the request and generate the desired content for each pattern listed in the index.

Formatting

The final step involves compiling all this information and formatting it into JSON files. To assist with this, a prompt is provided detailing the desired JSON structure.

Now for each pattern create a json structure using the text 
previously generated. The structure looks like this:

{
   "description": "The small description previously generated",
   "template": "The template for the pattern",
   "example": "The example of usage of the pattern"
}        

With this ChatGPT will start generating the JSON structures for this, it might require multiple runs until it gets all the desired content.

You can now copy and save these JSON files to your preferred knowledge base platform. Depending on your language model’s limitations and the compilation size, you may ask ChatGPT to compile these files into a zip file.

Even though this method is potent, it requires a thorough understanding of the content you’re extracting and some manual prompting. Always review the original source and ensure the output meets your expectations. You might need to adjust some prompts until you achieve the desired output.

Joydeep Bhattacharjee

?? Are you working towards leveling up your career? DM me. Lets Discuss. ????

6 个月
回复

Thanks for sharing your insights! Leveraging #generativeai and #promptengineering can indeed unlock valuable knowledge from diverse sources. As an IP law firm, we're intrigued by the potential of tools like #ChatGPT to streamline research and enhance content creation while prioritizing ethical considerations. Looking forward to exploring more innovative applications in this space!

回复
Woodley B. Preucil, CFA

Senior Managing Director

6 个月

Johan de Bruin Very Informative. Thank you for sharing.

要查看或添加评论,请登录

社区洞察

其他会员也浏览了