Create a RAG Knowledge Base from Large Documents with ChatGPT and Code Analysis
Johan de Bruin
Generative AI | Experienced Software Engineer & Team Leader | Ex-Riot, Ex-Age of Learning
Sometimes, you may want to create a knowledge base for future queries, either to assist a custom GPT or to establish a RAG system. This knowledge base requires normalized information in a relevant structure, such as multiple JSON files with numerous meta attributes.
This article aims to take a research paper from a diverse prompt pattern catalog, break it down into separate JSON files with uniform attributes, and compile these files into a zip folder using only ChatGPT and no external tools.
Extraction
The first step involves extracting the PDF files. For this, we can use a prompt similar to the following.
Using code analysis extract each page of this PDF as a plain text
in a separate file
This step is important as ChatGPT may not always be able to read the entire PDF. Now, all the text is accessible in separate text files for download.
Now we ask GPT to read the extracted text files. This will load the paper content as context.
Please read the different extracted text pages so that you can
answers questions about it.
Index
We want GPT to index each page according to a pattern. This helps GPT identify prompt patterns and determine which parts of the original PDF to use when generating knowledge. Depending on the PDF, some patterns may require indexing without python. In this case, there are no clear headers for indexing, so manual GPT indexing is necessary.
Create an index that maps the prompt patterns to the different
available pages
领英推荐
Generation
I’ve discovered that the most effective way to generate content is by focusing on the specific parts you wish to produce. In my situation, I want to create a description, template, and an example using this prompt pattern.
Here are the various prompts I’ve utilized for this task.
Now for each pattern create a small description explaining it
Now for each pattern create a small description on when would be
appropriate to use it.
Now for each prompt pattern provide an example using this pattern.
Make sure the example is the actual prompt we would provide to the LLM.
Make sure to use examples targeted to the lay person, like creating a
nutritionist expert in the persona pattern. Or generating a field trip
plan with the flipped interaction pattern.
ChatGPT will promptly respond to the request and generate the desired content for each pattern listed in the index.
Formatting
The final step involves compiling all this information and formatting it into JSON files. To assist with this, a prompt is provided detailing the desired JSON structure.
Now for each pattern create a json structure using the text
previously generated. The structure looks like this:
{
"description": "The small description previously generated",
"template": "The template for the pattern",
"example": "The example of usage of the pattern"
}
With this ChatGPT will start generating the JSON structures for this, it might require multiple runs until it gets all the desired content.
You can now copy and save these JSON files to your preferred knowledge base platform. Depending on your language model’s limitations and the compilation size, you may ask ChatGPT to compile these files into a zip file.
Even though this method is potent, it requires a thorough understanding of the content you’re extracting and some manual prompting. Always review the original source and ensure the output meets your expectations. You might need to adjust some prompts until you achieve the desired output.
?? Are you working towards leveling up your career? DM me. Lets Discuss. ????
6 个月Evaluating RAGs: https://www.youtube.com/watch?v=r0_O0IogbKo
Thanks for sharing your insights! Leveraging #generativeai and #promptengineering can indeed unlock valuable knowledge from diverse sources. As an IP law firm, we're intrigued by the potential of tools like #ChatGPT to streamline research and enhance content creation while prioritizing ethical considerations. Looking forward to exploring more innovative applications in this space!
Senior Managing Director
6 个月Johan de Bruin Very Informative. Thank you for sharing.