Generative AI based RAG with Llama2 model using AWS Sagemaker jumpstart and Kendra as Data Source with Multilingual features
Satish Srinivasan
Cloud Architect I Cloud Security Analyst I Specialist - AWS & Azure Cloud. AWS Community Builder| AWS APN Ambassador
Generative AI is a type of artificial intelligence technology that can create new content, such as text, images, code, or music, based on existing data and patterns.?
Generative AI (GenAI) and large language models (LLMs), such as those available via Amazon are transforming the way developers and enterprises are able to solve traditionally complex challenges related to natural language processing and understanding. Some of the benefits offered by LLMs include the ability to create more capable and compelling conversational AI experiences for customer service applications and improving employee productivity through more intuitive and accurate responses.
For these use cases, however, it’s critical for the GenAI applications implementing the conversational experiences to meet two key criteria: limit the responses to user domain specific data, thereby mitigating model hallucinations (incorrect statements), and filter responses according to the end-user content access permissions.
To restrict the GenAI application responses to domain specific data only, we need to use a technique called Retrieval Augmented Generation (RAG). An application using the RAG approach retrieves information most relevant to the user’s request from the enterprise knowledge base or content, bundles it as context along with the user’s request as a prompt, and then sends it to the LLM to get a GenAI response. LLMs have limitations around the maximum word count for the input prompt, therefore choosing the right passages among thousands or millions of documents in the enterprise has a direct impact on the LLM’s accuracy.
In designing effective RAG, content retrieval is a critical step to ensure the LLM receives the most relevant and concise context from enterprise content to generate accurate responses.
?We will be using Amazon Kendra as data source (Vector data store) .Amazon Kendra is a fully managed service that provides out-of-the-box semantic search capabilities for state-of-the-art ranking of documents and passages. We can use the high-accuracy search in Amazon Kendra to source the most relevant content and documents to maximize the quality of your RAG payload, yielding better LLM responses than using conventional or keyword-based search solutions. Amazon Kendra offers easy-to-use deep learning search models that are pre-trained on 14 domains and don’t require any ML expertise, so there’s no need to deal with word embeddings, document chunking, and other lower-level complexities typically required for RAG implementations. Amazon Kendra provides the Retrieve API, designed for the RAG use case. There are pre-built connectors to popular data sources such as Amazon Simple Storage Service (Amazon S3), SharePoint, Confluence, and websites, and support common document formats such as HTML, Word, PowerPoint, PDF, Excel, and pure text files.
Architecture overview
The Architecture diagram for Bot using Llama2 jumpstart model endpoint with Kendra as data source.
?Prerequisites
AWS Sagemaker Notebook, Kendra and EC2 instance in which we will deploy the streamlit App.
Amazon Kendra
Walkthrough
First let us create a S3 bucket and upload the Domain data which we will be using for this Generative AI bot.
?Next, we will create the Sagemaker instance, which we will use for deploying the LLM endpoint.
Go To AWS Sage Maker
?Notebook -> Notebook instances.
?Press “create notebook instance”.
Llama2-demo is the instance name.
Now under IAM role, change it to “create a new role” from drop down.
Press “Create role”.
The new role is created. Make a note of this role we will require this later when we create ec2 instance and associate a role to create the UI interface.
Press “Create notebook instance”.
?Wait for the instance to be created and ready to use.
The Instance is now ready to use to deploy the model. Press “Open JupyterLab”.
Select the conda_pytorch_p310.
Let us create the sagemaker endpoint which we will be using for this demo.write this code in our Jupiter notebook and run it.
I will be uploading the python notebook for reference and the associated Langchain code in the GitHub.
The endpoint has been created.
Go to AWS Sagemaker -> Inferences.
Under Inferences first go to Models
Click on “Models” and you will see your deployed model
Next go to Endpoint Configurations and click on it.
Next go to Endpoints and click on it.
?Make a note of the endpoint we need this with our LLM
?The Model is now successfully deployed. Close the Jupiter notebook and stop the notebook instance to save cost.
Press “Stop”.
The status will change to “Stopped”.
If not required, you can delete the instance.
?Next, We will create a Kendra index and load the contents of this file to Kendra to build our RAG model.
Create the Kendra Index we will be using for loading the data.
?Press “Create Index”.
The name “demo-80” is a unique name. Set “Create a new role” and give a unique name .
Press “Next”.
?Press “Next”.
Press “Next”.
?Press “Create”.
The Index is ready now we have to add the datasource
?Press “Data Sources”. ?We see the screen below
Select the S3 data source connector
Press “Next”. The DS-Data80 is just an English name which needs to be unique from what we have previously used in our account in Kendra.
Choose “Create a new role” and give a unique name. Press “Next”.
I am showing the S3 bucket from my account for the reference.
Choose the S3 bucket in which we have placed our PDF. In my case its “rag-demo-bkt” and se the Sync run schedule to “Run on demand”.? Press “Next”.
Press “Next”.
I will be covering Content data enrichment and converting audio files as data source for RAG using Kendra in a Separate blog.
Press “Add data source”.
?Let us sync now to load the data to Kendra Index.
Let us wait for this to complete.
The syncing has been completed.? Please make note of the Kendra Index like what we did for sagemaker endpoint we will require it for our demo.
领英推荐
?We have to set the values for three variables as shown below which we will be using for running the demo.
?Next, we will move to the EC2 instance.
?Create a role and add the following policies. These are not designed for least privilege and needs to be worked on.
The execution policy is the one we created when we created the Notebook.
?The security group inbound rules that you need to set are given below.
?I am giving permission to my public IP alone. This needs to be set to the users public IP. Press Save.?
Next , we will create the ec2 instance.
Press Launch Instance and wait for the instance to be running.
Next, we will login to the instance and run necessary packages.
Create virtual environment demo.
>> python3 -m venv demo
Go to the folder demo and activate it.
>> cd demo
>> source ~/demo/bin/activate
Now let us install the packages
The source code for the file will be checked into the gitbub.
?Next, we will use putty to login to the ec2 instance and start the application.
>> cd demo
>> source ~/demo/bin/activate to activate the virtual environment we created previously.
Next, we will run the export commands we saved previously
In the demo folder create a images folder and copy the images
To this folder.
Now we are ready to run the Bot.
>> streamlit run app.py llama2
Go to google chrome browser and run the command.
?https://54.208.229.156:8501 . We see the screen below. Let us start asking questions.
?First Question.
Second Question
?Now we will see the same Q&A bot working in German. The questions will be in German and the Answers also will be in German, but the Text data behind the scene will be in English.
Let us Open Google Translate and convert the English questions to German.
?Now we have this Information let us start the demo, this time we will be using the German bot. This can be further improved to be generic where get the Target language from UI and set the bot working automatically. This is an idea for improvement.
?Let us ask the first question in German.
?We get the answer back in German even though the Knowledge base was in English.
Second Question
?Cleaning Up
For EC2 instance, we can stop it to save costs.
Please delete the Endpoints, Endpoint Configuration and the Model in the Order mentioned.
Endpoints
Press “Delete”.
?Endpoint Configuration.
Press “Delete”.
?Models
Press “Delete”.
For Kendra also we need to follow the steps to clean up else we will end up with a huge bill.
Next, we will show the steps to delete data source and remove Index. This is a very important step as Kendra has a high running cost and has to be deleted once the usage is done.
Go back to Home page
Click “demo-80”.
?The above screen opens up. Under Data management click on “Data sources”.
?Select the “DS-Data80”, and delete the Datasource
Press “Delete”.
Type “Delete” and Press the “Delete” button.
Its now deleting the Datasource. Wait for it to be completed. This will usually take 15 to 30 min to complete .
The datasource has been deleted. Now we need to delete the index.
Go back to the home page, select the Index and press “Delete”
Choose the Option “Just curios, no real use case at the moment” and then press Delete.
Wait for it to be completed.
The Index has been successfully deleted.
The source code is available in “https://github.com/satishk01/LLMsample/tree/main/llama2kendrarag”
Conclusion
In this Blog, we learned how to create a jumpstart based llama2 endpoint, load the data to Kendra data sore and run a bot using streamlit. In the next set we will see how to convert audio file to text and index the same using Kendra and do a RAG based bot on thew same.
I would like to thank Ashutosh Dubey from AWS for the technical Guidance and help in this development.
?
Senior Software Architect
4 个月This is great Satish Srinivasan