Build a chatbot that retrieves and provides answers using SageMaker Canvas and AWS Data Wrangler

Ramandeep Chandna

System Engineering Manager AWS

发布日期: 2024年9月7日

Cost-Efficient, Minimal Customization Use Case Using SageMaker Canvas and AWS Data Wrangler

Situation:

You want to build a chatbot that retrieves and provides answers from multiple's SQL tables by generating embeddings and using Retrieval-Augmented Generation (RAG) with minimal customization. By leveraging AWS SageMaker Canvas for simplified ML model building and AWS Data Wrangler for data preparation, you aim to reduce complexity and manage costs.

Task:

Leverage AWS SageMaker Canvas to perform embedding generation and AWS Data Wrangler for data extraction and preparation. Then, integrate the solution with other AWS services for storing embeddings, building a chatbot, and retrieving data using RAG.

Action Plan:

Data Extraction and Preprocessing (AWS Data Wrangler):

AWS Data Wrangler, a library integrated with AWS Glue and Amazon Athena, can be used to efficiently extract data from your SQL database into a Data Frame format.

Use Data Wrangler’s built-in functions to clean, preprocess, and merge the data. This simplifies handling the overlapping columns between the SQL tables.

Perform basic data transformations like joins, filters, and aggregations to prepare the data for embedding generation. Store the cleaned data in Amazon S3.

Steps:

Use Data Wrangler's SQL connector to load your data from SQL into a pandas DataFrame.
Perform transformations such as column selection, missing value imputation, and data normalization.
Store the processed data in Amazon S3 for easy access during model building.

Embedding Generation (AWS SageMaker Canvas):

Use AWS SageMaker Canvas, which is a no-code tool that enables non-ML experts to build machine learning models.

Load the preprocessed SQL data from Amazon S3 into SageMaker Canvas. You can either build a text-based ML model or use a pre-trained model to generate embeddings from your SQL data.

SageMaker Canvas provides a simplified interface for generating embeddings without needing deep ML expertise. Use Canvas to convert your SQL table rows into vector embeddings that represent each record.

Steps:

In SageMaker Canvas, select the dataset stored in S3, and choose an appropriate pre-trained model for embedding generation.
SageMaker Canvas will automatically preprocess, train, and export embeddings based on the SQL data.

Storing Embeddings (Amazon S3 / DynamoDB):

After generating embeddings using SageMaker Canvas, store them in Amazon S3. You can also store metadata and smaller datasets in Amazon DynamoDB for faster retrieval during chatbot interactions.

If embedding search is needed, you can later integrate a more specialized search solution like Amazon Kendra (for contextual document search) or use k-NN search in OpenSearch. However, for cost optimization, start with S3 or DynamoDB.

Retrieval Using RAG (Amazon SageMaker):

For the Retrieval-Augmented Generation (RAG) approach, you can leverage a SageMaker endpoint with a pre-trained model (such as GPT or another generative model available in SageMaker JumpStart) to retrieve relevant embeddings and generate answers.

Use Amazon Athena to query data stored in S3, retrieve relevant information, and augment the query with the pre-generated embeddings.

Jon Bonso 1 周前

Why AWS is investing in a zero-ETL future

Swami Sivasubramanian 1 年前

Amazon Athena– A Serverless Data Analytic tool -…

Naresh i Technologies 2 个月前

In the RAG setup, when a user query comes in via the chatbot, generate embeddings for the query using the SageMaker model and find the most relevant pre-generated embeddings in DynamoDB (or by querying Athena if data is in S3). These embeddings can then be passed to the generative model (like GPT) to construct a response.

Steps:

Use AWS Lambda to create a backend service that processes user queries, calls the SageMaker model for embedding generation, and retrieves relevant information from S3 or DynamoDB.
Build the RAG process with SageMaker to find and augment the relevant embeddings with the user query and generate the final response.

Chatbot Interface (Amazon Lex):

Use Amazon Lex to build the chatbot interface. Lex will handle user input and trigger the backend logic via AWS Lambda.

AWS Lambda will query SageMaker’s model to generate embeddings for the user's query and retrieve relevant data from your storage service (S3 or DynamoDB).

Lex will provide the user with responses, leveraging the embeddings and RAG model to ensure that the answers are both contextually relevant and accurate.

Cost Optimization & Minimal Overhead:

By using AWS SageMaker Canvas and AWS Data Wrangler, you significantly reduce custom development and the need for a complex machine learning pipeline.

Amazon DynamoDB and Amazon S3 are highly cost-effective, serverless storage options, allowing you to store embeddings and data at minimal cost.

SageMaker's managed endpoints for both embedding generation and RAG ensure that you only pay for what you use (with auto-scaling capabilities).

Service Selection and Cost Consideration:

AWS Data Wrangler: Minimal cost and effort for ETL and data preprocessing.
SageMaker Canvas: Simplified machine learning model generation for embeddings with a no-code interface, reducing overhead for customization.
Amazon S3: Cost-effective object storage for embeddings and preprocessed data.
DynamoDB: Scalable and low-cost storage for metadata and smaller data sets that need fast retrieval.
Amazon Lex: Managed chatbot service with low operational overhead.
Amazon Athena: Query S3 data for retrieval during chatbot interactions with pay-per-query pricing, reducing idle costs.

Result:

By leveraging SageMaker Canvas and AWS Data Wrangler, you minimize customization and development time while achieving a scalable, low-cost solution for embedding generation and data retrieval in your chatbot. With Lex for chatbot interaction and Lambda for backend orchestration, this approach provides a manageable, cost-efficient solution.

AWS Services Involved:

AWS Data Wrangler – For ETL and data preprocessing from SQL into S3.
SageMaker Canvas – For no-code generation of embeddings from SQL data.
Amazon S3 – For storing embeddings and preprocessed SQL data.
Amazon DynamoDB – For storing metadata and fast retrieval of relevant embeddings.
Amazon Lex – Fully managed chatbot service to interact with users.
Amazon Lambda – Serverless function to handle backend embedding retrieval and RAG processing.
Amazon SageMaker – For implementing RAG to retrieve relevant data and generate responses based on embeddings.

This approach ensures minimal operational complexity while maintaining cost-efficiency, leveraging AWS's managed services to handle much of the complexity under the hood.

Reference Blog: https://aws.amazon.com/blogs/machine-learning/accelerate-data-preparation-for-ml-with-comprehensive-data-preparation-capabilities-and-a-natural-language-interface-in-amazon-sagemaker-canvas/

Engage with the latest in AWS,DevSecOps by subscribing to the newsletter and following for more insights.

?? Join the AWS DevSecOps Community: https://lnkd.in/dDsf4rCv

?? Follow on LinkedIn: https://lnkd.in/gy8xy2Gb

?? Book 1:1 Connect : Ramandeep Chandna

Remember to like, share, and comment to help spread valuable knowledge further. Let's keep learning and growing together.

Build a chatbot that retrieves and provides answers using SageMaker Canvas and AWS Data Wrangler

Ramandeep Chandna

System Engineering Manager AWS

领英推荐

Cloud Knowledge Sharing

4,080 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

How to Choose the Right Data Ingestion Service: AWS, Azure, GCP

Spark Your Data Journey: A Glue-tastic Guide to Big Data Brilliance

CIO Strategy for AWS Big Data Implementation

DATA Pill #078 - Streaming SQL in Data Mesh, Databricks + Arcion, BigQuery is much cheaper than you think

Redshift vs Bigquery

AWS Data Engineering Essentials Guidebook

Simplifying Data Work with Amazon EMR and PySpark for Data Processing and Analysis

Microsoft Certified Azure Data Engineer Associate | DP 203 | Step By Step Activity Guides (Hands-On Labs)

Data Engineering on AWS

领英推荐

Cloud Knowledge Sharing

4,080 位关注者

Use Case: Managing the Global Launch of Honey Singh’s Album “Glory” Across Multiple Streaming Platforms

2024年8月27日

How to Deploy An Nginx Ingress Controller in AWS EKS along with SSL termination in EKS.

2024年7月21日

EKS deployment Lifecycle management using Flux and GitOps principles with Terraform

2024年7月5日

Implementing Open Policy Agent (OPA) with an Amazon EKS cluster

2024年4月20日

Use Case to access AWS secrets from pods in EKS Cluster

2024年4月8日

Mastering AWS Certification A Comprehensive Learning Approach:- Session delivered by me in Community Event

2024年2月21日

My Hands-On Experience with AWS Well-Architected Best Practices

2024年2月20日

Leveraging Strategic Planning Over Technical Knowledge-Specific Strategy in STAR Method for Interview Success

2024年1月20日

Happy New Year 2024

2023年12月31日

Modernize organization application deployment by leveraging Kubernetes and AWS services

2023年12月17日

社区洞察

其他会员也浏览了

Which Data Pipeline Orchestration Tool Is Right For?You? (ML4Devs Newsletter, Issue 16)

How to Choose the Right Data Ingestion Service: AWS, Azure, GCP

Spark Your Data Journey: A Glue-tastic Guide to Big Data Brilliance

CIO Strategy for AWS Big Data Implementation

DATA Pill #078 - Streaming SQL in Data Mesh, Databricks + Arcion, BigQuery is much cheaper than you think

Redshift vs Bigquery

AWS Data Engineering Essentials Guidebook

Simplifying Data Work with Amazon EMR and PySpark for Data Processing and Analysis

Microsoft Certified Azure Data Engineer Associate | DP 203 | Step By Step Activity Guides (Hands-On Labs)

Data Engineering on AWS