登录查看更多内容

Building a Comprehensive Text Analysis & Retrieval-Augmented Generation (RAG) Pipeline: A Behind-the-Scenes Look

Davis Joseph

Machine Learning Researcher, M.Sc Artificial Intelligence,

发布日期: 2025年1月26日

Introduction

Over the past few months, I’ve been steadily working on a comprehensive Machine Learning portfolio project that integrates text classification, summarization, unsupervised clustering, and a retrieval-augmented generation (RAG) system. It’s been a journey with plenty of ups and downs, but through it all, I’ve gained invaluable insights. In this blog post, I’ll take you behind the scenes, sharing the architecture, the tools, the code, and (most importantly) the lessons learned.

Part 1: Data Ingestion & Preprocessing

1.1 Project Motivation

I initially set out to build an end-to-end text analysis system that could handle raw PDF files, images (through OCR), and plain text documents, then classify them or summarize them. This quickly expanded into training a custom summarization model, experimenting with unsupervised clustering, and eventually creating a retrieval-augmented generation pipeline.

The why behind it is simple: many real-world tasks involve messy documents (scans, PDFs, text, etc.) that need classification, summarization, and advanced retrieval. Building all these components gave me a well-rounded portfolio piece and a ton of hands-on experience with different libraries and AWS services.

1.2 Data Extraction: PDF & Image Processing

One of the first challenges was ingesting documents in different formats (PDFs, images, text files). I wrote a Python script (preprocess_data.py) that leverages:

PyMuPDF (fitz) for PDF extraction
pytesseract for OCR on images
PIL for image handling
nlpaug for text augmentation (like synonym replacement)

I struggled a bit with Tesseract’s path configuration and with making sure I properly caught exceptions for broken PDFs. But after a series of debugging sessions, I got a robust pipeline that reliably extracts text from various file types and saves them as JSON (with optional metadata).

Key takeaway: Incorporating OCR from the start means you can handle a wide variety of real-world documents; but definitely test with sample data that covers every edge case (blank pages, rotated scans, etc.).

1.3 Text Augmentation

I added synonym-based augmentation (via nlpaug) so that smaller datasets could be expanded artificially. The function augment_text() randomly replaces up to 10% of words with synonyms. This helped me avoid overfitting on a small supervised dataset. Configuring the augmentation probability (e.g., aug_p=0.1) took some trial and error, as high augmentation rates can degrade data quality rather than help.

Lesson learned: Data augmentation can be powerful, but keep it modest. Over-augmenting can produce nonsensical text that hurts model performance.

1.4 Directory Structure

I separated data into:

raw (unprocessed files),
preprocessed (JSONs with extracted text),
scripts (Python scripts for each major task),
train_clustering (SageMaker job scripts for unsupervised learning),
rag_deployment (Docker and Flask code for the final inference endpoint).

This structure kept my code organized and made it much easier to debug and maintain each piece.

Part 2: Building a Text Classification Pipeline

2.1 Data Preparation & Tokenization

I used the Hugging Face transformers library, specifically DistilBERT, to classify documents into categories like invoices, memos, emails, etc. The script data_prep.py handles:

Loading JSON from preprocessed directories,
Extracting the text & labels,
Tokenizing text with DistilBertTokenizerFast.

One big hurdle was ensuring consistent label mapping across train, validation, and test sets. I overcame this by saving a JSON label map (label_map.json) during training, then reloading it for evaluation.

Key tip: Always store your label-to-ID mapping. Otherwise, you’ll get mismatched predictions (e.g., “LABEL_2” might be a different class if you train a second time).

2.2 Model Training & Evaluation

The script train_distilbert.py orchestrates:

Loading the data,
Splitting into train/val,
Tokenizing,
Training a TFDistilBertForSequenceClassification model,
Saving the final model and label map.

I occasionally ran into GPU memory issues (especially when I forgot to reduce batch_size on a smaller GPU). Lowering the batch size and sequence length helped. Also, I discovered that using fp16 mixed-precision can speed up training significantly on compatible GPUs.

2.3 Model Deployment to SageMaker

I wrote deploy_model.py to package the trained DistilBERT model into a model.tar.gz and push it to S3, then create a SageMaker endpoint. The container environment variable HF_TASK="text-classification" told the Hugging Face inference container how to handle requests.

Lesson: The Hugging Face SageMaker integration is powerful but can be tricky if you deviate from the standard tasks. Make sure you set the environment variables (HF_TASK, transformers_version, pytorch_version or tensorflow_version) properly.

2.4 Inference Testing

Using test_summar_endpoint.py (though ironically named for summarization) or a similar approach, I tested classification by sending a JSON payload {"inputs":"Some text to classify"} to the endpoint. I found the returned label strings (like "LABEL_0") needed a friendly label map for final user-facing results.

Part 3: Summarization with a Custom Seq2Seq Model

3.1 Summarization Data & BART

I decided to create a custom summarizer because I had an interesting variety of text documents (like invoices, shipping orders, news articles). Using the script generate_pseudo_summaries.py, I generated “silver-standard” or “pseudo” summaries for unlabeled text with a pretrained model (facebook/bart-large-cnn). This allowed me to have at least some form of target summaries to fine-tune on.

Then, with train_summarizer.py, I fine-tuned a smaller BART or DistilBART on these pseudo-labeled data. This was a highlight: I learned to carefully handle max_input_length (to avoid out-of-memory errors) and to measure a basic validation metric.

3.2 Summarizer Deployment

Just like classification, I used a Hugging Face SageMaker container but set HF_TASK="summarization". The final endpoint accepted JSON {"inputs": "Some text"} and returned a {"summary_text": "..."} structure.

领英推荐

Exploring Named Entity Recognition use cases across…

Naveen Joshi 4 年前

Understanding CatBoost!

Damien Benveniste, PhD 9 个月前

Building Retrieval Augmented Generation (RAG) from…

Saurav Prateek 6 个月前

Hardest part: ensuring I saved both the model and the tokenizer. Missing the tokenizer can cause incorrect tokenization at inference time.

Part 4: Unsupervised Clustering (K-Means)

4.1 Why Clustering?

Not all documents were labeled. So, I used sentence-transformers embeddings to cluster the unlabeled set. This is in train_clustering/clustering_with_embeddings.py.

The biggest challenge was deciding k (the number of clusters). I used a combination of domain knowledge and the silhouette score to guess a good n_clusters. Ultimately, I stored the cluster assignments back into the JSON files ("predicted_cluster" and "predicted_label").

4.2 Deploying as a Training Job

I wanted to run the clustering on SageMaker, so I created run_clustering_job.py, a script that packages the code into a Docker container or uses the Hugging Face CPU/GPU container. This gave me exposure to running large embedding tasks on an ml.p3.2xlarge instance (GPU-based).

Pro tip: For large-scale embeddings, consider batching or using a multi-GPU approach. Sentence-Transformers is quite memory efficient, but you can still run out of memory if your dataset is huge.

Part 5: Retrieval-Augmented Generation (RAG)

5.1 Combining Summaries & FAISS Index

One of the most exciting parts was building a RAG pipeline. I chunked large documents into smaller summaries (chunk_and_summarize.py), then stored them in a FAISS index (build_rag_index_multi.py). For each chunk, I stored:

The chunk’s embedding (using Sentence Transformers),
A reference to the original file path or chunk ID,
The chunk’s summary text.

5.2 Querying with a Generative Model

The script rag_query.py shows how I:

Embed the user’s query,
Retrieve top-k chunks via FAISS,
Construct a prompt with those retrieved chunks,
Pass that prompt into a seq2seq generation model (my fine-tuned BART),
Return a final “answer.”

Gotcha: BART has a forced_bos_token_id by default, which can cause truncated outputs if you feed in custom prompts. Disabling that token ID in the config was crucial.

5.3 Dockerizing & SageMaker Inference

Finally, in the rag_deployment folder, I created:

A Flask app (app.py) that loads the FAISS index, the summarization model, and the embedding model.
A Dockerfile that sets up all dependencies (PyTorch, faiss-cpu, sentence-transformers, etc.).
Scripts to push this image to ECR (deploy_rag_model.py) and create a SageMaker endpoint (deploy_rag_endpoint.py).

Deploying a custom RAG solution was tricky because you can’t rely on a standard Hugging Face container for multi-step logic (embedding + retrieval + generation). But building your own Docker image with app.py was a powerful way to handle custom flows.

Part 6: Creating a Local Demo Server

6.1 The FastAPI Interface

To provide a user-friendly front-end, I built a small fastapi server (server.py in the my_local_server folder). It has:

HTML pages (index.html, classification.html, summarize_rag.html) served via the /static route,
Endpoints like /classify and /rag that forward requests to the respective SageMaker endpoints.

Local testing was so much simpler with this approach. I struggled a bit with CORS issues and file upload logic, but the final result was a neat local app that calls out to the cloud.

6.2 UI & Next Steps

Eventually, I plan to add file drag-and-drop, better error handling for huge files, and a streaming UI for RAG outputs. But for now, it’s a decent proof-of-concept.

Biggest Lessons Learned

Consistent Data Flow Maintaining a consistent data representation across extraction, augmentation, training, and inference is crucial. Logging every step or saving metadata in JSON can save hours of guesswork.
Resource Constraints GPU out-of-memory errors forced me to re-check batch sizes and max sequence lengths multiple times. Always start small if you’re unsure about memory constraints.
Label Management Storing label maps is essential for classification tasks. Otherwise, your predicted classes might end up misaligned at inference time.
SageMaker Quirks
Debugging is Everything Sometimes it’s a missing environment variable, sometimes it’s an OOM error. Good logs, versioning, and incremental tests keep your sanity intact.
Documentation & Organization Splitting scripts by function (preprocessing, training, evaluation, deployment) made the entire process more manageable. The same approach applies to large real-world projects.
Keep a Diary of Trials This entire blog post is basically a distilled version of the daily notes I kept: every time I got stuck or tried a new approach. Having that record is invaluable for explaining your project to employers or colleagues.

Conclusion & Next Steps

This project taught me how to handle the full lifecycle of an NLP system—from dirty PDF ingestion, to classification, to summarization, to advanced retrieval-augmented question-answering. The system is certainly not perfect, but it’s a robust starting point that can handle varied document types, scale on AWS, and serve real-time predictions.

If you’re curious or want to try it yourself, feel free to explore my GitHub repository where the code resides (scripts are heavily commented). Connect with me on LinkedIn to see more frequent updates on new features, especially as I refine the user interface and add advanced capabilities like sentence-level search or multilingual support.

Thank you for reading, and I hope my learnings can help you in your own end-to-end NLP journey!

Additional Resources:

Hugging Face Documentation
SageMaker Docs for Hugging Face containers
PyMuPDF (fitz) documentation
nlpaug GitHub

要查看或添加评论，请登录

Davis Joseph的更多文章

Automated Data Augmentation: A Step-by-Step Guide for Beginners

2024年12月15日

Automated Data Augmentation: A Step-by-Step Guide for Beginners

Automated Data Augmentation: A Step-by-Step Guide for Beginners Data augmentation is a critical technique in machine…

2 条评论
Predicting Bitcoin Price Using RNN: A Deep Dive into Time Series Forecasting

2024年9月13日

Predicting Bitcoin Price Using RNN: A Deep Dive into Time Series Forecasting

Bitcoin (BTC) is known for its volatility, which makes it an attractive asset for investors and traders looking to make…

2 条评论
Optimizing Machine Learning Models with Bayesian Optimization: A Deep Dive into Gaussian Processes and Hyperparameter Tuning

2024年8月18日

Optimizing Machine Learning Models with Bayesian Optimization: A Deep Dive into Gaussian Processes and Hyperparameter Tuning

Bayesian optimization of a function (black) with Gaussian processes (purple). Three acquisition functions (blue) are…

1 条评论
Transfer Learning for CIFAR-10 Classification Using VGG16

2024年6月22日

Transfer Learning for CIFAR-10 Classification Using VGG16

Abstract In this experiment, I trained a convolutional neural network (CNN) using transfer learning to classify images…
ImageNet Classification with Deep Convolutional Neural Networks

2024年6月8日

ImageNet Classification with Deep Convolutional Neural Networks

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton Introduction The paper "ImageNet Classification with Deep…
Enhancing Neural Networks: Exploring Regularization Techniques

2024年5月26日

Enhancing Neural Networks: Exploring Regularization Techniques

Regularization Techniques in Neural Networks: Ensuring Robust and Generalizable Models In the journey of training…
Mastering Machine Learning Optimization Techniques

2024年5月22日

Mastering Machine Learning Optimization Techniques

In the ever-evolving world of machine learning, optimizing the training process is crucial for building efficient and…

2 条评论
Understanding Activation Functions in Neural Networks: A Comprehensive Guide

2024年5月11日

Understanding Activation Functions in Neural Networks: A Comprehensive Guide

Introduction Activation functions play a crucial role in neural networks by helping them learn complex patterns in…

1 条评论
Understanding Mutable and Immutable Objects in Python

2023年10月23日

Understanding Mutable and Immutable Objects in Python

Introduction: Python is a versatile and popular programming language known for its simplicity and flexibility. One…

See all articles

Introduction

Part 1: Data Ingestion & Preprocessing

1.1 Project Motivation

1.2 Data Extraction: PDF & Image Processing

1.3 Text Augmentation

1.4 Directory Structure

Part 2: Building a Text Classification Pipeline

2.1 Data Preparation & Tokenization

2.2 Model Training & Evaluation

2.3 Model Deployment to SageMaker

2.4 Inference Testing

Part 3: Summarization with a Custom Seq2Seq Model

3.1 Summarization Data & BART

3.2 Summarizer Deployment

领英推荐

Part 4: Unsupervised Clustering (K-Means)

4.1 Why Clustering?

4.2 Deploying as a Training Job

Part 5: Retrieval-Augmented Generation (RAG)

5.1 Combining Summaries & FAISS Index

5.2 Querying with a Generative Model

5.3 Dockerizing & SageMaker Inference

Part 6: Creating a Local Demo Server

6.1 The FastAPI Interface

6.2 UI & Next Steps

Biggest Lessons Learned

Conclusion & Next Steps

Additional Resources:

Davis Joseph的更多文章

Automated Data Augmentation: A Step-by-Step Guide for Beginners

Predicting Bitcoin Price Using RNN: A Deep Dive into Time Series Forecasting

Optimizing Machine Learning Models with Bayesian Optimization: A Deep Dive into Gaussian Processes and Hyperparameter Tuning

Transfer Learning for CIFAR-10 Classification Using VGG16

ImageNet Classification with Deep Convolutional Neural Networks

Enhancing Neural Networks: Exploring Regularization Techniques

Mastering Machine Learning Optimization Techniques

Understanding Activation Functions in Neural Networks: A Comprehensive Guide

Understanding Mutable and Immutable Objects in Python

社区洞察

其他会员也浏览了

Blueprint for Leveraging Vector Database in Business

Fine-Tune Llama 3.1 with Your Data [No-Code] ??

Should you use Retrieval-Augmented Generation (RAG) or Train the Model?

Top 10 Future Trends in Data Science to Follow in 2024

How Enterprise Data Observability will make the most of your Shiny New Vector Databases

Vector Databases vs. Knowledge Graphs: Choosing the Right Foundation for Retrieval-Augmented Generation

KD 17:n01: 5 Machine Learning Projects You Can’t Overlook; Future of Deep Learning

Unsupervised Decision Tree

Understanding the RAG Pipeline: Components and Hyperparameters