Building a Comprehensive Text Analysis & Retrieval-Augmented Generation (RAG) Pipeline: A Behind-the-Scenes Look
Introduction
Over the past few months, I’ve been steadily working on a comprehensive Machine Learning portfolio project that integrates text classification, summarization, unsupervised clustering, and a retrieval-augmented generation (RAG) system. It’s been a journey with plenty of ups and downs, but through it all, I’ve gained invaluable insights. In this blog post, I’ll take you behind the scenes, sharing the architecture, the tools, the code, and (most importantly) the lessons learned.
Part 1: Data Ingestion & Preprocessing
1.1 Project Motivation
I initially set out to build an end-to-end text analysis system that could handle raw PDF files, images (through OCR), and plain text documents, then classify them or summarize them. This quickly expanded into training a custom summarization model, experimenting with unsupervised clustering, and eventually creating a retrieval-augmented generation pipeline.
The why behind it is simple: many real-world tasks involve messy documents (scans, PDFs, text, etc.) that need classification, summarization, and advanced retrieval. Building all these components gave me a well-rounded portfolio piece and a ton of hands-on experience with different libraries and AWS services.
1.2 Data Extraction: PDF & Image Processing
One of the first challenges was ingesting documents in different formats (PDFs, images, text files). I wrote a Python script (preprocess_data.py) that leverages:
I struggled a bit with Tesseract’s path configuration and with making sure I properly caught exceptions for broken PDFs. But after a series of debugging sessions, I got a robust pipeline that reliably extracts text from various file types and saves them as JSON (with optional metadata).
Key takeaway: Incorporating OCR from the start means you can handle a wide variety of real-world documents; but definitely test with sample data that covers every edge case (blank pages, rotated scans, etc.).
1.3 Text Augmentation
I added synonym-based augmentation (via nlpaug) so that smaller datasets could be expanded artificially. The function augment_text() randomly replaces up to 10% of words with synonyms. This helped me avoid overfitting on a small supervised dataset. Configuring the augmentation probability (e.g., aug_p=0.1) took some trial and error, as high augmentation rates can degrade data quality rather than help.
Lesson learned: Data augmentation can be powerful, but keep it modest. Over-augmenting can produce nonsensical text that hurts model performance.
1.4 Directory Structure
I separated data into:
This structure kept my code organized and made it much easier to debug and maintain each piece.
Part 2: Building a Text Classification Pipeline
2.1 Data Preparation & Tokenization
I used the Hugging Face transformers library, specifically DistilBERT, to classify documents into categories like invoices, memos, emails, etc. The script data_prep.py handles:
One big hurdle was ensuring consistent label mapping across train, validation, and test sets. I overcame this by saving a JSON label map (label_map.json) during training, then reloading it for evaluation.
Key tip: Always store your label-to-ID mapping. Otherwise, you’ll get mismatched predictions (e.g., “LABEL_2” might be a different class if you train a second time).
2.2 Model Training & Evaluation
The script train_distilbert.py orchestrates:
I occasionally ran into GPU memory issues (especially when I forgot to reduce batch_size on a smaller GPU). Lowering the batch size and sequence length helped. Also, I discovered that using fp16 mixed-precision can speed up training significantly on compatible GPUs.
2.3 Model Deployment to SageMaker
I wrote deploy_model.py to package the trained DistilBERT model into a model.tar.gz and push it to S3, then create a SageMaker endpoint. The container environment variable HF_TASK="text-classification" told the Hugging Face inference container how to handle requests.
Lesson: The Hugging Face SageMaker integration is powerful but can be tricky if you deviate from the standard tasks. Make sure you set the environment variables (HF_TASK, transformers_version, pytorch_version or tensorflow_version) properly.
2.4 Inference Testing
Using test_summar_endpoint.py (though ironically named for summarization) or a similar approach, I tested classification by sending a JSON payload {"inputs":"Some text to classify"} to the endpoint. I found the returned label strings (like "LABEL_0") needed a friendly label map for final user-facing results.
Part 3: Summarization with a Custom Seq2Seq Model
3.1 Summarization Data & BART
I decided to create a custom summarizer because I had an interesting variety of text documents (like invoices, shipping orders, news articles). Using the script generate_pseudo_summaries.py, I generated “silver-standard” or “pseudo” summaries for unlabeled text with a pretrained model (facebook/bart-large-cnn). This allowed me to have at least some form of target summaries to fine-tune on.
Then, with train_summarizer.py, I fine-tuned a smaller BART or DistilBART on these pseudo-labeled data. This was a highlight: I learned to carefully handle max_input_length (to avoid out-of-memory errors) and to measure a basic validation metric.
3.2 Summarizer Deployment
Just like classification, I used a Hugging Face SageMaker container but set HF_TASK="summarization". The final endpoint accepted JSON {"inputs": "Some text"} and returned a {"summary_text": "..."} structure.
领英推荐
Hardest part: ensuring I saved both the model and the tokenizer. Missing the tokenizer can cause incorrect tokenization at inference time.
Part 4: Unsupervised Clustering (K-Means)
4.1 Why Clustering?
Not all documents were labeled. So, I used sentence-transformers embeddings to cluster the unlabeled set. This is in train_clustering/clustering_with_embeddings.py.
The biggest challenge was deciding k (the number of clusters). I used a combination of domain knowledge and the silhouette score to guess a good n_clusters. Ultimately, I stored the cluster assignments back into the JSON files ("predicted_cluster" and "predicted_label").
4.2 Deploying as a Training Job
I wanted to run the clustering on SageMaker, so I created run_clustering_job.py, a script that packages the code into a Docker container or uses the Hugging Face CPU/GPU container. This gave me exposure to running large embedding tasks on an ml.p3.2xlarge instance (GPU-based).
Pro tip: For large-scale embeddings, consider batching or using a multi-GPU approach. Sentence-Transformers is quite memory efficient, but you can still run out of memory if your dataset is huge.
Part 5: Retrieval-Augmented Generation (RAG)
5.1 Combining Summaries & FAISS Index
One of the most exciting parts was building a RAG pipeline. I chunked large documents into smaller summaries (chunk_and_summarize.py), then stored them in a FAISS index (build_rag_index_multi.py). For each chunk, I stored:
5.2 Querying with a Generative Model
The script rag_query.py shows how I:
Gotcha: BART has a forced_bos_token_id by default, which can cause truncated outputs if you feed in custom prompts. Disabling that token ID in the config was crucial.
5.3 Dockerizing & SageMaker Inference
Finally, in the rag_deployment folder, I created:
Deploying a custom RAG solution was tricky because you can’t rely on a standard Hugging Face container for multi-step logic (embedding + retrieval + generation). But building your own Docker image with app.py was a powerful way to handle custom flows.
Part 6: Creating a Local Demo Server
6.1 The FastAPI Interface
To provide a user-friendly front-end, I built a small fastapi server (server.py in the my_local_server folder). It has:
Local testing was so much simpler with this approach. I struggled a bit with CORS issues and file upload logic, but the final result was a neat local app that calls out to the cloud.
6.2 UI & Next Steps
Eventually, I plan to add file drag-and-drop, better error handling for huge files, and a streaming UI for RAG outputs. But for now, it’s a decent proof-of-concept.
Biggest Lessons Learned
Conclusion & Next Steps
This project taught me how to handle the full lifecycle of an NLP system—from dirty PDF ingestion, to classification, to summarization, to advanced retrieval-augmented question-answering. The system is certainly not perfect, but it’s a robust starting point that can handle varied document types, scale on AWS, and serve real-time predictions.
If you’re curious or want to try it yourself, feel free to explore my GitHub repository where the code resides (scripts are heavily commented). Connect with me on LinkedIn to see more frequent updates on new features, especially as I refine the user interface and add advanced capabilities like sentence-level search or multilingual support.
Thank you for reading, and I hope my learnings can help you in your own end-to-end NLP journey!
Additional Resources: