Building Production-Ready RAG Pipelines: A Comprehensive Guide from Design to Deployment

In the era of large language models (LLMs), Retrieval-Augmented Generation (RAG) has become a cornerstone for building intelligent applications that ground responses in custom knowledge bases. But what if your data isn’t just a few documents—it’s a massive corpus of unstructured text, like PDFs, emails, or scraped web content? Scaling RAG while maintaining accuracy, efficiency, and conversational flow is no small feat.

In this blog post, we’ll walk through creating a full-stack RAG application that:

Ingests and processes huge volumes of unstructured data using a fine-tuned embedding model.
Stores vectorsin Qdrant, a high-performance vector database.
Retrieves relevant context, applies re-ranking for precision, and feeds it to an LLM for generation.
Incorporates short-term(in-session) and long-term(persistent) memory for natural conversational chats.
Uses Streamlit for an intuitive UI.
Adds a “judge” model to validate responses for hallucination and relevance.

By the end, you’ll have a production-ready prototype that’s efficient for large-scale data and user-friendly. We’ll use Python, LangChain for orchestration, Hugging Face for embeddings, and open-source tools to keep it accessible.

Why This Stack?

Judge Model: An LLM-based evaluator (e.g., GPT-4o-mini) scores responses on faithfulness and completeness.

Fine-tuned Embeddings: Off-the-shelf models like sentence-transformers/all-MiniLM-L6-v2 work well, but fine-tuning on your domain data boosts retrieval accuracy by 10-20%.

Qdrant: Handles billions of vectors with fast similarity search; perfect for 10GB-scale ingestion.

Re-Ranking: Post-retrieval filtering (e.g., via Cohere’s re-ranker) ensures top-k results are truly relevant.

Memory Layers: Short-term for chat history; long-term via a separate Qdrant collection for past conversations.

Streamlit UI: Quick to prototype interactive chats.

Prerequisites

1. Code Setup

Clone the repo from https://github.com/ekluvtech/fullrag t and follow the README.txt file to setup the env. For environment setup please refer my previous blogs

2. Hardware Recommendations

Ingestion (50GB+ unstructured data): 16GB+ RAM, multi-core CPU (use MAX_WORKERS=8 for parallel chunking). GPU (e.g., NVIDIA with 8GB VRAM) for faster embedding computation.
Runtime: 8GB RAM minimum for chat/inference; GPU accelerates Ollama/LLM calls.
Storage: ~100GB disk for models + indexed vectors (Qdrant compresses efficiently).

5. Environment Variables

Set via .env file or export (dotenv lib optional: pip install python-dotenv):
- APP_TITLE: Custom UI title (default: “Full RAG Chat”).
- CHUNK_SIZE=1200, CHUNK_OVERLAP=200: For text splitting.
- QDRANT_URL, QDRANT_API_KEY: DB connection.
- EMBEDDING_MODEL, EMBEDDING_DIM: Embedder setup.
- RERANKER_MODEL: Local reranker.
- LLM_PROVIDER=ollama, OLLAMA_HOST, OLLAMA_MODEL=llama3.2, LLM_TEMPERATURE=0.2.
- JUDGE_ENABLED=true, JUDGE_MODEL=llama3.2, JUDGE_THRESHOLD=6.0.
For 10GB scale: Increase MAX_WORKERS based on CPU cores to avoid bottlenecks.

Below is a high-level design diagram that visualizes the complete architecture:

Step 1: Configs

Here are the configurations for all the services. Please customize and adjust them as needed to match your local environment for the LLM, Ollama, Qdrant, and Judge.

Method	Input	Output	Description
__init__	Optional model name	N/A	Loads model on GPU/CPU; defaults from config.
embed_texts	List of strings	List of float vectors	Batches and encodes texts; normalizes; returns NumPy-to-list for storage.
embed_text	Single string	Single float vector	Wrapper for single-text embedding via batch method.

Building Production-Ready RAG Pipelines: A Comprehensive Guide from Design to Deployment

Why This Stack?

Prerequisites

1. Code Setup

2. Hardware Recommendations

5. Environment Variables

Below is a high-level design diagram that visualizes the complete architecture:

Step 1: Configs

Step 2: Data Ingestion and Embeddings

IngestionPipeline

Ingest Method (Batch Mode)

ingest_stream Method (Streaming Mode)

_process_batch Method (Private Helper)

Key Components

Methods

Strengths

Step 3: Retrieval, Re-Ranking, and LLM Generation

rerank Method

build_messages Method

chat Method

Step 4: Adding Short- and Long-Term Memory for Conversational Chat

ShortTermMemory Class

LongTermMemory Class

Step 5: Judge Model for Response Validation

validate_response Method

Usage Example

Step 6: Streamlit UI for Interactive Chat

Session State Initialization

Service Initialization

Ingestion pipeline

Output

Run streamlit app

Output

Conclusion: