Why an LLM alone is not enough
Modern LLMs (GPT-5.2, Claude Sonnet 5, Gemini 3 Pro) carry vast general knowledge, but they have two structural limits that block enterprise deployment.
- Knowledge cutoff. The model only knows what was in its training data. Anything that happened after the cutoff is invisible.
- Hallucination. When the model does not know the answer, it often invents one — confidently and convincingly, but wrong.
In a business setting that is unacceptable. If your chatbot tells a customer the wrong return policy or cites a policy document that does not exist, that is a real liability.
RAG (Retrieval Augmented Generation) fixes this directly: instead of answering from memory, the model first retrieves relevant documents from your own knowledge base, then generates an answer grounded in those sources. It does not invent — it cites.
The RAG pipeline, step by step
A RAG system has two phases: indexing (one-time per document) and querying (every time a user asks a question).
Phase 1 — Indexing
The flow: Documents → Chunking → Embedding → Vector database.
Document ingestion
First, gather the documents that form the knowledge base. Common sources:
- PDFs (policies, contracts, technical documentation)
- Web pages (help center, FAQ)
- Confluence and Notion pages
- Curated email archives
- Database records
Each source needs its own loader. LangChain and LlamaIndex both ship a wide library of built-in loaders, so you rarely write one from scratch.
Chunking
The LLM context window is finite, and retrieval accuracy collapses when you store huge unbroken text blocks. The fix: split documents into smaller pieces called chunks.
| Strategy | Description | When to use |
|---|---|---|
| Fixed-size | 512 tokens, 10-20% overlap | General purpose, good starting point |
| Semantic | Groups sentences by similarity | Mixed-content documents |
| Recursive | Splits at chapter → paragraph → sentence | Structured documents |
| Agentic | LLM decides split points | Complex, varied corpora |
| Late chunking | Embed full doc first, then split | Context-sensitive content |
Embedding generation
Each chunk is converted into a vector by an embedding model— a dense numerical representation of the text's meaning. Semantically similar text ends up close together in vector space.
Popular embedding models in 2026:
- OpenAI text-embedding-3-large: 3072 dimensions, excellent quality, paid API.
- OpenAI text-embedding-3-small: 1536 dimensions, strong cost-quality balance.
- Cohere Embed v4: multimodal (text + image), 128K context, 100+ languages.
- Voyage AI voyage-3-large: SOTA across 100 datasets, Matryoshka dimensions (256-2048), quantization support.
- Open-source:
nomic-embed-text-v2-moe(MoE, multilingual),BGE-M3(dense + sparse + multi-vector retrieval, 100+ languages).
Vector database
Embeddings live in a vector database that supports fast similarity search (cosine similarity, dot product, or Euclidean distance).
| Database | Type | Strength | When to choose |
|---|---|---|---|
| Pinecone | Managed SaaS | Scaling, simplicity | When you do not want to manage infrastructure |
| Qdrant | Open-source | Performance, filtering | When speed and self-hosting matter |
| Weaviate | Open-source | Hybrid search, modular | When keyword + vector search is required |
| ChromaDB | Open-source | Simplicity, dev-friendly | Prototyping, small to mid scale |
| Milvus | Open-source | Enterprise scale, 35K+ stars | Large datasets, enterprise |
| pgvector | PostgreSQL ext. | Reuses existing PG infra | When PostgreSQL is already in place |
Benchmark numbers (2026): Pinecone reaches roughly 50,000 insertions/s and 5,000 queries/s. Qdrant lands close behind at ~45,000 / ~4,500. ChromaDB wins on developer convenience but tops out around 25,000 / 2,000 at scale.
Phase 2 — Querying
The flow: Question → Embedding → Vector search → Top-K documents → LLM response.
- Embedthe user's question with the same model used during indexing.
- Vector search: find the most similar chunks, usually top 3-10.
- Insert the retrieved chunks as context in the LLM prompt.
- The LLM generates an answer grounded in that context.
# Simple RAG query (pseudocode)
question = "What is our return policy?"
# 1. Question embedding
question_embedding = embed(question)
# 2. Vector search
relevant_chunks = vector_db.search(question_embedding, top_k=5)
# 3. Prompt construction
prompt = f"""Answer the question based on the following documents.
If the documents don't contain the answer, say you don't know.
Documents:
{relevant_chunks}
Question: {question}"""
# 4. LLM response generation
answer = llm.generate(prompt)Hybrid search: vector + keyword
Pure vector search is not always enough. When a user searches for an exact term — say "GDPR Article 17" — semantic search may miss it, because vector similarity optimizes for meaning, not exact word matches.
Hybrid search combines two retrievers:
- Vector search for semantic similarity
- Keyword search (BM25) for exact term matches
The two result sets are merged with reciprocal rank fusion (RRF) or a weighted average. Both Weaviate and Qdrant support hybrid search natively.
Reranking: the second filter
Vector search is fast but not always precise. Reranking is a second, more accurate model that re-scores the retrieved documents against the original question.
Popular rerankers:
- Cohere Rerank v3.5 — SaaS, simple API, excellent quality, supports reasoning.
- BGE Reranker v2 — open-source, solid value.
- Cross-encoder models — slower than bi-encoder embeddings but more accurate.
Advanced RAG patterns
Multi-step RAG
User questions are not always clean. Multi-step RAG first rewrites or expands the query, then runs multiple retrieval rounds.
Original question: "What's happening with the project?"
→ Rewritten query 1: "What is the current status of the Alpha project?"
→ Rewritten query 2: "What are the open tasks for the Alpha project?"
→ Unified answer based on both searchesSelf-RAG
The LLM evaluates its own response — does the retrieved context really support the answer, and does the answer faithfully reflect the sources? If not, it re-retrieves or self-corrects.
Agentic RAG
The agent decides search strategy on its own: which sources to query, what filters to apply, whether multi-round retrieval is needed. Combines the ReAct pattern with RAG.
Graph RAG
Models the relationships between documents as a knowledge graph. Useful when entity connections matter — for example, "Who works on the Alpha project?" where the answer is scattered across many documents.
Evaluation: how do you know it works?
Evaluating a RAG system is not trivial. Four metrics matter most.
| Metric | Question it answers |
|---|---|
| Faithfulness | Does the answer match the retrieved sources, or is the LLM hallucinating? |
| Answer relevancy | Does the response actually address the question? |
| Context precision | Are the retrieved documents actually relevant? |
| Context recall | Did the system find all the relevant documents? |
Frameworks like RAGAS and LangSmith automate evaluation across these metrics.
Practical example: a 200-person knowledge base
A concrete reference build for a RAG-powered internal knowledge base at a 200-person company.
1. Data sources
- 500+ internal documents (PDF, Word)
- Confluence wiki (300+ pages)
- Historical support tickets (5,000+)
- Curated email archive
2. Tech stack
- Embedding: OpenAI text-embedding-3-small (cost-effective)
- Vector DB: Qdrant (self-hosted, Docker)
- LLM: Claude Sonnet 5 (strong reasoning)
- Framework: LangChain v0.3 + FastAPI
- Frontend: custom chat interface
3. Pipeline
Confluence API + PDFs + Support tickets
↓
Chunking (512 tokens, 10% overlap)
↓
Embedding generation (text-embedding-3-small)
↓
Qdrant vector DB
↓
Hybrid search (vector + BM25)
↓
Reranking (Cohere Rerank v3.5)
↓
LLM response generation (Claude Sonnet 5)
↓
Response + source citations4. Monthly cost estimate
| Item | Monthly cost (USD) |
|---|---|
| Qdrant hosting (VPS) | $30 – $50 |
| OpenAI embedding API | $10 – $30 (500 docs) |
| Claude API | $50 – $200 (100-500 q/day) |
| Total | $90 – $280 |
That is a small fraction of what a single dedicated support hire would cost — and it serves the whole company at once.
20-30%
answer relevancy lift from reranking
50,000/s
Pinecone insert throughput
2026 benchmarks
$90-280
monthly cost for a 200-person knowledge base
When NOT to use RAG
RAG is not the right tool for every problem. Skip it when:
- Questions are simple and static. If a plain FAQ page solves the problem, a RAG system is overkill.
- You have very little data. For 10-20 documents, put the full text directly into the prompt — no vector store needed.
- You need real-time data. RAG indexes documents, not live data. For real-time you want API integration plus an agent.
- 100% accuracy is mandatory. RAG reduces hallucination but does not eliminate it. Critical decisions still need a human in the loop.
- Your data is already structured. If the answer lives in a SQL table, use Text-to-SQL instead of vectorizing rows.
Costs and scaling
Embedding costs
Embedding is essentially a one-time cost unless you constantly update documents:
- OpenAI text-embedding-3-small: ~$0.02 / 1M tokens
- OpenAI text-embedding-3-large: ~$0.13 / 1M tokens
- Open-source self-hosted: infrastructure cost only, no per-token fees
LLM costs (response generation)
The main ongoing cost — every question triggers an LLM call:
- Claude Sonnet 5: ~$3 / 1M input tokens, $15 / 1M output tokens
- GPT-5.2: ~$1.75 / 1M input tokens, $14 / 1M output tokens
- GPT-5 mini: ~$0.25 / 1M input tokens, $2 / 1M output tokens
- Open-source (Llama 4, Mistral Large 2): infrastructure cost only
Scaling tips
- Cache answers for frequently asked questions.
- Smaller model when possible. Response generation does not always need the largest model — route easy questions to mini models.
- Batch indexing. Do not embed in real time; schedule batches.
- Tune chunk size. Smaller chunks mean more vectors and pricier search, but more precise retrieval. There is a sweet spot per corpus.
For a deeper budget breakdown, see our AI development cost guide.
The bottom line
RAG is not magic — it is the most reliable way to enrich an LLM with your own data. The architecture matters: proper chunking, the right embedding model, a fast vector database, and the hybrid search + reranking combo. In 2026 RAG is no longer experimental. It is a production-grade pattern shipped by the largest enterprises on the planet. The interesting question is no longer "does it work?" but "how do you fit it into your stack?"
Planning a RAG-powered knowledge base or chatbot? The AppForge team handles the full implementation — data ingestion, vector database selection, evaluation, and production deployment. Start with a free consultation: request a quote or browse our AI portfolio for real deployments.
Frequently asked questions
What is RAG and how does it differ from a plain LLM?
RAG (Retrieval Augmented Generation) retrieves relevant documents from your own knowledge base before the LLM generates an answer. A plain LLM answers from its training data and can hallucinate; a RAG system grounds every response in your actual sources and cites them. That difference is what makes RAG production-safe for enterprise use.
What chunk size should I use for a RAG system?
Start with 512-token fixed-size chunks and 10-20% overlap. That works well for most documentation, knowledge bases, and FAQ corpora. If retrieval quality is low on mixed-content documents, switch to semantic or recursive chunking. Chunking strategy affects retrieval quality at least as much as the embedding model itself.
Which vector database should I pick in 2026?
Pinecone for managed simplicity (~50,000 inserts/s, ~5,000 queries/s). Qdrant for self-hosted speed and rich filtering. Weaviate when you need native hybrid search. ChromaDB for prototypes under 1M vectors. pgvector when you already run PostgreSQL and want to avoid a new system. Milvus for enterprise-scale deployments.
How much does it cost to run a RAG system per month?
A RAG-powered internal knowledge base for a 200-person company typically runs $90-280/month: ~$30-50 for self-hosted Qdrant on a VPS, $10-30 for embeddings via OpenAI text-embedding-3-small, and $50-200 for Claude or GPT inference at 100-500 questions per day. That is a fraction of a single dedicated support hire.
Does reranking actually improve answer quality?
Yes — reranking with Cohere Rerank v3.5 or BGE Reranker v2 typically improves answer relevancy by 20-30%. Vector search retrieves a wide candidate set fast; the reranker re-scores those candidates with a more accurate cross-encoder model. Small additional cost, large quality gain.
When should I NOT use RAG?
Skip RAG when an FAQ page solves the problem, when you have fewer than 20 documents (just put them in the prompt), when you need real-time data (use API integration plus an agent instead), when 100% accuracy is mandatory, or when your data is already structured and queryable via SQL — in that case use Text-to-SQL.
Can I switch embedding models later?
Technically yes, but it forces a full re-embed of your entire corpus. Pick the embedding model carefully up front. OpenAI text-embedding-3-small is the cost-effective default; text-embedding-3-large or Voyage AI voyage-3-large are the quality picks; BGE-M3 or nomic-embed-text-v2-moe are strong open-source options for self-hosted setups.



