RAG Systems: How to Build an Intelligent Enterprise Knowledge Base

RAG grounds the LLM in your own documents — no more hallucinations, every answer is sourced. Architecture, chunking, hybrid search, reranking, and a real $90-280/month build.

9 min readByBoncz Bálint

Why an LLM alone is not enough

Modern LLMs (GPT-5.2, Claude Sonnet 5, Gemini 3 Pro) carry vast general knowledge, but they have two structural limits that block enterprise deployment.

  1. Knowledge cutoff. The model only knows what was in its training data. Anything that happened after the cutoff is invisible.
  2. Hallucination. When the model does not know the answer, it often invents one — confidently and convincingly, but wrong.

In a business setting that is unacceptable. If your chatbot tells a customer the wrong return policy or cites a policy document that does not exist, that is a real liability.

RAG (Retrieval Augmented Generation) fixes this directly: instead of answering from memory, the model first retrieves relevant documents from your own knowledge base, then generates an answer grounded in those sources. It does not invent — it cites.

The RAG pipeline, step by step

A RAG system has two phases: indexing (one-time per document) and querying (every time a user asks a question).

Phase 1 — Indexing

The flow: Documents → Chunking → Embedding → Vector database.

Document ingestion

First, gather the documents that form the knowledge base. Common sources:

  • PDFs (policies, contracts, technical documentation)
  • Web pages (help center, FAQ)
  • Confluence and Notion pages
  • Curated email archives
  • Database records

Each source needs its own loader. LangChain and LlamaIndex both ship a wide library of built-in loaders, so you rarely write one from scratch.

Chunking

The LLM context window is finite, and retrieval accuracy collapses when you store huge unbroken text blocks. The fix: split documents into smaller pieces called chunks.

StrategyDescriptionWhen to use
Fixed-size512 tokens, 10-20% overlapGeneral purpose, good starting point
SemanticGroups sentences by similarityMixed-content documents
RecursiveSplits at chapter → paragraph → sentenceStructured documents
AgenticLLM decides split pointsComplex, varied corpora
Late chunkingEmbed full doc first, then splitContext-sensitive content

Embedding generation

Each chunk is converted into a vector by an embedding model— a dense numerical representation of the text's meaning. Semantically similar text ends up close together in vector space.

Popular embedding models in 2026:

  • OpenAI text-embedding-3-large: 3072 dimensions, excellent quality, paid API.
  • OpenAI text-embedding-3-small: 1536 dimensions, strong cost-quality balance.
  • Cohere Embed v4: multimodal (text + image), 128K context, 100+ languages.
  • Voyage AI voyage-3-large: SOTA across 100 datasets, Matryoshka dimensions (256-2048), quantization support.
  • Open-source: nomic-embed-text-v2-moe (MoE, multilingual), BGE-M3 (dense + sparse + multi-vector retrieval, 100+ languages).

Vector database

Embeddings live in a vector database that supports fast similarity search (cosine similarity, dot product, or Euclidean distance).

DatabaseTypeStrengthWhen to choose
PineconeManaged SaaSScaling, simplicityWhen you do not want to manage infrastructure
QdrantOpen-sourcePerformance, filteringWhen speed and self-hosting matter
WeaviateOpen-sourceHybrid search, modularWhen keyword + vector search is required
ChromaDBOpen-sourceSimplicity, dev-friendlyPrototyping, small to mid scale
MilvusOpen-sourceEnterprise scale, 35K+ starsLarge datasets, enterprise
pgvectorPostgreSQL ext.Reuses existing PG infraWhen PostgreSQL is already in place

Benchmark numbers (2026): Pinecone reaches roughly 50,000 insertions/s and 5,000 queries/s. Qdrant lands close behind at ~45,000 / ~4,500. ChromaDB wins on developer convenience but tops out around 25,000 / 2,000 at scale.

Phase 2 — Querying

The flow: Question → Embedding → Vector search → Top-K documents → LLM response.

  1. Embedthe user's question with the same model used during indexing.
  2. Vector search: find the most similar chunks, usually top 3-10.
  3. Insert the retrieved chunks as context in the LLM prompt.
  4. The LLM generates an answer grounded in that context.
# Simple RAG query (pseudocode)
question = "What is our return policy?"

# 1. Question embedding
question_embedding = embed(question)

# 2. Vector search
relevant_chunks = vector_db.search(question_embedding, top_k=5)

# 3. Prompt construction
prompt = f"""Answer the question based on the following documents.
If the documents don't contain the answer, say you don't know.

Documents:
{relevant_chunks}

Question: {question}"""

# 4. LLM response generation
answer = llm.generate(prompt)

Pure vector search is not always enough. When a user searches for an exact term — say "GDPR Article 17" — semantic search may miss it, because vector similarity optimizes for meaning, not exact word matches.

Hybrid search combines two retrievers:

  • Vector search for semantic similarity
  • Keyword search (BM25) for exact term matches

The two result sets are merged with reciprocal rank fusion (RRF) or a weighted average. Both Weaviate and Qdrant support hybrid search natively.

Reranking: the second filter

Vector search is fast but not always precise. Reranking is a second, more accurate model that re-scores the retrieved documents against the original question.

Popular rerankers:

  • Cohere Rerank v3.5 — SaaS, simple API, excellent quality, supports reasoning.
  • BGE Reranker v2 — open-source, solid value.
  • Cross-encoder models — slower than bi-encoder embeddings but more accurate.

Advanced RAG patterns

Multi-step RAG

User questions are not always clean. Multi-step RAG first rewrites or expands the query, then runs multiple retrieval rounds.

Original question: "What's happening with the project?"
  → Rewritten query 1: "What is the current status of the Alpha project?"
  → Rewritten query 2: "What are the open tasks for the Alpha project?"
  → Unified answer based on both searches

Self-RAG

The LLM evaluates its own response — does the retrieved context really support the answer, and does the answer faithfully reflect the sources? If not, it re-retrieves or self-corrects.

Agentic RAG

The agent decides search strategy on its own: which sources to query, what filters to apply, whether multi-round retrieval is needed. Combines the ReAct pattern with RAG.

Graph RAG

Models the relationships between documents as a knowledge graph. Useful when entity connections matter — for example, "Who works on the Alpha project?" where the answer is scattered across many documents.

Evaluation: how do you know it works?

Evaluating a RAG system is not trivial. Four metrics matter most.

MetricQuestion it answers
FaithfulnessDoes the answer match the retrieved sources, or is the LLM hallucinating?
Answer relevancyDoes the response actually address the question?
Context precisionAre the retrieved documents actually relevant?
Context recallDid the system find all the relevant documents?

Frameworks like RAGAS and LangSmith automate evaluation across these metrics.

Practical example: a 200-person knowledge base

A concrete reference build for a RAG-powered internal knowledge base at a 200-person company.

1. Data sources

  • 500+ internal documents (PDF, Word)
  • Confluence wiki (300+ pages)
  • Historical support tickets (5,000+)
  • Curated email archive

2. Tech stack

  • Embedding: OpenAI text-embedding-3-small (cost-effective)
  • Vector DB: Qdrant (self-hosted, Docker)
  • LLM: Claude Sonnet 5 (strong reasoning)
  • Framework: LangChain v0.3 + FastAPI
  • Frontend: custom chat interface

3. Pipeline

Confluence API + PDFs + Support tickets
    ↓
Chunking (512 tokens, 10% overlap)
    ↓
Embedding generation (text-embedding-3-small)
    ↓
Qdrant vector DB
    ↓
Hybrid search (vector + BM25)
    ↓
Reranking (Cohere Rerank v3.5)
    ↓
LLM response generation (Claude Sonnet 5)
    ↓
Response + source citations

4. Monthly cost estimate

ItemMonthly cost (USD)
Qdrant hosting (VPS)$30 – $50
OpenAI embedding API$10 – $30 (500 docs)
Claude API$50 – $200 (100-500 q/day)
Total$90 – $280

That is a small fraction of what a single dedicated support hire would cost — and it serves the whole company at once.

20-30%

answer relevancy lift from reranking

50,000/s

Pinecone insert throughput

2026 benchmarks

$90-280

monthly cost for a 200-person knowledge base

When NOT to use RAG

RAG is not the right tool for every problem. Skip it when:

  • Questions are simple and static. If a plain FAQ page solves the problem, a RAG system is overkill.
  • You have very little data. For 10-20 documents, put the full text directly into the prompt — no vector store needed.
  • You need real-time data. RAG indexes documents, not live data. For real-time you want API integration plus an agent.
  • 100% accuracy is mandatory. RAG reduces hallucination but does not eliminate it. Critical decisions still need a human in the loop.
  • Your data is already structured. If the answer lives in a SQL table, use Text-to-SQL instead of vectorizing rows.

Costs and scaling

Embedding costs

Embedding is essentially a one-time cost unless you constantly update documents:

  • OpenAI text-embedding-3-small: ~$0.02 / 1M tokens
  • OpenAI text-embedding-3-large: ~$0.13 / 1M tokens
  • Open-source self-hosted: infrastructure cost only, no per-token fees

LLM costs (response generation)

The main ongoing cost — every question triggers an LLM call:

  • Claude Sonnet 5: ~$3 / 1M input tokens, $15 / 1M output tokens
  • GPT-5.2: ~$1.75 / 1M input tokens, $14 / 1M output tokens
  • GPT-5 mini: ~$0.25 / 1M input tokens, $2 / 1M output tokens
  • Open-source (Llama 4, Mistral Large 2): infrastructure cost only

Scaling tips

  • Cache answers for frequently asked questions.
  • Smaller model when possible. Response generation does not always need the largest model — route easy questions to mini models.
  • Batch indexing. Do not embed in real time; schedule batches.
  • Tune chunk size. Smaller chunks mean more vectors and pricier search, but more precise retrieval. There is a sweet spot per corpus.

For a deeper budget breakdown, see our AI development cost guide.

The bottom line

RAG is not magic — it is the most reliable way to enrich an LLM with your own data. The architecture matters: proper chunking, the right embedding model, a fast vector database, and the hybrid search + reranking combo. In 2026 RAG is no longer experimental. It is a production-grade pattern shipped by the largest enterprises on the planet. The interesting question is no longer "does it work?" but "how do you fit it into your stack?"

Planning a RAG-powered knowledge base or chatbot? The AppForge team handles the full implementation — data ingestion, vector database selection, evaluation, and production deployment. Start with a free consultation: request a quote or browse our AI portfolio for real deployments.

Frequently asked questions

What is RAG and how does it differ from a plain LLM?

RAG (Retrieval Augmented Generation) retrieves relevant documents from your own knowledge base before the LLM generates an answer. A plain LLM answers from its training data and can hallucinate; a RAG system grounds every response in your actual sources and cites them. That difference is what makes RAG production-safe for enterprise use.

What chunk size should I use for a RAG system?

Start with 512-token fixed-size chunks and 10-20% overlap. That works well for most documentation, knowledge bases, and FAQ corpora. If retrieval quality is low on mixed-content documents, switch to semantic or recursive chunking. Chunking strategy affects retrieval quality at least as much as the embedding model itself.

Which vector database should I pick in 2026?

Pinecone for managed simplicity (~50,000 inserts/s, ~5,000 queries/s). Qdrant for self-hosted speed and rich filtering. Weaviate when you need native hybrid search. ChromaDB for prototypes under 1M vectors. pgvector when you already run PostgreSQL and want to avoid a new system. Milvus for enterprise-scale deployments.

How much does it cost to run a RAG system per month?

A RAG-powered internal knowledge base for a 200-person company typically runs $90-280/month: ~$30-50 for self-hosted Qdrant on a VPS, $10-30 for embeddings via OpenAI text-embedding-3-small, and $50-200 for Claude or GPT inference at 100-500 questions per day. That is a fraction of a single dedicated support hire.

Does reranking actually improve answer quality?

Yes — reranking with Cohere Rerank v3.5 or BGE Reranker v2 typically improves answer relevancy by 20-30%. Vector search retrieves a wide candidate set fast; the reranker re-scores those candidates with a more accurate cross-encoder model. Small additional cost, large quality gain.

When should I NOT use RAG?

Skip RAG when an FAQ page solves the problem, when you have fewer than 20 documents (just put them in the prompt), when you need real-time data (use API integration plus an agent instead), when 100% accuracy is mandatory, or when your data is already structured and queryable via SQL — in that case use Text-to-SQL.

Can I switch embedding models later?

Technically yes, but it forces a full re-embed of your entire corpus. Pick the embedding model carefully up front. OpenAI text-embedding-3-small is the cost-effective default; text-embedding-3-large or Voyage AI voyage-3-large are the quality picks; BGE-M3 or nomic-embed-text-v2-moe are strong open-source options for self-hosted setups.

Ready to start?

Let's scope your project — 30 free minutes.

Within 24 hours we send back a concrete price range, a realistic timeline and the clear next step. No sales pitch.

Start a project