Skip to main content
AppForge Solution - Webfejlesztés, Appfejlesztés, MI Fejlesztés

RAG Systems: How to Build an Intelligent Enterprise Knowledge Base

By AppForge Team 7 min read
RAG pipeline processing documents into AI responses

Why an LLM Alone Isn’t Enough

LLMs (GPT-5.2, Claude Sonnet 5, Gemini 3 Pro) possess vast knowledge - but they have two fundamental limitations:

  1. Knowledge cutoff: The model only “knows” up to its training data. Anything that happened after that is invisible.
  2. Hallucination: When it doesn’t know the answer, it often makes one up - confidently and convincingly, but falsely.

In enterprise environments, this is catastrophic. If your AI chatbot gives a customer false information or references outdated internal documents, that’s a business risk.

RAG (Retrieval Augmented Generation) solves exactly this problem: instead of the LLM answering from its “head,” it first retrieves relevant documents, then generates a response based on them. It doesn’t invent - it cites.

The RAG Pipeline Step by Step

A RAG system consists of two main phases: indexing (one-time) and querying (on every question).

Phase 1: Indexing

Documents → Chunking → Embedding → Vector Database

Document Ingestion

The first step: gather the documents that form the foundation of your knowledge base. These can be:

  • PDFs (policies, contracts, documentation)
  • Web pages (help center, FAQ)
  • Confluence/Notion pages
  • Email archives
  • Database records

Each source requires a different loader. Both LangChain and LlamaIndex offer plenty of built-in loaders.

Chunking

The LLM’s context window is limited, and retrieval accuracy degrades dramatically when storing massive text blocks. The solution: split documents into smaller pieces (chunks).

Chunking strategy is critically important - research shows that chunk size and overlap have at least as much impact on retrieval quality as the embedding model itself.

Main strategies:

StrategyDescriptionWhen to use
Fixed-size512 tokens, 10-20% overlapGeneral purpose, good starting point
SemanticGroups sentences by similarityMixed-content documents
RecursiveSplits at chapter → paragraph → sentence levelStructured documents
AgenticLLM decides split pointsComplex, varied documents
Late chunkingFull document embedding first, then splitContext-sensitive content

Practical recommendation: Start with 512-token fixed-size chunks with 10-20% overlap. This works well for most use cases. If quality is unsatisfactory, try semantic chunking.

Embedding Generation

Each chunk is converted into a vector by an embedding model - a dense numerical representation that captures the text’s meaning. Vectors for semantically similar text end up close together in vector space.

Popular embedding models (2026):

  • OpenAI text-embedding-3-large: 3072 dimensions, excellent quality, paid API
  • OpenAI text-embedding-3-small: 1536 dimensions, good value for money
  • Cohere Embed v4: Multimodal (text + image), 128K context, 100+ languages
  • Voyage AI voyage-3-large: SOTA performance across 100 datasets, Matryoshka dimensions (256-2048), quantization support
  • Open-source: nomic-embed-text-v2-moe (MoE architecture, multilingual), BGE-M3 (dense + sparse + multi-vector retrieval, 100+ languages)

Important: The embedding model choice is effectively permanent - if you switch later, you need to re-embed all documents.

Vector Database

Embeddings are stored in a vector database that enables fast similarity search (cosine similarity, dot product, euclidean distance).

Vector database comparison (2026):

DatabaseTypeStrengthWhen to choose
PineconeManaged SaaSScaling, simplicityWhen you don’t want to manage infrastructure
QdrantOpen-sourcePerformance, filteringWhen speed and self-hosting matter
WeaviateOpen-sourceHybrid search, modularWhen keyword + vector search is needed
ChromaDBOpen-sourceSimplicity, developer-friendlyPrototyping, small-medium scale
MilvusOpen-sourceEnterprise scale, 35K+ GitHub starsLarge datasets, enterprise
pgvectorPostgreSQL ext.Existing PG infrastructureWhen you already have PostgreSQL

Benchmark data (2026): Pinecone shows ~50,000 insertions/s and ~5,000 queries/s, Qdrant performs similarly (~45,000 / ~4,500), while ChromaDB excels in developer convenience but is slower at scale (~25,000 / ~2,000).

Phase 2: Querying

Question → Embedding → Vector Search → Top-K Documents → LLM Response Generation
  1. Embed the user’s question (using the same model as indexing)
  2. Vector search: find the most similar chunks (typically top 3-10)
  3. Insert the retrieved chunks as context into the LLM prompt
  4. The LLM generates a response based on the context
# Simple RAG query (pseudocode)
question = "What is our return policy?"

# 1. Question embedding
question_embedding = embed(question)

# 2. Vector search
relevant_chunks = vector_db.search(question_embedding, top_k=5)

# 3. Prompt construction
prompt = f"""Answer the question based on the following documents.
If the documents don't contain the answer, say you don't know.

Documents:
{relevant_chunks}

Question: {question}"""

# 4. LLM response generation
answer = llm.generate(prompt)

Hybrid Search: Vector + Keyword

Pure vector search isn’t always enough. When a user searches for an exact term (e.g., “GDPR Article 17”), semantic search may not find it - because vector search optimizes for meaning, not exact word matching.

Hybrid search combines:

  • Vector search: Semantic similarity (based on meaning)
  • Keyword search (BM25): Exact word matching (based on terms)

The two result sets are combined using reciprocal rank fusion (RRF) or weighted averaging. Weaviate and Qdrant natively support hybrid search.

Reranking: The Second Filter

Vector search is fast but not always precise. Reranking is a second, more accurate model that re-scores retrieved documents in light of the question.

Popular reranker models:

  • Cohere Rerank v3.5: SaaS, simple API, excellent quality, reasoning capabilities
  • BGE Reranker v2: Open-source, good value
  • Cross-encoder models: Slower but more accurate than bi-encoder embeddings

Reranking typically yields a 20-30% improvement in answer relevancy - small investment, big impact.

Advanced RAG Patterns

Multi-step RAG

The user’s question isn’t always clear. Multi-step RAG first clarifies the question (query rewriting, query expansion), then searches in multiple rounds.

Original question: "What's happening with the project?"
  → Rewritten query 1: "What is the current status of the Alpha project?"
  → Rewritten query 2: "What are the open tasks for the Alpha project?"
  → Unified answer based on both searches

Self-RAG

The LLM evaluates its own response: it checks whether the retrieved documents are actually relevant and whether the answer faithfully reflects the sources. If not, it searches again or self-corrects.

Agentic RAG

The agent autonomously decides on search strategy: which sources to search, what filters to apply, whether multi-round search is needed. This is the combination of the ReAct pattern and RAG.

Graph RAG

Models relationships between documents (knowledge graph). Particularly useful when entity connections matter - e.g., for questions like “Who works on the Alpha project?” where the answer is scattered across multiple documents.

Evaluation: How Do You Know It Works?

Evaluating a RAG system isn’t trivial. The key metrics:

Faithfulness

Does the generated answer match the source documents? Is the LLM hallucinating information?

Answer Relevancy

Does the response actually answer the question? Does it stay on topic?

Context Precision

Are the retrieved documents actually relevant to the question?

Context Recall

Did the system find all relevant documents?

Evaluation frameworks: RAGAS and LangSmith provide automated evaluation on these metrics.

Practical Example: Enterprise Knowledge Base

Let’s say you’re building a RAG-powered internal knowledge base for a 200-person company. Here’s what the project looks like:

1. Data Sources

  • 500+ internal documents (PDF, Word)
  • Confluence wiki (300+ pages)
  • Historical support tickets (5,000+)
  • Email archive (curated)

2. Tech Stack

  • Embedding: OpenAI text-embedding-3-small (cost-effective)
  • Vector DB: Qdrant (self-hosted, Docker)
  • LLM: Claude Sonnet 5 (excellent reasoning)
  • Framework: LangChain v0.3 + FastAPI
  • Frontend: Custom chat interface

3. Pipeline

Confluence API + PDFs + Support tickets

Chunking (512 tokens, 10% overlap)

Embedding generation (text-embedding-3-small)

Qdrant vector DB

Hybrid search (vector + BM25)

Reranking (Cohere Rerank v3.5)

LLM response generation (Claude Sonnet 5)

Response + source citations

4. Cost Estimate (Monthly)

  • Qdrant hosting (VPS): ~$30-50/month
  • OpenAI embedding API: ~$10-30/month (for 500 documents)
  • Claude API: ~$50-200/month (100-500 questions per day)
  • Total: ~$90-280/month

That’s a fraction of what a dedicated support team member would cost.

When NOT to Use RAG

RAG isn’t for everything. Skip it if:

  • Questions are simple and static: If an FAQ page solves the problem, don’t build a RAG system
  • Not enough data: For 10-20 documents, don’t build RAG - put the full text in the prompt
  • Real-time data is needed: RAG stores documents, not real-time data. For that, you need API integration + agents
  • 100% accuracy is required: RAG reduces hallucination but doesn’t eliminate it. Critical decisions need human approval
  • Data is structured: If the data is queryable via SQL, don’t vectorize it - use a Text-to-SQL approach

Costs and Scaling

Embedding Costs

Embedding is a one-time cost (unless you frequently update documents):

  • OpenAI text-embedding-3-small: ~$0.02 / 1M tokens
  • OpenAI text-embedding-3-large: ~$0.13 / 1M tokens
  • Open-source (self-hosted): infrastructure cost only, no per-token fees

LLM Costs (Response Generation)

This is the main ongoing cost, as every question requires an LLM call:

  • Claude Sonnet 5: ~$3 / 1M input tokens, $15 / 1M output tokens
  • GPT-5.2: ~$1.75 / 1M input tokens, $14 / 1M output tokens
  • GPT-5 mini: ~$0.25 / 1M input tokens, $2 / 1M output tokens
  • Open-source (Llama 4, Mistral Large 2): infrastructure cost only, no API fees

Scaling Tips

  • Cache: Cache responses for frequently asked questions
  • Smaller model: Response generation doesn’t always need the most powerful model
  • Batch indexing: Don’t embed in real-time - schedule batch processing
  • Chunk size optimization: Smaller chunks = more vectors = more expensive search, but more precise

Summary

RAG isn’t magic - but it’s the best method for enriching LLMs with your own data. The key is the right architecture: proper chunking, the right embedding model, a fast vector database, and the hybrid search + reranking combination.

In 2026, RAG is no longer experimental technology - it’s a production-ready solution used by the world’s largest enterprises. The question isn’t “does it work,” but “how do you integrate it into your system.”

If you’re planning a RAG-powered knowledge base or chatbot, the AppForge team can help with the full implementation - from data processing to vector database selection to production deployment.

Share:

Need an AI solution?

Automate your workflows and gain a competitive edge with our artificial intelligence solutions.

Related Articles

You might also be interested in these articles