Enterprise RAG: the complete pipeline that actually works

How to build a production RAG with semantic chunking, hybrid search and reranking. The real decisions that determine whether your system retrieves well or fails silently.

Contributors: Ivan Garcia Villar, Carlos Hernandez Prieto

Every company has valuable knowledge trapped in documents: internal policies, contracts, technical manuals, support ticket history, product specifications. RAG is the mechanism that connects that knowledge with a language model so it can use it in real time. The question isn’t whether to implement RAG — it’s how to build it right from the start so that knowledge is truly accessible and accurate. In this post I’ll explain the complete pipeline with the decisions that matter most, in the order they matter.

The complete pipeline at a glance

Before diving into each decision, it’s worth having the complete architecture clear. Most teams I’ve seen build their first RAG advance component by component without seeing the whole system, and that leads to optimizing what doesn’t need optimization while ignoring what actually fails.

1.00

The diagram makes clear where each decision intervenes: chunking happens at ingestion, embeddings transform chunks before storing them, hybrid search operates at retrieval time, reranking reorders results before building the context, and evaluation observes each stage. Each of those decisions has a different impact on the final result. The first two — chunking and embeddings — are the foundation of the system and determine the quality ceiling that any later optimization can reach.

The foundation of the system: semantic chunking

The first thing I implemented was semantic chunking, and it’s where I saw the most difference compared to what I’d built before. Most start by splitting documents every N characters because it’s the simplest to implement. The problem is that it destroys context: an idea spanning three paragraphs ends up fragmented into chunks that separately mean nothing.

To make it concrete: a technical manual describing a four-step procedure might get split in half of the second step. The chunk containing steps 1-2 (incomplete) and the chunk containing steps 3-4 (without context of the first ones) are useless for retrieval. When the user asks how to execute that procedure, the system retrieves fragments that answer nothing by themselves.

Semantic chunking respects document structure — sections, paragraphs, complete entities — and produces fragments that have meaning on their own.

To see the difference without code, take this paragraph from a real internal policy:

“The approval process for purchase orders above €50,000 requires validation from the technical area within a maximum of 48 hours. If no response is received within that timeframe, the order remains in blocked status and must be escalated to the operations director for resolution.”

Character-based chunking (150 char chunk):

  • The approval process for purchase orders above €50,000 requires validation fro

  • m the technical area within a maximum of 48 hours. If no response is received w

  • ithin that timeframe, the order remains in blocked status and must be escalate

  • d to the operations director for resolution.

Semantic chunking (complete paragraph as unit):

  • The approval process for purchase orders above €50,000 requires validation from the technical area within a maximum of 48 hours. If no response is received within that timeframe, the order remains in blocked status and must be escalated to the operations director for resolution.

No explanation needed. One fragment makes sense. The other four don’t.

The three strategies and when to use each

StrategyUse casesAdvantagesLimitations
By charactersQuick prototyping, homogeneous and dense corpusSimple, no external dependenciesBreaks context, produces chunks without semantic meaning
By tokensCorpus where respecting LLM limits mattersAligned with model’s context windowDoesn’t respect semantic structure of document
SemanticEnterprise, technical, contract, policy documentsPreserves context, visibly improves retrieval precisionRequires more initial configuration and corpus knowledge

The correct strategy depends on document type. For documents with clear structure — contracts, manuals, policies with defined sections — chunking by sections or paragraphs works well because structure already delimits semantic units. For free-form text — emails, notes, conversation logs — you need a more careful approach using contextual signals (topic changes, argument inflection points) to determine chunk boundaries. For mixed documents — policies with tables and narrative paragraphs — the best result comes from treating each content type with its own fragmentation logic.

Overlap: necessary, but not arbitrary

Overlap — overlapping content between adjacent chunks — compensates for the risk that a chunk boundary cuts right in the middle of an idea expanding from one paragraph to the next. Without overlap, if relevant information falls between two chunks, retrieval loses it.

The correct overlap size isn’t a fixed number. It depends on document type and the typical length of sentences connecting adjacent ideas. In dense technical documentation, a 10-15% overlap of chunk size usually works well. In documents with very independent sections (like a FAQ with separated questions), overlap can be minimal or none.

What to avoid is overlap so large that chunks are nearly duplicates — that doesn’t help retrieval, just bloats the vector store and adds noise.

Embedding strategy: the decision with the most impact

Chunking determines the structure of what you store. The embedding model determines whether retrieval can find it.

Not all embedding models capture all domains equally well. A generic model trained on general text can work perfectly for English documentation on common topics and fail systematically with a company’s internal terminology — proprietary acronyms, product names, domain-specific jargon.

In a real project I faced this problem head-on. The system received questions about “OC urgentes pendientes de validar por el área técnica” (urgent POs pending technical validation). The generic embedding model didn’t associate “OC” (purchase order, the client’s internal term) with the concept of “purchase order” or “pedido” — the vectors fell in a different semantic space than the documents where information was stored as “órdenes de compra” or “pedidos de aprovisionamiento”. Retrieval returned zero relevant results for a query that had a direct answer in the corpus.

The solution wasn’t tuning chunking or adjusting hybrid search. It was switching the embedding model to one evaluated on a real sample of the client’s vocabulary.

How to choose the right model before building everything on top

The test I always use before committing to an embedding model: take 20-30 representative questions from the real use case, along with the documents containing the answers, and measure whether the model positions those documents in the top retrieval results. You don’t need a complete evaluation framework for this — it’s a manual test taking an afternoon that can save weeks of work built on the wrong foundation.

If the correct documents don’t appear in the top 5 results in more than 70% of questions, the embedding model isn’t right for that corpus. No later optimization compensates.

Open-source vs API, and the real trade-offs

Open-source models like those from sentence-transformers, E5 or BGE have the advantage of running locally and not depending on an external provider — which matters if the corpus is confidential. API models like OpenAI or Cohere offer high-quality embeddings without infrastructure management.

The factor that most changes the equation isn’t the abstract quality of the model but the cost depending on when in the pipeline it runs. At ingestion time, embeddings are generated once per document and latency cost isn’t critical. At retrieval time, embeddings are generated on each user query and latency does matter. A model taking 500ms to generate an embedding might be acceptable for ingestion and completely unviable for real-time retrieval.

Vector stores: the right choice by context

Vector storeIdeal scenarioMain advantageLimitation
ChromaDevelopment, medium corpus, local setupZero configuration, iterate fastNot designed for production scale
PineconeLarge corpus, production, team without opsManaged SaaS, scale without infrastructure workCost with very large corpus, vendor dependency
WeaviateInfrastructure control, native hybrid searchOpen-source, integrated BM25 moduleMore operations than Chroma
pgvectorAlready running PostgreSQL in productionAdd vector capability without new infrastructureNot optimized for pure vector retrieval at large scale

The right choice crosses three dimensions: corpus size, scale requirements, and setup simplicity. For the initial phase, Chroma lets you iterate without operational friction. For production with large corpus, Pinecone or Weaviate are more solid options. If your team already manages PostgreSQL and the corpus doesn’t exceed a certain threshold, pgvector is the option with lowest operational cost.

Optimizations that multiply retrieval

With chunking and embeddings configured correctly, the system should already retrieve relevant information in most cases. What’s missing is solving cases semantic search alone can’t handle, and improving the quality of context reaching the LLM.

Hybrid search: when semantics alone isn’t enough

Semantic search is excellent at finding information related by meaning. It fails when the query contains specific technical terms, proper names, product codes or internal acronyms that the embedding model doesn’t directly associate with the right documents.

The concrete case that made me understand why I needed hybrid search: searching for “RMA-2024-0392” — the number of a return case — and semantic retrieval returned documents about returns in general but not that specific case. It makes perfect sense: “RMA-2024-0392” has no obvious semantic neighbors in vector space. It’s an exact text string that either exists in documents or doesn’t.

BM25 solves exactly that case. Exact term search with BM25 finds the document containing “RMA-2024-0392” with surgical precision.

Combining both — semantic for conceptual questions, BM25 for exact terms — produces retrieval covering both cases without sacrificing either. The result is a system finding information both by meaning and exact terms.

Implementing hybrid retrieval in Python has this structure:

# Hybrid retrieval: semantic + BM25 with configurable weights
from rank_bm25 import BM25Okapi
import numpy as np

def hybrid_retrieve(query: str, chunks: list[str], embedder, vector_store,
                    alpha: float = 0.5, top_k: int = 20) -> list[str]:
    """
    alpha: semantic retrieval weight (0.0 = BM25 only, 1.0 = semantic only)
    """
    # Semantic retrieval
    query_embedding = embedder.encode(query)
    semantic_results = vector_store.similarity_search_with_score(
        query_embedding, k=top_k
    )
    semantic_scores = {doc_id: score for doc_id, score in semantic_results}

    # BM25 retrieval
    tokenized_corpus = [chunk.lower().split() for chunk in chunks]
    bm25 = BM25Okapi(tokenized_corpus)
    bm25_scores_raw = bm25.get_scores(query.lower().split())

    # Normalize scores to [0, 1]
    sem_max = max(semantic_scores.values(), default=1)
    bm25_max = bm25_scores_raw.max() or 1

    combined = {}
    for i, chunk in enumerate(chunks):
        sem = semantic_scores.get(i, 0) / sem_max
        bm25_score = bm25_scores_raw[i] / bm25_max
        combined[i] = alpha * sem + (1 - alpha) * bm25_score

    top_indices = sorted(combined, key=combined.get, reverse=True)[:top_k]
    return [chunks[i] for i in top_indices]

The alpha parameter lets you adjust the weight of each strategy based on your corpus. For corpus with lots of exact technical terminology, lowering alpha to 0.3-0.4 gives more weight to BM25. For corpus of conceptual, free-form text, raising it to 0.7-0.8 prioritizes semantics.

Reranking: improving context quality before the LLM

After hybrid search, retrieved fragments are ordered by a combined similarity score. That order isn’t necessarily what maximizes final answer quality — because similarity to the query doesn’t equal real relevance for the specific question.

Reranking applies a second model — slower than embedding but more precise at capturing relevance — to reorder those fragments before building the context the LLM receives. Until I added reranking I didn’t see the real difference on complex questions requiring information from multiple documents. The model received well-prioritized context, and answer quality visibly improved.

Reranking adds more value when corpus is large (many candidates to reorder), questions are complex or multifaceted, and context reaching the LLM has strict token limits. When corpus is small and questions are straightforward, it can be skipped in early phases without losing much.

Models like Cohere Rerank or sentence-transformers cross-encoders (like cross-encoder/ms-marco-MiniLM-L-6-v2) are common options. Open-source cross-encoders have the advantage of not depending on external APIs, which matters if corpus is confidential.

Continuous evaluation: the layer that makes everything measurable

A RAG without metrics operates blind. You don’t know if improving chunking helped, if the new embedding model was better, or if reranking is doing anything. Metrics are the feedback mechanism that converts building into an iterative process.

Three metrics matter and together give a complete picture of where the system fails:

Faithfulness: Is the answer grounded in retrieved documents or is the model inventing? An answer can be correct but not supported by the context — that’s hallucination, even if it happens to be right. This metric specifically measures whether the model invents or relies on what it retrieved.

Answer relevancy: Does the answer actually answer the question? Not if it’s correct in the abstract, but if it answers what the user asked. A model can generate a coherent, well-supported answer that doesn’t address the specific question.

Context precision: Are retrieved fragments relevant to the question? This metric diagnoses retrieval independently of generation — if context precision is low, the problem is in chunking, embeddings or hybrid search, not the LLM.

Combining all three lets you diagnose which pipeline stage has the problem. Low context precision points to retrieval. Low faithfulness points to generation or insufficient context. Low answer relevancy could be retrieval, generation, or how the final prompt is constructed.

# Evaluation with RAGAS — the three main metrics
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

def evaluar_pipeline_rag(preguntas, respuestas, contextos, respuestas_referencia):
    dataset = Dataset.from_dict({
        "question": preguntas,
        "answer": respuestas,
        "contexts": contextos,          # list of lists: fragments retrieved per question
        "ground_truth": respuestas_referencia
    })

    resultado = evaluate(
        dataset=dataset,
        metrics=[faithfulness, answer_relevancy, context_precision]
    )
    return resultado

# Reference thresholds to decide if an iteration improved the system
UMBRALES = {
    "faithfulness": 0.85,       # below: review generation or insufficient context
    "answer_relevancy": 0.80,   # below: review retrieval or prompt construction
    "context_precision": 0.75   # below: review chunking or embeddings
}

When a metric drops, diagnostic order matters. First check context precision — if retrieval isn’t bringing relevant fragments, no later adjustment compensates. If context precision is fine but faithfulness drops, verify whether context is sufficient to answer the question or the model is extrapolating. If the first two are fine and answer relevancy drops, the problem usually lies in how the final prompt is constructed or the system retrieving related information but not the specific answer.

Common mistakes building an enterprise RAG

Evaluating the embedding model on a generic benchmark, not your real corpus

The most frequent mistake I see: choosing the embedding model based on public benchmarks (MTEB, for example) without validating it on your system’s actual vocabulary. Benchmarks measure general performance — they don’t capture internal terminology, proprietary acronyms or linguistic patterns specific to each company. The model leading MTEB might be worst for your particular corpus. Manual validation with 20-30 representative questions before committing to the model is non-negotiable.

Fixed chunk size for entire corpus

Applying the same chunk size to documents with radically different structures — a 50-page contract, a 3-paragraph email, a technical spec with tables — produces inconsistent results. Short documents end in one giant chunk without structure; long, complex documents fragment without respecting their natural semantic units. The solution is a chunking strategy by document type, not a single global parameter.

Ignoring corpus updates

An enterprise RAG works with documents that change: policies update, prices vary, cases resolve. If the vector store doesn’t have a re-ingestion strategy, it starts answering with old information versions and there’s no way to know — the system answers confidently but with outdated data.

The right strategy is detecting modified documents (by hash, timestamp or explicit signal), removing their chunks from the vector store, reprocessing and reinserting. Without this, the system silently degrades over time.

Adding reranking before validating base retrieval

Reranking improves the order of fragments that retrieval already found. If base retrieval isn’t bringing the right fragments, reranking can’t invent them. I’ve seen teams add a cross-encoder expecting it to solve problems actually in chunking or the embedding model. Correct diagnosis is always bottom-up: chunking → embeddings → retrieval → reranking.

Not having a fixed test set before iterating

Without a set of reference questions with known answers, any pipeline change is a blind leap. It might seem like improvement or degradation with no objective way to measure it. Building that test set — 50-100 representative questions with expected answers — is the first task before optimizing any component.

Phase-based building framework

The complete system doesn’t build all at once. The following phases order decisions from greatest to least impact, with concrete criteria for knowing when to advance.

Phase 1 — Minimum viable functionality

This phase’s goal isn’t having the best possible system, but having a system that works measurably and can be iterated on.

  • Semantic chunking adapted to your corpus’s documentation type

  • Embedding model validated over a sample of 20-30 representative questions

  • Local vector store (Chroma) to iterate without operational friction

  • Test set of 50-100 questions with reference answers

  • Basic evaluation of faithfulness and answer relevancy over that set

Advancement criterion: faithfulness > 0.80 and answer relevancy > 0.75 consistently on test set. If not reached, the problem is in chunking or embeddings — not what comes next.

Phase 2 — Optimize retrieval

With the foundation validated, retrieval optimizations have a clear baseline to measure impact.

  • Hybrid search (semantic + BM25) with alpha adjusted to corpus type

  • Reranking for complex questions or when context has strict token limits

  • Production-capable vector store if corpus exceeds 100K documents

  • Extended metrics: add context precision to evaluation set

Advancement criterion: measurable improvement in context precision (> 0.75) without degrading faithfulness. If context precision rises but faithfulness drops, something in reranking or context construction is breaking coherence.

Phase 3 — Continuous evaluation and scale

This phase doesn’t add new retrieval capabilities — it consolidates the system to maintain quality over time and at scale.

  • Automated evaluation pipeline running in CI with every system change

  • Production metric monitoring (at least daily sample of real queries)

  • Re-ingestion strategy for documents that update

  • Chunking strategy and vector store review if corpus exceeds 1M documents

Advancement criterion: system detects metric regressions before reaching users. If a pipeline change degrades faithfulness by 5 points, CI detects it before deployment.

Implementation checklist

  • Chunking respects document’s semantic units (sections, complete paragraphs, entities)

  • Embedding model is validated over a sample of questions representative of the real domain

  • Vector store is configured for current corpus size with scaling migration plan

  • Overlap between chunks is sized by document type, not a global fixed value

  • Hybrid search has configurable alpha parameter adjusted to system’s query type

  • Reranking is applied before building final context the LLM receives

  • Test set of 50-100 questions with reference answers exists before iterating

  • Faithfulness, answer relevancy and context precision metrics measured with every change

  • Re-ingestion strategy for updated documents is defined and documented

Frequently Asked Questions

When to use RAG vs fine-tuning?

This is the question that paralyzes most teams most, and it has a more precise answer than it seems. The confusion comes from treating RAG and fine-tuning as alternatives, when they really answer different questions.

Fine-tuning modifies model weights: it changes how it thinks or speaks, not what it knows. RAG doesn’t touch the model — it adds context at inference time with information the model didn’t have in training. That architectural difference determines when to use each.

RAG is the right answer when knowledge changes frequently (prices, policies, cases that resolve), when corpus is extensive and doesn’t fit in the model’s context, when traceability matters — you need to know what specific fragment grounds each answer — or when knowledge is confidential and you can’t send it to a provider for fine-tuning.

Fine-tuning is the right answer when you want to change behavior, tone or response format consistently, when knowledge is stable and well-defined (internal taxonomies, specific output format), or when you need very low latency and RAG’s additional context is the bottleneck.

The most effective production pattern combines both: fine-tuning so the model speaks the company’s language (terminology, response format, tone), RAG so it knows what the company knows today. The practical criterion: if the team’s question is “what does the model know?”, RAG. If the question is “how does the model respond?”, fine-tuning. If both questions have an answer, both tools.

Why is “what’s the right chunk size?” a poorly formed question?

Because there’s no correct number — there’s a correct number for each combination of document type, embedding model, and typical query length in your system.

The useful question isn’t “what size do I use?” but “how do I find the right size for my corpus?” The answer: measure context precision with three or four candidate chunk sizes on your test set. The size maximizing context precision without producing chunks too small to have meaning is correct for that specific corpus. In enterprise technical documentation, ranges of 256-512 tokens usually work well, but these are starting points for measurement, not final answers.

How to handle documents that update?

Without an explicit re-ingestion strategy, the RAG starts answering with old information versions. The mechanism is simple: keep a record of each document in the vector store with its hash or timestamp, detect when a source document modifies, delete all its chunks from the vector store, reprocess and reinsert from scratch.

The critical part often overlooked is deleting old chunks before inserting new ones. If they aren’t deleted, the vector store ends with multiple versions of the same document, and retrieval returns fragments mixed from different versions. Most vector stores have deletion operations by metadata (filtering by doc_id or source) making this operation straightforward.

How does the system scale as corpus grows?

Decisions scaling well from the start and those needing review differ. Semantic chunking and embeddings scale without changes — ingestion process is the same for 10K or 1M documents, just slower.

What changes with scale is the vector store and indexing strategy. Chroma works well through tens of thousands of documents; beyond that, retrieval latency becomes perceptible and it’s time to migrate to Pinecone, Weaviate or a solution with HNSW indexing optimized for large corpus. Hybrid search with BM25 also needs scale review — BM25 on millions of documents requires an efficient inverted index (Elasticsearch or Weaviate with BM25 module, not in-memory implementation).

The metric to monitor as corpus grows isn’t just retrieval latency but retrieval quality: with more documents, the probability of bringing noisy fragments in top results rises, and reranking shifts from optimization to necessary component.