RAG Optimization Notes (First-Person)
After reviewing recent RAG optimization materials, my conclusion is straightforward:
The bottleneck of RAG is no longer “can it run,” but “can it hit reliably, stay controllable, and remain measurable in production.”
I now break RAG optimization into four layers:
- Pre-retrieval optimization (Query + Chunk)
- Retrieval-time optimization (Recall + Rank)
- Post-retrieval optimization (Context Packing + Compression)
- Production loop optimization (Evaluation + Feedback)
1) Pre-Retrieval Optimization: Fix Input and Corpus Quality First
What I focus on
- Semantic chunking
- I no longer use fixed 300/500-token hard cuts.
- I chunk by semantic paragraphs, code boundaries, and heading hierarchy.
- My goal is to make each chunk self-contained and independently citable.
- Query rewriting
- Normalize colloquial user questions into domain terms.
- Handle abbreviations, aliases, and typo normalization.
- Decompose complex questions into sub-queries.
- HyDE (Hypothetical Document Embeddings)
- Generate an “ideal answer draft” first.
- Retrieve using the draft embedding, not only the short user query.
- I treat HyDE as a recall-boost switch, enabled only in low-recall scenarios.
My assessment
If pre-retrieval is weak, reranking/compression/caching are mostly damage control.
2) Retrieval-Time Optimization: Multi-Path Recall + Rerank, Not Vector-Only
My current approach
- Hybrid search
- Dense vectors for semantic recall.
- Sparse retrieval (BM25/keywords) to recover exact-match cases.
- Fuse results before reranking.
- Two-stage ranking (Recall L1 -> Rank L2)
- Stage 1 maximizes recall (better to over-fetch).
- Stage 2 reranker narrows to top-k precision.
- Cross-encoder / API rerank
- Score query-doc pairs directly.
- More stable than pure embedding similarity, especially on long chunks.
My assessment
In production, the issue is often not “nothing found,” but “too many low-precision hits.” Rerank is not optional; it is a quality gate.
3) Post-Retrieval Optimization: Turn Context into High-Density Evidence
Three things I optimize
- Evidence compression
- Rerank first, then compress.
- Remove weakly relevant sentences, template noise, and duplicates.
- Keep entities, numbers, and conclusion-bearing sentences.
- Context packing strategy
- Do not concatenate by raw retrieval order.
- Repack by “question sub-intent -> evidence groups.”
- Tag each evidence block with source IDs for traceability.
- Cache-friendly prompt assembly
- Place stable system prefixes and static background first.
- Maximize prefix reuse and cache hit rate (cost + latency benefits).
My assessment
RAG cost is often dominated not by retrieval itself, but by sending low-value context to the LLM. Post-retrieval refinement is one of the most direct cost levers.
4) Production Loop Optimization: Make RAG a System, Not a Demo
My evaluation perspective
- Retrieval-layer metrics
- Recall@k
- MRR / nDCG
- Hit-rate buckets (short query / long query / code query)
- Generation-layer metrics
- Faithfulness (is the answer grounded in evidence?)
- Answer relevance (does it answer the actual question?)
- Context precision (how much retrieved context is truly useful?)
- System-layer metrics
- P95 latency
- Per-query token cost
- Cache hit rate
- Fallback-routing ratio (needs backup retrieval/web search)
My feedback loop
- User query -> recall -> rerank -> generate answer
- Evaluator scores answer and evidence automatically
- Low-score samples flow into a hard-case dataset
- Weekly regression over retrieval params, chunking policy, and reranker setup
Vendor/Framework Recommendations I Use as Baseline
I prioritize official vendor/framework docs over second-hand summaries.
- Microsoft Learn: Build Advanced Retrieval-Augmented Generation Systems
- End-to-end advanced RAG workflow
- Strong emphasis on query rewriting, post-retrieval processing, and evaluation loops
- Azure Architecture Center: Develop a RAG Solution—Information-Retrieval Phase
- Systematic retrieval-phase guidance
- Explicitly covers query augmentation/decomposition/rewriting/HyDE
- Anthropic Engineering: Contextual Retrieval
- Practical guidance on hybrid retrieval and context utilization
- Clearly addresses “retrieved is not equal to used correctly”
- Anthropic Help: Retrieval Augmented Generation (RAG) for Projects
- Checklist-oriented practical recommendations for productization
- Cohere Docs: Best Practices for using Rerank
- Practical rerank guidance for input organization and deployment
- Paper: Lost in the Middle
- Evidence for middle-context utilization degradation
- Supports the need for reranking, compression, and packing
- Foundational retrieval+generation paradigm
How I Integrate These Optimizations into Real AI Application Iteration
I run a weekly optimization loop:
Step 0: Define scenario buckets and baseline
- Build 100–300 real QA samples (bucketed by scenario).
- Record baseline: retrieval hit quality, answer quality, latency, and cost.
Step 1: Change only one variable per iteration
I modify one parameter at a time:
- Chunking policy
- Query rewriting switch
- Hybrid fusion weights
- Reranker model/threshold
- Context compression ratio
This avoids confounded results.
Step 2: Pass offline evaluation first
- No offline pass, no online rollout.
- I check three dimensions: quality gain, latency impact, cost impact.
Step 3: Online canary with rollback thresholds
- Roll out on small traffic.
- Set automatic rollback thresholds (P95, complaint rate, empty-answer rate).
Step 4: Convert wins into engineering assets
I persist proven improvements into:
- Retrieval config templates
- Prompt/context assembly conventions
- RAG regression scripts
- Failure case datasets and labeling rules
My Conclusion
My final view on RAG optimization:
- Pre-retrieval defines the ceiling (is the question represented correctly?)
- Retrieval-time defines hit quality (are we finding the right evidence?)
- Post-retrieval defines cost and usability (is high-density evidence delivered to the LLM?)
- Production loop defines sustainability (can quality keep improving?)
One-line summary:
RAG optimization is not "just tune model parameters"; it is engineering governance across retrieval, reranking, context construction, evaluation, and feedback.