Most RAG demos are theater
A bot answers three curated questions correctly and the room claps. Then a real user asks the fourth question — and the system invents a refund policy.
RAG is not magic. It's a small number of choices, made well.
The choices that move the needle
- Chunking strategy — semantic chunking beats fixed windows for most knowledge bases. Title-aware splitters beat semantic chunking for docs with strong hierarchy.
- Hybrid search — BM25 + dense vectors recovers the keyword-heavy queries that pure embeddings miss. The cost is one extra index.
- Rerankers — a cross-encoder over the top 50 results is the single cheapest accuracy upgrade in the stack.
- Citations — answers without verifiable source links are not answers. They're suggestions.
The eval set you'll wish you had
Start with 100 real questions from real users. Tag each one with the expected source document. Now you can:
- Measure recall@k for the retriever in isolation.
- Measure answer faithfulness for the generator in isolation.
- Catch regressions before your customers do.
What we build
We ship RAG systems that quote their sources, fall back gracefully when confidence is low, and tell you — in plain English — when they don't know. That last part is the hardest, and the most valuable.


