Agentic AI Evaluation: From Development to Production

A Practitioner’s Guide for Engineers and Technical Leaders

Grounded in: EquipmentIQ — a production multi-agent RAG system for CNC machinery predictive maintenance, built on 1,702 real vibration recordings from the Bosch CNC Machining Dataset (CC-BY-4.0). Every principle in this guide was learned from a real failure in a live system.

Why Agentic AI Systems Require a Different Evaluation Model
The Evaluation Stack: Three Layers, Three Purposes
The Evaluation Lifecycle: Four Phases
The Metrics Reference
The Human Feedback Loop
Failure Mode Taxonomy
The Live Debugging Cycle
Domain Coverage Testing
Key Principles for Technical Leaders
Real-World Results: EquipmentIQ Benchmark

1. Why Agentic AI Systems Require a Different Evaluation Model

Traditional software systems are deterministic: a given input always produces the same output. A sorting function returns the same sorted list every time. A query against a relational database returns the same rows. Testing is straightforward — write assertions, run them, read pass/fail.

Agentic AI systems break this contract at every layer.

The Five Sources of Non-Determinism

Source	Description	Impact on evaluation
Intent classification	The same query can be routed to different agents depending on model state, prompt phrasing, and context window content	Routing accuracy must be measured independently of retrieval quality
Retrieval	Vector similarity rankings shift when embeddings change, collections are updated, or the query distribution evolves	Retrieval metrics must be re-run after every knowledge base change
Reranking	Cross-encoder scores vary with model version, chunk boundaries, and co-occurring context	Reranker model versions must be pinned and change-controlled
Generation	LLM output is stochastic — the same prompt produces different answers across calls	Generation quality requires statistical sampling, not single-query testing
Feedback loops	The system changes over time as golden sets grow, prompts are tuned, and models are updated	Evaluation itself must be versioned and regression-tested

The Consequence for Engineering Teams

A system can produce plausible-looking answers while being fundamentally broken. In the EquipmentIQ build, the orchestrator was returning answers within the expected latency range with no errors — yet retrieval was returning 0.00 NDCG across all collections. The answers were hallucinated. Without a structured evaluation pipeline, this failure would have reached users.

The core principle: Evaluation is not validation that the system works. It is the definition of what “working” means.

2. The Evaluation Stack: Three Layers, Three Purposes

Every agentic AI system must be evaluated at three distinct layers. These layers are hierarchical — failures at Layer 1 make Layer 2 measurements meaningless, and failures at Layer 2 make Layer 3 measurements misleading.

┌──────────────────────────────────────────────────────────────────┐
│  LAYER 3: System Health                                           │
│  Latency  ·  Token Cost  ·  Embedding Drift  ·  Feedback Rate   │
│  Measured: continuously in production                             │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 2: Generation Quality                                      │
│  Faithfulness  ·  Answer Relevance  ·  LLM-as-Judge Score       │
│  Measured: on sampled production traffic and golden set           │
├──────────────────────────────────────────────────────────────────┤
│  LAYER 1: Retrieval Quality                                       │
│  NDCG@K  ·  Hit Rate@K  ·  MRR  ·  Routing Accuracy            │
│  Measured: against golden set before every deployment             │
└──────────────────────────────────────────────────────────────────┘
         Fix Layer 1 before measuring Layer 2.
         Fix Layer 2 before trusting Layer 3.

Layer Interaction — Why Order Matters

Scenario	Layer 1 state	Layer 2 state	What it means
Retrieval broken, generation good	NDCG = 0.00	Faithfulness appears high	Faithfulness is measuring wrong context — false positive
Routing broken, retrieval correct	Hit Rate = 0.00	Answers empty	Correct agent never runs — routing must be fixed first
Retrieval good, generation broken	NDCG = 1.00	Faithfulness = 0.10	Context not reaching synthesiser — code bug, not quality issue
All layers healthy	NDCG ≥ 0.70	Faithfulness ≥ 0.80	System is functioning correctly

What Each Layer Catches

Layer 1 catches: wrong agent routing, empty retrieval, embedding mismatches, golden set ID mismatches, similarity floor miscalibration.

Layer 2 catches: hallucination, incomplete answers, ignored context, synthesis prompt regressions, model degradation.

Layer 3 catches: latency regressions, cost overruns, query distribution drift, embedding space degradation, user satisfaction decline.

3. The Evaluation Lifecycle: Four Phases

Evaluation is not a one-time activity. It runs continuously across the full system lifecycle, with different goals and methods at each phase.

PHASE 1              PHASE 2              PHASE 3              PHASE 4
Ground Truth    →    Development     →    Pre-Deploy      →    Production
                     Evaluation           Gates                Monitoring

Build golden         Unit tests           Numerical gates      Real-time traces
set before           Retrieval eval       CI/CD blocking       Sampled scoring
writing code         Routing checks       Load testing         Drift detection
                     Embed consistency    Regression suite     Feedback loop

Phase 1 — Ground Truth Before Code

The most common and most damaging evaluation mistake is building the ground truth test set after the system is already working. This creates circular validation: the system is tuned until it passes tests written while observing the system’s output. The tests then measure the system’s behaviour, not its correctness.

Ground truth must be built before the system runs.

What the Golden Set Contains

A golden set is a curated collection of question-answer pairs where the correct answer is known independently of what the system produces. For a multi-agent RAG system, each entry specifies:

Field	Description	Example
`query`	The natural language question	“What does error SPN-CR-001 mean?”
`agent`	Which agent should handle this query	`software`
`expected_doc_ids`	IDs of chunks that must appear in retrieval results	`["SPN-CR-001_chunk_0"]`
`ground_truth_answer`	The correct answer from authoritative source	“CRITICAL — spindle bearing failure…”

Golden Set Sizing

System size	Minimum entries	Entries per agent	Rationale
1-agent system	20	20	Provides statistical stability for NDCG
3-agent system	30	10 per agent	One failure moves NDCG by 0.10 — detectable
5-agent system	50	10 per agent	Cross-domain pairs needed in addition
Production system	90+	30 per agent	Organic growth from user feedback

The Four Golden Set Rules

Rule	Description	Risk if violated
Use live collection IDs	Extract expected_doc_ids by querying the live vector store — never write from memory	ID format mismatches produce NDCG = 0.00 regardless of retrieval quality
Grow from failures	Every production failure with a known correct answer becomes a new golden set entry	Test set becomes increasingly disconnected from real usage
Version control	Golden set changes require review — never update to make failing tests pass	Evaluation fraud: tests pass but system quality has not improved
Minimum density	At least 10 entries per agent	Fewer entries make metric noise indistinguishable from real signal

Phase 2 — Development Evaluation

During development, evaluation has a single goal: make retrieval and routing deterministic before touching generation quality.

The Development Evaluation Loop

Write component
      ↓
Run unit tests (fast, mocked, every commit)
      ↓
Run retrieval evaluation (against live collections)
      ↓
         ┌─── NDCG ≥ 0.70? ───No──→ Diagnose layer 1 failure → Fix
         │
        Yes
         ↓
Run routing evaluation (40 labelled queries)
      ↓
         ┌─── Accuracy ≥ 95%? ──No──→ Fix intent classifier prompt → Retest
         │
        Yes
         ↓
Run generation evaluation (sampled)
      ↓
Commit

Unit Tests vs Evaluation Tests — Critical Distinction

Dimension	Unit tests	Evaluation tests
Purpose	Verify code correctness	Measure system quality
Output	Binary pass/fail	Numerical score vs threshold
Speed	Milliseconds	Seconds to minutes
Frequency	Every commit	Scheduled or triggered
Dependencies	All mocked	Real collections and APIs
Failure action	Block commit	Diagnose and tune
Example pass	NDCG formula returns 0.0 when no match	NDCG ≥ 0.70 on 10 golden queries
Example fail	NDCG formula returns 1.28 (overflow bug)	NDCG = 0.27 on support collection

Both are required. Neither replaces the other. In EquipmentIQ, 91 unit tests ran in under 60 seconds with all external dependencies mocked. A separate evaluation suite ran against live collections and real APIs. The unit tests found the NDCG formula bug (returning 1.28). The evaluation suite found the routing collapse (software domain scoring 30%).

The NDCG Formula — A Common Implementation Bug

NDCG is the industry standard retrieval quality metric. It rewards systems that rank relevant documents at the top of the result list. The formula has one common implementation error that produces values above 1.0 — a mathematical impossibility that signals a bug in the evaluation code itself.

Implementation	Formula	Result range	Correctness
Incorrect	relevance / (rank + 1)	[0, ∞)	❌ Can exceed 1.0
Correct	relevance / log₂(rank + 1)	[0, 1.0]	✅ Mathematically bounded

The correct NDCG computation:

For each retrieved document at rank r (starting at 1):
  DCG  += relevance(r) / log₂(r + 1)

For the ideal ranking of the same documents:
  IDCG += 1.0 / log₂(r + 1)   for each relevant doc in top-K positions

NDCG = DCG / IDCG   (clamped to [0.0, 1.0])

Always add a hard clamp to [0.0, 1.0]. Any value above 1.0 is a formula bug — not an exceptionally good retrieval result.

Routing Accuracy — The Overlooked Development Metric

In multi-agent systems, routing accuracy is as consequential as retrieval quality. A query routed to the wrong agent produces a confidently-wrong answer — often more harmful than an empty response, because it signals false certainty.

Routing scenario	User experience	Risk level
Correct agent, relevant chunks	Accurate answer with citations	Low
Correct agent, irrelevant chunks	INSUFFICIENT_CONTEXT response	Medium — user knows to ask differently
Wrong agent, relevant chunks	Partial answer from wrong domain	High — misleading but not empty
Wrong agent, irrelevant chunks	INSUFFICIENT_CONTEXT from wrong domain	High — user has no path forward
Cross-domain when ambiguous	Parallel retrieval, merged context	Low — designed for ambiguity

Target: 95% routing accuracy on a labelled test set of 40 queries (10 per domain). This test must run after every change to the intent classification prompt.

Embedding Consistency — The Silent Production Killer

The most architecturally damaging failure in EquipmentIQ was an embedding dimension mismatch: documents were ingested using a 1536-dimensional embedding model, but the retrieval layer was querying using a 384-dimensional model. Every retrieval call silently returned empty results. The system appeared to function — no errors, normal latency — but answers were hallucinated.

Scenario	Symptom	Root cause
Dimension mismatch	Empty retrieval, random results	Different embedding model used for ingestion vs query
Model version mismatch	Declining NDCG after model update	Stored embeddings incompatible with new query embeddings
Distance metric mismatch	Scores outside expected range	Collection created with L2 distance, queried with cosine

The rule: one embedding model, one distance metric, specified in configuration, referenced by every component that touches embeddings. Never hardcode embedding model names in individual scripts.

Phase 3 — Pre-Deployment Gates

Every change that reaches production — new prompt, new chunk size, new embedding model, new documents added, any code change — must pass a set of numerical gates before deployment proceeds.

The Gate Framework

Gate ID	Metric	Target	What it catches	Failure consequence
AC-001	NDCG@5 per collection	≥ 0.70	Retrieval degradation from any cause	Deployment blocked
AC-002	Hit Rate@5 per collection	≥ 0.85	Retrieval coverage loss	Deployment blocked
AC-003	Generation Faithfulness	≥ 0.80	Synthesis hallucination regression	Deployment blocked
AC-004	Routing accuracy	≥ 95% on 40 queries	Intent classifier regression	Deployment blocked
AC-005	P95 latency (single agent)	≤ 10 seconds	Performance regression	Deployment blocked
AC-006	P95 latency (cross-domain)	≤ 20 seconds	Parallel retrieval overhead	Advisory
AC-007	Domain Q&A coverage	≥ 90% per domain	Knowledge base coverage gaps	Advisory, investigate

Gate Calibration Strategy

Gates set too high block legitimate deployments. Gates set too low allow regressions to reach production. The recommended calibration approach:

System maturity	NDCG target	Faithfulness target	Routing target
Early development	≥ 0.50	≥ 0.60	≥ 80%
Stable development	≥ 0.70	≥ 0.80	≥ 95%
Production	≥ 0.80	≥ 0.85	≥ 95%
Mature production	≥ 0.90	≥ 0.90	≥ 97%

Raise targets as the system matures. Do not start at maximum targets — this creates evaluation debt where teams route around gates rather than fixing underlying quality.

What Triggers a Gate Run

Trigger	Gates that run	Rationale
Any prompt change	AC-001 to AC-004	Prompts affect routing and synthesis quality
Chunk size / overlap change	AC-001, AC-002	Chunking affects what retrieval can find
New documents added	AC-001, AC-002, AC-007	New content may disturb existing retrieval
Embedding model change	AC-001 to AC-005	Requires full collection rebuild
LLM model version change	AC-003, AC-004	New models behave differently
Scheduled weekly	AC-001 to AC-007	Catch silent degradation

Phase 4 — Continuous Production Monitoring

Production monitoring detects when a system that was working stops working, before users notice. The challenge is cost: running full evaluation on every production query is prohibitively expensive. The solution is tiered monitoring.

The Tiered Monitoring Architecture

Tier	Trigger	What runs	LLM calls	Approx. cost/run
Real-time	Every query	Latency per node, routing domain, chunk count, similarity scores	0	$0.00
Sampled online	10–15% of traffic	Faithfulness, answer relevance, LLM-as-Judge	2	~$0.004
Feedback-triggered	Every user feedback submission	Failure mode classification, metric correlation	1	~$0.001
Nightly batch	Daily scheduled	Full NDCG/MRR across golden set, drift detection	4–8	~$0.02
Weekly regression	Weekly scheduled	Full golden set evaluation vs prior week baseline	10–20	~$0.10
Deployment gate	Every change	Full golden set + latency + routing accuracy	15–25	~$0.10

At this cost structure, a mature system running 1,000 production queries per day spends approximately $0.60–1.00 per day on evaluation infrastructure — less than 0.1% of typical LLM API spend.

Embedding Drift Detection

Embedding drift occurs when the distribution of production queries diverges from the distribution of indexed documents. This is not a code failure — it is a natural consequence of a system being used in ways its designers did not anticipate. If unchecked, it produces a gradual decline in NDCG that appears in metrics weeks after the underlying cause.

DRIFT DETECTION MECHANISM

At collection creation:
  Compute mean embedding vector of all indexed documents (baseline centroid)
  Save baseline to file

Nightly:
  Compute current mean embedding vector of production query log
  Measure cosine distance between current and baseline centroids
  
  Distance < 0.10 → No action
  Distance 0.10–0.15 → Log and monitor
  Distance > 0.15 → Alert: review query distribution, consider re-indexing

Production Observability — What to Monitor

For each production query, a structured trace should capture the following signals:

Signal	Where captured	What it indicates when anomalous
Classification domain	Routing node	Shift toward cross_domain → prompt drift
Classification confidence	Routing node	Declining confidence → ambiguity in new query types
Chunks retrieved per agent	Agent nodes	Count dropping → embedding drift or collection issue
Top similarity score	Agent nodes	Score declining → query distribution shift
Synthesis latency	Synthesis node	Rising → context window growing, reduce top-K
Total tokens (in + out)	Synthesis node	Rising → cost management trigger
Citations in answer	Synthesis node	Declining → synthesis prompt regression
Answer length	Output	Declining sharply → INSUFFICIENT_CONTEXT pattern

4. The Metrics Reference

Retrieval Metrics

Metric	Full name	Formula	Range	Interpretation
NDCG@K	Normalised Discounted Cumulative Gain	DCG / IDCG, where DCG = Σ rel(r) / log₂(r+1)	[0, 1]	Measures ranking quality — rewards relevant documents at top positions. 1.0 = perfect ranking.
Hit Rate@K	Hit Rate at K	1 if any expected doc in top-K, else 0	[0, 1]	Binary — did the system find at least one relevant document?
MRR	Mean Reciprocal Rank	Mean of 1/rank(first relevant doc)	[0, 1]	Measures how high the first correct answer appears. 1.0 = always rank 1.
Routing Accuracy	—	Correct routings / total queries	[0, 1]	Fraction of queries sent to the correct agent.

Generation Metrics

Metric	What it measures	Interpretation
Faithfulness	Are factual claims in the answer grounded in retrieved context?	Low = hallucination risk. High = answer is evidence-based.
Answer Relevance	Does the answer directly address the question?	Low = topic drift or incomplete answers.
LLM-as-Judge (1–5)	Overall answer quality on a structured rubric	4+ = acceptable for production. Below 3 = systematic prompt issue.
Context Precision	What fraction of retrieved chunks were used in the answer?	Low = retrieval returning noise. High = tight retrieval-synthesis alignment.

The Faithfulness Contradiction Pattern

A critical diagnostic signal: when faithfulness and LLM-as-Judge scores contradict each other, the cause is almost always a bug in the evaluation pipeline, not in the system.

Faithfulness	LLM Judge	Diagnosis	Action
High (≥ 0.80)	High (≥ 4.0)	System working correctly	No action
Low (< 0.40)	High (≥ 4.0)	Context not reaching faithfulness scorer	Debug context passing in evaluation code
High (≥ 0.80)	Low (< 3.0)	Answer is grounded but incomplete or poorly structured	Improve synthesis prompt
Low (< 0.40)	Low (< 3.0)	Both retrieval and synthesis have issues	Fix retrieval first, then synthesis

System Health Metrics

Metric	Normal range	Alert threshold	Response
Embedding drift (cosine distance)	0.00 – 0.10	> 0.15	Review query distribution, update baselines
P95 latency — single agent	< 5 seconds	> 10 seconds	Reduce top-K, check collection size
P95 latency — cross-domain	< 10 seconds	> 20 seconds	Parallelism degraded, check async execution
Routing accuracy	> 97%	< 95%	Prompt regression, run routing test suite
Discordant feedback rate	< 10%	> 20%	Automated metrics miscalibrated
INSUFFICIENT_CONTEXT rate	< 5%	> 15%	Knowledge base coverage gap

5. The Human Feedback Loop

Automated metrics scale to 100% of production traffic. They cannot capture whether an answer was actually useful for a specific user’s situation. Human feedback provides the ground truth signal that calibrates whether automated metrics are measuring the right things.

The Four-Stage Feedback Pipeline

STAGE 1          STAGE 2          STAGE 3          STAGE 4
Capture     →    Extract     →    Correlate    →    Grow
                 Signal           with Metrics      Golden Set

Thumbs up/down   LLM classifies   Compare human    Negative feedback
+ free text      failure mode     vs automated     with known answer
                 from free text   scores           → new golden entry

Stage 2 — Failure Mode Classification

Raw free-text feedback is noisy and inconsistent. Structuring it into classified failure modes enables systematic diagnosis.

Failure mode	Description	System implication
`wrong_answer`	Answer is factually incorrect	Retrieval returning wrong content, or synthesis hallucinating
`incomplete`	Answer addresses part of the question but misses key information	Context window too small, or relevant chunks below similarity floor
`hallucinated`	Answer contains information not in retrieved context	Synthesis prompt not enforcing grounding
`out_of_scope`	System answered a question it should have declined	Similarity floor too low, accepting irrelevant chunks
`correct`	Answer is accurate and complete	System functioning as intended

Stage 3 — Metric Calibration via Discordant Cases

Human rating	Automated score	Label	Meaning
Positive	High faithfulness	Concordant positive	System working, metrics calibrated
Negative	Low faithfulness	Concordant negative	System failing, metrics correctly detecting
Negative	High faithfulness	Discordant	Metrics not measuring what matters to users
Positive	Low faithfulness	Discordant	Metrics measuring something wrong

A discordant rate above 20% signals that automated metrics need recalibration. The most common cause: faithfulness measures whether answers are grounded in context, but users care whether answers are complete and actionable. A grounded-but-incomplete answer scores high on faithfulness but low on user satisfaction.

The Feedback-Evaluation Relationship

Human feedback calibrates automated metrics. Automated metrics scale human judgment. Neither replaces the other.

6. Failure Mode Taxonomy

This table documents every failure mode encountered in a production multi-agent RAG system, its root cause, and the correct fix. The most important column is “incorrect fix” — what teams do when they address the symptom rather than the cause.

Symptom	Root cause	Correct fix	Incorrect fix	Risk of incorrect fix
NDCG = 0.00 across all collections	Golden set IDs don’t match live collection IDs	Rebuild golden set by querying live collections	Re-ingest all collections	Ingestion works fine — wasted effort, IDs still wrong
NDCG > 1.00	DCG formula uses 1/(rank+1) instead of 1/log₂(rank+1)	Correct formula and add clamp	Lower evaluation targets	Metrics permanently untrustworthy
Faithfulness = 0.00, LLM Judge = 4+	Context not passed to faithfulness scorer (code bug)	Debug context passing in evaluation pipeline	Tune synthesis prompt	Synthesis prompt is not the problem
Support queries route to cross_domain	Intent classifier has insufficient support-domain examples	Add examples + categorical rule to prompt	Lower confidence threshold	Ambiguous queries now route incorrectly to single agents
Software domain scores 30% on coverage test	Parameter/MID queries have vocabulary that sounds mechanical	Add software examples for parameter IDs, MID numbers	Increase top-K retrieval	Retrieval is not the problem — routing is
Empty retrieval despite populated collection	Embedding dimension mismatch (ingestion vs query)	Standardise embedding model across all components	Add more documents	More documents with same mismatch makes nothing better
Retrieval quality degrades after prompt update	New examples broke disambiguation of edge cases	Narrow example scope; add routing regression tests	Revert all prompt changes	Routing regression becomes permanent risk
Knowledge base content gap	Specific topic documented in source but not indexed	Add supplementary content file; re-ingest	Tune retrieval parameters	Parameters cannot retrieve content that was never ingested
Correct domain, INSUFFICIENT_CONTEXT answer	Query phrasing semantically distant from indexed text	Add synonymous phrasing to supplementary content	Lower similarity floor	Lower floor adds irrelevant chunks without helping
Good NDCG, poor user satisfaction	System answers question asked, not question intended	Review query intent; improve synthesis prompt	Tune retrieval	Retrieval is correct — intent interpretation is the issue

7. The Live Debugging Cycle

When evaluation reveals a mismatch, the debugging process must follow a strict order. Every step skipped creates new problems while appearing to fix the current one.

The Six-Step Cycle

Step	Action	Output	Common mistake
1. Observe	Document exactly what the system did vs what it should have done	Precise, falsifiable problem statement	Jumping to a fix before fully characterising the problem
2. Localise	Test each layer independently to find where the failure originates	Layer identified: routing / retrieval / synthesis / evaluation	Testing the wrong layer and concluding no problem
3. Fix	Make exactly one change targeting the identified root cause	One changed file, one changed configuration value	Making multiple changes simultaneously
4. Verify	Run the specific failing query or test	Pass/fail on the originally failing case	Declaring success without running the specific case
5. Regression	Run the full golden set and coverage test	Confirmation that no previously-passing tests broke	Skipping regression and shipping a fix that breaks other tests
6. Document	Update golden set, change log, and repository history	Permanent record of root cause, fix, and verification	Undocumented fix that is silently broken six months later

Localisation: Testing Each Layer Independently

When an answer is wrong, the first task is identifying which layer produced the failure. Each layer answers one specific diagnostic question:

Layer	Diagnostic question	Evidence of layer failure
Routing	Was the query sent to the correct agent?	Wrong domain in routing output, or cross_domain with low confidence
Retrieval	Did the correct agent return relevant chunks?	Zero chunks, wrong source documents, or similarity scores below 0.15
Synthesis	Did the LLM use the retrieved context faithfully?	Answer contains facts not in any retrieved chunk
Evaluation	Is the metric itself computing correctly?	NDCG > 1.0, faithfulness = 0.0 everywhere, golden set IDs mismatched

This layered approach was critical in diagnosing the Zone C routing failure in EquipmentIQ. The query “What error codes are triggered in Zone C?” routed to the software agent at 85% confidence and returned INSUFFICIENT_CONTEXT. Layer-by-layer diagnosis in 25 minutes:

Step	Finding
Observe	Domain = software, confidence = 85%, answer = INSUFFICIENT_CONTEXT
Localise (routing)	Classifier reasoning: “query asks about error codes → software” — wrong
Localise (retrieval)	Software collection has no ISO zone documentation — retrieval correctly empty
Root cause	Vocabulary collision: “error codes” mapped to software, but ISO 10816-3 zones live in mechanical PDFs
Fix	Added 3 examples + 1 disambiguation rule to intent classification prompt
Verify	Query now routes to mechanical at 87% confidence with correct answer
Regression	All 30 golden set entries still passing — no regressions

The One-Change Rule

Making multiple changes in a single debugging cycle is the most common cause of evaluation regressions that cannot be explained. When the next evaluation run shows a different failure, there is no way to know which change caused it.

Scenario	What happened	Result
One change, problem fixed	Root cause identified and resolved	Trustworthy fix — can be documented and explained
One change, problem persists	Change did not address root cause	Clear signal to re-localise at the correct layer
Two changes, problem fixed	Either change A or B fixed it — unknown which	False confidence; the other change may have introduced a regression
Two changes, problem persists	Unknown interaction	Starting over from scratch is faster than debugging the combination

8. Domain Coverage Testing

The golden set measures retrieval precision on specific known queries. It does not measure whether the system can answer the full breadth of questions users will actually ask. Domain coverage testing fills this gap.

Golden Set vs Coverage Testing

Dimension	Golden set evaluation	Domain coverage testing
Purpose	Verify retrieval quality on known queries	Verify the system handles the full query space
Query source	Carefully curated, verified ground truth	Representative sample of real user queries
Pass criterion	NDCG ≥ threshold	Answer contains expected key terms
Frequency	Every deployment (gate)	After major content or routing changes
Failure action	Block deployment	Diagnose failure type, fix root cause
Scale	30–90 entries	20+ per domain (80+ for 3-agent systems)

A system can achieve NDCG = 1.00 on its golden set and still fail 30% of domain coverage tests — because the golden set only covers the topics it was designed around.

The Four Coverage Failure Types

Failure type	Description	How to diagnose	Fix
Routing failure	Query goes to wrong agent	Check routing output for correct domain	Add examples to intent classifier prompt
Retrieval failure (empty)	Correct agent, zero chunks returned	Inspect similarity scores; check embedding consistency	Fix embedding pipeline; lower similarity floor
Content gap	Query topic not in knowledge base	Check source documents for coverage	Add supplementary content; re-ingest
Semantic mismatch	Content exists but query phrasing doesn’t match indexed text	Run direct collection query; compare phrasing	Add synonymous phrasing to supplementary content

EquipmentIQ Coverage Test Results

Two coverage tests were run against the EquipmentIQ system — a 20-query mechanical domain test and an 80-query full-system test.

Test 1: 20-Query Mechanical Domain (DOC-EIQ-005 Vibration Monitoring)

#	Query topic	Result	Failure type
1	Vibration classification standard	✅ PASS	—
2	ISO 10816-3 machine group	✅ PASS	—
3	Zone D RMS range and action	✅ PASS	—
4	Error codes triggered in Zone C	✅ PASS	—
5	Zone B vs Zone B Upper	❌ FAIL	Content gap — “Zone B Upper” not explicitly documented
6	Statistical features per axis	✅ PASS	—
7	Best early-fault indicator	✅ PASS	—
8	Kurtosis alarm threshold	✅ PASS	—
9	Crest Factor formula	❌ FAIL	Semantic mismatch — formula exists but phrasing differs
10	Purpose of Mean feature	✅ PASS	—
11	Fault category for tooth-pass frequency	✅ PASS	—
12	Spindle bearing fault parameters	✅ PASS	—
13	Operations affected by tool wear	✅ PASS	—
14	Actuator fault indicators	❌ FAIL	Routing failure — routed to cross_domain
15	Process anomaly signal	✅ PASS	—
16	Normal range for P064	✅ PASS	—
17	VIB-SR-001 required action	❌ FAIL	Missing data — error code not ingested
18	Critical range for P004	✅ PASS	—
19	Bosch dataset size	❌ FAIL	Routing failure — dataset stats routed to cross_domain
20	Normal vs fault sample breakdown	❌ FAIL	Routing failure — same cause as #19

Initial score: 14/20 (70%). After five targeted fixes: 20/20 (100%).

Test 2: 80-Query Full-System Validation

Domain	Initial score	After Tier 1 fixes	Target
Mechanical	16/20 (80%)	17/20 (85%)	≥ 90%
Software	6/20 (30%)	14/20 (70%)	≥ 90%
Support	14/20 (70%)	15/20 (75%)	≥ 90%
Cross-domain	20/20 (100%)	20/20 (100%)	≥ 95%
Overall	56/80 (70%)	66/80 (83%)	≥ 90%

Key diagnostic insight from the 80-query test: Cross-domain (parallel retrieval across all three agents) scored 100% while software single-agent scored 30%. This pattern is diagnostic — it means the knowledge base contains the right content, but the routing layer is preventing the software agent from being called for software-domain queries. The fix is in the intent classifier, not in retrieval or the knowledge base.

The Coverage Test as a Living Asset

Coverage tests must be re-run after:

Change	Why re-run
New documents added to any collection	New content may conflict with existing retrieval, or queries about new content may fail
Intent prompt updated	New examples may break disambiguation of previously-correct queries
Chunk size or overlap changed	Different chunking distributes content differently, affecting semantic matches
Similarity floor adjusted	Threshold changes affect which queries return chunks vs INSUFFICIENT_CONTEXT
New query patterns observed in production logs	Emerging usage patterns may reveal new coverage gaps

9. Key Principles for Technical Leaders

These principles distil the lessons from building, breaking, and repairing a production multi-agent RAG system. They are relevant for anyone making architectural, resourcing, or quality decisions about agentic AI systems.

Principle 1 — Evaluation Defines Correctness

You cannot verify a system works by looking at it. In agentic systems, a system producing answers with normal latency and no errors can simultaneously be returning hallucinated content with 0.00 retrieval quality. The only way to know is to measure. Evaluation is not a testing activity — it is the engineering definition of what the system is supposed to do.

Principle 2 — Layer Order is Non-Negotiable

Always diagnose and fix in layer order: routing → retrieval → generation → system health. Generation metrics are meaningless when retrieval is broken. A team that tunes the synthesis prompt while NDCG is 0.00 is not making the system better — it is adding noise on top of a broken foundation.

Principle 3 — Ground Truth Must Be Independent of the System

A golden set built by observing the system’s output is not ground truth — it is a description of the system’s current behaviour. True ground truth requires knowing the correct answer from an authoritative source before the system runs. Golden sets built after the fact measure conformance to past behaviour, not correctness.

Principle 4 — Automated Metrics and Human Feedback Serve Different Purposes

Automated metrics	Human feedback
Scale to 100% of traffic	Capture real-world usefulness
Consistent and reproducible	Noisy but authentic
Fast and cheap	Slow and expensive
Measure proxy signals	Measure actual user value
Can be fooled by metric gaming	Cannot be fooled
Tell you what changed	Tell you what matters

Run both. When they disagree, human feedback is right and the automated metric needs recalibration.

Principle 5 — The Confidence Threshold is Not a Quality Control

A routing confidence threshold determines how much ambiguity the system tolerates before routing to parallel retrieval. It is not a mechanism for correcting routing errors. Lowering the threshold to fix a routing failure shifts ambiguous queries to arbitrary single agents — it does not fix the classifier. Every routing fix must happen in the classifier examples, not in the threshold parameter.

Principle 6 — Evaluation Infrastructure is Production Code

Evaluation code has the same failure modes as system code. The NDCG formula bug (returning 1.28) gave false confidence about retrieval quality for an entire development sprint. Evaluation scripts need the same engineering rigour as the system they evaluate: unit tests, version control, code review, and change management.

Principle 7 — Coverage Debt Accumulates Silently

A system with NDCG = 1.00 on its 30-entry golden set can have 30% of real user queries failing — because the golden set was built around the topics the team thought about, not the topics users actually ask about. Domain coverage tests, run regularly and grown organically from production failures, are the mechanism for keeping evaluation aligned with real usage.

10. Real-World Results: EquipmentIQ Benchmark

EquipmentIQ is a three-agent RAG system for CNC machinery predictive maintenance. The following results reflect the system state at the close of Sprint 3 (evaluation pipeline complete) and the subsequent coverage testing phase.

System Architecture

Component	Specification
Orchestration	LangGraph StateGraph, 8 nodes, conditional routing
Agents	3 specialised (mechanical, software/error codes, customer support)
Vector store	3 isolated ChromaDB collections
Embedding model	OpenAI text-embedding-3-small (1536 dimensions, cosine distance)
Reranker	Cross-encoder/ms-marco-MiniLM-L-6-v2 (top-8 → top-5)
Synthesis model	Anthropic Claude (Haiku for evaluation, Sonnet for production)
Confidence threshold	0.75 (cross-domain below this)
Similarity floor	0.15 (chunks below this score rejected)

Knowledge Base

Collection	Source	Documents	Content type
mechanical_collection	6 technical PDFs + supplementary	66 chunks	Machine specs, wiring, maintenance, ISO standards
software_collection	96 error code JSON files	96 documents	Error codes, severity levels, parameters, diagnostics
support_collection	150 complaint records CSV	150 documents	Phone notes, investigation notes, RMA data, remedies

Retrieval Evaluation Results (Sprint 3 Close)

Collection	NDCG@5	Hit Rate@5	MRR	Gate
mechanical_collection	1.00	1.00	1.00	✅ PASS (≥ 0.70)
software_collection	1.00	1.00	1.00	✅ PASS (≥ 0.70)
support_collection	1.00	1.00	1.00	✅ PASS (≥ 0.70)
Embedding drift	0.00 (baseline)	—	—	✅ OK (< 0.15)

Domain Coverage Test Results

Domain	Queries tested	Initial pass	After fixes	Fix types applied
Mechanical	20	14/20 (70%)	20/20 (100%)	3 routing, 1 content gap, 1 ingestion
Software	20	6/20 (30%)	14/20 (70%)	6 routing, ongoing
Support	20	14/20 (70%)	15/20 (75%)	2 routing, 4 transient errors
Cross-domain	20	20/20 (100%)	20/20 (100%)	None required
Total	80	54/80 (67%)	69/80 (86%)

Issues Encountered and Resolved

Issue	Symptom	Root cause	Resolution
NDCG = 0.00 everywhere	All golden set queries scored zero	Golden set used short IDs; collection stored full filename IDs	Rebuilt golden set by querying live collections
NDCG > 1.00	Score of 1.28 returned	DCG formula used 1/(rank+1) instead of 1/log₂(rank+1)	Corrected formula, added clamp
Faithfulness = 0.015	Near-zero despite accurate answers	Faithfulness scorer received empty context list	Fixed context extraction in evaluation pipeline
Embedding mismatch	Empty retrieval across all collections	Ingestion used 1536-dim model; retrieval defaulted to 384-dim	Standardised embedding client across all components
Support NDCG = 0.27	Support queries retrieving wrong content	Support queries routed to cross_domain (confidence 0.65–0.72)	Added complaint-specific examples to intent classifier
Software coverage 30%	Parameter/MID queries routed to mechanical	Classifier lacked software examples for these query patterns	Added 9 software examples + disambiguation rule
Zone C routing failure	“Error codes in Zone C” → software agent	“error codes” keyword overrode vibration context	Added mechanical examples for ISO zone queries
Zone B Upper content gap	No chunks matched Zone B Upper threshold	Specific sub-zone not explicitly named in source PDFs	Created supplementary content file with explicit zone table

Test Infrastructure

Component	Count	Coverage
Unit tests	91	Config, ingestion, agents, orchestrator, evaluation, feedback
Integration tests	12	End-to-end with live API calls
Golden set entries	37	10+ per agent, grown from failures
Coverage test queries	80	20 per domain
Evaluation tiers	6	Real-time through weekly regression

Appendix A — Metric Quick Reference

Metric	Layer	Formula summary	Target	Alert threshold
NDCG@5	Retrieval	Σ rel(r)/log₂(r+1) normalised	≥ 0.70	< 0.50
Hit Rate@5	Retrieval	Any relevant in top-5	≥ 0.85	< 0.70
MRR	Retrieval	Mean 1/rank(first relevant)	≥ 0.60	< 0.40
Routing accuracy	Retrieval	Correct routings / total	≥ 95%	< 90%
Faithfulness	Generation	Claims grounded in context	≥ 0.80	< 0.60
Answer relevance	Generation	Answer addresses question	≥ 0.75	< 0.55
LLM Judge	Generation	1–5 rubric score	≥ 3.5/5	< 2.5/5
Embedding drift	System	Cosine distance vs baseline	< 0.10	> 0.15
P95 latency (single)	System	95th percentile ms	< 10,000 ms	> 15,000 ms
Domain coverage	Coverage	Pass rate on domain Q&A test	≥ 90%	< 75%
Discordant rate	Feedback	Human-metric disagreement %	< 10%	> 20%

Appendix B — Diagnostic Decision Tree

WRONG OR EMPTY ANSWER
│
├─── Step 1: Was the query routed to the correct agent?
│    │
│    ├─── NO → Routing failure
│    │         Fix: Add examples to intent classifier prompt
│    │         Do NOT: Lower confidence threshold
│    │
│    └─── YES → Continue to Step 2
│
├─── Step 2: Did the correct agent retrieve relevant chunks?
│    │
│    ├─── ZERO CHUNKS → Check embedding consistency and similarity floor
│    │                  Check embedding dimensions match (ingestion vs query)
│    │
│    ├─── WRONG CHUNKS → Check golden set IDs match live collection
│    │                   Check if content exists in knowledge base at all
│    │
│    └─── RIGHT CHUNKS → Continue to Step 3
│
├─── Step 3: Did synthesis use the retrieved context?
│    │
│    ├─── CONTEXT IGNORED → Tighten synthesis prompt grounding instruction
│    │
│    ├─── HALLUCINATED → Verify context is passed correctly to synthesiser
│    │
│    └─── INSUFFICIENT_CONTEXT → Content gap: add supplementary content
│
└─── Step 4: Is the evaluation metric itself correct?
     │
     ├─── NDCG > 1.0 → Formula bug: use log₂(rank+1) not (rank+1)
     │
     ├─── Faithfulness = 0.0 everywhere → Debug context argument to scorer
     │
     └─── All metrics pass but answers wrong → Fix golden set IDs

Mohcine Madkour, PhD — Senior AI/ML Engineer & Architect EquipmentIQ Project — Multi-Agent RAG for Industrial Predictive Maintenance Bosch CNC Machining Dataset (CC-BY-4.0) — Tnani et al. Procedia CIRP 2022, 107, 131–136