Mohcine Madkour

Why Most Agentic AI Systems Fail in Production (And How to Fix That)

2026-04-01T00:00:00+00:00

Most teams ship agentic AI systems without knowing if they actually work.

They run a few manual tests, see the demo produce plausible outputs, and call it production-ready. Then the hallucinations start. Retrieval degrades silently. The agent loops. Users lose trust.

The problem isn’t the LLM. The problem is the absence of engineering discipline around evaluation, monitoring, and continuous improvement.

The Evaluation Pyramid

I think about agentic AI quality in four layers:

Retrieval quality — Is the vector store returning relevant documents? (Hit Rate@K, MRR, NDCG)
Generation quality — Is the LLM using retrieved context faithfully? (RAGAS Faithfulness, Answer Relevancy)
System-level quality — Does the full pipeline meet latency and cost requirements?
Human quality — Do real users find the outputs useful and trustworthy?

Most teams only operate at layer 4 — they notice problems after users complain. By then, trust is already damaged.

The goal of a production evaluation framework is to catch problems at layers 1–3 automatically, before they reach users.

Retrieval: Where It Usually Breaks First

Retrieval failures are insidious because they’re invisible at the application layer. The LLM still produces fluent, confident-sounding output — it just doesn’t have the right information to work with.

The fix is systematic retrieval evaluation:

from ragas.metrics import ContextPrecision, ContextRecall
from langsmith import Client

# Evaluate retrieval on a test set
results = evaluate(
    dataset=test_questions,
    metrics=[ContextPrecision(), ContextRecall()],
    llm=your_llm,
    embeddings=your_embeddings
)

Track these metrics over time in LangSmith. When they drop — and they will drop, as your data and query distribution shift — you’ll know before users do.

Generation: The LLM-as-Judge Pattern

For generation quality, automated LLM-as-Judge pipelines are the current best practice. The key is using a separate, stronger model as evaluator, and prompt-engineering the rubric carefully to minimize self-serving bias.

from ragas.metrics import Faithfulness, AnswerRelevancy

# Is the answer grounded in the retrieved context?
# Does it actually answer what was asked?
generation_metrics = evaluate(
    dataset=qa_pairs_with_context,
    metrics=[Faithfulness(), AnswerRelevancy()]
)

Faithfulness below 0.85 is a red flag. It means your LLM is hallucinating — generating claims not supported by the retrieved context.

CI/CD Quality Gates

The highest-leverage intervention is blocking production deployments when evaluation metrics fail:

# .github/workflows/eval.yml
- name: Run evaluation suite
  run: python evaluate.py --dataset test_set.json

- name: Enforce quality gates
  run: |
    python -c "
    import json
    results = json.load(open('eval_results.json'))
    assert results['hit_rate_5'] > 0.80, 'Retrieval Hit Rate@5 below threshold'
    assert results['faithfulness'] > 0.85, 'Faithfulness below threshold'
    print('All quality gates passed')
    "

This turns evaluation from a manual ritual into an automated guardrail.

What’s Next

In the coming posts, I’ll go deeper on each layer of the evaluation pyramid — with code, real results, and the lessons learned deploying these systems in healthcare and industrial settings.

If you’re building agentic AI systems and want to talk through your evaluation strategy, reach out on LinkedIn.

RAG Retrieval Metrics You Should Actually Be Tracking

2026-03-20T00:00:00+00:00

If you’re building a RAG system and not tracking retrieval metrics, you’re flying blind.

Most teams measure answer quality — they ask “does the LLM give a good response?” But answer quality is a lagging indicator. Retrieval quality is the leading indicator, and it’s where the problems start.

Here are the three retrieval metrics I use on every RAG project, and what each one tells you.

Hit Rate@K

What it measures: Does the correct document appear anywhere in the top-K retrieved results?

def hit_rate_at_k(retrieved_docs, relevant_doc_id, k):
    top_k_ids = [doc.id for doc in retrieved_docs[:k]]
    return 1 if relevant_doc_id in top_k_ids else 0

# Average over your test set
hit_rate = sum(hit_rate_at_k(r, rel, k=5)
               for r, rel in zip(results, relevant_ids)) / len(results)

When it matters most: When your LLM has a large context window and can synthesize across multiple retrieved chunks. If you’re passing top-5 to the LLM, Hit Rate@5 is your primary metric.

Typical target: > 0.80 for a well-tuned system.

MRR (Mean Reciprocal Rank)

What it measures: How high does the correct document rank? Being first is better than being fifth, even if both count as a “hit.”

def reciprocal_rank(retrieved_docs, relevant_doc_id):
    for rank, doc in enumerate(retrieved_docs, start=1):
        if doc.id == relevant_doc_id:
            return 1 / rank
    return 0

mrr = sum(reciprocal_rank(r, rel)
          for r, rel in zip(results, relevant_ids)) / len(results)

When it matters most: When your LLM only uses the top-1 or top-2 results. If the correct document is ranked 5th, MRR penalizes that even if Hit Rate@5 counts it as a success.

NDCG (Normalized Discounted Cumulative Gain)

What it measures: Ranking quality across multiple relevant documents, with higher-ranked documents weighted more.

from sklearn.metrics import ndcg_score
import numpy as np

# relevance_scores: list of [1, 0, 1, 0, 0] for each retrieved doc
ndcg = ndcg_score(
    y_true=np.array([relevance_scores]),
    y_score=np.array([retrieval_scores])
)

When it matters most: When queries have multiple relevant documents (e.g., “summarize everything about X”) and you need to know if the most relevant ones rank highest.

Putting It Together in LangSmith

from langsmith import Client
from langsmith.evaluation import evaluate

def retrieval_evaluator(run, example):
    retrieved = run.outputs["retrieved_docs"]
    relevant = example.outputs["relevant_doc_ids"]

    return {
        "hit_rate_5": hit_rate_at_k(retrieved, relevant[0], k=5),
        "mrr": reciprocal_rank(retrieved, relevant[0]),
    }

results = evaluate(
    retrieval_pipeline,
    data="my-rag-test-dataset",
    evaluators=[retrieval_evaluator],
)

Track these weekly. When Hit Rate@5 drops below 0.80, investigate: did your data change? Did query distribution shift? Did someone modify the chunking strategy?

Retrieval metrics give you the early warning system that answer quality alone can’t.

MLOps Monitoring: The Tools Actually Worth Using in 2026

2026-02-15T00:00:00+00:00

Production ML monitoring is crowded with tools that promise everything and deliver complexity. Here’s what I’ve actually used across healthcare AI and industrial ML projects, and what each tool is genuinely good at.

The Four Tools I Use

Evidently — Best for Classic ML Drift

If you have a traditional ML model (classification, regression) in production, Evidently is the most practical starting point. It generates readable HTML reports on data drift, target drift, and data quality without much setup.

from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

report = Report(metrics=[DataDriftPreset()])
report.run(reference_data=train_df, current_data=production_df)
report.save_html("drift_report.html")

I’ve used this at Intuitive Surgical to monitor sensor feature distributions for the surgical robot predictive maintenance system. When a robot gets a firmware update that changes sensor calibration, Evidently catches the resulting distribution shift within days.

Limitation: Not built for LLMs or agentic systems. Text drift is possible but awkward.

LangSmith — The Standard for LLM/RAG Monitoring

If you’re building with LangChain or LangGraph, LangSmith is the obvious choice. It traces every step of your chain — retrieval, generation, tool calls — and stores structured run data that you can query and evaluate against.

What makes it worth using: the combination of tracing + evaluation datasets + CI/CD integration. You can define evaluation rubrics, run them against your traces, and set up automated testing pipelines.

I use LangSmith on every agentic AI project now. The ability to replay a failing trace and understand exactly which retrieval step went wrong is invaluable.

Arize Phoenix — Best for Embedding Drift

When your vector store ages and queries stop matching well, the symptom is retrieval degradation. Arize Phoenix is the best tool I’ve found for visualizing embedding space drift and identifying which query clusters are underperforming.

It also has UMAP visualizations of your embeddings over time — genuinely useful for diagnosing retrieval regression when you re-index or change your embedding model.

Azure ML Monitor — Production MLOps in Azure

If your stack is Azure (Azure ML, Databricks, Azure Event Hubs), Azure ML Monitor gives you model performance tracking and data drift monitoring that integrates natively with your deployment infrastructure. The dashboard is less polished than Evidently’s reports, but the integration with Azure ML pipelines is seamless.

Used this at Cummins for fleet-level model health monitoring across thousands of connected engines.

My Current Stack

Use Case	Tool
Classic ML drift	Evidently
LLM/RAG tracing	LangSmith
Embedding drift	Arize Phoenix
Azure production	Azure ML Monitor

No single tool covers everything. The combination of Evidently (classical ML) + LangSmith (LLM layer) covers 90% of what most teams need.

The Monitoring You Actually Need

Tooling aside, here’s the minimum viable monitoring setup I recommend:

Data drift alert on your top 10 input features (Evidently)
Retrieval Hit Rate@5 tracked weekly (LangSmith + custom evaluator)
Faithfulness score tracked on a sample of production queries (RAGAS via LangSmith)
Latency p50/p99 for your full pipeline (any APM tool)
Error rate on LLM API calls and vector store queries

If those five metrics are green and trending right, your system is almost certainly behaving. If any one of them degrades, you have a signal to investigate before users notice.