<?xml version="1.0" encoding="utf-8"?><feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en"><generator uri="https://jekyllrb.com/" version="4.4.1">Jekyll</generator><link href="https://mohcinemadkour.github.io/feed.xml" rel="self" type="application/atom+xml"/><link href="https://mohcinemadkour.github.io/" rel="alternate" type="text/html" hreflang="en"/><updated>2026-04-25T17:18:32+00:00</updated><id>https://mohcinemadkour.github.io/feed.xml</id><title type="html">Mohcine Madkour</title><subtitle>Senior AI/ML Engineer &amp; Architect | PhD in Computer Science | LangGraph · LangSmith · RAG · MLOps · Healthcare AI </subtitle><entry><title type="html">Why Most Agentic AI Systems Fail in Production (And How to Fix That)</title><link href="https://mohcinemadkour.github.io/writing/2026/agentic-ai-evaluation-intro/" rel="alternate" type="text/html" title="Why Most Agentic AI Systems Fail in Production (And How to Fix That)"/><published>2026-04-01T00:00:00+00:00</published><updated>2026-04-01T00:00:00+00:00</updated><id>https://mohcinemadkour.github.io/writing/2026/agentic-ai-evaluation-intro</id><content type="html" xml:base="https://mohcinemadkour.github.io/writing/2026/agentic-ai-evaluation-intro/"><![CDATA[<p>Most teams ship agentic AI systems without knowing if they actually work.</p> <p>They run a few manual tests, see the demo produce plausible outputs, and call it production-ready. Then the hallucinations start. Retrieval degrades silently. The agent loops. Users lose trust.</p> <p>The problem isn’t the LLM. The problem is the absence of engineering discipline around evaluation, monitoring, and continuous improvement.</p> <h2 id="the-evaluation-pyramid">The Evaluation Pyramid</h2> <p>I think about agentic AI quality in four layers:</p> <ol> <li><strong>Retrieval quality</strong> — Is the vector store returning relevant documents? (Hit Rate@K, MRR, NDCG)</li> <li><strong>Generation quality</strong> — Is the LLM using retrieved context faithfully? (RAGAS Faithfulness, Answer Relevancy)</li> <li><strong>System-level quality</strong> — Does the full pipeline meet latency and cost requirements?</li> <li><strong>Human quality</strong> — Do real users find the outputs useful and trustworthy?</li> </ol> <p>Most teams only operate at layer 4 — they notice problems after users complain. By then, trust is already damaged.</p> <p>The goal of a production evaluation framework is to catch problems at layers 1–3 automatically, before they reach users.</p> <h2 id="retrieval-where-it-usually-breaks-first">Retrieval: Where It Usually Breaks First</h2> <p>Retrieval failures are insidious because they’re invisible at the application layer. The LLM still produces fluent, confident-sounding output — it just doesn’t have the right information to work with.</p> <p>The fix is systematic retrieval evaluation:</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">ragas.metrics</span> <span class="kn">import</span> <span class="n">ContextPrecision</span><span class="p">,</span> <span class="n">ContextRecall</span>
<span class="kn">from</span> <span class="n">langsmith</span> <span class="kn">import</span> <span class="n">Client</span>

<span class="c1"># Evaluate retrieval on a test set
</span><span class="n">results</span> <span class="o">=</span> <span class="nf">evaluate</span><span class="p">(</span>
    <span class="n">dataset</span><span class="o">=</span><span class="n">test_questions</span><span class="p">,</span>
    <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="nc">ContextPrecision</span><span class="p">(),</span> <span class="nc">ContextRecall</span><span class="p">()],</span>
    <span class="n">llm</span><span class="o">=</span><span class="n">your_llm</span><span class="p">,</span>
    <span class="n">embeddings</span><span class="o">=</span><span class="n">your_embeddings</span>
<span class="p">)</span>
</code></pre></div></div> <p>Track these metrics over time in LangSmith. When they drop — and they will drop, as your data and query distribution shift — you’ll know before users do.</p> <h2 id="generation-the-llm-as-judge-pattern">Generation: The LLM-as-Judge Pattern</h2> <p>For generation quality, automated LLM-as-Judge pipelines are the current best practice. The key is using a separate, stronger model as evaluator, and prompt-engineering the rubric carefully to minimize self-serving bias.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">ragas.metrics</span> <span class="kn">import</span> <span class="n">Faithfulness</span><span class="p">,</span> <span class="n">AnswerRelevancy</span>

<span class="c1"># Is the answer grounded in the retrieved context?
# Does it actually answer what was asked?
</span><span class="n">generation_metrics</span> <span class="o">=</span> <span class="nf">evaluate</span><span class="p">(</span>
    <span class="n">dataset</span><span class="o">=</span><span class="n">qa_pairs_with_context</span><span class="p">,</span>
    <span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="nc">Faithfulness</span><span class="p">(),</span> <span class="nc">AnswerRelevancy</span><span class="p">()]</span>
<span class="p">)</span>
</code></pre></div></div> <p>Faithfulness below 0.85 is a red flag. It means your LLM is hallucinating — generating claims not supported by the retrieved context.</p> <h2 id="cicd-quality-gates">CI/CD Quality Gates</h2> <p>The highest-leverage intervention is blocking production deployments when evaluation metrics fail:</p> <div class="language-yaml highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="c1"># .github/workflows/eval.yml</span>
<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Run evaluation suite</span>
  <span class="na">run</span><span class="pi">:</span> <span class="s">python evaluate.py --dataset test_set.json</span>

<span class="pi">-</span> <span class="na">name</span><span class="pi">:</span> <span class="s">Enforce quality gates</span>
  <span class="na">run</span><span class="pi">:</span> <span class="pi">|</span>
    <span class="s">python -c "</span>
    <span class="s">import json</span>
    <span class="s">results = json.load(open('eval_results.json'))</span>
    <span class="s">assert results['hit_rate_5'] &gt; 0.80, 'Retrieval Hit Rate@5 below threshold'</span>
    <span class="s">assert results['faithfulness'] &gt; 0.85, 'Faithfulness below threshold'</span>
    <span class="s">print('All quality gates passed')</span>
    <span class="s">"</span>
</code></pre></div></div> <p>This turns evaluation from a manual ritual into an automated guardrail.</p> <h2 id="whats-next">What’s Next</h2> <p>In the coming posts, I’ll go deeper on each layer of the evaluation pyramid — with code, real results, and the lessons learned deploying these systems in healthcare and industrial settings.</p> <p>If you’re building agentic AI systems and want to talk through your evaluation strategy, <a href="https://linkedin.com/in/mohcine-madkour-83a642b2/">reach out on LinkedIn</a>.</p>]]></content><author><name></name></author><category term="writing"/><category term="agentic-ai"/><category term="RAG"/><category term="LangGraph"/><category term="LangSmith"/><category term="MLOps"/><summary type="html"><![CDATA[Agentic AI demos look great. Production deployments rarely do. Here's the engineering discipline that bridges the gap.]]></summary></entry><entry><title type="html">RAG Retrieval Metrics You Should Actually Be Tracking</title><link href="https://mohcinemadkour.github.io/writing/2026/rag-retrieval-metrics/" rel="alternate" type="text/html" title="RAG Retrieval Metrics You Should Actually Be Tracking"/><published>2026-03-20T00:00:00+00:00</published><updated>2026-03-20T00:00:00+00:00</updated><id>https://mohcinemadkour.github.io/writing/2026/rag-retrieval-metrics</id><content type="html" xml:base="https://mohcinemadkour.github.io/writing/2026/rag-retrieval-metrics/"><![CDATA[<p>If you’re building a RAG system and not tracking retrieval metrics, you’re flying blind.</p> <p>Most teams measure answer quality — they ask “does the LLM give a good response?” But answer quality is a lagging indicator. Retrieval quality is the leading indicator, and it’s where the problems start.</p> <p>Here are the three retrieval metrics I use on every RAG project, and what each one tells you.</p> <h2 id="hit-ratek">Hit Rate@K</h2> <p><strong>What it measures:</strong> Does the correct document appear anywhere in the top-K retrieved results?</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">hit_rate_at_k</span><span class="p">(</span><span class="n">retrieved_docs</span><span class="p">,</span> <span class="n">relevant_doc_id</span><span class="p">,</span> <span class="n">k</span><span class="p">):</span>
    <span class="n">top_k_ids</span> <span class="o">=</span> <span class="p">[</span><span class="n">doc</span><span class="p">.</span><span class="nb">id</span> <span class="k">for</span> <span class="n">doc</span> <span class="ow">in</span> <span class="n">retrieved_docs</span><span class="p">[:</span><span class="n">k</span><span class="p">]]</span>
    <span class="k">return</span> <span class="mi">1</span> <span class="k">if</span> <span class="n">relevant_doc_id</span> <span class="ow">in</span> <span class="n">top_k_ids</span> <span class="k">else</span> <span class="mi">0</span>

<span class="c1"># Average over your test set
</span><span class="n">hit_rate</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="nf">hit_rate_at_k</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">rel</span><span class="p">,</span> <span class="n">k</span><span class="o">=</span><span class="mi">5</span><span class="p">)</span>
               <span class="k">for</span> <span class="n">r</span><span class="p">,</span> <span class="n">rel</span> <span class="ow">in</span> <span class="nf">zip</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">relevant_ids</span><span class="p">))</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
</code></pre></div></div> <p><strong>When it matters most:</strong> When your LLM has a large context window and can synthesize across multiple retrieved chunks. If you’re passing top-5 to the LLM, Hit Rate@5 is your primary metric.</p> <p><strong>Typical target:</strong> &gt; 0.80 for a well-tuned system.</p> <h2 id="mrr-mean-reciprocal-rank">MRR (Mean Reciprocal Rank)</h2> <p><strong>What it measures:</strong> How high does the correct document rank? Being first is better than being fifth, even if both count as a “hit.”</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">reciprocal_rank</span><span class="p">(</span><span class="n">retrieved_docs</span><span class="p">,</span> <span class="n">relevant_doc_id</span><span class="p">):</span>
    <span class="k">for</span> <span class="n">rank</span><span class="p">,</span> <span class="n">doc</span> <span class="ow">in</span> <span class="nf">enumerate</span><span class="p">(</span><span class="n">retrieved_docs</span><span class="p">,</span> <span class="n">start</span><span class="o">=</span><span class="mi">1</span><span class="p">):</span>
        <span class="k">if</span> <span class="n">doc</span><span class="p">.</span><span class="nb">id</span> <span class="o">==</span> <span class="n">relevant_doc_id</span><span class="p">:</span>
            <span class="k">return</span> <span class="mi">1</span> <span class="o">/</span> <span class="n">rank</span>
    <span class="k">return</span> <span class="mi">0</span>

<span class="n">mrr</span> <span class="o">=</span> <span class="nf">sum</span><span class="p">(</span><span class="nf">reciprocal_rank</span><span class="p">(</span><span class="n">r</span><span class="p">,</span> <span class="n">rel</span><span class="p">)</span>
          <span class="k">for</span> <span class="n">r</span><span class="p">,</span> <span class="n">rel</span> <span class="ow">in</span> <span class="nf">zip</span><span class="p">(</span><span class="n">results</span><span class="p">,</span> <span class="n">relevant_ids</span><span class="p">))</span> <span class="o">/</span> <span class="nf">len</span><span class="p">(</span><span class="n">results</span><span class="p">)</span>
</code></pre></div></div> <p><strong>When it matters most:</strong> When your LLM only uses the top-1 or top-2 results. If the correct document is ranked 5th, MRR penalizes that even if Hit Rate@5 counts it as a success.</p> <h2 id="ndcg-normalized-discounted-cumulative-gain">NDCG (Normalized Discounted Cumulative Gain)</h2> <p><strong>What it measures:</strong> Ranking quality across multiple relevant documents, with higher-ranked documents weighted more.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">sklearn.metrics</span> <span class="kn">import</span> <span class="n">ndcg_score</span>
<span class="kn">import</span> <span class="n">numpy</span> <span class="k">as</span> <span class="n">np</span>

<span class="c1"># relevance_scores: list of [1, 0, 1, 0, 0] for each retrieved doc
</span><span class="n">ndcg</span> <span class="o">=</span> <span class="nf">ndcg_score</span><span class="p">(</span>
    <span class="n">y_true</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="nf">array</span><span class="p">([</span><span class="n">relevance_scores</span><span class="p">]),</span>
    <span class="n">y_score</span><span class="o">=</span><span class="n">np</span><span class="p">.</span><span class="nf">array</span><span class="p">([</span><span class="n">retrieval_scores</span><span class="p">])</span>
<span class="p">)</span>
</code></pre></div></div> <p><strong>When it matters most:</strong> When queries have multiple relevant documents (e.g., “summarize everything about X”) and you need to know if the most relevant ones rank highest.</p> <h2 id="putting-it-together-in-langsmith">Putting It Together in LangSmith</h2> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">langsmith</span> <span class="kn">import</span> <span class="n">Client</span>
<span class="kn">from</span> <span class="n">langsmith.evaluation</span> <span class="kn">import</span> <span class="n">evaluate</span>

<span class="k">def</span> <span class="nf">retrieval_evaluator</span><span class="p">(</span><span class="n">run</span><span class="p">,</span> <span class="n">example</span><span class="p">):</span>
    <span class="n">retrieved</span> <span class="o">=</span> <span class="n">run</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="sh">"</span><span class="s">retrieved_docs</span><span class="sh">"</span><span class="p">]</span>
    <span class="n">relevant</span> <span class="o">=</span> <span class="n">example</span><span class="p">.</span><span class="n">outputs</span><span class="p">[</span><span class="sh">"</span><span class="s">relevant_doc_ids</span><span class="sh">"</span><span class="p">]</span>

    <span class="k">return</span> <span class="p">{</span>
        <span class="sh">"</span><span class="s">hit_rate_5</span><span class="sh">"</span><span class="p">:</span> <span class="nf">hit_rate_at_k</span><span class="p">(</span><span class="n">retrieved</span><span class="p">,</span> <span class="n">relevant</span><span class="p">[</span><span class="mi">0</span><span class="p">],</span> <span class="n">k</span><span class="o">=</span><span class="mi">5</span><span class="p">),</span>
        <span class="sh">"</span><span class="s">mrr</span><span class="sh">"</span><span class="p">:</span> <span class="nf">reciprocal_rank</span><span class="p">(</span><span class="n">retrieved</span><span class="p">,</span> <span class="n">relevant</span><span class="p">[</span><span class="mi">0</span><span class="p">]),</span>
    <span class="p">}</span>

<span class="n">results</span> <span class="o">=</span> <span class="nf">evaluate</span><span class="p">(</span>
    <span class="n">retrieval_pipeline</span><span class="p">,</span>
    <span class="n">data</span><span class="o">=</span><span class="sh">"</span><span class="s">my-rag-test-dataset</span><span class="sh">"</span><span class="p">,</span>
    <span class="n">evaluators</span><span class="o">=</span><span class="p">[</span><span class="n">retrieval_evaluator</span><span class="p">],</span>
<span class="p">)</span>
</code></pre></div></div> <p>Track these weekly. When Hit Rate@5 drops below 0.80, investigate: did your data change? Did query distribution shift? Did someone modify the chunking strategy?</p> <p>Retrieval metrics give you the early warning system that answer quality alone can’t.</p>]]></content><author><name></name></author><category term="writing"/><category term="RAG"/><category term="retrieval"/><category term="metrics"/><category term="LangSmith"/><category term="Python"/><summary type="html"><![CDATA[Hit Rate@K, MRR, NDCG — what they measure, when each one matters, and how to implement them for your RAG system.]]></summary></entry><entry><title type="html">MLOps Monitoring: The Tools Actually Worth Using in 2026</title><link href="https://mohcinemadkour.github.io/writing/2026/mlops-monitoring/" rel="alternate" type="text/html" title="MLOps Monitoring: The Tools Actually Worth Using in 2026"/><published>2026-02-15T00:00:00+00:00</published><updated>2026-02-15T00:00:00+00:00</updated><id>https://mohcinemadkour.github.io/writing/2026/mlops-monitoring</id><content type="html" xml:base="https://mohcinemadkour.github.io/writing/2026/mlops-monitoring/"><![CDATA[<p>Production ML monitoring is crowded with tools that promise everything and deliver complexity. Here’s what I’ve actually used across healthcare AI and industrial ML projects, and what each tool is genuinely good at.</p> <h2 id="the-four-tools-i-use">The Four Tools I Use</h2> <h3 id="evidently--best-for-classic-ml-drift">Evidently — Best for Classic ML Drift</h3> <p>If you have a traditional ML model (classification, regression) in production, Evidently is the most practical starting point. It generates readable HTML reports on data drift, target drift, and data quality without much setup.</p> <div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">from</span> <span class="n">evidently.report</span> <span class="kn">import</span> <span class="n">Report</span>
<span class="kn">from</span> <span class="n">evidently.metric_preset</span> <span class="kn">import</span> <span class="n">DataDriftPreset</span>

<span class="n">report</span> <span class="o">=</span> <span class="nc">Report</span><span class="p">(</span><span class="n">metrics</span><span class="o">=</span><span class="p">[</span><span class="nc">DataDriftPreset</span><span class="p">()])</span>
<span class="n">report</span><span class="p">.</span><span class="nf">run</span><span class="p">(</span><span class="n">reference_data</span><span class="o">=</span><span class="n">train_df</span><span class="p">,</span> <span class="n">current_data</span><span class="o">=</span><span class="n">production_df</span><span class="p">)</span>
<span class="n">report</span><span class="p">.</span><span class="nf">save_html</span><span class="p">(</span><span class="sh">"</span><span class="s">drift_report.html</span><span class="sh">"</span><span class="p">)</span>
</code></pre></div></div> <p>I’ve used this at Intuitive Surgical to monitor sensor feature distributions for the surgical robot predictive maintenance system. When a robot gets a firmware update that changes sensor calibration, Evidently catches the resulting distribution shift within days.</p> <p><strong>Limitation:</strong> Not built for LLMs or agentic systems. Text drift is possible but awkward.</p> <h3 id="langsmith--the-standard-for-llmrag-monitoring">LangSmith — The Standard for LLM/RAG Monitoring</h3> <p>If you’re building with LangChain or LangGraph, LangSmith is the obvious choice. It traces every step of your chain — retrieval, generation, tool calls — and stores structured run data that you can query and evaluate against.</p> <p>What makes it worth using: the combination of <strong>tracing + evaluation datasets + CI/CD integration</strong>. You can define evaluation rubrics, run them against your traces, and set up automated testing pipelines.</p> <p>I use LangSmith on every agentic AI project now. The ability to replay a failing trace and understand exactly which retrieval step went wrong is invaluable.</p> <h3 id="arize-phoenix--best-for-embedding-drift">Arize Phoenix — Best for Embedding Drift</h3> <p>When your vector store ages and queries stop matching well, the symptom is retrieval degradation. Arize Phoenix is the best tool I’ve found for visualizing embedding space drift and identifying which query clusters are underperforming.</p> <p>It also has UMAP visualizations of your embeddings over time — genuinely useful for diagnosing retrieval regression when you re-index or change your embedding model.</p> <h3 id="azure-ml-monitor--production-mlops-in-azure">Azure ML Monitor — Production MLOps in Azure</h3> <p>If your stack is Azure (Azure ML, Databricks, Azure Event Hubs), Azure ML Monitor gives you model performance tracking and data drift monitoring that integrates natively with your deployment infrastructure. The dashboard is less polished than Evidently’s reports, but the integration with Azure ML pipelines is seamless.</p> <p>Used this at Cummins for fleet-level model health monitoring across thousands of connected engines.</p> <h2 id="my-current-stack">My Current Stack</h2> <table> <thead> <tr> <th>Use Case</th> <th>Tool</th> </tr> </thead> <tbody> <tr> <td>Classic ML drift</td> <td>Evidently</td> </tr> <tr> <td>LLM/RAG tracing</td> <td>LangSmith</td> </tr> <tr> <td>Embedding drift</td> <td>Arize Phoenix</td> </tr> <tr> <td>Azure production</td> <td>Azure ML Monitor</td> </tr> </tbody> </table> <p>No single tool covers everything. The combination of Evidently (classical ML) + LangSmith (LLM layer) covers 90% of what most teams need.</p> <h2 id="the-monitoring-you-actually-need">The Monitoring You Actually Need</h2> <p>Tooling aside, here’s the minimum viable monitoring setup I recommend:</p> <ol> <li><strong>Data drift alert</strong> on your top 10 input features (Evidently)</li> <li><strong>Retrieval Hit Rate@5</strong> tracked weekly (LangSmith + custom evaluator)</li> <li><strong>Faithfulness score</strong> tracked on a sample of production queries (RAGAS via LangSmith)</li> <li><strong>Latency p50/p99</strong> for your full pipeline (any APM tool)</li> <li><strong>Error rate</strong> on LLM API calls and vector store queries</li> </ol> <p>If those five metrics are green and trending right, your system is almost certainly behaving. If any one of them degrades, you have a signal to investigate before users notice.</p>]]></content><author><name></name></author><category term="writing"/><category term="MLOps"/><category term="monitoring"/><category term="Evidently"/><category term="LangSmith"/><category term="Arize"/><category term="Azure ML"/><summary type="html"><![CDATA[A practitioner's comparison of Evidently, Arize Phoenix, Azure ML Monitor, and LangSmith for production ML and LLM monitoring — based on real deployments.]]></summary></entry></feed>