<!DOCTYPE html>

Agentic AI Evaluation: Production-Ready Systems — Udemy Course
Coming to Udemy  ·  New course 2026

Agentic AI Evaluation
Production-Ready Systems LangGraph · LangSmith · RAGAS · CI/CD Quality Gates

Move beyond "it works in the demo." Build automated, rigorous evaluation pipelines for your RAG and agentic AI systems — measuring retrieval quality, generation faithfulness, latency, and cost, all the way to GitHub Actions deployment gates.

10×
Modules
2 tracks
Core + Advanced
9+
Capstone deliverables
$0
Tool licensing cost
SystemAgentic RAG
RetrievalRAGAS Eval
GenerationLLM-as-Judge
TracingLangSmith
GateCI/CD Block
MonitorDrift Alert

Why this course exists

Shipping an agentic AI system without evaluation infrastructure is like deploying a web service with no monitoring. You will not know when it breaks — until your users tell you.

📉
Silent production failure
Retrieval drifts, hallucinations slip through, costs balloon — all without a single error log. Evaluation infrastructure is the only way to catch this before users do.
🔬
Metrics that actually matter
Beyond vibe checks and manual spot-testing — NDCG, Hit Rate@K, faithfulness, answer relevancy, and LLM-as-Judge scoring built into your CI pipeline.
🏗️
Production patterns, not toy demos
Every module produces working, deployable artifacts — a retrieval evaluator, a CI quality gate, an embedding drift detector — that you can plug into a real system on day one.
🤖
Agentic and multi-hop ready
Standard RAG evaluation is not enough. This course covers trajectory evaluation, tool call correctness, and consistency testing for multi-agent systems running in production.

Built for engineers
who ship to production

This is not an introduction to LLMs. It is an engineering course for practitioners who have already built agentic or RAG systems and need to make them measurably reliable.

ML engineers & AI architects maintaining RAG or agentic pipelines in production who need systematic quality measurement, not ad-hoc testing.
Senior developers who have shipped LLM-powered features and now need to instrument them for drift, hallucination, and cost visibility.
AI architects designing evaluation strategies for enterprise agentic systems — including multi-agent orchestration with LangGraph.
Data scientists with Python and LLM API experience who want to move into production MLOps for AI systems beyond classical ML monitoring.
Healthcare and clinical AI engineers who need domain-specific evaluation rubrics — the M9 case study covers hallucination detection in a production clinical RAG system.
Platform and DevOps engineers who own CI/CD pipelines and need to understand what quality gates for AI systems should look like before implementing them.
Prerequisites — you need these coming in
Python (intermediate) LLM API basics RAG fundamentals Git & command line Docker basics (M8+)

Real evaluators, real code

Every module produces working code you can deploy. Here's a taste of the retrieval evaluator you will build in Module 3.

retrieval_evaluator.py  ·  Module 3 — Building Retrieval Evaluators
# Evaluate retrieval quality with Hit Rate@K, MRR, and NDCG from ragas.metrics import ContextPrecision, ContextRecall from datasets import Dataset import numpy as np def ndcg_at_k(retrieved_ids, relevant_ids, k=5): """Normalized Discounted Cumulative Gain @K""" hits = [1 if doc in relevant_ids else 0 for doc in retrieved_ids[:k]] dcg = sum(h / np.log2(i + 2) for i, h in enumerate(hits)) idcg = sum(1 / np.log2(i + 2) for i in range(min(len(relevant_ids), k))) return dcg / idcg if idcg else 0.0 # Run full eval against golden dataset eval_data = Dataset.from_dict({ "question": golden_questions, "contexts": retrieved_contexts, "ground_truth": reference_answers, }) results = evaluate(eval_data, metrics=[ContextPrecision(), ContextRecall()]) # → Publishes scores to LangSmith experiment dashboard

10 modules across two tracks

Track A builds the foundation — retrieval evaluation, generation scoring, and CI/CD gates. Track B extends into drift detection, cost optimization, multi-agent evaluation, and a healthcare case study.

Track A — Core Curriculum (Modules 1–5)
M1
Why Evaluation Matters
~60 min
5 lessons
The evaluation problem in production The Evaluation Pyramid What "production-ready" actually means Course roadmap walkthrough
M2
Metrics Demystified
~80 min
6 lessons1 lab
Retrieval vs generation metrics Precision, Recall, F1 for RAG LangSmith setup & tracing First experiment run
M3
Building Retrieval Evaluators
~100 min
7 lessons1 lab
Hit Rate@K MRR implementation NDCG from scratch RAGAS context precision & recall Golden dataset construction
M4
Generation Quality & LLM-as-Judge
~110 min
8 lessons1 lab
RAGAS faithfulness scoring Answer relevancy metric LLM-as-Judge pipeline design Bias mitigation strategies Human-in-the-loop calibration
M5
CI/CD Quality Gates
~90 min
7 lessons1 lab
GitHub Actions eval workflow Threshold configuration Blocking deployments on quality drop LangSmith experiment comparison
Track B — Advanced Modules (Modules 6–10)
M6
Embedding Drift & Re-indexing
~90 min
6 lessons1 lab
Cosine centroid drift detection Spearman correlation monitoring Chunk-level citation tracking Re-indexing strategy
M7
Latency & Cost Optimization
~80 min
6 lessons1 lab
Pipeline profiling Cost-per-query analysis Latency vs quality tradeoff Caching strategies
M8
Multi-Agent Evaluation
~100 min
7 lessons1 lab
Trajectory evaluation Tool call correctness scoring Consistency testing across agents LangGraph tracing with LangSmith
M9
Healthcare RAG Case Study
~90 min
6 lessonscase study
Clinical safety rubric design Hallucination caught in production Domain-specific eval patterns Regulatory & compliance framing
M10
Capstone Project
~120 min
9 deliverablesportfolio
Build your own eval framework 9-deliverable checklist LangSmith dashboard setup CI gate deployment Portfolio writeup

Seven production artifacts
you will build

Labs are not toy exercises. Each produces a deployable artifact you can extend, adapt, and put in a portfolio — or plug directly into a system you already run.

01
🔍
Retrieval Evaluator
A working evaluator measuring Hit Rate@K, MRR, and NDCG against a golden dataset — publishable directly to a LangSmith experiment for tracking over time.
02
⚖️
LLM-as-Judge Pipeline
An automated generation quality scorer that rates faithfulness and answer relevancy at scale — with calibration against human labels and bias mitigation built in.
03
🚦
CI/CD Quality Gate
A GitHub Actions workflow that runs your eval suite on every PR and blocks deployment if retrieval or generation scores fall below configured thresholds.
04
📡
Embedding Drift Detector
A monitor that tracks cosine centroid shift across your vector index over time and triggers re-indexing workflows when drift exceeds a configurable threshold.
05
🧩
Multi-Agent Trajectory Evaluator
A LangGraph-native evaluator that scores agent trajectories for tool call correctness, step consistency, and final answer quality across multi-hop reasoning chains.
06
🏥
Clinical Safety Rubric
A domain-specific evaluation rubric for healthcare RAG systems — covering hallucination risk tiers, clinical accuracy scoring, and the regulatory framing for AI-in-the-loop systems.
07
📦
Full Eval Framework
The capstone deliverable — a complete, modular evaluation framework integrating all prior artifacts, documented and packaged for deployment to any agentic AI system you own.

All open source, all free

LangGraph
LangSmith
RAGAS
LangChain
ChromaDB
Anthropic API
OpenAI API
GitHub Actions
sentence-transformers
Python 3.11+
TruLens
Hugging Face

Built by someone who
caught the failures in production

MM
Mohcine Madkour, PhD
Senior AI/ML Engineer & Architect · Biomedical Informatics
I have spent a decade building AI systems that run on real data — from the Da Vinci surgical robotics RAG system at Intuitive Surgical, to predictive maintenance pipelines at Cummins ($700K annual savings), to surgical risk prediction at UF Shands (AUC 0.82–0.94). In every one of those systems, the hardest problems were evaluation problems: knowing when retrieval drifted, catching hallucinations before clinicians did, and proving to stakeholders that the system was improving, not just changing. This course is what I wish had existed when I was building those systems.
PhD Computer Science Postdoc · UTHealth Houston Intuitive Surgical Cummins · $700K savings UF Shands · MySurgeryRisk AI/ML Boot Camp Instructor SharpestMinds Mentor

Early access — launching 2026

Stop shipping agentic AI
without measuring it

In 10 modules you will go from manual spot-testing to a fully automated evaluation pipeline — with retrieval metrics, LLM-as-Judge scoring, CI/CD gates, and drift detection running on every deployment. All tools are free. All labs produce deployable code.

$19.99
List price $129.99
Launch discount · early access

Join the waitlist on Udemy
30-day Udemy money-back guarantee · lifetime access · certificate of completion