Coming to Udemy · New course 2026

Agentic AI Evaluation
Production-Ready Systems LangGraph · LangSmith · RAGAS · CI/CD Quality Gates

Move beyond "it works in the demo." Build automated, rigorous evaluation pipelines for your RAG and agentic AI systems — measuring retrieval quality, generation faithfulness, latency, and cost, all the way to GitHub Actions deployment gates.

Join the waitlist View curriculum ↓

10×

Modules

2 tracks

Core + Advanced

Capstone deliverables

Tool licensing cost

SystemAgentic RAG

→

RetrievalRAGAS Eval

→

GenerationLLM-as-Judge

→

TracingLangSmith

→

GateCI/CD Block

→

MonitorDrift Alert

Purpose

Why this course exists

Shipping an agentic AI system without evaluation infrastructure is like deploying a web service with no monitoring. You will not know when it breaks — until your users tell you.

📉

Silent production failure

Retrieval drifts, hallucinations slip through, costs balloon — all without a single error log. Evaluation infrastructure is the only way to catch this before users do.

🔬

Metrics that actually matter

Beyond vibe checks and manual spot-testing — NDCG, Hit Rate@K, faithfulness, answer relevancy, and LLM-as-Judge scoring built into your CI pipeline.

🏗️

Production patterns, not toy demos

Every module produces working, deployable artifacts — a retrieval evaluator, a CI quality gate, an embedding drift detector — that you can plug into a real system on day one.

🤖

Agentic and multi-hop ready

Standard RAG evaluation is not enough. This course covers trajectory evaluation, tool call correctness, and consistency testing for multi-agent systems running in production.

Who is this for

Built for engineers
who ship to production

This is not an introduction to LLMs. It is an engineering course for practitioners who have already built agentic or RAG systems and need to make them measurably reliable.

✓

ML engineers & AI architects maintaining RAG or agentic pipelines in production who need systematic quality measurement, not ad-hoc testing.

✓

Senior developers who have shipped LLM-powered features and now need to instrument them for drift, hallucination, and cost visibility.

✓

AI architects designing evaluation strategies for enterprise agentic systems — including multi-agent orchestration with LangGraph.

✓

Data scientists with Python and LLM API experience who want to move into production MLOps for AI systems beyond classical ML monitoring.

✓

Healthcare and clinical AI engineers who need domain-specific evaluation rubrics — the M9 case study covers hallucination detection in a production clinical RAG system.

✓

Platform and DevOps engineers who own CI/CD pipelines and need to understand what quality gates for AI systems should look like before implementing them.

Prerequisites — you need these coming in

Python (intermediate) LLM API basics RAG fundamentals Git & command line Docker basics (M8+)

Sample content

Real evaluators, real code

Every module produces working code you can deploy. Here's a taste of the retrieval evaluator you will build in Module 3.

retrieval_evaluator.py · Module 3 — Building Retrieval Evaluators

 # Evaluate retrieval quality with Hit Rate@K, MRR, and NDCG  from ragas.metrics import ContextPrecision, ContextRecall from datasets import Dataset import numpy as np  def ndcg_at_k(retrieved_ids, relevant_ids, k=5): """Normalized Discounted Cumulative Gain @K"""  hits = [1 if doc in relevant_ids else 0 for doc in retrieved_ids[:k]] dcg = sum(h / np.log2(i + 2) for i, h in enumerate(hits)) idcg = sum(1 / np.log2(i + 2) for i in range(min(len(relevant_ids), k))) return dcg / idcg if idcg else 0.0  # Run full eval against golden dataset eval_data = Dataset.from_dict({ "question": golden_questions, "contexts": retrieved_contexts, "ground_truth": reference_answers, }) results = evaluate(eval_data, metrics=[ContextPrecision(), ContextRecall()]) # → Publishes scores to LangSmith experiment dashboard 

Curriculum

10 modules across two tracks

Track A builds the foundation — retrieval evaluation, generation scoring, and CI/CD gates. Track B extends into drift detection, cost optimization, multi-agent evaluation, and a healthcare case study.

Track A — Core Curriculum (Modules 1–5)

Why Evaluation Matters

~60 min

5 lessons

The evaluation problem in production The Evaluation Pyramid What "production-ready" actually means Course roadmap walkthrough

Metrics Demystified

~80 min

6 lessons1 lab

Retrieval vs generation metrics Precision, Recall, F1 for RAG LangSmith setup & tracing First experiment run

Building Retrieval Evaluators

~100 min

7 lessons1 lab

Hit Rate@K MRR implementation NDCG from scratch RAGAS context precision & recall Golden dataset construction

Generation Quality & LLM-as-Judge

~110 min

8 lessons1 lab

RAGAS faithfulness scoring Answer relevancy metric LLM-as-Judge pipeline design Bias mitigation strategies Human-in-the-loop calibration

CI/CD Quality Gates

~90 min

7 lessons1 lab

GitHub Actions eval workflow Threshold configuration Blocking deployments on quality drop LangSmith experiment comparison

Track B — Advanced Modules (Modules 6–10)

Embedding Drift & Re-indexing

~90 min

6 lessons1 lab

Cosine centroid drift detection Spearman correlation monitoring Chunk-level citation tracking Re-indexing strategy

Latency & Cost Optimization

~80 min

6 lessons1 lab

Pipeline profiling Cost-per-query analysis Latency vs quality tradeoff Caching strategies

Multi-Agent Evaluation

~100 min

7 lessons1 lab

Trajectory evaluation Tool call correctness scoring Consistency testing across agents LangGraph tracing with LangSmith

Healthcare RAG Case Study

~90 min

6 lessonscase study

Clinical safety rubric design Hallucination caught in production Domain-specific eval patterns Regulatory & compliance framing

M10

Capstone Project

~120 min

9 deliverablesportfolio

Build your own eval framework 9-deliverable checklist LangSmith dashboard setup CI gate deployment Portfolio writeup

Capstone deliverables

Seven production artifacts
you will build

Labs are not toy exercises. Each produces a deployable artifact you can extend, adapt, and put in a portfolio — or plug directly into a system you already run.

🔍

Retrieval Evaluator

A working evaluator measuring Hit Rate@K, MRR, and NDCG against a golden dataset — publishable directly to a LangSmith experiment for tracking over time.

⚖️

LLM-as-Judge Pipeline

An automated generation quality scorer that rates faithfulness and answer relevancy at scale — with calibration against human labels and bias mitigation built in.

🚦

CI/CD Quality Gate

A GitHub Actions workflow that runs your eval suite on every PR and blocks deployment if retrieval or generation scores fall below configured thresholds.

📡

Embedding Drift Detector

A monitor that tracks cosine centroid shift across your vector index over time and triggers re-indexing workflows when drift exceeds a configurable threshold.

🧩

Multi-Agent Trajectory Evaluator

A LangGraph-native evaluator that scores agent trajectories for tool call correctness, step consistency, and final answer quality across multi-hop reasoning chains.

🏥

Clinical Safety Rubric

A domain-specific evaluation rubric for healthcare RAG systems — covering hallucination risk tiers, clinical accuracy scoring, and the regulatory framing for AI-in-the-loop systems.

📦

Full Eval Framework

The capstone deliverable — a complete, modular evaluation framework integrating all prior artifacts, documented and packaged for deployment to any agentic AI system you own.

Instructor

Built by someone who
caught the failures in production

Mohcine Madkour, PhD

Senior AI/ML Engineer & Architect · Biomedical Informatics

I have spent a decade building AI systems that run on real data — from the Da Vinci surgical robotics RAG system at Intuitive Surgical, to predictive maintenance pipelines at Cummins ($700K annual savings), to surgical risk prediction at UF Shands (AUC 0.82–0.94). In every one of those systems, the hardest problems were evaluation problems: knowing when retrieval drifted, catching hallucinations before clinicians did, and proving to stakeholders that the system was improving, not just changing. This course is what I wish had existed when I was building those systems.

PhD Computer Science Postdoc · UTHealth Houston Intuitive Surgical Cummins · $700K savings UF Shands · MySurgeryRisk AI/ML Boot Camp Instructor SharpestMinds Mentor

Early access — launching 2026

Stop shipping agentic AI
without measuring it

In 10 modules you will go from manual spot-testing to a fully automated evaluation pipeline — with retrieval metrics, LLM-as-Judge scoring, CI/CD gates, and drift detection running on every deployment. All tools are free. All labs produce deployable code.

$19.99

List price $129.99

Launch discount · early access

Join the waitlist on Udemy

30-day Udemy money-back guarantee · lifetime access · certificate of completion

Agentic AI Evaluation Production-Ready Systems LangGraph · LangSmith · RAGAS · CI/CD Quality Gates

Why this course exists

Built for engineerswho ship to production

Real evaluators, real code

10 modules across two tracks

Seven production artifactsyou will build

All open source, all free

Built by someone whocaught the failures in production

Stop shipping agentic AIwithout measuring it

Agentic AI Evaluation
Production-Ready Systems LangGraph · LangSmith · RAGAS · CI/CD Quality Gates

Built for engineers
who ship to production

Seven production artifacts
you will build

Built by someone who
caught the failures in production

Stop shipping agentic AI
without measuring it