Analysing Model Performance from ROC and Precision-Recall Curves

ROC and PR curves are commonly used to present results for binary decision problems in machine learning. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The area under the ROC curve (AUC) is a summary measure of performance that indicates whether on average a true positive is ranked higher than a false positive.

When dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm’s performance. There is a deep connection between ROC space and PR space: a curve dominates in ROC space if and only if it dominates in PR space.

Sensitivity (positive in disease)

Sensitivity is the ability of a test to correctly classify an individual as ‘diseased’:

Sensitivity = true_positive / (true_positive + false_negative)
            = Probability of being test positive when disease present

Specificity (negative in health)

The ability of a test to correctly classify an individual as disease-free:

Specificity = true_negative / (true_negative + false_positive)
            = Probability of being test negative when disease absent

Sensitivity and specificity are inversely proportional — as sensitivity increases, specificity decreases.

Positive Predictive Value (PPV)

The percentage of patients with a positive test who actually have the disease:

PPV = true_positive / (true_positive + false_positive)

Negative Predictive Value (NPV)

The percentage of patients with a negative test who do not have the disease:

NPV = true_negative / (false_negative + true_negative)

PPV and NPV are influenced by prevalence. In a high-prevalence setting, PPV increases and NPV decreases.

Methods to Find the Optimal Threshold

Three criteria are commonly used:

  1. Closest point to (0,1) on the ROC curve
  2. Youden Index — maximizes sensitivity + specificity - 1
  3. Minimize cost criterion — accounts for financial and ethical costs

The Youden Index is the most commonly used because it reflects maximizing the correct classification rate:

J = max[sensitivity + specificity - 1]
j = model_metric['thres'].iloc[model_metric['yod_index'].idxmax() - 1]

A secondary cutoff for risk stratification can be computed using accuracy stability:

def cal_cutoff2(data):
    val = 0
    for i in range(len(data) - 10):
        if abs(data['acc'].iloc[i] - data['acc'].iloc[i+10]) < 0.002:
            val = data['thres'].iloc[i]
            break
    return val

Python Code for Performance Metrics

import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score

def calculate_metric(outcome, score):
    obser = [1.0 if x == 1 else 0.0 for x in outcome]
    score = [float(i) for i in score]
    thres = np.arange(0.01, 1.01, 0.01)
    results = []
    for t in thres:
        m = ROC_parameters(obser, score, t)
        m['thres'] = t
        m['auc'] = roc_auc_score(obser, score)
        results.append(m)
    return pd.DataFrame(results)

def ROC_parameters(obser, score, thr):
    pred = [1 if s >= thr else 0 for s in score]
    p_ind = [i for i, x in enumerate(obser) if x == 1]
    n_ind = [i for i, x in enumerate(obser) if x == 0]
    TP = sum(pred[i] == 1 for i in p_ind)
    FP = sum(pred[i] == 1 for i in n_ind)
    TN = sum(pred[i] == 0 for i in n_ind)
    FN = sum(pred[i] == 0 for i in p_ind)
    acc = (TP + TN) / len(pred)
    ppv = TP / (TP + FP) if TP + FP > 0 else float('nan')
    npv = TN / (TN + FN) if TN + FN > 0 else float('nan')
    sen = TP / (TP + FN) if TP + FN > 0 else float('nan')
    spe = TN / (TN + FP) if TN + FP > 0 else float('nan')
    recall = sen
    precision = ppv
    F1 = 2 * recall * precision / (recall + precision) if (recall + precision) > 0 else float('nan')
    yod = sen + spe - 1
    return dict(acc=acc, ppv=ppv, npv=npv, sensitivity=sen,
                specificity=spe, yod_index=yod, recall=recall,
                precision=precision, F1=F1)

Calculating Confidence Intervals via Bootstrapping

def calculate_metric_bootstrap(outcome, score, n_bootstraps=1000):
    rng = np.random.RandomState(42)
    bootstrap_results = []
    for _ in range(n_bootstraps):
        indices = rng.randint(0, len(outcome), len(outcome))
        if len(np.unique(np.array(outcome)[indices])) < 2:
            continue
        bootstrap_results.append(
            calculate_metric(np.array(outcome)[indices],
                             np.array(score)[indices])
        )
    return bootstrap_results

def confidence_interval(bootstrap_results, ci=0.95):
    lower_p = (1 - ci) / 2
    upper_p = 1 - lower_p
    all_metrics = pd.concat(bootstrap_results)
    ci_lower = all_metrics.quantile(lower_p)
    ci_upper = all_metrics.quantile(upper_p)
    return ci_lower, ci_upper

The Bias-Variance Trade-off

Bias and variance are inherent properties of estimators. Two useful diagnostic tools:

Learning Curves — show validation and training scores for varying training set sizes. If both converge to a low value, you likely need a more complex model (lower bias). If training score is much greater than validation score, adding more data will likely help (high variance).

Model Complexity Plots — show how performance changes with model complexity, helping identify the sweet spot between underfitting and overfitting.