Analysing Model Performance from ROC and Precision-Recall Curves
ROC and PR curves are commonly used to present results for binary decision problems in machine learning. The ROC curve is created by plotting the true positive rate (TPR) against the false positive rate (FPR) at various threshold settings. The area under the ROC curve (AUC) is a summary measure of performance that indicates whether on average a true positive is ranked higher than a false positive.
When dealing with highly skewed datasets, Precision-Recall (PR) curves give a more informative picture of an algorithm’s performance. There is a deep connection between ROC space and PR space: a curve dominates in ROC space if and only if it dominates in PR space.
Sensitivity (positive in disease)
Sensitivity is the ability of a test to correctly classify an individual as ‘diseased’:
Sensitivity = true_positive / (true_positive + false_negative)
= Probability of being test positive when disease present
Specificity (negative in health)
The ability of a test to correctly classify an individual as disease-free:
Specificity = true_negative / (true_negative + false_positive)
= Probability of being test negative when disease absent
Sensitivity and specificity are inversely proportional — as sensitivity increases, specificity decreases.
Positive Predictive Value (PPV)
The percentage of patients with a positive test who actually have the disease:
PPV = true_positive / (true_positive + false_positive)
Negative Predictive Value (NPV)
The percentage of patients with a negative test who do not have the disease:
NPV = true_negative / (false_negative + true_negative)
PPV and NPV are influenced by prevalence. In a high-prevalence setting, PPV increases and NPV decreases.
Methods to Find the Optimal Threshold
Three criteria are commonly used:
- Closest point to (0,1) on the ROC curve
- Youden Index — maximizes
sensitivity + specificity - 1 - Minimize cost criterion — accounts for financial and ethical costs
The Youden Index is the most commonly used because it reflects maximizing the correct classification rate:
J = max[sensitivity + specificity - 1]
j = model_metric['thres'].iloc[model_metric['yod_index'].idxmax() - 1]
A secondary cutoff for risk stratification can be computed using accuracy stability:
def cal_cutoff2(data):
val = 0
for i in range(len(data) - 10):
if abs(data['acc'].iloc[i] - data['acc'].iloc[i+10]) < 0.002:
val = data['thres'].iloc[i]
break
return val
Python Code for Performance Metrics
import numpy as np
import pandas as pd
from sklearn.metrics import roc_auc_score
def calculate_metric(outcome, score):
obser = [1.0 if x == 1 else 0.0 for x in outcome]
score = [float(i) for i in score]
thres = np.arange(0.01, 1.01, 0.01)
results = []
for t in thres:
m = ROC_parameters(obser, score, t)
m['thres'] = t
m['auc'] = roc_auc_score(obser, score)
results.append(m)
return pd.DataFrame(results)
def ROC_parameters(obser, score, thr):
pred = [1 if s >= thr else 0 for s in score]
p_ind = [i for i, x in enumerate(obser) if x == 1]
n_ind = [i for i, x in enumerate(obser) if x == 0]
TP = sum(pred[i] == 1 for i in p_ind)
FP = sum(pred[i] == 1 for i in n_ind)
TN = sum(pred[i] == 0 for i in n_ind)
FN = sum(pred[i] == 0 for i in p_ind)
acc = (TP + TN) / len(pred)
ppv = TP / (TP + FP) if TP + FP > 0 else float('nan')
npv = TN / (TN + FN) if TN + FN > 0 else float('nan')
sen = TP / (TP + FN) if TP + FN > 0 else float('nan')
spe = TN / (TN + FP) if TN + FP > 0 else float('nan')
recall = sen
precision = ppv
F1 = 2 * recall * precision / (recall + precision) if (recall + precision) > 0 else float('nan')
yod = sen + spe - 1
return dict(acc=acc, ppv=ppv, npv=npv, sensitivity=sen,
specificity=spe, yod_index=yod, recall=recall,
precision=precision, F1=F1)
Calculating Confidence Intervals via Bootstrapping
def calculate_metric_bootstrap(outcome, score, n_bootstraps=1000):
rng = np.random.RandomState(42)
bootstrap_results = []
for _ in range(n_bootstraps):
indices = rng.randint(0, len(outcome), len(outcome))
if len(np.unique(np.array(outcome)[indices])) < 2:
continue
bootstrap_results.append(
calculate_metric(np.array(outcome)[indices],
np.array(score)[indices])
)
return bootstrap_results
def confidence_interval(bootstrap_results, ci=0.95):
lower_p = (1 - ci) / 2
upper_p = 1 - lower_p
all_metrics = pd.concat(bootstrap_results)
ci_lower = all_metrics.quantile(lower_p)
ci_upper = all_metrics.quantile(upper_p)
return ci_lower, ci_upper
The Bias-Variance Trade-off
Bias and variance are inherent properties of estimators. Two useful diagnostic tools:
Learning Curves — show validation and training scores for varying training set sizes. If both converge to a low value, you likely need a more complex model (lower bias). If training score is much greater than validation score, adding more data will likely help (high variance).
Model Complexity Plots — show how performance changes with model complexity, helping identify the sweet spot between underfitting and overfitting.