Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

1 / 5

ROC Curve Construction

The Universal Classifier Evaluation Tool

In machine learning, we often reduce classifier performance to a single number—accuracy, F1-score, or some other point metric. But a single number hides crucial information: it reflects model behavior at exactly one decision threshold.

Consider a fraud detection model that outputs probability scores between 0 and 1. If we classify transactions as fraudulent when the score exceeds 0.5, we get one confusion matrix. Change the threshold to 0.3, and we get an entirely different confusion matrix with different tradeoffs. Which threshold is 'correct'? The answer depends entirely on the cost of false positives versus false negatives—costs that may be unknown or variable.

The Receiver Operating Characteristic (ROC) curve elegantly solves this problem by evaluating the classifier's performance across all possible thresholds simultaneously. It answers a fundamental question: Given a classifier's predicted scores, how well does it separate positive from negative examples—independent of any specific threshold choice?

What You Will Learn

By the end of this page, you will understand the mathematical foundations of ROC curves, their historical origins in signal detection theory, the step-by-step algorithm for constructing them from predicted scores, their graphical interpretation, and the deep connection between ROC curves and the ranking quality of classifiers.

Historical Origins and Motivation

The ROC curve has a fascinating origin story that predates modern machine learning by decades. Understanding this history illuminates why the ROC curve works the way it does and what it fundamentally measures.

Signal Detection Theory

During World War II, radar operators faced a critical challenge: distinguishing genuine enemy aircraft (signals) from random noise (static, birds, weather patterns). This wasn't a deterministic problem—both signals and noise produced uncertain readings on radar screens.

The fundamental tradeoff: Setting a low detection threshold caught more enemy aircraft but also triggered more false alarms. Setting a high threshold reduced false alarms but missed genuine threats. Neither extreme was acceptable—missed aircraft meant casualties; too many false alarms meant scrambling fighters for nothing, wasting resources and inducing alert fatigue.

Psychophysicists and engineers developed Signal Detection Theory (SDT) to analyze this tradeoff systematically. They conceptualized the problem as two overlapping probability distributions:

Noise distribution: The probability distribution of radar readings when no aircraft is present
Signal + Noise distribution: The probability distribution when an aircraft is present

The overlap between these distributions determines decision difficulty. If they're completely separated, perfect detection is possible. If they heavily overlap, errors are inevitable.

Why 'Operating Characteristic'?

The term 'Receiver Operating Characteristic' comes from radar receiver evaluation. Different radar systems had different 'operating characteristics'—their inherent ability to discriminate signals from noise. The ROC curve captured this discriminative capacity independent of any specific threshold, allowing fair comparison between receivers.

From Radar to Medicine to Machine Learning

After WWII, ROC analysis spread to other domains:

Medical Diagnosis (1960s-1970s): Radiologists interpreting X-rays faced the same signal/noise problem—distinguishing tumors from benign abnormalities. ROC curves became standard for evaluating diagnostic tests.
Psychology (1970s-1980s): Recognition memory experiments used ROC analysis to separate true memory from guessing.
Machine Learning (1990s-present): As probabilistic classifiers became prevalent, ML researchers adopted ROC curves to evaluate ranking quality.

This cross-domain applicability isn't coincidental—ROC curves capture a fundamental property of any binary discrimination task, regardless of domain.

ROC Terminology Across Domains
Domain	Positive Class	Negative Class	Detection	False Alarm
Radar	Enemy aircraft	Noise/clutter	Hit	False alarm
Medicine	Disease present	Healthy	Sensitivity	1 - Specificity
Machine Learning	Positive example	Negative example	True Positive Rate	False Positive Rate
Information Retrieval	Relevant document	Non-relevant	Recall	Fall-out

Mathematical Foundations

Before constructing ROC curves, we must precisely define the quantities they display. These definitions form the bedrock upon which all ROC analysis rests.

The Confusion Matrix

Given a binary classifier and a labeled dataset, the confusion matrix tabulates four quantities:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Let's define:

P = Total actual positives = TP + FN
N = Total actual negatives = FP + TN

True Positive Rate (TPR) — Sensitivity/Recall

The True Positive Rate measures the proportion of actual positives correctly identified:

TPR Definition
TPR = TP / (TP + FN) = TP / P
 
Interpretation: Of all actual positive examples, what fraction did we correctly classify as positive?
 
Also known as:
- Sensitivity (medicine)
- Recall (information retrieval)
- Hit rate (signal detection)
- Power (statistics)

False Positive Rate (FPR) — Type I Error

The False Positive Rate measures the proportion of actual negatives incorrectly classified as positive:

FPR Definition
FPR = FP / (FP + TN) = FP / N = 1 - Specificity
 
Interpretation: Of all actual negative examples, what fraction did we incorrectly classify as positive?
 
Also known as:
- Fall-out (information retrieval)
- False alarm rate (signal detection)
- Type I error rate (statistics)
- 1 - Specificity (medicine)

The Key Insight

Notice that TPR involves only actual positives (denominator is P), while FPR involves only actual negatives (denominator is N). This separation is crucial: TPR and FPR are independent of class balance. A dataset with 99% negatives has the same FPR definition as one with 50% negatives. This class-balance independence is a core strength of ROC analysis.

The Threshold Parameter

Most modern classifiers output a score (probability, decision function value, or other numeric measure) rather than a discrete class label. The score represents confidence that an example belongs to the positive class.

To convert scores to predictions, we choose a threshold τ:

Predicted class = Positive  if  score(x) ≥ τ
                  Negative  if  score(x) < τ

As τ varies from -∞ to +∞ (or 0 to 1 for probabilities):

τ → -∞: Everything predicted positive → TP = P, FP = N → TPR = 1, FPR = 1
τ → +∞: Everything predicted negative → TP = 0, FP = 0 → TPR = 0, FPR = 0

For intermediate thresholds, we get intermediate TPR and FPR values.

The ROC Curve as a Parametric Curve

The ROC curve plots (FPR(τ), TPR(τ)) as τ varies across all possible values:

ROC: τ → (FPR(τ), TPR(τ)) ∈ [0,1] × [0,1]

This is a parametric curve with threshold as the parameter. Each point on the curve corresponds to a specific threshold choice, and the curve connects all such points.

Key Properties of the ROC Curve

•Domain and Range: The curve lives in [0,1] × [0,1] — the unit square
•Endpoints: Starts at (0,0) when τ = +∞ (predict all negative), ends at (1,1) when τ = -∞ (predict all positive)
•Monotonicity: As threshold decreases, both FPR and TPR increase or stay the same (never decrease)
•Above the Diagonal: A classifier better than random lies above the diagonal y = x
•Invariance to Monotonic Score Transformations: Applying any strictly increasing function to scores doesn't change the ROC curve

The Construction Algorithm

Now we turn theory into practice with a precise algorithm for constructing ROC curves. The algorithm is surprisingly elegant once understood.

Input Assumptions

We have:

A dataset of n examples: {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)} where yᵢ ∈ {0, 1}
A classifier that assigns scores sᵢ = f(xᵢ) to each example
Higher scores indicate higher confidence in the positive class

Let P = number of actual positives (yᵢ = 1) and N = number of actual negatives (yᵢ = 0).

The Core Algorithm

ROC Curve Construction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def construct_roc_curve(scores, labels):
    """
    Construct ROC curve from classifier scores and true labels.
    
    Parameters:
    -----------
    scores : array-like of shape (n_samples,)
        Classifier output scores (higher = more likely positive)
    labels : array-like of shape (n_samples,)
        True binary labels (0 or 1)
    
    Returns:
    --------
    fpr : array of FPR values
    tpr : array of TPR values
    thresholds : array of threshold values
    """
    import numpy as np
    
    # Convert to numpy arrays
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Count positives and negatives
    P = np.sum(labels == 1)
    N = np.sum(labels == 0)
    
    # Sort examples by descending score
    sorted_indices = np.argsort(-scores)  # Descending order
    sorted_labels = labels[sorted_indices]
    sorted_scores = scores[sorted_indices]
    
    # Initialize counters and result lists
    tp = 0
    fp = 0
    
    tpr_list = [0.0]  # Start at (0,0)
    fpr_list = [0.0]
    thresholds = [sorted_scores[0] + 1]  # Threshold above all scores
    
    # Process examples in order of decreasing score
    for i in range(len(sorted_labels)):
        # "Lower" the threshold to include this example
        if sorted_labels[i] == 1:
            tp += 1
        else:
            fp += 1
        
        # Compute current TPR and FPR
        current_tpr = tp / P if P > 0 else 0
        current_fpr = fp / N if N > 0 else 0
        
        # Add point if it differs from previous
        # (handles ties efficiently)
        if i == len(sorted_labels) - 1 or sorted_scores[i] != sorted_scores[i + 1]:
            tpr_list.append(current_tpr)
            fpr_list.append(current_fpr)
            if i < len(sorted_labels) - 1:
                thresholds.append(sorted_scores[i + 1])
            else:
                thresholds.append(sorted_scores[i] - 1)
    
    return np.array(fpr_list), np.array(tpr_list), np.array(thresholds)

Algorithm Walkthrough

The algorithm exploits a key insight: as we lower the threshold, we 'accept' examples in order of their scores. By sorting examples by descending score, we can efficiently compute TPR and FPR at each unique threshold.

Let's trace through an example:

Example	Score	Label	After including...	TP	FP	TPR	FPR
-	(τ > 0.95)	-	Nothing	0	0	0.00	0.00
A	0.95	1	A	1	0	0.33	0.00
B	0.85	0	A, B	1	1	0.33	0.33
C	0.75	1	A, B, C	2	1	0.67	0.33
D	0.60	1	A, B, C, D	3	1	1.00	0.33
E	0.40	0	A, B, C, D, E	3	2	1.00	0.67
F	0.20	0	All	3	3	1.00	1.00

Here P = 3 (positives: A, C, D) and N = 3 (negatives: B, E, F).

The ROC curve for this dataset passes through: (0,0) → (0, 0.33) → (0.33, 0.33) → (0.33, 0.67) → (0.33, 1.00) → (0.67, 1.00) → (1.00, 1.00)

Time Complexity

The ROC construction algorithm runs in O(n log n) time due to sorting, plus O(n) for the single pass through sorted examples. Space complexity is O(n) for storing the curve points. For very large datasets, approximate methods exist that sample thresholds.

Handling Tied Scores

Tied scores—multiple examples with identical classifier outputs—require careful handling. The approach depends on whether you want a step function (discrete curve) or a continuous curve.

The Tie Problem

When multiple examples share the same score, they must all be classified the same way (all positive or all negative) for any given threshold. There's no threshold that separates them.

Consider this scenario:

Examples with score 0.5: [Positive, Negative, Positive, Negative]

As the threshold crosses 0.5:

Just above 0.5: None included → (TPR, FPR) = (0, 0)
Just below 0.5: All included → (TPR, FPR) = (2/P, 2/N)

We jump from one point to another with no intermediate points because no threshold produces intermediate behavior.

Discrete (Step Function) Approach

The correct interpretation is that the ROC curve has vertical and horizontal segments representing the instantaneous change when crossing a tied score:

Emit a point just before including tied examples
Include all tied examples
Emit a point after including all tied examples

The curve jumps from one point to the next. This is mathematically accurate because there really is no threshold that achieves intermediate TPR/FPR values.

Handling Ties in ROC Construction
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def construct_roc_with_ties(scores, labels):
    """
    ROC construction with proper tie handling.
    
    When examples share scores, we either:
    1. Include all tied examples together (step function)
    2. Or interpolate linearly between (continuous approximation)
    """
    import numpy as np
    from collections import defaultdict
    
    scores = np.array(scores)
    labels = np.array(labels)
    
    P = np.sum(labels == 1)
    N = np.sum(labels == 0)
    
    # Group by score
    score_groups = defaultdict(lambda: {'pos': 0, 'neg': 0})
    for s, l in zip(scores, labels):
        if l == 1:
            score_groups[s]['pos'] += 1
        else:
            score_groups[s]['neg'] += 1
    
    # Sort scores descending
    unique_scores = sorted(score_groups.keys(), reverse=True)
    
    tpr_list = [0.0]
    fpr_list = [0.0]
    
    cumulative_tp = 0
    cumulative_fp = 0
    
    for score in unique_scores:
        group = score_groups[score]
        
        # Add all examples with this score at once
        cumulative_tp += group['pos']
        cumulative_fp += group['neg']
        
        tpr_list.append(cumulative_tp / P if P > 0 else 0)
        fpr_list.append(cumulative_fp / N if N > 0 else 0)
    
    return np.array(fpr_list), np.array(tpr_list)

Interpolation Warning

Some implementations linearly interpolate between points when visualizing ROC curves. While visually smoother, this misrepresents achievable performance—no threshold actually produces those interpolated points. For rigorous analysis, use the step function representation.

Graphical Interpretation

The ROC plot encodes rich information about classifier behavior. Let's decode what different curve shapes tell us.

The ROC Space

Axes:

X-axis (FPR): The 'cost' axis — false alarms among negatives
Y-axis (TPR): The 'benefit' axis — correct detections among positives

Goal: Maximize TPR (go up) while minimizing FPR (stay left). The ideal point is (0, 1) — perfect classification.

Key Reference Points and Lines

Landmarks in ROC Space

•Point (0, 0): Predict all negative — miss positives but no false alarms
•Point (1, 1): Predict all positive — catch all positives but maximum false alarms
•Point (0, 1): Perfect classifier — all positives detected, no false alarms (ideal but usually unachievable)
•Point (1, 0): Worst classifier — inverted predictions (flip predictions and you have perfect classifier)
•Diagonal (y = x): Random classifier — no better than random guessing
•Upper-left triangle: Better than random — all useful classifiers live here
•Lower-right triangle: Worse than random — invert predictions to improve

What Curve Shapes Mean

Curve hugging the top-left corner:

Excellent separation between classes
High TPR achievable with low FPR
Classifier assigns higher scores to most positives than to most negatives

Curve following the diagonal:

Scores are uninformative about class membership
Class distributions of scores completely overlap
No better than random assignment

Curve bowing below the diagonal:

Worse than random — but easy to fix!
Scores are informative but inverted
Negate scores or flip predictions to improve

'Staircase' appearance (jagged curve):

Fine granularity in score distribution
Many unique threshold values produce distinct operating points
Common with continuous score distributions

Smooth appearance:

Either interpolated visualization
Or many examples creating fine-grained steps

The Upper-Left Triangle

Every point above the diagonal represents a classifier that outperforms random guessing. The further into the upper-left corner, the better. A perfectly separable problem has a curve that goes straight up from (0,0) to (0,1), then right to (1,1) — forming an inverted 'L' shape.

Reading Operating Points

Each point on the ROC curve corresponds to a specific threshold. Reading coordinates gives:

At point (0.1, 0.8):

TPR = 0.8: We detect 80% of positive examples
FPR = 0.1: We falsely alarm on 10% of negatives

Moving right along the curve (lowering threshold):

Catches more positives (TPR increases)
But also catches more negatives (FPR increases)
Tradeoff: Each gain in sensitivity costs specificity

The slope at any point indicates the local tradeoff: the ratio of true positives gained to false positives gained when slightly lowering the threshold. Steeper is better (more TP per FP gained).

ROC as a Ranking Metric

A profound insight connects ROC curves to ranking quality: the ROC curve measures how well the classifier ranks positive examples above negative examples.

The Ranking Perspective

Consider the classifier as producing a ranking of all examples by their score. A perfect classifier would rank:

[All Positives] > [All Negatives]

Every positive has a higher score than every negative. The ROC curve for such a ranking goes (0,0) → (0,1) → (1,1).

A random ranker would intermix positives and negatives unpredictably::

[Mixed: P, N, P, N, P, N, ...]

Its ROC curve follows the diagonal.

The Fundamental Theorem of ROC Curves

ROC and Pairwise Ranking

The area under the ROC curve (AUC) equals the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example:

AUC = P(score(random positive) > score(random negative))

This is the Wilcoxon-Mann-Whitney statistic, connecting ROC to non-parametric statistics.

Building Intuition

Imagine randomly selecting one positive and one negative example. The classifier assigns them scores. What's the probability the positive has a higher score?

If scores perfectly separate classes → Probability = 1.0
If scores are random → Probability = 0.5
If scores are perfectly inverted → Probability = 0.0

This probability is exactly the AUC.

Why This Matters

The ranking interpretation explains ROC's invariance properties:

Invariant to score monotonic transformations: Only the ranking matters, not absolute score values. Apply any strictly increasing function to scores—the ranking (and ROC curve) doesn't change.
Invariant to class balance: The ranking quality between positives and negatives doesn't depend on how many of each exist. A 90-10 imbalanced dataset produces the same ROC if the classifier ranks identically.
Interpretable probability: AUC as a probability is intuitive—"There's an 85% chance the model ranks a random positive above a random negative."

AUC via Ranking
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def auc_via_ranking(scores, labels):
    """
    Compute AUC directly from ranking interpretation.
    
    Count all positive-negative pairs and fraction where
    positive is ranked higher.
    """
    import numpy as np
    
    scores = np.array(scores)
    labels = np.array(labels)
    
    pos_scores = scores[labels == 1]
    neg_scores = scores[labels == 0]
    
    # Count pairs where positive is ranked higher
    correct_pairs = 0
    total_pairs = len(pos_scores) * len(neg_scores)
    
    for pos_score in pos_scores:
        for neg_score in neg_scores:
            if pos_score > neg_score:
                correct_pairs += 1
            elif pos_score == neg_score:
                correct_pairs += 0.5  # Tie counts as half
    
    return correct_pairs / total_pairs if total_pairs > 0 else 0.5
 
# Note: This O(P × N) algorithm is pedagogical.
# Efficient O(n log n) methods exist using sorting.

Score Distributions and ROC Geometry

The shape of the ROC curve is directly determined by the score distributions for positive and negative classes. Understanding this connection provides deep insight into classifier behavior.

Two Overlapping Distributions

Imagine two probability distributions:

f₀(s): Distribution of scores for negative examples
f₁(s): Distribution of scores for positive examples

For a threshold τ:

TPR(τ) = P(score > τ | positive) = ∫_τ^∞ f₁(s) ds — the tail probability for positives
FPR(τ) = P(score > τ | negative) = ∫_τ^∞ f₀(s) ds — the tail probability for negatives

As τ decreases from +∞ to -∞, both integrals go from 0 to 1.

Geometric Interpretation

The ROC curve's shape reflects how these distributions overlap:

Complete Separation (f₀ and f₁ don't overlap):

All negative scores < All positive scores

There exists a threshold that perfectly separates classes
ROC curve: (0,0) → (0,1) → (1,1) — perfect classifier

Complete Overlap (f₀ = f₁):

Scores are independent of class

No threshold provides any discrimination
ROC curve: diagonal line — random classifier

Partial Overlap (typical case):

Some positives have higher scores (separated)
Some have lower scores than some negatives (confused)
ROC curve: Between diagonal and perfect — the gap indicates discriminative power

Good Separation

•Distributions have minimal overlap
•f₁ shifted significantly higher than f₀
•ROC curve hugs upper-left corner
•Steep initial rise in TPR with little FPR cost
•Area under curve approaches 1.0

Poor Separation

•Distributions heavily overlap
•f₁ and f₀ are nearly identical
•ROC curve close to diagonal
•Increasing TPR requires proportional FPR increase
•Area under curve close to 0.5

Distribution Shape Affects Curve Shape

Bimodal score distributions (common in ensembles that are 'confident') produce ROC curves with distinct flat regions. Unimodal distributions (common in simple models) produce smoother curves. The derivative of the ROC curve at any point relates to the likelihood ratio f₁(s)/f₀(s) at the corresponding threshold.

Practical Construction with scikit-learn

While understanding the algorithm is crucial, in practice we use optimized library implementations. Let's see how to construct and visualize ROC curves using scikit-learn.

Complete ROC Curve Example
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score
 
# Generate synthetic dataset
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],  # Class imbalance
    random_state=42
)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
 
# Train two classifiers
lr = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
 
lr.fit(X_train, y_train)
rf.fit(X_train, y_train)
 
# Get probability scores for positive class
lr_probs = lr.predict_proba(X_test)[:, 1]
rf_probs = rf.predict_proba(X_test)[:, 1]
 
# Construct ROC curves
# Returns: FPR array, TPR array, Threshold array
lr_fpr, lr_tpr, lr_thresholds = roc_curve(y_test, lr_probs)
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf_probs)
 
# Calculate AUC
lr_auc = roc_auc_score(y_test, lr_probs)
rf_auc = roc_auc_score(y_test, rf_probs)
 
# Visualize
plt.figure(figsize=(10, 8))
 
plt.plot(lr_fpr, lr_tpr, 
         label=f'Logistic Regression (AUC = {lr_auc:.3f})',
         linewidth=2)
plt.plot(rf_fpr, rf_tpr,
         label=f'Random Forest (AUC = {rf_auc:.3f})',
         linewidth=2)
 
# Reference line (random classifier)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)')
 
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14)
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
 
plt.tight_layout()
plt.savefig('roc_comparison.png', dpi=150)
plt.show()

Understanding scikit-learn's Output

The roc_curve function returns three arrays:

fpr: Array of FPR values, starting at 0
tpr: Array of corresponding TPR values, starting at 0
thresholds: Array of decreasing thresholds. The i-th threshold produces the i-th (FPR, TPR) point.

Important note: The first threshold is set to max(scores) + 1 to produce the (0, 0) point where nothing is classified positive. Thresholds then decrease.

Finding the Optimal Threshold

Often we want to identify which threshold best balances TPR and FPR:

Optimal Threshold Selection
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Method 1: Youden's J statistic (maximize TPR - FPR)
j_scores = lr_tpr - lr_fpr
best_idx_j = np.argmax(j_scores)
best_threshold_j = lr_thresholds[best_idx_j]
print(f"Youden's J: Best threshold = {best_threshold_j:.3f}")
print(f"  TPR = {lr_tpr[best_idx_j]:.3f}, FPR = {lr_fpr[best_idx_j]:.3f}")
 
# Method 2: Closest to (0, 1) — Euclidean distance
distances = np.sqrt(lr_fpr**2 + (1 - lr_tpr)**2)
best_idx_dist = np.argmin(distances)
best_threshold_dist = lr_thresholds[best_idx_dist]
print(f"\nClosest to (0,1): Best threshold = {best_threshold_dist:.3f}")
print(f"  TPR = {lr_tpr[best_idx_dist]:.3f}, FPR = {lr_fpr[best_idx_dist]:.3f}")
 
# Method 3: Fixed FPR constraint (find threshold for FPR ≤ 0.1)
target_fpr = 0.1
valid_indices = np.where(lr_fpr <= target_fpr)[0]
if len(valid_indices) > 0:
    # Choose the one with highest TPR among valid
    best_constrained_idx = valid_indices[np.argmax(lr_tpr[valid_indices])]
    print(f"\nConstrained (FPR ≤ {target_fpr}):")
    print(f"  Threshold = {lr_thresholds[best_constrained_idx]:.3f}")
    print(f"  TPR = {lr_tpr[best_constrained_idx]:.3f}")
    print(f"  FPR = {lr_fpr[best_constrained_idx]:.3f}")

Threshold Selection is Domain-Specific

There is no universally 'best' threshold. Youden's J maximizes balanced accuracy; minimizing distance to (0,1) treats TPR and FPR equally. In practice, business constraints (e.g., 'we can only handle 5% false alarm rate') or cost matrices (asymmetric misclassification costs) should guide threshold selection.

Summary: ROC Curve Construction

We've built a comprehensive understanding of ROC curve construction from the ground up. Let's consolidate the key insights:

Key Takeaways

•ROC curves evaluate classifiers across all thresholds — they capture the complete tradeoff between sensitivity and false alarm rate, independent of any specific operating point.
•The curve plots TPR vs FPR — TPR (y-axis) measures detection among positives; FPR (x-axis) measures false alarms among negatives. Both are class-specific rates.
•Construction is sorting + counting — Sort examples by score, then walk through computing cumulative TP and FP counts. O(n log n) due to sorting.
•ROC reflects ranking quality — A classifier's ROC curve measures how well it ranks positives above negatives, not just classification accuracy at one threshold.
•Score distributions determine curve shape — Separated distributions produce curves hugging the upper-left; overlapping distributions produce curves near the diagonal.
•The diagonal represents random performance — Any curve above the diagonal is useful; below is worse than random (invert predictions to improve).

What's next:

Now that we can construct ROC curves, the next page explores AUC (Area Under the Curve) — how to summarize an entire ROC curve as a single number, its statistical interpretation, its strengths and limitations, and when it's the right (or wrong) metric to use.

Page Complete

You now understand the mathematical foundations, historical context, construction algorithm, and graphical interpretation of ROC curves. This knowledge forms the foundation for the deeper evaluation concepts that follow.

1 / 5

Loading learning content...

Machine LearningModel Evaluation Metrics

ROC and Precision-Recall Curves

LevelIntermediate

Duration90 mins

TopicModel Evaluation Metrics

1 / 5

ROC Curve Construction

The Universal Classifier Evaluation Tool

What You Will Learn

Historical Origins and Motivation

Signal Detection Theory

Psychophysicists and engineers developed Signal Detection Theory (SDT) to analyze this tradeoff systematically. They conceptualized the problem as two overlapping probability distributions:

Noise distribution: The probability distribution of radar readings when no aircraft is present
Signal + Noise distribution: The probability distribution when an aircraft is present

The overlap between these distributions determines decision difficulty. If they're completely separated, perfect detection is possible. If they heavily overlap, errors are inevitable.

Why 'Operating Characteristic'?

From Radar to Medicine to Machine Learning

After WWII, ROC analysis spread to other domains:

Medical Diagnosis (1960s-1970s): Radiologists interpreting X-rays faced the same signal/noise problem—distinguishing tumors from benign abnormalities. ROC curves became standard for evaluating diagnostic tests.
Psychology (1970s-1980s): Recognition memory experiments used ROC analysis to separate true memory from guessing.
Machine Learning (1990s-present): As probabilistic classifiers became prevalent, ML researchers adopted ROC curves to evaluate ranking quality.

This cross-domain applicability isn't coincidental—ROC curves capture a fundamental property of any binary discrimination task, regardless of domain.

ROC Terminology Across Domains
Domain	Positive Class	Negative Class	Detection	False Alarm
Radar	Enemy aircraft	Noise/clutter	Hit	False alarm
Medicine	Disease present	Healthy	Sensitivity	1 - Specificity
Machine Learning	Positive example	Negative example	True Positive Rate	False Positive Rate
Information Retrieval	Relevant document	Non-relevant	Recall	Fall-out

Mathematical Foundations

Before constructing ROC curves, we must precisely define the quantities they display. These definitions form the bedrock upon which all ROC analysis rests.

The Confusion Matrix

Given a binary classifier and a labeled dataset, the confusion matrix tabulates four quantities:

	Predicted Positive	Predicted Negative
Actual Positive	True Positive (TP)	False Negative (FN)
Actual Negative	False Positive (FP)	True Negative (TN)

Let's define:

P = Total actual positives = TP + FN
N = Total actual negatives = FP + TN

True Positive Rate (TPR) — Sensitivity/Recall

The True Positive Rate measures the proportion of actual positives correctly identified:

TPR Definition
TPR = TP / (TP + FN) = TP / P
 
Interpretation: Of all actual positive examples, what fraction did we correctly classify as positive?
 
Also known as:
- Sensitivity (medicine)
- Recall (information retrieval)
- Hit rate (signal detection)
- Power (statistics)

False Positive Rate (FPR) — Type I Error

The False Positive Rate measures the proportion of actual negatives incorrectly classified as positive:

FPR Definition
FPR = FP / (FP + TN) = FP / N = 1 - Specificity
 
Interpretation: Of all actual negative examples, what fraction did we incorrectly classify as positive?
 
Also known as:
- Fall-out (information retrieval)
- False alarm rate (signal detection)
- Type I error rate (statistics)
- 1 - Specificity (medicine)

The Key Insight

The Threshold Parameter

To convert scores to predictions, we choose a threshold τ:

Predicted class = Positive  if  score(x) ≥ τ
                  Negative  if  score(x) < τ

As τ varies from -∞ to +∞ (or 0 to 1 for probabilities):

τ → -∞: Everything predicted positive → TP = P, FP = N → TPR = 1, FPR = 1
τ → +∞: Everything predicted negative → TP = 0, FP = 0 → TPR = 0, FPR = 0

For intermediate thresholds, we get intermediate TPR and FPR values.

The ROC Curve as a Parametric Curve

The ROC curve plots (FPR(τ), TPR(τ)) as τ varies across all possible values:

ROC: τ → (FPR(τ), TPR(τ)) ∈ [0,1] × [0,1]

This is a parametric curve with threshold as the parameter. Each point on the curve corresponds to a specific threshold choice, and the curve connects all such points.

Key Properties of the ROC Curve

•Domain and Range: The curve lives in [0,1] × [0,1] — the unit square
•Endpoints: Starts at (0,0) when τ = +∞ (predict all negative), ends at (1,1) when τ = -∞ (predict all positive)
•Monotonicity: As threshold decreases, both FPR and TPR increase or stay the same (never decrease)
•Above the Diagonal: A classifier better than random lies above the diagonal y = x
•Invariance to Monotonic Score Transformations: Applying any strictly increasing function to scores doesn't change the ROC curve

The Construction Algorithm

Now we turn theory into practice with a precise algorithm for constructing ROC curves. The algorithm is surprisingly elegant once understood.

Input Assumptions

We have:

A dataset of n examples: {(x₁, y₁), (x₂, y₂), ..., (xₙ, yₙ)} where yᵢ ∈ {0, 1}
A classifier that assigns scores sᵢ = f(xᵢ) to each example
Higher scores indicate higher confidence in the positive class

Let P = number of actual positives (yᵢ = 1) and N = number of actual negatives (yᵢ = 0).

The Core Algorithm

ROC Curve Construction
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
def construct_roc_curve(scores, labels):
    """
    Construct ROC curve from classifier scores and true labels.
    
    Parameters:
    -----------
    scores : array-like of shape (n_samples,)
        Classifier output scores (higher = more likely positive)
    labels : array-like of shape (n_samples,)
        True binary labels (0 or 1)
    
    Returns:
    --------
    fpr : array of FPR values
    tpr : array of TPR values
    thresholds : array of threshold values
    """
    import numpy as np
    
    # Convert to numpy arrays
    scores = np.array(scores)
    labels = np.array(labels)
    
    # Count positives and negatives
    P = np.sum(labels == 1)
    N = np.sum(labels == 0)
    
    # Sort examples by descending score
    sorted_indices = np.argsort(-scores)  # Descending order
    sorted_labels = labels[sorted_indices]
    sorted_scores = scores[sorted_indices]
    
    # Initialize counters and result lists
    tp = 0
    fp = 0
    
    tpr_list = [0.0]  # Start at (0,0)
    fpr_list = [0.0]
    thresholds = [sorted_scores[0] + 1]  # Threshold above all scores
    
    # Process examples in order of decreasing score
    for i in range(len(sorted_labels)):
        # "Lower" the threshold to include this example
        if sorted_labels[i] == 1:
            tp += 1
        else:
            fp += 1
        
        # Compute current TPR and FPR
        current_tpr = tp / P if P > 0 else 0
        current_fpr = fp / N if N > 0 else 0
        
        # Add point if it differs from previous
        # (handles ties efficiently)
        if i == len(sorted_labels) - 1 or sorted_scores[i] != sorted_scores[i + 1]:
            tpr_list.append(current_tpr)
            fpr_list.append(current_fpr)
            if i < len(sorted_labels) - 1:
                thresholds.append(sorted_scores[i + 1])
            else:
                thresholds.append(sorted_scores[i] - 1)
    
    return np.array(fpr_list), np.array(tpr_list), np.array(thresholds)

Algorithm Walkthrough

Let's trace through an example:

Example	Score	Label	After including...	TP	FP	TPR	FPR
-	(τ > 0.95)	-	Nothing	0	0	0.00	0.00
A	0.95	1	A	1	0	0.33	0.00
B	0.85	0	A, B	1	1	0.33	0.33
C	0.75	1	A, B, C	2	1	0.67	0.33
D	0.60	1	A, B, C, D	3	1	1.00	0.33
E	0.40	0	A, B, C, D, E	3	2	1.00	0.67
F	0.20	0	All	3	3	1.00	1.00

Here P = 3 (positives: A, C, D) and N = 3 (negatives: B, E, F).

The ROC curve for this dataset passes through: (0,0) → (0, 0.33) → (0.33, 0.33) → (0.33, 0.67) → (0.33, 1.00) → (0.67, 1.00) → (1.00, 1.00)

Time Complexity

Handling Tied Scores

Tied scores—multiple examples with identical classifier outputs—require careful handling. The approach depends on whether you want a step function (discrete curve) or a continuous curve.

The Tie Problem

When multiple examples share the same score, they must all be classified the same way (all positive or all negative) for any given threshold. There's no threshold that separates them.

Consider this scenario:

Examples with score 0.5: [Positive, Negative, Positive, Negative]

As the threshold crosses 0.5:

Just above 0.5: None included → (TPR, FPR) = (0, 0)
Just below 0.5: All included → (TPR, FPR) = (2/P, 2/N)

We jump from one point to another with no intermediate points because no threshold produces intermediate behavior.

Discrete (Step Function) Approach

The correct interpretation is that the ROC curve has vertical and horizontal segments representing the instantaneous change when crossing a tied score:

Emit a point just before including tied examples
Include all tied examples
Emit a point after including all tied examples

The curve jumps from one point to the next. This is mathematically accurate because there really is no threshold that achieves intermediate TPR/FPR values.

Handling Ties in ROC Construction
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
def construct_roc_with_ties(scores, labels):
    """
    ROC construction with proper tie handling.
    
    When examples share scores, we either:
    1. Include all tied examples together (step function)
    2. Or interpolate linearly between (continuous approximation)
    """
    import numpy as np
    from collections import defaultdict
    
    scores = np.array(scores)
    labels = np.array(labels)
    
    P = np.sum(labels == 1)
    N = np.sum(labels == 0)
    
    # Group by score
    score_groups = defaultdict(lambda: {'pos': 0, 'neg': 0})
    for s, l in zip(scores, labels):
        if l == 1:
            score_groups[s]['pos'] += 1
        else:
            score_groups[s]['neg'] += 1
    
    # Sort scores descending
    unique_scores = sorted(score_groups.keys(), reverse=True)
    
    tpr_list = [0.0]
    fpr_list = [0.0]
    
    cumulative_tp = 0
    cumulative_fp = 0
    
    for score in unique_scores:
        group = score_groups[score]
        
        # Add all examples with this score at once
        cumulative_tp += group['pos']
        cumulative_fp += group['neg']
        
        tpr_list.append(cumulative_tp / P if P > 0 else 0)
        fpr_list.append(cumulative_fp / N if N > 0 else 0)
    
    return np.array(fpr_list), np.array(tpr_list)

Interpolation Warning

Graphical Interpretation

The ROC plot encodes rich information about classifier behavior. Let's decode what different curve shapes tell us.

The ROC Space

Axes:

X-axis (FPR): The 'cost' axis — false alarms among negatives
Y-axis (TPR): The 'benefit' axis — correct detections among positives

Goal: Maximize TPR (go up) while minimizing FPR (stay left). The ideal point is (0, 1) — perfect classification.

Key Reference Points and Lines

Landmarks in ROC Space

•Point (0, 0): Predict all negative — miss positives but no false alarms
•Point (1, 1): Predict all positive — catch all positives but maximum false alarms
•Point (0, 1): Perfect classifier — all positives detected, no false alarms (ideal but usually unachievable)
•Point (1, 0): Worst classifier — inverted predictions (flip predictions and you have perfect classifier)
•Diagonal (y = x): Random classifier — no better than random guessing
•Upper-left triangle: Better than random — all useful classifiers live here
•Lower-right triangle: Worse than random — invert predictions to improve

What Curve Shapes Mean

Curve hugging the top-left corner:

Excellent separation between classes
High TPR achievable with low FPR
Classifier assigns higher scores to most positives than to most negatives

Curve following the diagonal:

Scores are uninformative about class membership
Class distributions of scores completely overlap
No better than random assignment

Curve bowing below the diagonal:

Worse than random — but easy to fix!
Scores are informative but inverted
Negate scores or flip predictions to improve

'Staircase' appearance (jagged curve):

Fine granularity in score distribution
Many unique threshold values produce distinct operating points
Common with continuous score distributions

Smooth appearance:

Either interpolated visualization
Or many examples creating fine-grained steps

The Upper-Left Triangle

Reading Operating Points

Each point on the ROC curve corresponds to a specific threshold. Reading coordinates gives:

At point (0.1, 0.8):

TPR = 0.8: We detect 80% of positive examples
FPR = 0.1: We falsely alarm on 10% of negatives

Moving right along the curve (lowering threshold):

Catches more positives (TPR increases)
But also catches more negatives (FPR increases)
Tradeoff: Each gain in sensitivity costs specificity

The slope at any point indicates the local tradeoff: the ratio of true positives gained to false positives gained when slightly lowering the threshold. Steeper is better (more TP per FP gained).

ROC as a Ranking Metric

A profound insight connects ROC curves to ranking quality: the ROC curve measures how well the classifier ranks positive examples above negative examples.

The Ranking Perspective

Consider the classifier as producing a ranking of all examples by their score. A perfect classifier would rank:

[All Positives] > [All Negatives]

Every positive has a higher score than every negative. The ROC curve for such a ranking goes (0,0) → (0,1) → (1,1).

A random ranker would intermix positives and negatives unpredictably::

[Mixed: P, N, P, N, P, N, ...]

Its ROC curve follows the diagonal.

The Fundamental Theorem of ROC Curves

ROC and Pairwise Ranking

The area under the ROC curve (AUC) equals the probability that a randomly chosen positive example is ranked higher than a randomly chosen negative example:

AUC = P(score(random positive) > score(random negative))

This is the Wilcoxon-Mann-Whitney statistic, connecting ROC to non-parametric statistics.

Building Intuition

Imagine randomly selecting one positive and one negative example. The classifier assigns them scores. What's the probability the positive has a higher score?

If scores perfectly separate classes → Probability = 1.0
If scores are random → Probability = 0.5
If scores are perfectly inverted → Probability = 0.0

This probability is exactly the AUC.

Why This Matters

The ranking interpretation explains ROC's invariance properties:

Invariant to score monotonic transformations: Only the ranking matters, not absolute score values. Apply any strictly increasing function to scores—the ranking (and ROC curve) doesn't change.
Invariant to class balance: The ranking quality between positives and negatives doesn't depend on how many of each exist. A 90-10 imbalanced dataset produces the same ROC if the classifier ranks identically.
Interpretable probability: AUC as a probability is intuitive—"There's an 85% chance the model ranks a random positive above a random negative."

AUC via Ranking
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def auc_via_ranking(scores, labels):
    """
    Compute AUC directly from ranking interpretation.
    
    Count all positive-negative pairs and fraction where
    positive is ranked higher.
    """
    import numpy as np
    
    scores = np.array(scores)
    labels = np.array(labels)
    
    pos_scores = scores[labels == 1]
    neg_scores = scores[labels == 0]
    
    # Count pairs where positive is ranked higher
    correct_pairs = 0
    total_pairs = len(pos_scores) * len(neg_scores)
    
    for pos_score in pos_scores:
        for neg_score in neg_scores:
            if pos_score > neg_score:
                correct_pairs += 1
            elif pos_score == neg_score:
                correct_pairs += 0.5  # Tie counts as half
    
    return correct_pairs / total_pairs if total_pairs > 0 else 0.5
 
# Note: This O(P × N) algorithm is pedagogical.
# Efficient O(n log n) methods exist using sorting.

Score Distributions and ROC Geometry

The shape of the ROC curve is directly determined by the score distributions for positive and negative classes. Understanding this connection provides deep insight into classifier behavior.

Two Overlapping Distributions

Imagine two probability distributions:

f₀(s): Distribution of scores for negative examples
f₁(s): Distribution of scores for positive examples

For a threshold τ:

TPR(τ) = P(score > τ | positive) = ∫_τ^∞ f₁(s) ds — the tail probability for positives
FPR(τ) = P(score > τ | negative) = ∫_τ^∞ f₀(s) ds — the tail probability for negatives

As τ decreases from +∞ to -∞, both integrals go from 0 to 1.

Geometric Interpretation

The ROC curve's shape reflects how these distributions overlap:

Complete Separation (f₀ and f₁ don't overlap):

All negative scores < All positive scores

There exists a threshold that perfectly separates classes
ROC curve: (0,0) → (0,1) → (1,1) — perfect classifier

Complete Overlap (f₀ = f₁):

Scores are independent of class

No threshold provides any discrimination
ROC curve: diagonal line — random classifier

Partial Overlap (typical case):

Some positives have higher scores (separated)
Some have lower scores than some negatives (confused)
ROC curve: Between diagonal and perfect — the gap indicates discriminative power

Good Separation

•Distributions have minimal overlap
•f₁ shifted significantly higher than f₀
•ROC curve hugs upper-left corner
•Steep initial rise in TPR with little FPR cost
•Area under curve approaches 1.0

Poor Separation

•Distributions heavily overlap
•f₁ and f₀ are nearly identical
•ROC curve close to diagonal
•Increasing TPR requires proportional FPR increase
•Area under curve close to 0.5

Distribution Shape Affects Curve Shape

Practical Construction with scikit-learn

While understanding the algorithm is crucial, in practice we use optimized library implementations. Let's see how to construct and visualize ROC curves using scikit-learn.

Complete ROC Curve Example
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_curve, roc_auc_score
 
# Generate synthetic dataset
X, y = make_classification(
    n_samples=2000,
    n_features=20,
    n_informative=10,
    n_redundant=5,
    n_classes=2,
    weights=[0.7, 0.3],  # Class imbalance
    random_state=42
)
 
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, stratify=y, random_state=42
)
 
# Train two classifiers
lr = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier(n_estimators=100, random_state=42)
 
lr.fit(X_train, y_train)
rf.fit(X_train, y_train)
 
# Get probability scores for positive class
lr_probs = lr.predict_proba(X_test)[:, 1]
rf_probs = rf.predict_proba(X_test)[:, 1]
 
# Construct ROC curves
# Returns: FPR array, TPR array, Threshold array
lr_fpr, lr_tpr, lr_thresholds = roc_curve(y_test, lr_probs)
rf_fpr, rf_tpr, rf_thresholds = roc_curve(y_test, rf_probs)
 
# Calculate AUC
lr_auc = roc_auc_score(y_test, lr_probs)
rf_auc = roc_auc_score(y_test, rf_probs)
 
# Visualize
plt.figure(figsize=(10, 8))
 
plt.plot(lr_fpr, lr_tpr, 
         label=f'Logistic Regression (AUC = {lr_auc:.3f})',
         linewidth=2)
plt.plot(rf_fpr, rf_tpr,
         label=f'Random Forest (AUC = {rf_auc:.3f})',
         linewidth=2)
 
# Reference line (random classifier)
plt.plot([0, 1], [0, 1], 'k--', label='Random (AUC = 0.500)')
 
plt.xlim([-0.01, 1.01])
plt.ylim([-0.01, 1.01])
plt.xlabel('False Positive Rate', fontsize=12)
plt.ylabel('True Positive Rate', fontsize=12)
plt.title('ROC Curve Comparison', fontsize=14)
plt.legend(loc='lower right', fontsize=11)
plt.grid(alpha=0.3)
 
plt.tight_layout()
plt.savefig('roc_comparison.png', dpi=150)
plt.show()

Understanding scikit-learn's Output

The roc_curve function returns three arrays:

fpr: Array of FPR values, starting at 0
tpr: Array of corresponding TPR values, starting at 0
thresholds: Array of decreasing thresholds. The i-th threshold produces the i-th (FPR, TPR) point.

Important note: The first threshold is set to max(scores) + 1 to produce the (0, 0) point where nothing is classified positive. Thresholds then decrease.

Finding the Optimal Threshold

Often we want to identify which threshold best balances TPR and FPR:

Optimal Threshold Selection
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
# Method 1: Youden's J statistic (maximize TPR - FPR)
j_scores = lr_tpr - lr_fpr
best_idx_j = np.argmax(j_scores)
best_threshold_j = lr_thresholds[best_idx_j]
print(f"Youden's J: Best threshold = {best_threshold_j:.3f}")
print(f"  TPR = {lr_tpr[best_idx_j]:.3f}, FPR = {lr_fpr[best_idx_j]:.3f}")
 
# Method 2: Closest to (0, 1) — Euclidean distance
distances = np.sqrt(lr_fpr**2 + (1 - lr_tpr)**2)
best_idx_dist = np.argmin(distances)
best_threshold_dist = lr_thresholds[best_idx_dist]
print(f"\nClosest to (0,1): Best threshold = {best_threshold_dist:.3f}")
print(f"  TPR = {lr_tpr[best_idx_dist]:.3f}, FPR = {lr_fpr[best_idx_dist]:.3f}")
 
# Method 3: Fixed FPR constraint (find threshold for FPR ≤ 0.1)
target_fpr = 0.1
valid_indices = np.where(lr_fpr <= target_fpr)[0]
if len(valid_indices) > 0:
    # Choose the one with highest TPR among valid
    best_constrained_idx = valid_indices[np.argmax(lr_tpr[valid_indices])]
    print(f"\nConstrained (FPR ≤ {target_fpr}):")
    print(f"  Threshold = {lr_thresholds[best_constrained_idx]:.3f}")
    print(f"  TPR = {lr_tpr[best_constrained_idx]:.3f}")
    print(f"  FPR = {lr_fpr[best_constrained_idx]:.3f}")

Threshold Selection is Domain-Specific

Summary: ROC Curve Construction

We've built a comprehensive understanding of ROC curve construction from the ground up. Let's consolidate the key insights:

Key Takeaways

•ROC curves evaluate classifiers across all thresholds — they capture the complete tradeoff between sensitivity and false alarm rate, independent of any specific operating point.
•The curve plots TPR vs FPR — TPR (y-axis) measures detection among positives; FPR (x-axis) measures false alarms among negatives. Both are class-specific rates.
•Construction is sorting + counting — Sort examples by score, then walk through computing cumulative TP and FP counts. O(n log n) due to sorting.
•ROC reflects ranking quality — A classifier's ROC curve measures how well it ranks positives above negatives, not just classification accuracy at one threshold.
•Score distributions determine curve shape — Separated distributions produce curves hugging the upper-left; overlapping distributions produce curves near the diagonal.
•The diagonal represents random performance — Any curve above the diagonal is useful; below is worse than random (invert predictions to improve).

What's next:

Page Complete

1 / 5