Ml Project Management - Learning Module

Loading content...

0/245

Problem Scoping

The Foundation of ML Success

In the landscape of machine learning failures, a striking pattern emerges: the majority of ML projects fail not because of algorithmic limitations or data quality issues, but because the problem was poorly defined from the outset. Industry surveys consistently report that 60-80% of ML projects never reach production, and the root cause often traces back to the earliest phase—problem scoping.

Problem scoping is the disciplined practice of translating ambiguous business needs into precisely formulated machine learning problems. It encompasses understanding stakeholder objectives, assessing technical feasibility, defining success metrics, and establishing the boundaries of what ML can and cannot deliver. It is simultaneously the most underrated and most consequential phase of any ML project.

This page provides a comprehensive framework for problem scoping that distinguishes successful ML practitioners from those who build sophisticated solutions to the wrong problems.

What You Will Master

By completing this page, you will be able to: (1) Translate vague business requests into precisely formulated ML problems, (2) Assess whether ML is the appropriate solution for a given problem, (3) Define measurable success criteria that align with business objectives, (4) Identify and mitigate common scoping pitfalls, and (5) Communicate technical feasibility to non-technical stakeholders with clarity and precision.

Understanding the Business Context

Every ML project begins with a business need, but stakeholders rarely articulate that need in ML-compatible terms. A product manager might say, "We want to reduce customer churn," while a finance director requests "better fraud detection." These statements describe outcomes, not problems. The ML practitioner's first task is to excavate the underlying structure.

The Translation Problem:

Consider the gap between business language and ML formulation:

Business Ask	Hidden Complexity	ML Formulation Required
"Reduce churn"	When is churn measured? What actions are possible?	Binary classification: Will user churn in next X days?
"Better recommendations"	Better how? Engagement? Revenue? Diversity?	Multi-objective ranking with explicit tradeoffs
"Automate support"	What fraction? Edge case handling?	Multi-class classification + confidence thresholding
"Detect anomalies"	What constitutes an anomaly? False positive tolerance?	Unsupervised detection with domain-specific thresholds

The translation process requires extensive stakeholder dialogue, not just surface-level requirements gathering.

Critical Questions for Business Context Discovery

•What is the current process? — How is this task performed today? Manual? Rule-based? Understanding the baseline reveals what ML must beat and exposes constraints.
•What decisions will this model drive? — The model output must be actionable. A churn prediction is useless without a retention intervention mechanism.
•What are the costs of different errors? — False positives vs. false negatives have vastly different business implications. Medical diagnosis ≠ spam filtering.
•What is the latency constraint? — Real-time inference requires different architectures than batch processing. This affects model complexity choices.
•Who are the end users? — Model consumers shape output format. An API for developers differs from a dashboard for analysts differs from risk scores for underwriters.

The Stakeholder Alignment Trap

Different stakeholders often have conflicting definitions of success. Marketing wants engagement; Finance wants revenue; Trust & Safety wants risk reduction. Failing to surface and resolve these conflicts early leads to models that satisfy no one. Explicit prioritization conversations are mandatory.

The Feasibility Assessment Framework

Not every problem should be solved with machine learning. Before committing to an ML approach, rigorous feasibility assessment prevents wasted effort and misallocated resources. The assessment examines four dimensions: data feasibility, technical feasibility, organizational feasibility, and economic feasibility.

Data Feasibility:

The fundamental question: Does the signal necessary to solve this problem exist in available data?

This assessment requires examining:

Label availability — Supervised learning requires labels. Are they available? At what cost? With what accuracy?
Signal-to-noise ratio — Does the data contain predictive information? Sometimes the task is fundamentally unpredictable.
Data volume — Is there sufficient data for the complexity of the task? Deep learning on 500 samples is wishful thinking.
Representation — Does the data represent the deployment distribution? Training on US data for global deployment fails.
Temporal stability — Does the relationship between features and outcomes persist? Market data from 2019 may not predict 2024.

Technical Feasibility:

Even with perfect data, technical constraints may make the problem intractable:

Latency requirements — Sub-10ms inference eliminates most large models.
Hardware constraints — Edge deployment restricts model size and computational requirements.
Explainability requirements — Regulated domains may preclude black-box models entirely.
Privacy constraints — Differential privacy, federated learning, or on-device processing may be mandatory.
Accuracy floor — Some problems require 99%+ accuracy; others tolerate 70%. Know which you face.

Green Flags: Proceed with ML

•Historical labels exist and are reliable
•Human experts can perform the task (signal exists)
•Data volume is sufficient for task complexity
•Clear, measurable success metrics are defined
•Latency and infrastructure constraints are compatible
•The decision boundary is learnable, not chaotic

Red Flags: Reconsider ML Approach

•Labels don't exist or require expensive creation
•Even human experts cannot perform the task reliably
•Fewer than thousands of labeled examples available
•Success metrics are subjective or political
•Required accuracy exceeds state-of-the-art
•Data distribution will shift rapidly post-deployment

Organizational Feasibility:

Technical solutions require organizational support:

Stakeholder commitment — Does leadership support a multi-month investment with uncertain outcomes?
Integration path — Can the model be integrated into existing systems and workflows?
Maintenance capacity — Who will monitor, retrain, and maintain the model post-deployment?
Change management — Will end users adopt the system? Resistance undermines even perfect models.
Regulatory approval — In regulated industries, model approval may take longer than development.

Economic Feasibility:

The final arbiter—does the ROI justify the investment?

Development cost — Personnel time, infrastructure, licensing
Operational cost — Inference compute, monitoring, maintenance
Opportunity cost — What else could this team accomplish?
Value delivered — Revenue increase, cost reduction, risk mitigation

A 2% accuracy improvement that costs $500K to achieve and delivers $50K annual value is a poor investment. Always quantify the expected impact.

The 10x Rule of Thumb

For an ML project to justify its complexity, it should deliver at least 10x the value of a simpler alternative. If a rule-based system achieves 85% of the benefit at 10% of the cost, that's often the right choice. ML should be reserved for problems where the performance gap is substantial and economically meaningful.

Formulating the ML Problem

Once feasibility is established, the next challenge is precise problem formulation. This is where business objectives become mathematical specifications. The formulation defines:

Task type — Classification, regression, ranking, generation, etc.
Input specification — What features are available at inference time?
Output specification — What exactly does the model produce?
Objective function — What does the model optimize?
Evaluation metrics — How is model quality assessed?

Task Type Selection:

The same business problem can often be formulated as different ML tasks. Consider "predicting customer lifetime value":

Formulation	Task Type	Output	Tradeoffs
Exact LTV prediction	Regression	Dollar amount	High variance, requires revenue history
LTV tier classification	Multi-class	High/Medium/Low	Lower granularity, more robust
High-value customer detection	Binary	Yes/No	Simplest, loses LTV ordering
LTV percentile ranking	Learning to Rank	Relative ordering	No absolute values, easier calibration

Each formulation has different data requirements, modeling approaches, and business implications. The choice should be driven by how the output will be used, not modeling convenience.

problem_formulation_template.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# Problem Formulation Template
# A rigorous specification that forces clarity
 
from dataclasses import dataclass
from typing import List, Dict, Optional
from enum import Enum
 
class TaskType(Enum):
    BINARY_CLASSIFICATION = "binary_classification"
    MULTICLASS_CLASSIFICATION = "multiclass_classification"
    REGRESSION = "regression"
    RANKING = "ranking"
    SEQUENCE_LABELING = "sequence_labeling"
    GENERATION = "generation"
    CLUSTERING = "clustering"
    ANOMALY_DETECTION = "anomaly_detection"
 
@dataclass
class MLProblemFormulation:
    """
    Comprehensive problem formulation document.
    Forces explicit consideration of all critical dimensions.
    """
    
    # Business Context
    business_objective: str  # What business outcome is targeted?
    success_definition: str  # How will success be measured in business terms?
    stakeholders: List[str]  # Who are the decision-makers?
    
    # Task Specification
    task_type: TaskType
    input_features: List[str]  # Features available at inference
    inference_time_constraints: List[str]  # What cannot be computed in real-time
    output_specification: str  # Exact output format and semantics
    
    # Data Specification
    training_data_source: str
    label_definition: str  # Precise definition of positive/negative or target
    label_acquisition_method: str  # How are labels obtained?
    expected_data_volume: int
    data_freshness_requirement: str  # How recent must training data be?
    
    # Constraints
    latency_requirement_ms: Optional[int]
    minimum_acceptable_performance: Dict[str, float]  # metric: threshold
    explainability_requirement: str  # None, global, local, regulatory
    privacy_constraints: List[str]
    
    # Evaluation
    primary_metric: str
    secondary_metrics: List[str]
    evaluation_methodology: str  # Cross-validation, temporal split, etc.
    offline_online_gap_risk: str  # Known risks of offline-online mismatch
    
    def validate(self) -> List[str]:
        """
        Validates completeness of problem formulation.
        Returns list of issues that require resolution.
        """
        issues = []
        
        if not self.label_definition:
            issues.append("Label definition is missing - critical for alignment")
        
        if not self.minimum_acceptable_performance:
            issues.append("No minimum performance threshold defined")
        
        if self.task_type == TaskType.BINARY_CLASSIFICATION:
            if "precision" not in self.primary_metric.lower() and \
               "recall" not in self.primary_metric.lower():
                issues.append(
                    "Binary classification without precision/recall "
                    "consideration - error costs may be unbalanced"
                )
        
        if not self.offline_online_gap_risk:
            issues.append(
                "Offline-online gap risk not assessed - "
                "distribution shift may invalidate offline metrics"
            )
        
        return issues
 
 
# Example: Churn Prediction Problem Formulation
churn_formulation = MLProblemFormulation(
    # Business Context
    business_objective="Reduce voluntary subscription cancellations by identifying at-risk users for proactive intervention",
    success_definition="20% reduction in monthly churn rate among users targeted by the retention campaign",
    stakeholders=["Product (user experience)", "Marketing (retention campaigns)", "Finance (revenue impact)"],
    
    # Task Specification
    task_type=TaskType.BINARY_CLASSIFICATION,
    input_features=[
        "days_since_last_active",
        "weekly_session_count_4w",
        "feature_adoption_score",
        "support_ticket_count_30d",
        "subscription_tenure_months",
        "payment_method_risk_score"
    ],
    inference_time_constraints=["LTV prediction (requires batch computation)"],
    output_specification="Probability of churn within next 30 days, calibrated [0,1]",
    
    # Data Specification
    training_data_source="User activity logs + subscription events",
    label_definition="User initiated cancellation or failed renewal within 30-day window after prediction point",
    label_acquisition_method="Historical observation with 30-day wait period",
    expected_data_volume=500000,  # labeled examples
    data_freshness_requirement="Training data no older than 90 days due to product changes",
    
    # Constraints
    latency_requirement_ms=50,  # Batch scoring acceptable
    minimum_acceptable_performance={
        "precision_at_10pct_recall": 0.30,  # Among users we target, 30%+ actually would churn
        "auc_roc": 0.75
    },
    explainability_requirement="Local explanations for customer support use case",
    privacy_constraints=["No direct access to message content", "GDPR right-to-explanation"],
    
    # Evaluation
    primary_metric="Precision at 10% recall (matches intervention capacity)",
    secondary_metrics=["AUC-ROC", "Calibration error", "Lift at top decile"],
    evaluation_methodology="Temporal train-test split (train < July, test = July-August)",
    offline_online_gap_risk="Users receiving retention offers may behave differently than historical patterns"
)
 
# Validate the formulation
issues = churn_formulation.validate()
if issues:
    print("Formulation issues to resolve:")
    for issue in issues:
        print(f"  - {issue}")
else:
    print("Problem formulation complete and validated")

The Critical Role of Label Definition:

The single most consequential decision in problem formulation is the label definition. Ambiguity here propagates through every subsequent step:

Consider "fraud detection":

Is fraud defined by confirmed investigations? (Precision-focused but misses soft fraud)
By chargebacks? (Includes friendly fraud, misses non-disputed fraud)
By rule-based triggers? (Learns to mimic rules, not detect fraud)
By model predictions? (Circular definition disaster)

Each definition produces a different training set, optimizes for different behavior, and has different performance characteristics. There is no objectively correct definition—only business-appropriate definitions. The choice must be explicit and documented.

Proxy Metric Peril

When the true objective cannot be directly measured, we use proxies. But proxies create gaps. Optimizing for click-through rate as a proxy for user satisfaction produces clickbait. Optimizing for time-on-site as a proxy for engagement produces addiction loops. Always document the gap between your measurable proxy and the true business objective, and implement guardrails against proxy gaming.

Scope Boundaries and Constraints

A well-scoped ML problem has explicit boundaries—what is included and what is deliberately excluded. Without boundaries, scope creep transforms a deliverable project into an endless research odyssey.

Defining What's In Scope:

User segments — Which users does this model serve? All users? Specific tiers? Specific geographies?
Use cases — Which scenarios is the model designed for? Which edge cases are explicitly excluded?
Input domains — What input space is the model expected to handle? What constitutes out-of-distribution?
Performance envelope — Under what conditions is the model expected to meet quality thresholds?

Defining What's Out of Scope:

Explicitly listing exclusions prevents misunderstandings:

"The model does not handle new users with <7 days of history (cold start excluded from V1)"
"Enterprise customers with custom pricing are excluded from the churn model"
"Non-English queries are out of scope for this release"

Constraint Documentation:

Every ML system operates within constraints that shape the solution space:

ML System Constraint Categories
Constraint Type	Examples	Impact on Solution
Latency	< 50ms P99	Limits model complexity, may require distillation
Throughput	10K inferences/second	Batch processing, GPU requirements
Memory	< 500MB model size	Quantization, pruning, smaller architectures
Cost	$X per 1000 inferences	Simpler models, caching strategies
Privacy	No PII in features	Feature engineering constraints, federated approaches
Explainability	Regulatory explanations required	Interpretable models or post-hoc explanations
Availability	99.9% uptime	Fallback mechanisms, graceful degradation
Freshness	Updates within 1 hour	Streaming pipelines, incremental learning

Minimum Viable Model (MVM):

Just as products have MVPs, ML projects should define an MVM—the simplest model that delivers value. The MVM serves multiple purposes:

Validation checkpoint — Proves the problem is solvable before heavy investment
Baseline establishment — Sets a performance floor for more complex approaches
Integration testing — Exercises the deployment pipeline with a real model
Stakeholder alignment — Demonstrates what "good" looks like in practice

The MVM should be:

Buildable in 1-2 weeks
Uses readily available data
Employs established techniques (logistic regression, gradient boosting)
Deployable through production infrastructure

Example MVM progression:

MVM: Logistic regression on 5 hand-crafted features
 → Establishes feasibility and baseline metrics

V1: Gradient boosted trees on 50 engineered features
 → Improves performance, validates feature value

V2: Neural network with embeddings
 → Pushes performance ceiling, requires more infrastructure

V3: Ensemble with online learning
 → Addresses distribution shift, production-grade

Each increment should have explicit performance targets that justify the added complexity.

The One-Pager Rule

A well-scoped ML problem can be completely described in a single page. If your problem statement requires extensive documentation to be understood, it's either too complex (break it down) or too vague (add specificity). The one-pager forces clarity and enables rapid alignment across stakeholders.

Success Metrics and Evaluation Criteria

Before any modeling begins, success must be defined in measurable terms. This requires distinguishing between model metrics (how the model performs technically) and business metrics (impact on business outcomes).

Model Metrics (Offline Evaluation):

These are the traditional ML performance measures computed on held-out test sets:

Task Type	Common Metrics	Considerations
Binary Classification	AUC-ROC, Precision@K, Recall@K, F1	AUC insensitive to class imbalance; choose based on decision threshold
Multi-class	Macro/Micro F1, Confusion Matrix, Top-K Accuracy	Class weights matter for imbalanced classes
Regression	MAE, RMSE, MAPE, R²	RMSE penalizes large errors; MAE more robust to outliers
Ranking	NDCG, MRR, MAP	Position-weighted; critical for recommendation systems
Calibration	Brier Score, ECE	Essential when probabilities drive decisions

Business Metrics (Online Evaluation):

These connect model performance to business value:

Business Metric	Definition	Connection to Model
Revenue impact	Incremental revenue from model decisions	Model precision → Correct targeting → Revenue
Cost reduction	Savings from automation vs. manual	Model coverage × Accuracy → Reduced manual review
User engagement	Downstream user behavior changes	Recommendation quality → User satisfaction → Engagement
Risk reduction	Prevented losses from adverse events	Detection rate × True precision → Prevented fraud

The Metric Selection Process:

Choosing the right primary metric requires understanding the decision-making context:

Identify the operating point — Where on the precision-recall curve will the system operate?
- High-stakes, limited capacity (e.g., manual fraud review): Optimize precision at low recall
- Low-stakes, broad coverage (e.g., email spam): Optimize recall at acceptable precision
Match metric to business cost — Translate error types to dollar values
- False positive cost: Manual review cost + customer friction
- False negative cost: Actual loss from missed detection
- The ratio of these costs determines the optimal threshold
Consider calibration requirements — If probabilities are used directly in downstream calculations (pricing, risk scoring), calibration metrics are essential
Account for distribution shift — Metrics that are robust to distribution changes (normalized metrics, rank-based metrics) may be preferable in volatile domains

metric_framework.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
from dataclasses import dataclass
from typing import Dict, List, Callable
import numpy as np
 
@dataclass
class MetricDefinition:
    """Complete specification of an evaluation metric."""
    
    name: str
    description: str
    formula: str  # LaTeX or prose explanation
    compute_fn: Callable  # Actual computation
    higher_is_better: bool
    requires_probabilities: bool
    threshold_dependent: bool
    business_interpretation: str
 
@dataclass
class SuccessCriteria:
    """
    Comprehensive success criteria for an ML project.
    Separates model-level from business-level metrics.
    """
    
    # Model-level success (offline metrics)
    primary_model_metric: MetricDefinition
    primary_threshold: float
    secondary_model_metrics: Dict[str, float]  # metric_name: minimum_threshold
    
    # Business-level success (online metrics)
    primary_business_metric: str
    business_calculation: str  # How model performance translates to business value
    expected_business_impact: str
    
    # Guardrail constraints (must not violate)
    guardrails: Dict[str, float]  # metric: maximum_acceptable_violation
    
    # Statistical significance requirements
    minimum_sample_size: int  # For A/B testing
    confidence_level: float
    minimum_detectable_effect: float
    
    def generate_success_document(self) -> str:
        """
        Generates a stakeholder-readable success criteria document.
        """
        doc = f"""
MODEL SUCCESS CRITERIA
======================
 
PRIMARY METRIC: {self.primary_model_metric.name}
Threshold for success: {self.primary_model_metric.name} >= {self.primary_threshold}
 
Business interpretation: {self.primary_model_metric.business_interpretation}
 
SECONDARY METRICS (must also meet):
"""
        for metric, threshold in self.secondary_model_metrics.items():
            doc += f"  - {metric} >= {threshold}\n"
        
        doc += f"""
GUARDRAILS (must not violate):
"""
        for metric, max_violation in self.guardrails.items():
            doc += f"  - {metric}: maximum acceptable = {max_violation}\n"
        
        doc += f"""
BUSINESS IMPACT:
Primary business metric: {self.primary_business_metric}
Calculation: {self.business_calculation}
Expected impact: {self.expected_business_impact}
 
A/B TEST REQUIREMENTS:
- Minimum sample size: {self.minimum_sample_size:,}
- Confidence level: {self.confidence_level * 100}%
- Minimum detectable effect: {self.minimum_detectable_effect * 100}%
"""
        return doc
 
 
# Example: Fraud Detection Success Criteria
def compute_precision_at_recall(y_true, y_prob, target_recall=0.80):
    """Compute precision at a fixed recall threshold."""
    from sklearn.metrics import precision_recall_curve
    precision, recall, thresholds = precision_recall_curve(y_true, y_prob)
    # Find threshold that achieves target recall
    valid_idx = np.where(recall >= target_recall)[0]
    if len(valid_idx) == 0:
        return 0.0
    best_idx = valid_idx[np.argmax(precision[valid_idx])]
    return precision[best_idx]
 
fraud_success_criteria = SuccessCriteria(
    # Model metrics
    primary_model_metric=MetricDefinition(
        name="Precision at 80% Recall",
        description="Among all predicted fraud cases at the threshold that catches 80% of actual fraud, what fraction are truly fraudulent?",
        formula="P(fraud | predicted_fraud) when threshold set to achieve R(fraud) = 0.80",
        compute_fn=compute_precision_at_recall,
        higher_is_better=True,
        requires_probabilities=True,
        threshold_dependent=True,
        business_interpretation="At our manual review capacity, 50%+ of flagged transactions should be actual fraud to justify investigation cost"
    ),
    primary_threshold=0.50,  # 50% precision at 80% recall
    secondary_model_metrics={
        "auc_roc": 0.90,
        "calibration_error": 0.05,  # Maximum expected calibration error
    },
    
    # Business metrics
    primary_business_metric="Net Fraud Loss Reduction",
    business_calculation="(Caught fraud - False positive cost) - (Model operation cost)",
    expected_business_impact="$2M annual reduction in fraud losses; $200K review efficiency gain",
    
    # Guardrails
    guardrails={
        "customer_friction_rate": 0.02,  # No more than 2% of legitimate transactions flagged
        "review_queue_time_hours": 4,  # Fraud queue must clear within 4 hours
    },
    
    # A/B test requirements
    minimum_sample_size=50000,  # transactions per variant
    confidence_level=0.95,
    minimum_detectable_effect=0.10  # 10% relative improvement
)
 
print(fraud_success_criteria.generate_success_document())

The Offline-Online Gap

Offline model metrics often don't translate directly to online business impact. A model with 5% higher AUC may produce only 1% business improvement—or 20% improvement—depending on the decision-making context. Always plan for online experimentation (A/B tests) to measure true business impact, and treat offline metrics as necessary but not sufficient conditions for success.

Common Scoping Pitfalls and How to Avoid Them

Experience reveals recurring patterns of scoping failures. Understanding these anti-patterns protects against repeating common mistakes.

Pitfall 1: Solution Searching for a Problem

"We want to use deep learning somewhere. Find a use case."

Technology-first thinking inverts the correct order. When the goal is to deploy a particular technique rather than solve a business problem, the result is:

Forced applications that don't fit the problem structure
Overengineered solutions that simple methods would handle
Stakeholder frustration when the technology doesn't deliver expected magic

Correction: Always start with the business problem. The technique should be selected to serve the problem, not the reverse.

Pitfall 2: Scope Creep by Committee

"Can we also add X? What about Y? Everyone wants Z."

Well-intentioned stakeholders continuously add requirements until the project becomes impossible. Each addition seems minor but the total becomes undeliverable.

Correction: Freeze scope after the scoping phase. New requirements go to V2. If something truly must be added, something else must be removed. Zero-sum requirement changes enforce discipline.

Critical Scoping Anti-Patterns

•The Perfect Label Myth — Assuming labels are clean and consistent. In reality, labeling is noisy, subjective, and expensive. Scope must account for label acquisition and quality.
•Ignoring Data Availability — Scoping a model that requires features that don't exist or can't be computed at inference time. Always verify data availability early.
•Optimistic Timeline Bias — ML projects routinely take 2-5x longer than estimated. Scope should include buffer for experimentation, debugging, and iteration.
•The Production Gap — Scoping model development without considering serving infrastructure, monitoring, or maintenance. A model without deployment is a science project.
•Stakeholder Telephone — Receiving requirements through intermediaries who don't fully understand the technical implications. Always speak directly with decision-makers.
•Sunk Cost Commitment — Refusing to descope or cancel a project that becomes infeasible. Killing bad projects early is a feature, not a failure.

Pitfall 3: The Single Metric Obsession

"Just optimize for accuracy. That's all that matters."

Real systems have multiple objectives that trade off against each other. A spam filter optimized purely for precision catches no spam; optimized purely for recall, it blocks legitimate email. Single-metric optimization ignores constraints and produces unusable systems.

Correction: Define a primary metric for optimization but add guardrail constraints that cannot be violated. The objective becomes: maximize primary metric subject to constraint satisfaction.

Pitfall 4: Ignoring the Human Loop

ML systems rarely operate autonomously. They typically augment human decision-making. Scoping that ignores the human interface produces:

Outputs that humans can't interpret or action
Workflows that create more work instead of reducing it
User resistance that undermines adoption

Correction: Map the human workflow explicitly. Where does the model output enter? Who acts on it? What format do they need? What happens when confidence is low? Design the model to serve the human, not replace them.

The Maintenance Blind Spot

Scoping typically focuses on building the model, but 80% of ML cost is in maintenance. Distribution shift, data pipeline failures, model degradation, and retraining cycles are predictable. A scoping document that doesn't address ongoing maintenance is incomplete. Include: Who monitors? How often is retraining? What triggers retraining? What are the fallback mechanisms?

The ML Scoping Document Template

A formal scoping document serves as the contract between technical teams and stakeholders. It captures decisions, assumptions, and commitments that govern the project. The following template synthesizes best practices from production ML teams:

Document Structure:

Executive Summary (1/2 page)
- Business problem and ML approach summary
- Expected impact and investment required
- Key risks and dependencies
Business Context (1 page)
- Problem statement and business motivation
- Current solution and its limitations
- Stakeholders and their requirements
- Success criteria in business terms
Technical Specification (2-3 pages)
- ML task formulation
- Input features and output specification
- Data sources and label definition
- Model performance requirements
- Infrastructure constraints
Scope Definition (1 page)
- In-scope use cases and user segments
- Out-of-scope exclusions
- Assumptions and dependencies
- Known limitations
Risk Assessment (1 page)
- Technical risks and mitigations
- Data risks (availability, quality, shift)
- Organizational risks (adoption, maintenance)
- Ethical and fairness considerations
Implementation Plan (1 page)
- Phases and milestones
- Resource requirements
- Timeline with decision checkpoints
- Success gates for each phase

scoping_document.md
Markdown
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
# ML Project Scoping Document
 
## Project: Customer Churn Prediction v1.0
**Date:** 2024-01-15  
**Author:** ML Team Lead  
**Stakeholders:** Product, Marketing, Finance  
**Status:** APPROVED
 
---
 
## 1. Executive Summary
 
We propose building a machine learning model to predict customer churn 
30 days in advance, enabling proactive retention interventions.
 
**Expected Impact:** 15-25% reduction in monthly churn among targeted users
**Investment:** 3 months engineering, $50K infrastructure  
**Key Risk:** Label definition depends on unreliable cancellation tracking
 
---
 
## 2. Business Context
 
### Problem Statement
Monthly churn rate of 5.2% costs approximately $12M annually in lost 
recurring revenue. Current retention efforts are reactive (after 
cancellation) rather than proactive (preventing cancellation).
 
### Current Solution
Manual identification based on support tickets and engagement drops.
Coverage: ~10% of churning users. Success rate: ~15% prevention.
 
### Stakeholder Requirements
| Stakeholder | Requirement | Priority |
|-------------|-------------|----------|
| Product | Minimize false alerts to users | P0 |
| Marketing | Actionable segments for campaigns | P1 |
| Finance | ROI-positive intervention program | P0 |
 
### Business Success Criteria
- 20% relative reduction in monthly churn rate
- < 2% of retained users report negative survey response to outreach
- Positive ROI within 6 months of deployment
 
---
 
## 3. Technical Specification
 
### ML Task Formulation
**Type:** Binary Classification  
**Input:** User activity features over trailing 28-day window  
**Output:** Calibrated probability of churn in next 30 days  
**Threshold:** Set to achieve 15% recall (matching intervention capacity)
 
### Feature Specification
| Feature Category | Examples | Availability |
|------------------|----------|--------------|
| Engagement | Sessions, time spent, features used | Real-time |
| Subscription | Tenure, plan type, payment history | Real-time |
| Support | Tickets, satisfaction scores | 24-hour lag |
| Social | Team size, invitations sent | Real-time |
 
### Label Definition
**Positive (Churn):** User-initiated cancellation OR failed renewal 
not recovered within 7 days, occurring 1-30 days after prediction point.
 
**Negative (Retained):** Active subscription 30+ days after prediction.
 
**Exclusions:** Enterprise customers, trial users, accounts < 14 days old.
 
### Performance Requirements
| Metric | Minimum | Target |
|--------|---------|--------|
| Precision @ 15% Recall | 30% | 50% |
| AUC-ROC | 0.75 | 0.85 |
| Calibration Error (ECE) | < 0.10 | < 0.05 |
 
### Infrastructure Constraints
- Inference latency: < 100ms (batch acceptable)
- Model size: < 500MB (MLflow deployment)
- Retraining: Weekly automated pipeline
 
---
 
## 4. Scope Definition
 
### In Scope (v1.0)
- Self-serve subscription customers
- US and EU regions
- Voluntary churn only (excludes payment failures)
 
### Out of Scope (v1.0)
- Enterprise/custom pricing customers
- Free tier users
- Involuntary churn (payment issues handled separately)
- New users < 14 days (cold start problem deferred to v2)
 
### Assumptions
1. Marketing team has capacity to act on 15% of user base weekly
2. Historical cancellation data accurately reflects churn patterns
3. Product will not undergo major changes during development
 
### Known Limitations
- No visibility into external factors (competitor offers, life changes)
- Cannot predict enterprise churn (insufficient volume for training)
 
---
 
## 5. Risk Assessment
 
| Risk | Probability | Impact | Mitigation |
|------|-------------|--------|------------|
| Insufficient signal in features | Medium | High | MVM phase to validate feasibility |
| Label noise from data quality | High | Medium | Manual audit of label sample |
| Distribution shift post-launch | High | Medium | Weekly retraining, drift monitoring |
| Low adoption by marketing | Medium | High | Co-design intervention workflow |
| Privacy/GDPR challenges | Low | High | Legal review before feature selection |
 
### Ethical Considerations
- Model must not discriminate based on protected attributes
- Fairness audit across user segments before deployment
- Transparency: users may request explanation for targeting
 
---
 
## 6. Implementation Plan
 
### Phase 1: Feasibility (Weeks 1-3)
- Data extraction and label audit
- MVM baseline (logistic regression)
- **Gate:** AUC > 0.70 OR project descoped
 
### Phase 2: Development (Weeks 4-8)
- Feature engineering
- Model experimentation
- Offline evaluation
- **Gate:** Meets minimum performance thresholds
 
### Phase 3: Integration (Weeks 9-10)
- Scoring pipeline deployment
- Integration with marketing tools
- Shadow mode operation
 
### Phase 4: Launch (Weeks 11-12)
- A/B test design and execution
- Monitor business metrics
- **Gate:** Positive A/B test OR iteration
 
---
 
## Approvals
 
| Role | Name | Date | Signature |
|------|------|------|-----------|
| ML Lead | | | |
| Product Owner | | | |
| Marketing Lead | | | |
| Engineering Manager | | | |

The Living Document

The scoping document should be a living document, updated as new information emerges. However, changes after the scoping phase require explicit stakeholder agreement. The document serves as a change control mechanism—preventing silent scope expansion while allowing deliberate scope evolution.

Summary and Key Takeaways

Problem scoping is the foundation upon which successful ML projects are built. The time invested in rigorous scoping pays dividends throughout the project lifecycle—reducing wasted effort, aligning stakeholders, and ensuring that the model you build is the model the business needs.

Key Takeaways

•Translate business language to ML formulation — The stakeholder's ask is rarely the ML problem. Deep discovery is required to extract the underlying structure.
•Assess feasibility across all dimensions — Data, technical, organizational, and economic feasibility must all be satisfied. One failure mode kills the project.
•Define labels with extreme precision — Label definition is the most consequential scoping decision. Ambiguity here corrupts everything downstream.
•Establish explicit boundaries — What's in scope, what's out, what assumptions are made. Scope creep is the enemy of delivery.
•Separate model metrics from business metrics — Offline model performance is necessary but not sufficient. Plan for online experimentation.
•Document everything — The scoping document is a contract. It protects the team and aligns stakeholders.
•Plan for maintenance — ML systems require ongoing care. Scope must include monitoring, retraining, and maintenance capacity.

What's Next:

With a well-scoped problem in hand, the next critical phase is data collection planning. The next page examines how to systematically plan data acquisition, ensure data quality, and build data pipelines that support both model development and production operation.

Page Complete

You now understand the systematic discipline of ML problem scoping. This capability separates ML practitioners who ship meaningful products from those who build impressive models that never reach production. Next, we'll explore data collection planning—the fuel that powers every ML system.