Loading content...
In the landscape of machine learning failures, a striking pattern emerges: the majority of ML projects fail not because of algorithmic limitations or data quality issues, but because the problem was poorly defined from the outset. Industry surveys consistently report that 60-80% of ML projects never reach production, and the root cause often traces back to the earliest phase—problem scoping.
Problem scoping is the disciplined practice of translating ambiguous business needs into precisely formulated machine learning problems. It encompasses understanding stakeholder objectives, assessing technical feasibility, defining success metrics, and establishing the boundaries of what ML can and cannot deliver. It is simultaneously the most underrated and most consequential phase of any ML project.
This page provides a comprehensive framework for problem scoping that distinguishes successful ML practitioners from those who build sophisticated solutions to the wrong problems.
By completing this page, you will be able to: (1) Translate vague business requests into precisely formulated ML problems, (2) Assess whether ML is the appropriate solution for a given problem, (3) Define measurable success criteria that align with business objectives, (4) Identify and mitigate common scoping pitfalls, and (5) Communicate technical feasibility to non-technical stakeholders with clarity and precision.
Every ML project begins with a business need, but stakeholders rarely articulate that need in ML-compatible terms. A product manager might say, "We want to reduce customer churn," while a finance director requests "better fraud detection." These statements describe outcomes, not problems. The ML practitioner's first task is to excavate the underlying structure.
The Translation Problem:
Consider the gap between business language and ML formulation:
| Business Ask | Hidden Complexity | ML Formulation Required |
|---|---|---|
| "Reduce churn" | When is churn measured? What actions are possible? | Binary classification: Will user churn in next X days? |
| "Better recommendations" | Better how? Engagement? Revenue? Diversity? | Multi-objective ranking with explicit tradeoffs |
| "Automate support" | What fraction? Edge case handling? | Multi-class classification + confidence thresholding |
| "Detect anomalies" | What constitutes an anomaly? False positive tolerance? | Unsupervised detection with domain-specific thresholds |
The translation process requires extensive stakeholder dialogue, not just surface-level requirements gathering.
Different stakeholders often have conflicting definitions of success. Marketing wants engagement; Finance wants revenue; Trust & Safety wants risk reduction. Failing to surface and resolve these conflicts early leads to models that satisfy no one. Explicit prioritization conversations are mandatory.
Not every problem should be solved with machine learning. Before committing to an ML approach, rigorous feasibility assessment prevents wasted effort and misallocated resources. The assessment examines four dimensions: data feasibility, technical feasibility, organizational feasibility, and economic feasibility.
Data Feasibility:
The fundamental question: Does the signal necessary to solve this problem exist in available data?
This assessment requires examining:
Technical Feasibility:
Even with perfect data, technical constraints may make the problem intractable:
Organizational Feasibility:
Technical solutions require organizational support:
Economic Feasibility:
The final arbiter—does the ROI justify the investment?
A 2% accuracy improvement that costs $500K to achieve and delivers $50K annual value is a poor investment. Always quantify the expected impact.
For an ML project to justify its complexity, it should deliver at least 10x the value of a simpler alternative. If a rule-based system achieves 85% of the benefit at 10% of the cost, that's often the right choice. ML should be reserved for problems where the performance gap is substantial and economically meaningful.
Once feasibility is established, the next challenge is precise problem formulation. This is where business objectives become mathematical specifications. The formulation defines:
Task Type Selection:
The same business problem can often be formulated as different ML tasks. Consider "predicting customer lifetime value":
| Formulation | Task Type | Output | Tradeoffs |
|---|---|---|---|
| Exact LTV prediction | Regression | Dollar amount | High variance, requires revenue history |
| LTV tier classification | Multi-class | High/Medium/Low | Lower granularity, more robust |
| High-value customer detection | Binary | Yes/No | Simplest, loses LTV ordering |
| LTV percentile ranking | Learning to Rank | Relative ordering | No absolute values, easier calibration |
Each formulation has different data requirements, modeling approaches, and business implications. The choice should be driven by how the output will be used, not modeling convenience.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
# Problem Formulation Template# A rigorous specification that forces clarity from dataclasses import dataclassfrom typing import List, Dict, Optionalfrom enum import Enum class TaskType(Enum): BINARY_CLASSIFICATION = "binary_classification" MULTICLASS_CLASSIFICATION = "multiclass_classification" REGRESSION = "regression" RANKING = "ranking" SEQUENCE_LABELING = "sequence_labeling" GENERATION = "generation" CLUSTERING = "clustering" ANOMALY_DETECTION = "anomaly_detection" @dataclassclass MLProblemFormulation: """ Comprehensive problem formulation document. Forces explicit consideration of all critical dimensions. """ # Business Context business_objective: str # What business outcome is targeted? success_definition: str # How will success be measured in business terms? stakeholders: List[str] # Who are the decision-makers? # Task Specification task_type: TaskType input_features: List[str] # Features available at inference inference_time_constraints: List[str] # What cannot be computed in real-time output_specification: str # Exact output format and semantics # Data Specification training_data_source: str label_definition: str # Precise definition of positive/negative or target label_acquisition_method: str # How are labels obtained? expected_data_volume: int data_freshness_requirement: str # How recent must training data be? # Constraints latency_requirement_ms: Optional[int] minimum_acceptable_performance: Dict[str, float] # metric: threshold explainability_requirement: str # None, global, local, regulatory privacy_constraints: List[str] # Evaluation primary_metric: str secondary_metrics: List[str] evaluation_methodology: str # Cross-validation, temporal split, etc. offline_online_gap_risk: str # Known risks of offline-online mismatch def validate(self) -> List[str]: """ Validates completeness of problem formulation. Returns list of issues that require resolution. """ issues = [] if not self.label_definition: issues.append("Label definition is missing - critical for alignment") if not self.minimum_acceptable_performance: issues.append("No minimum performance threshold defined") if self.task_type == TaskType.BINARY_CLASSIFICATION: if "precision" not in self.primary_metric.lower() and \ "recall" not in self.primary_metric.lower(): issues.append( "Binary classification without precision/recall " "consideration - error costs may be unbalanced" ) if not self.offline_online_gap_risk: issues.append( "Offline-online gap risk not assessed - " "distribution shift may invalidate offline metrics" ) return issues # Example: Churn Prediction Problem Formulationchurn_formulation = MLProblemFormulation( # Business Context business_objective="Reduce voluntary subscription cancellations by identifying at-risk users for proactive intervention", success_definition="20% reduction in monthly churn rate among users targeted by the retention campaign", stakeholders=["Product (user experience)", "Marketing (retention campaigns)", "Finance (revenue impact)"], # Task Specification task_type=TaskType.BINARY_CLASSIFICATION, input_features=[ "days_since_last_active", "weekly_session_count_4w", "feature_adoption_score", "support_ticket_count_30d", "subscription_tenure_months", "payment_method_risk_score" ], inference_time_constraints=["LTV prediction (requires batch computation)"], output_specification="Probability of churn within next 30 days, calibrated [0,1]", # Data Specification training_data_source="User activity logs + subscription events", label_definition="User initiated cancellation or failed renewal within 30-day window after prediction point", label_acquisition_method="Historical observation with 30-day wait period", expected_data_volume=500000, # labeled examples data_freshness_requirement="Training data no older than 90 days due to product changes", # Constraints latency_requirement_ms=50, # Batch scoring acceptable minimum_acceptable_performance={ "precision_at_10pct_recall": 0.30, # Among users we target, 30%+ actually would churn "auc_roc": 0.75 }, explainability_requirement="Local explanations for customer support use case", privacy_constraints=["No direct access to message content", "GDPR right-to-explanation"], # Evaluation primary_metric="Precision at 10% recall (matches intervention capacity)", secondary_metrics=["AUC-ROC", "Calibration error", "Lift at top decile"], evaluation_methodology="Temporal train-test split (train < July, test = July-August)", offline_online_gap_risk="Users receiving retention offers may behave differently than historical patterns") # Validate the formulationissues = churn_formulation.validate()if issues: print("Formulation issues to resolve:") for issue in issues: print(f" - {issue}")else: print("Problem formulation complete and validated")The Critical Role of Label Definition:
The single most consequential decision in problem formulation is the label definition. Ambiguity here propagates through every subsequent step:
Consider "fraud detection":
Each definition produces a different training set, optimizes for different behavior, and has different performance characteristics. There is no objectively correct definition—only business-appropriate definitions. The choice must be explicit and documented.
When the true objective cannot be directly measured, we use proxies. But proxies create gaps. Optimizing for click-through rate as a proxy for user satisfaction produces clickbait. Optimizing for time-on-site as a proxy for engagement produces addiction loops. Always document the gap between your measurable proxy and the true business objective, and implement guardrails against proxy gaming.
A well-scoped ML problem has explicit boundaries—what is included and what is deliberately excluded. Without boundaries, scope creep transforms a deliverable project into an endless research odyssey.
Defining What's In Scope:
Defining What's Out of Scope:
Explicitly listing exclusions prevents misunderstandings:
Constraint Documentation:
Every ML system operates within constraints that shape the solution space:
| Constraint Type | Examples | Impact on Solution |
|---|---|---|
| Latency | < 50ms P99 | Limits model complexity, may require distillation |
| Throughput | 10K inferences/second | Batch processing, GPU requirements |
| Memory | < 500MB model size | Quantization, pruning, smaller architectures |
| Cost | $X per 1000 inferences | Simpler models, caching strategies |
| Privacy | No PII in features | Feature engineering constraints, federated approaches |
| Explainability | Regulatory explanations required | Interpretable models or post-hoc explanations |
| Availability | 99.9% uptime | Fallback mechanisms, graceful degradation |
| Freshness | Updates within 1 hour | Streaming pipelines, incremental learning |
Minimum Viable Model (MVM):
Just as products have MVPs, ML projects should define an MVM—the simplest model that delivers value. The MVM serves multiple purposes:
The MVM should be:
Example MVM progression:
MVM: Logistic regression on 5 hand-crafted features
→ Establishes feasibility and baseline metrics
V1: Gradient boosted trees on 50 engineered features
→ Improves performance, validates feature value
V2: Neural network with embeddings
→ Pushes performance ceiling, requires more infrastructure
V3: Ensemble with online learning
→ Addresses distribution shift, production-grade
Each increment should have explicit performance targets that justify the added complexity.
A well-scoped ML problem can be completely described in a single page. If your problem statement requires extensive documentation to be understood, it's either too complex (break it down) or too vague (add specificity). The one-pager forces clarity and enables rapid alignment across stakeholders.
Before any modeling begins, success must be defined in measurable terms. This requires distinguishing between model metrics (how the model performs technically) and business metrics (impact on business outcomes).
Model Metrics (Offline Evaluation):
These are the traditional ML performance measures computed on held-out test sets:
| Task Type | Common Metrics | Considerations |
|---|---|---|
| Binary Classification | AUC-ROC, Precision@K, Recall@K, F1 | AUC insensitive to class imbalance; choose based on decision threshold |
| Multi-class | Macro/Micro F1, Confusion Matrix, Top-K Accuracy | Class weights matter for imbalanced classes |
| Regression | MAE, RMSE, MAPE, R² | RMSE penalizes large errors; MAE more robust to outliers |
| Ranking | NDCG, MRR, MAP | Position-weighted; critical for recommendation systems |
| Calibration | Brier Score, ECE | Essential when probabilities drive decisions |
Business Metrics (Online Evaluation):
These connect model performance to business value:
| Business Metric | Definition | Connection to Model |
|---|---|---|
| Revenue impact | Incremental revenue from model decisions | Model precision → Correct targeting → Revenue |
| Cost reduction | Savings from automation vs. manual | Model coverage × Accuracy → Reduced manual review |
| User engagement | Downstream user behavior changes | Recommendation quality → User satisfaction → Engagement |
| Risk reduction | Prevented losses from adverse events | Detection rate × True precision → Prevented fraud |
The Metric Selection Process:
Choosing the right primary metric requires understanding the decision-making context:
Identify the operating point — Where on the precision-recall curve will the system operate?
Match metric to business cost — Translate error types to dollar values
Consider calibration requirements — If probabilities are used directly in downstream calculations (pricing, risk scoring), calibration metrics are essential
Account for distribution shift — Metrics that are robust to distribution changes (normalized metrics, rank-based metrics) may be preferable in volatile domains
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128
from dataclasses import dataclassfrom typing import Dict, List, Callableimport numpy as np @dataclassclass MetricDefinition: """Complete specification of an evaluation metric.""" name: str description: str formula: str # LaTeX or prose explanation compute_fn: Callable # Actual computation higher_is_better: bool requires_probabilities: bool threshold_dependent: bool business_interpretation: str @dataclassclass SuccessCriteria: """ Comprehensive success criteria for an ML project. Separates model-level from business-level metrics. """ # Model-level success (offline metrics) primary_model_metric: MetricDefinition primary_threshold: float secondary_model_metrics: Dict[str, float] # metric_name: minimum_threshold # Business-level success (online metrics) primary_business_metric: str business_calculation: str # How model performance translates to business value expected_business_impact: str # Guardrail constraints (must not violate) guardrails: Dict[str, float] # metric: maximum_acceptable_violation # Statistical significance requirements minimum_sample_size: int # For A/B testing confidence_level: float minimum_detectable_effect: float def generate_success_document(self) -> str: """ Generates a stakeholder-readable success criteria document. """ doc = f"""MODEL SUCCESS CRITERIA====================== PRIMARY METRIC: {self.primary_model_metric.name}Threshold for success: {self.primary_model_metric.name} >= {self.primary_threshold} Business interpretation: {self.primary_model_metric.business_interpretation} SECONDARY METRICS (must also meet):""" for metric, threshold in self.secondary_model_metrics.items(): doc += f" - {metric} >= {threshold}\n" doc += f"""GUARDRAILS (must not violate):""" for metric, max_violation in self.guardrails.items(): doc += f" - {metric}: maximum acceptable = {max_violation}\n" doc += f"""BUSINESS IMPACT:Primary business metric: {self.primary_business_metric}Calculation: {self.business_calculation}Expected impact: {self.expected_business_impact} A/B TEST REQUIREMENTS:- Minimum sample size: {self.minimum_sample_size:,}- Confidence level: {self.confidence_level * 100}%- Minimum detectable effect: {self.minimum_detectable_effect * 100}%""" return doc # Example: Fraud Detection Success Criteriadef compute_precision_at_recall(y_true, y_prob, target_recall=0.80): """Compute precision at a fixed recall threshold.""" from sklearn.metrics import precision_recall_curve precision, recall, thresholds = precision_recall_curve(y_true, y_prob) # Find threshold that achieves target recall valid_idx = np.where(recall >= target_recall)[0] if len(valid_idx) == 0: return 0.0 best_idx = valid_idx[np.argmax(precision[valid_idx])] return precision[best_idx] fraud_success_criteria = SuccessCriteria( # Model metrics primary_model_metric=MetricDefinition( name="Precision at 80% Recall", description="Among all predicted fraud cases at the threshold that catches 80% of actual fraud, what fraction are truly fraudulent?", formula="P(fraud | predicted_fraud) when threshold set to achieve R(fraud) = 0.80", compute_fn=compute_precision_at_recall, higher_is_better=True, requires_probabilities=True, threshold_dependent=True, business_interpretation="At our manual review capacity, 50%+ of flagged transactions should be actual fraud to justify investigation cost" ), primary_threshold=0.50, # 50% precision at 80% recall secondary_model_metrics={ "auc_roc": 0.90, "calibration_error": 0.05, # Maximum expected calibration error }, # Business metrics primary_business_metric="Net Fraud Loss Reduction", business_calculation="(Caught fraud - False positive cost) - (Model operation cost)", expected_business_impact="$2M annual reduction in fraud losses; $200K review efficiency gain", # Guardrails guardrails={ "customer_friction_rate": 0.02, # No more than 2% of legitimate transactions flagged "review_queue_time_hours": 4, # Fraud queue must clear within 4 hours }, # A/B test requirements minimum_sample_size=50000, # transactions per variant confidence_level=0.95, minimum_detectable_effect=0.10 # 10% relative improvement) print(fraud_success_criteria.generate_success_document())Offline model metrics often don't translate directly to online business impact. A model with 5% higher AUC may produce only 1% business improvement—or 20% improvement—depending on the decision-making context. Always plan for online experimentation (A/B tests) to measure true business impact, and treat offline metrics as necessary but not sufficient conditions for success.
Experience reveals recurring patterns of scoping failures. Understanding these anti-patterns protects against repeating common mistakes.
Pitfall 1: Solution Searching for a Problem
"We want to use deep learning somewhere. Find a use case."
Technology-first thinking inverts the correct order. When the goal is to deploy a particular technique rather than solve a business problem, the result is:
Correction: Always start with the business problem. The technique should be selected to serve the problem, not the reverse.
Pitfall 2: Scope Creep by Committee
"Can we also add X? What about Y? Everyone wants Z."
Well-intentioned stakeholders continuously add requirements until the project becomes impossible. Each addition seems minor but the total becomes undeliverable.
Correction: Freeze scope after the scoping phase. New requirements go to V2. If something truly must be added, something else must be removed. Zero-sum requirement changes enforce discipline.
Pitfall 3: The Single Metric Obsession
"Just optimize for accuracy. That's all that matters."
Real systems have multiple objectives that trade off against each other. A spam filter optimized purely for precision catches no spam; optimized purely for recall, it blocks legitimate email. Single-metric optimization ignores constraints and produces unusable systems.
Correction: Define a primary metric for optimization but add guardrail constraints that cannot be violated. The objective becomes: maximize primary metric subject to constraint satisfaction.
Pitfall 4: Ignoring the Human Loop
ML systems rarely operate autonomously. They typically augment human decision-making. Scoping that ignores the human interface produces:
Correction: Map the human workflow explicitly. Where does the model output enter? Who acts on it? What format do they need? What happens when confidence is low? Design the model to serve the human, not replace them.
Scoping typically focuses on building the model, but 80% of ML cost is in maintenance. Distribution shift, data pipeline failures, model degradation, and retraining cycles are predictable. A scoping document that doesn't address ongoing maintenance is incomplete. Include: Who monitors? How often is retraining? What triggers retraining? What are the fallback mechanisms?
A formal scoping document serves as the contract between technical teams and stakeholders. It captures decisions, assumptions, and commitments that govern the project. The following template synthesizes best practices from production ML teams:
Document Structure:
Executive Summary (1/2 page)
Business Context (1 page)
Technical Specification (2-3 pages)
Scope Definition (1 page)
Risk Assessment (1 page)
Implementation Plan (1 page)
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158
# ML Project Scoping Document ## Project: Customer Churn Prediction v1.0**Date:** 2024-01-15 **Author:** ML Team Lead **Stakeholders:** Product, Marketing, Finance **Status:** APPROVED --- ## 1. Executive Summary We propose building a machine learning model to predict customer churn 30 days in advance, enabling proactive retention interventions. **Expected Impact:** 15-25% reduction in monthly churn among targeted users**Investment:** 3 months engineering, $50K infrastructure **Key Risk:** Label definition depends on unreliable cancellation tracking --- ## 2. Business Context ### Problem StatementMonthly churn rate of 5.2% costs approximately $12M annually in lost recurring revenue. Current retention efforts are reactive (after cancellation) rather than proactive (preventing cancellation). ### Current SolutionManual identification based on support tickets and engagement drops.Coverage: ~10% of churning users. Success rate: ~15% prevention. ### Stakeholder Requirements| Stakeholder | Requirement | Priority ||-------------|-------------|----------|| Product | Minimize false alerts to users | P0 || Marketing | Actionable segments for campaigns | P1 || Finance | ROI-positive intervention program | P0 | ### Business Success Criteria- 20% relative reduction in monthly churn rate- < 2% of retained users report negative survey response to outreach- Positive ROI within 6 months of deployment --- ## 3. Technical Specification ### ML Task Formulation**Type:** Binary Classification **Input:** User activity features over trailing 28-day window **Output:** Calibrated probability of churn in next 30 days **Threshold:** Set to achieve 15% recall (matching intervention capacity) ### Feature Specification| Feature Category | Examples | Availability ||------------------|----------|--------------|| Engagement | Sessions, time spent, features used | Real-time || Subscription | Tenure, plan type, payment history | Real-time || Support | Tickets, satisfaction scores | 24-hour lag || Social | Team size, invitations sent | Real-time | ### Label Definition**Positive (Churn):** User-initiated cancellation OR failed renewal not recovered within 7 days, occurring 1-30 days after prediction point. **Negative (Retained):** Active subscription 30+ days after prediction. **Exclusions:** Enterprise customers, trial users, accounts < 14 days old. ### Performance Requirements| Metric | Minimum | Target ||--------|---------|--------|| Precision @ 15% Recall | 30% | 50% || AUC-ROC | 0.75 | 0.85 || Calibration Error (ECE) | < 0.10 | < 0.05 | ### Infrastructure Constraints- Inference latency: < 100ms (batch acceptable)- Model size: < 500MB (MLflow deployment)- Retraining: Weekly automated pipeline --- ## 4. Scope Definition ### In Scope (v1.0)- Self-serve subscription customers- US and EU regions- Voluntary churn only (excludes payment failures) ### Out of Scope (v1.0)- Enterprise/custom pricing customers- Free tier users- Involuntary churn (payment issues handled separately)- New users < 14 days (cold start problem deferred to v2) ### Assumptions1. Marketing team has capacity to act on 15% of user base weekly2. Historical cancellation data accurately reflects churn patterns3. Product will not undergo major changes during development ### Known Limitations- No visibility into external factors (competitor offers, life changes)- Cannot predict enterprise churn (insufficient volume for training) --- ## 5. Risk Assessment | Risk | Probability | Impact | Mitigation ||------|-------------|--------|------------|| Insufficient signal in features | Medium | High | MVM phase to validate feasibility || Label noise from data quality | High | Medium | Manual audit of label sample || Distribution shift post-launch | High | Medium | Weekly retraining, drift monitoring || Low adoption by marketing | Medium | High | Co-design intervention workflow || Privacy/GDPR challenges | Low | High | Legal review before feature selection | ### Ethical Considerations- Model must not discriminate based on protected attributes- Fairness audit across user segments before deployment- Transparency: users may request explanation for targeting --- ## 6. Implementation Plan ### Phase 1: Feasibility (Weeks 1-3)- Data extraction and label audit- MVM baseline (logistic regression)- **Gate:** AUC > 0.70 OR project descoped ### Phase 2: Development (Weeks 4-8)- Feature engineering- Model experimentation- Offline evaluation- **Gate:** Meets minimum performance thresholds ### Phase 3: Integration (Weeks 9-10)- Scoring pipeline deployment- Integration with marketing tools- Shadow mode operation ### Phase 4: Launch (Weeks 11-12)- A/B test design and execution- Monitor business metrics- **Gate:** Positive A/B test OR iteration --- ## Approvals | Role | Name | Date | Signature ||------|------|------|-----------|| ML Lead | | | || Product Owner | | | || Marketing Lead | | | || Engineering Manager | | | |The scoping document should be a living document, updated as new information emerges. However, changes after the scoping phase require explicit stakeholder agreement. The document serves as a change control mechanism—preventing silent scope expansion while allowing deliberate scope evolution.
Problem scoping is the foundation upon which successful ML projects are built. The time invested in rigorous scoping pays dividends throughout the project lifecycle—reducing wasted effort, aligning stakeholders, and ensuring that the model you build is the model the business needs.
What's Next:
With a well-scoped problem in hand, the next critical phase is data collection planning. The next page examines how to systematically plan data acquisition, ensure data quality, and build data pipelines that support both model development and production operation.
You now understand the systematic discipline of ML problem scoping. This capability separates ML practitioners who ship meaningful products from those who build impressive models that never reach production. Next, we'll explore data collection planning—the fuel that powers every ML system.