Machine LearningML System Design

ML System Design

LevelAdvanced

Duration120 mins

TopicML System Design

1 / 5

Requirements Gathering

The Foundation of ML Success

Every failed ML project shares a common origin: insufficient understanding of requirements. Not bad algorithms. Not inadequate data. Not poor engineering. The root cause is almost always a gap between what was built and what was actually needed.

This isn't hyperbole. Industry surveys consistently show that 85-90% of ML projects fail to make it into production, and the primary reason isn't technical—it's organizational. The model works on test data but doesn't solve the actual business problem. The system achieves impressive metrics but doesn't integrate with existing workflows. The solution is technically elegant but answers the wrong question.

Requirements gathering for ML systems is fundamentally different from traditional software development. In conventional software, you can often specify exact inputs, outputs, and behaviors. In ML, you're building systems that operate under uncertainty, whose behavior emerges from data, and whose success depends on probabilistic guarantees that stakeholders may not intuitively understand.

What You Will Learn

By the end of this page, you will understand how to systematically gather requirements for ML systems—translating ambiguous business objectives into precise technical specifications, aligning stakeholders with realistic expectations, defining success metrics that matter, and identifying constraints before they derail your project.

The ML Requirements Framework

Effective ML requirements gathering operates across multiple dimensions simultaneously. Unlike traditional software where functional requirements dominate, ML systems require careful consideration of data, model behavior, operational constraints, and business objectives—all intertwined and mutually constraining.

The ML Requirements Framework organizes these concerns into a structured approach that ensures nothing critical is overlooked. Each dimension informs the others, creating a coherent picture of what success looks like and how to achieve it.

The Five Dimensions of ML Requirements

•Problem Definition — What exactly are we trying to solve? What decision will be made differently because this system exists? What happens if we get it wrong?
•Success Metrics — How will we measure success? What's the minimum viable improvement? How do offline metrics connect to online business outcomes?
•Data Requirements — What data do we need? Do we have it? Can we get it? What's the quality, volume, freshness, and legal status of the data?
•Operational Constraints — What are the latency requirements? Throughput expectations? Resource budgets? Integration points? Deployment environment?
•Organizational Context — Who are the stakeholders? What's the timeline? How does this fit into broader business strategy? What governance applies?

The Order Matters

These dimensions must be addressed roughly in order. Jumping to operational constraints before defining the problem leads to solutions looking for problems. Discussing data before establishing success metrics leads to 'we have this data, what can we do with it?' thinking—which rarely produces valuable outcomes.

Problem Definition Deep Dive

The most critical and most frequently botched aspect of ML requirements is problem definition. Stakeholders often come with solutions in mind ("we need a recommendation engine" or "let's use deep learning for this") rather than clearly articulated problems. Your first task is to work backward from the proposed solution to the underlying problem, then forward again to potentially different—and often simpler—solutions.

The Problem Definition Canvas captures the essential questions that must be answered before any ML work begins:

Problem Definition Canvas
Question	Why It Matters	Red Flags If Missing
What decision will change?	ML systems are decision-support tools. If no decision changes, no value is created.	Answers like 'we'll have better insights' or 'more data-driven culture'
Who makes this decision today?	Understanding current process reveals opportunities and constraints.	Nobody knows or 'it's handled manually somehow'
How is this decision made now?	Baseline for comparison. Often reveals simpler alternatives.	Vague answers or inability to describe current state
What's the cost of a wrong prediction?	Determines acceptable error rates and need for human-in-loop.	All predictions treated as equally important
What's the expected impact of improvement?	Quantifies ROI and sets realistic expectations.	Vague claims like 'significant improvement' without numbers
Is this actually an ML problem?	Many 'ML projects' are better solved with rules or simple heuristics.	Assumption that ML is always the answer

The Translation Challenge

Business stakeholders speak in terms of outcomes: "We want to reduce customer churn" or "We need to optimize pricing." These high-level objectives must be translated into precise ML problem formulations:

Reduce churn → Predict probability of churn per customer → Binary classification with temporal dynamics
Optimize pricing → Predict demand elasticity curves → Regression with causal considerations
Improve recommendations → Rank items by relevance for each user → Learning-to-rank or matrix factorization
Detect fraud → Identify anomalous transactions → Anomaly detection with extreme class imbalance

Each translation involves choices that significantly impact the solution approach. Framing churn as binary classification is simpler than predicting days-until-churn (regression) or optimal intervention timing (reinforcement learning). The right framing depends on what decisions will actually be made with the predictions.

The Five Whys for ML

When stakeholders request an ML system, ask 'why' five times. 'We need a churn prediction model.' Why? 'To identify at-risk customers.' Why? 'So we can intervene before they leave.' Why? 'Because retention is cheaper than acquisition.' Why? 'Because our CAC is $200 but LTV is $2000.' Now you understand the economics that define success: if intervention costs $20 and saves 10% of addressed users, you need to target customers with >10% churn probability for ROI.

Problem Decomposition

Complex business problems often require decomposition into multiple ML tasks. Consider a content moderation system:

High-level need: Remove harmful content from the platform.

Decomposed ML tasks:

Classification: Is this content potentially harmful? (High recall, moderate precision)
Categorization: What type of harm? (Violence, hate speech, misinformation, etc.)
Severity scoring: How severe is the violation? (Prioritizes human review queue)
User risk modeling: Is this user a repeat offender? (Informs action severity)
False positive detection: Is this likely a false positive? (Prevents over-enforcement)

Each sub-problem has different requirements, evaluation criteria, and acceptable error rates. Treating this as a single classification task would miss the nuanced decision-making that effective moderation requires.

Success Metrics Architecture

Defining success metrics for ML systems is deceptively complex. The challenge is bridging three distinct worlds: the business metrics that stakeholders care about, the model metrics that ML engineers optimize, and the system metrics that ensure reliable operation. These three types of metrics are interconnected but distinct, and confusion between them is a leading cause of ML project failure.

The Three Metrics Layers

•Business Metrics — Revenue, conversion rate, customer lifetime value, operational cost reduction, user engagement. These are what the organization actually cares about, but they're influenced by many factors beyond the ML system.
•Model Metrics — Accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, NDCG. These measure prediction quality directly, but their connection to business impact isn't always clear or linear.
•System Metrics — Latency, throughput, availability, error rate, resource utilization. These ensure the system operates reliably, but a perfectly reliable system that makes bad predictions is still useless.

The Metrics Alignment Problem

The fundamental challenge is that improving model metrics doesn't guarantee improving business metrics. This happens for several reasons:

Offline-Online Gap: A model trained to minimize log loss on historical data might not perform well on live traffic where data distributions shift.
Proxy Metric Divergence: Click-through rate is easy to measure but may not correlate with user satisfaction or purchase intent.
Goodhart's Law: Once a metric becomes a target, it ceases to be a good metric. Optimizing for engagement can lead to addictive design patterns that hurt long-term retention.
Simpson's Paradox: Model improvements within segments can disappear or reverse when aggregated, especially if the segment distribution shifts.
Confounding Variables: Business metrics are affected by seasonality, marketing campaigns, product changes, and competitor actions—making causal attribution difficult.

Metrics Mapping Example: E-commerce Recommendation System
Business Metric	Model Metric	System Metric	Connection Challenge
Revenue per session	NDCG@10, Hit Rate@K	P50 latency < 100ms	Higher ranking quality should improve revenue, but depends on pricing and inventory
Conversion rate	Precision@recommendation	Availability > 99.9%	Good recommendations need to be shown consistently to impact conversion
Average order value	Cross-category exposure	Cold-boot time < 5s	Diverse recommendations may increase AOV but decrease immediate clicks
Customer retention	Long-term engagement score	Error rate < 0.1%	Short-term optimization can hurt long-term loyalty (e.g., recommending only deals)

The Vanity Metrics Trap

Be wary of vanity metrics that look impressive but don't connect to business value. '95% accuracy' is meaningless without context—95% accuracy on a 50/50 balanced dataset is barely better than random for many applications. Always ask: 'If this metric improves by X%, what changes in the real world? Can we quantify the dollar impact?'

Establishing Metric Baselines

Before any ML work begins, establish baselines for all relevant metrics:

Simple Baselines:

Random predictions (establishes floor)
Most-popular/majority class (surprisingly hard to beat)
Last-value persistence (for time series)
Rule-based heuristics (often captures 80% of value)

Current State Baseline:

Existing system performance (if replacing something)
Human expert performance (for comparison)
Business metrics before intervention

Theoretical Ceiling:

Inter-rater agreement (human upper bound)
Irreducible noise in the data
Best published results on similar problems

The Improvement Target: With baselines established, define the minimum viable improvement. A 2% lift in AUC might be meaningless, or it might translate to $10M in annual revenue. The business translation determines whether the project is worth pursuing.

Stakeholder Alignment

ML projects typically involve stakeholders with fundamentally different perspectives, incentives, and risk tolerances. Aligning these stakeholders is not a soft skill afterthought—it's a core technical requirement. Misalignment results in projects that are technically successful but organizationally rejected.

Understanding the different stakeholder archetypes and their concerns enables proactive alignment:

Stakeholder Archetypes in ML Projects
Stakeholder	Primary Concern	Common Objections	Alignment Strategy
Business Sponsor	ROI, timeline, business impact	"When will we see results? What's the cost?"	Quantify expected impact, establish milestones, define success criteria upfront
End Users	Usability, trust, workflow integration	"I don't trust black boxes. This will automate my job."	Involve early, demonstrate explainability, position as augmentation not replacement
Data Teams	Data quality, access, governance	"The data isn't ready. Privacy constraints apply."	Data audit early, define minimum viable data, address compliance proactively
Engineering	Integration, maintenance, reliability	"How does this fit our stack? Who maintains it?"	Design for operations from day one, involve in architecture decisions
Legal/Compliance	Risk, liability, regulatory adherence	"What if the model is biased? Who's liable?"	Document decision processes, plan for audits, address fairness explicitly
Executive Leadership	Strategic alignment, competitive position	"Why this project over others? What's the risk?"	Connect to strategic goals, present honest risk assessment, define abort criteria

Managing Expectations with the ML Uncertainty Principle

Unlike traditional software where you can often guarantee specific behaviors, ML systems have inherent uncertainty. Stakeholders accustomed to deterministic software need to understand this fundamental difference:

What ML can provide:

Probabilistic predictions with quantified uncertainty
Aggregate performance guarantees (on average, over time)
Continuous improvement as more data is collected
Patterns and insights invisible to human review at scale

What ML cannot provide:

Perfect predictions for every individual case
Guarantees against any specific error type
Static behavior as data distributions shift
Causal explanations (only correlations)

This isn't a limitation to hide—it's a fundamental characteristic to communicate early and often. Stakeholders who expect 100% accuracy will be disappointed; stakeholders who understand they're getting probabilistic decision support will be satisfied.

The Concrete Example Technique

Abstract discussions about precision and recall rarely resonate. Instead, prepare concrete examples: 'With 90% precision, for every 100 alerts your team investigates, 90 will be real issues and 10 will be false alarms. Is that acceptable workload?' This makes abstract metrics tangible and enables informed trade-off discussions.

The Requirements Sign-Off Process

Formalize stakeholder alignment with a requirements sign-off that includes:

Problem Statement: Clear, agreed-upon definition of what we're solving
Success Criteria: Specific metrics with target values and measurement methodology
Failure Criteria: What would cause us to abandon this approach
Timeline: Realistic milestones for data, MVP, production deployment
Resource Commitment: Data access, engineering support, review bandwidth
Risk Register: Known risks, mitigations, and acceptable risk levels
Decision Rights: Who approves changes to scope, metrics, or approach

This document serves as a reference point throughout the project, preventing scope creep and misunderstandings.

Data Requirements Assessment

Data is the fuel for ML systems, and data requirements gathering is often where projects discover they can't proceed. Better to discover data gaps early—before investing in model development—than to build a sophisticated model that can't be trained or deployed due to data limitations.

The Data Requirements Assessment covers seven critical dimensions:

Data Requirements Dimensions

•Availability — Does the data exist? Is it accessible? Who controls access? What's the lead time to get it?
•Quality — Is the data accurate? Complete? Consistent? How are errors and missing values handled?
•Volume — Is there enough data to train models? What's the class distribution? How much labeled data exists?
•Freshness — How current is the data? How often is it updated? Is there latency between event and availability?
•Representativeness — Does the data reflect the population we'll serve? Are there systematic biases or gaps?
•Labels — For supervised learning: do labels exist? Are they reliable? What's the cost to obtain more?
•Compliance — Can we legally use this data? Are there privacy constraints? What retention policies apply?

The Data Audit Checklist

For each data source identified as relevant, conduct a systematic audit:

Data Source Audit Template
Dimension	Questions to Answer	Documentation Required
Source Identity	What systems generate this data? Who owns it?	Data lineage, ownership documentation
Access Method	API, database query, file transfer?	Access procedures, credentials management
Update Frequency	Real-time, hourly, daily, weekly?	SLAs, freshness guarantees
Historical Depth	How far back does data exist?	Retention policies, archival status
Schema Stability	Does the schema change? How often?	Schema versioning, change notification
Quality Issues	Known problems, biases, gaps?	Data quality reports, known limitations
Legal Status	Consent basis, permitted uses, restrictions?	Legal review, DPA agreements

The Labeling Cost Reality

Supervised learning requires labels, and labels are expensive. A project needing 100,000 labeled examples at $0.50/label has a $50,000 data cost before any model development. Include labeling costs, timelines, and quality assurance in your requirements. Consider: Can you use active learning to reduce labeling needs? Can you leverage weak supervision? Is unsupervised or self-supervised learning viable?

Feature Availability Analysis

Beyond raw data, assess whether the features needed for prediction will be available at inference time:

Training-Serving Skew Risks:

Features computed from future data (look-ahead bias)
Features requiring expensive computation (feasible offline, not in real-time)
Features from third-party sources with availability gaps
Features that aggregate over time windows differently in batch vs. streaming

Example: A churn prediction model uses 'last 30 days of activity' as a feature. During training, this is computed accurately from historical data. In production, for real-time serving, you need streaming infrastructure to maintain rolling 30-day aggregations—fundamentally different architecture.

The Feature Feasibility Matrix: For each candidate feature, assess:

Can it be computed in training? (data exists historically)
Can it be computed at serving time? (data available, latency acceptable)
Does it change the system architecture? (new pipelines needed)
What's the marginal predictive value? (worth the complexity?)

Constraints and Trade-offs

Every ML system operates within constraints—computational resources, latency requirements, budget limitations, regulatory mandates, organizational capabilities. Identifying these constraints early prevents building systems that can't be deployed or sustained.

The Constraint Categories:

Technical Constraints

•Latency: What's the maximum acceptable response time? P50, P95, P99?
•Throughput: How many predictions per second? Peak vs. average?
•Compute Budget: What infrastructure is available? GPU, CPU, edge?
•Memory: Model size constraints for deployment targets?
•Bandwidth: Data transfer limits for edge deployment?
•Availability: Uptime requirements? Degradation strategy?

Business Constraints

•Budget: Development cost, infrastructure cost, ongoing cost?
•Timeline: When must MVP launch? When must it be profitable?
•Explainability: Must decisions be explainable to users? Regulators?
•Fairness: Protected classes? Required parity definitions?
•Human Review: What percentage can/must go to human review?
•Rollback: How quickly must we be able to disable the system?

The Iron Triangle of ML

ML systems face fundamental trade-offs that cannot be escaped—only navigated. Understanding these trade-offs enables explicit prioritization:

Accuracy vs. Latency: More complex models are often more accurate but slower. A 100-layer transformer beats a logistic regression on most tasks—but not if you need sub-millisecond response times.

Precision vs. Recall: You cannot maximize both. A spam filter with high precision rarely catches legitimate email—but also catches less spam. High recall catches all spam—but also many legitimate messages.

Freshness vs. Cost: Real-time model updates enable rapid adaptation but require expensive streaming infrastructure. Daily retraining is cheaper but responds slowly to distribution shifts.

Generalization vs. Personalization: Global models are simpler to train and deploy. Per-user models capture individual preferences but create cold-start problems and scalability challenges.

Interpretability vs. Performance: Linear models are easy to explain but limited in what they can learn. Deep networks capture complex patterns but act as black boxes.

Constraint Prioritization Exercise

Force stakeholders to rank constraints. Give them 100 points to distribute across: accuracy, latency, explainability, cost, time-to-market. This exercise reveals true priorities and prevents the 'everything is critical' trap that leads to impossible requirements.

The ML Requirements Document

A comprehensive ML requirements document synthesizes all gathered information into a single reference that guides development and enables accountability. This document should be living—updated as understanding evolves—but versioned to track how requirements changed over time.

Standard ML Requirements Document Structure:

ml-requirements-template.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# ML System Requirements Document
 
## 1. Executive Summary
- One-paragraph description of system purpose
- Expected business impact (quantified)
- Key success metrics
- High-level timeline
 
## 2. Problem Definition
### 2.1 Business Context
- Current state and pain points
- Why ML is the right approach
- What alternatives were considered
 
### 2.2 ML Problem Formulation
- Task type (classification, regression, ranking, etc.)
- Input specification
- Output specification
- Decision that will be made with predictions
 
## 3. Success Criteria
### 3.1 Business Metrics
- Primary metric with target value
- Secondary metrics with targets
- Measurement methodology
 
### 3.2 Model Metrics
- Offline evaluation metrics
- Online evaluation metrics  
- Baseline performance to beat
 
### 3.3 System Metrics
- Latency requirements (P50, P95, P99)
- Throughput requirements
- Availability requirements
 
## 4. Data Requirements
### 4.1 Training Data
- Data sources with access method
- Volume and time range
- Labeling strategy and cost
 
### 4.2 Serving Data
- Real-time data requirements
- Feature computation strategy
- Freshness requirements
 
### 4.3 Data Governance
- Privacy compliance (GDPR, CCPA, etc.)
- Retention policies
- Access controls
 
## 5. Constraints
### 5.1 Technical Constraints
- Infrastructure limitations
- Integration requirements
- Performance budgets
 
### 5.2 Business Constraints
- Budget (development + operational)
- Timeline with milestones
- Team capabilities
 
### 5.3 Regulatory Constraints
- Explainability requirements
- Fairness requirements
- Audit requirements
 
## 6. Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| ... | ... | ... | ... |
 
## 7. Stakeholder Sign-off
| Stakeholder | Role | Date | Signature |
|-------------|------|------|-----------|
| ... | ... | ... | ... |
 
## Appendix: Glossary
- Domain-specific terms defined

Document as Communication Tool

The value of this document isn't just in creating it—it's in sharing it. All stakeholders should have access, and the document should be referenced in design reviews, stand-ups, and retrospectives. When trade-offs are debated, point to the documented priorities. When scope creeps, reference the agreed constraints.

Common Pitfalls and Anti-patterns

Experience reveals recurring patterns of requirements failure. Recognizing these anti-patterns helps you avoid them in your own projects:

Requirements Anti-patterns

•Solution-First Thinking — Starting with 'we need deep learning' rather than 'we need to solve X problem.' Technology choice should follow problem understanding, not precede it.
•Metric Cargo Culting — Adopting metrics from research papers without understanding their relevance. AUC-ROC is great for balanced classification but misleading for highly imbalanced problems.
•The 'More Data' Delusion — Assuming more data always helps. Sometimes 10x data yields 1% improvement. Sometimes the signal isn't in the data at all.
•Ignoring the Human Loop — Designing for full automation when reality requires human review. The system needs to support, not replace, human judgment.
•Over-specifying Accuracy — Demanding 99% accuracy without understanding the precision-recall trade-off or what 99% means for their specific use case.
•Under-specifying Operations — Detailed model requirements but vague operational requirements. 'It just needs to work in production' is not a specification.
•The Pilot Fallacy — 'We'll figure out deployment after the pilot.' Production deployment often reveals requirements invisible in controlled experiments.
•Stakeholder Exclusion — Not involving end users, data teams, or operations until late. Their constraints should shape requirements, not react to them.

The Most Dangerous Assumption

The most dangerous assumption in ML requirements is that offline model performance predicts online system value. Many metrics that look great in development provide no business value in production—or even negative value. Build feedback loops from production outcomes back to requirements early.

Summary: Requirements Gathering Mastery

Requirements gathering for ML systems is a discipline unto itself—distinct from traditional software requirements and deserving of serious investment. The time spent here pays dividends throughout the project lifecycle.

Key Takeaways

•Start with the decision, not the prediction — Understand what will change because this system exists. No changed decision means no value.
•Bridge three metric worlds — Business metrics, model metrics, and system metrics must all be defined and connected.
•Align stakeholders explicitly — Different stakeholders have different concerns. Surface conflicts early through structured sign-off.
•Audit data ruthlessly — Data availability, quality, and compliance determine project feasibility. Discover gaps before investing in modeling.
•Document constraints and trade-offs — Every ML system involves trade-offs. Make them explicit and get stakeholder buy-in on priorities.
•Use templates for consistency — A structured requirements document prevents omissions and enables accountability.
•Avoid common anti-patterns — Solution-first thinking, metric cargo culting, and ignoring operations doom projects before they start.

What's next:

With requirements gathered and documented, the next step is designing the data infrastructure that will feed your ML system. The next page explores Data Pipeline Design—how to build the data infrastructure that transforms raw data into training sets and features, enabling reliable model development and production serving.

Page Complete

You now understand how to systematically gather requirements for ML systems. This foundation—clear problem definition, aligned stakeholders, defined metrics, audited data, and documented constraints—enables everything that follows. Next, we design the data pipelines that make ML systems possible.

1 / 5

Loading learning content...

Machine LearningML System Design

ML System Design

LevelAdvanced

Duration120 mins

TopicML System Design

1 / 5

Requirements Gathering

The Foundation of ML Success

What You Will Learn

The ML Requirements Framework

The Five Dimensions of ML Requirements

•Problem Definition — What exactly are we trying to solve? What decision will be made differently because this system exists? What happens if we get it wrong?
•Success Metrics — How will we measure success? What's the minimum viable improvement? How do offline metrics connect to online business outcomes?
•Data Requirements — What data do we need? Do we have it? Can we get it? What's the quality, volume, freshness, and legal status of the data?
•Operational Constraints — What are the latency requirements? Throughput expectations? Resource budgets? Integration points? Deployment environment?
•Organizational Context — Who are the stakeholders? What's the timeline? How does this fit into broader business strategy? What governance applies?

The Order Matters

Problem Definition Deep Dive

The Problem Definition Canvas captures the essential questions that must be answered before any ML work begins:

Problem Definition Canvas
Question	Why It Matters	Red Flags If Missing
What decision will change?	ML systems are decision-support tools. If no decision changes, no value is created.	Answers like 'we'll have better insights' or 'more data-driven culture'
Who makes this decision today?	Understanding current process reveals opportunities and constraints.	Nobody knows or 'it's handled manually somehow'
How is this decision made now?	Baseline for comparison. Often reveals simpler alternatives.	Vague answers or inability to describe current state
What's the cost of a wrong prediction?	Determines acceptable error rates and need for human-in-loop.	All predictions treated as equally important
What's the expected impact of improvement?	Quantifies ROI and sets realistic expectations.	Vague claims like 'significant improvement' without numbers
Is this actually an ML problem?	Many 'ML projects' are better solved with rules or simple heuristics.	Assumption that ML is always the answer

The Translation Challenge

Reduce churn → Predict probability of churn per customer → Binary classification with temporal dynamics
Optimize pricing → Predict demand elasticity curves → Regression with causal considerations
Improve recommendations → Rank items by relevance for each user → Learning-to-rank or matrix factorization
Detect fraud → Identify anomalous transactions → Anomaly detection with extreme class imbalance

The Five Whys for ML

Problem Decomposition

Complex business problems often require decomposition into multiple ML tasks. Consider a content moderation system:

High-level need: Remove harmful content from the platform.

Decomposed ML tasks:

Classification: Is this content potentially harmful? (High recall, moderate precision)
Categorization: What type of harm? (Violence, hate speech, misinformation, etc.)
Severity scoring: How severe is the violation? (Prioritizes human review queue)
User risk modeling: Is this user a repeat offender? (Informs action severity)
False positive detection: Is this likely a false positive? (Prevents over-enforcement)

Success Metrics Architecture

The Three Metrics Layers

•Business Metrics — Revenue, conversion rate, customer lifetime value, operational cost reduction, user engagement. These are what the organization actually cares about, but they're influenced by many factors beyond the ML system.
•Model Metrics — Accuracy, precision, recall, F1, AUC-ROC, log loss, RMSE, NDCG. These measure prediction quality directly, but their connection to business impact isn't always clear or linear.
•System Metrics — Latency, throughput, availability, error rate, resource utilization. These ensure the system operates reliably, but a perfectly reliable system that makes bad predictions is still useless.

The Metrics Alignment Problem

The fundamental challenge is that improving model metrics doesn't guarantee improving business metrics. This happens for several reasons:

Offline-Online Gap: A model trained to minimize log loss on historical data might not perform well on live traffic where data distributions shift.
Proxy Metric Divergence: Click-through rate is easy to measure but may not correlate with user satisfaction or purchase intent.
Goodhart's Law: Once a metric becomes a target, it ceases to be a good metric. Optimizing for engagement can lead to addictive design patterns that hurt long-term retention.
Simpson's Paradox: Model improvements within segments can disappear or reverse when aggregated, especially if the segment distribution shifts.
Confounding Variables: Business metrics are affected by seasonality, marketing campaigns, product changes, and competitor actions—making causal attribution difficult.

Metrics Mapping Example: E-commerce Recommendation System
Business Metric	Model Metric	System Metric	Connection Challenge
Revenue per session	NDCG@10, Hit Rate@K	P50 latency < 100ms	Higher ranking quality should improve revenue, but depends on pricing and inventory
Conversion rate	Precision@recommendation	Availability > 99.9%	Good recommendations need to be shown consistently to impact conversion
Average order value	Cross-category exposure	Cold-boot time < 5s	Diverse recommendations may increase AOV but decrease immediate clicks
Customer retention	Long-term engagement score	Error rate < 0.1%	Short-term optimization can hurt long-term loyalty (e.g., recommending only deals)

The Vanity Metrics Trap

Establishing Metric Baselines

Before any ML work begins, establish baselines for all relevant metrics:

Simple Baselines:

Random predictions (establishes floor)
Most-popular/majority class (surprisingly hard to beat)
Last-value persistence (for time series)
Rule-based heuristics (often captures 80% of value)

Current State Baseline:

Existing system performance (if replacing something)
Human expert performance (for comparison)
Business metrics before intervention

Theoretical Ceiling:

Inter-rater agreement (human upper bound)
Irreducible noise in the data
Best published results on similar problems

Stakeholder Alignment

Understanding the different stakeholder archetypes and their concerns enables proactive alignment:

Stakeholder Archetypes in ML Projects
Stakeholder	Primary Concern	Common Objections	Alignment Strategy
Business Sponsor	ROI, timeline, business impact	"When will we see results? What's the cost?"	Quantify expected impact, establish milestones, define success criteria upfront
End Users	Usability, trust, workflow integration	"I don't trust black boxes. This will automate my job."	Involve early, demonstrate explainability, position as augmentation not replacement
Data Teams	Data quality, access, governance	"The data isn't ready. Privacy constraints apply."	Data audit early, define minimum viable data, address compliance proactively
Engineering	Integration, maintenance, reliability	"How does this fit our stack? Who maintains it?"	Design for operations from day one, involve in architecture decisions
Legal/Compliance	Risk, liability, regulatory adherence	"What if the model is biased? Who's liable?"	Document decision processes, plan for audits, address fairness explicitly
Executive Leadership	Strategic alignment, competitive position	"Why this project over others? What's the risk?"	Connect to strategic goals, present honest risk assessment, define abort criteria

Managing Expectations with the ML Uncertainty Principle

What ML can provide:

Probabilistic predictions with quantified uncertainty
Aggregate performance guarantees (on average, over time)
Continuous improvement as more data is collected
Patterns and insights invisible to human review at scale

What ML cannot provide:

Perfect predictions for every individual case
Guarantees against any specific error type
Static behavior as data distributions shift
Causal explanations (only correlations)

The Concrete Example Technique

The Requirements Sign-Off Process

Formalize stakeholder alignment with a requirements sign-off that includes:

Problem Statement: Clear, agreed-upon definition of what we're solving
Success Criteria: Specific metrics with target values and measurement methodology
Failure Criteria: What would cause us to abandon this approach
Timeline: Realistic milestones for data, MVP, production deployment
Resource Commitment: Data access, engineering support, review bandwidth
Risk Register: Known risks, mitigations, and acceptable risk levels
Decision Rights: Who approves changes to scope, metrics, or approach

This document serves as a reference point throughout the project, preventing scope creep and misunderstandings.

Data Requirements Assessment

The Data Requirements Assessment covers seven critical dimensions:

Data Requirements Dimensions

•Availability — Does the data exist? Is it accessible? Who controls access? What's the lead time to get it?
•Quality — Is the data accurate? Complete? Consistent? How are errors and missing values handled?
•Volume — Is there enough data to train models? What's the class distribution? How much labeled data exists?
•Freshness — How current is the data? How often is it updated? Is there latency between event and availability?
•Representativeness — Does the data reflect the population we'll serve? Are there systematic biases or gaps?
•Labels — For supervised learning: do labels exist? Are they reliable? What's the cost to obtain more?
•Compliance — Can we legally use this data? Are there privacy constraints? What retention policies apply?

The Data Audit Checklist

For each data source identified as relevant, conduct a systematic audit:

Data Source Audit Template
Dimension	Questions to Answer	Documentation Required
Source Identity	What systems generate this data? Who owns it?	Data lineage, ownership documentation
Access Method	API, database query, file transfer?	Access procedures, credentials management
Update Frequency	Real-time, hourly, daily, weekly?	SLAs, freshness guarantees
Historical Depth	How far back does data exist?	Retention policies, archival status
Schema Stability	Does the schema change? How often?	Schema versioning, change notification
Quality Issues	Known problems, biases, gaps?	Data quality reports, known limitations
Legal Status	Consent basis, permitted uses, restrictions?	Legal review, DPA agreements

The Labeling Cost Reality

Feature Availability Analysis

Beyond raw data, assess whether the features needed for prediction will be available at inference time:

Training-Serving Skew Risks:

Features computed from future data (look-ahead bias)
Features requiring expensive computation (feasible offline, not in real-time)
Features from third-party sources with availability gaps
Features that aggregate over time windows differently in batch vs. streaming

The Feature Feasibility Matrix: For each candidate feature, assess:

Can it be computed in training? (data exists historically)
Can it be computed at serving time? (data available, latency acceptable)
Does it change the system architecture? (new pipelines needed)
What's the marginal predictive value? (worth the complexity?)

Constraints and Trade-offs

The Constraint Categories:

Technical Constraints

•Latency: What's the maximum acceptable response time? P50, P95, P99?
•Throughput: How many predictions per second? Peak vs. average?
•Compute Budget: What infrastructure is available? GPU, CPU, edge?
•Memory: Model size constraints for deployment targets?
•Bandwidth: Data transfer limits for edge deployment?
•Availability: Uptime requirements? Degradation strategy?

Business Constraints

•Budget: Development cost, infrastructure cost, ongoing cost?
•Timeline: When must MVP launch? When must it be profitable?
•Explainability: Must decisions be explainable to users? Regulators?
•Fairness: Protected classes? Required parity definitions?
•Human Review: What percentage can/must go to human review?
•Rollback: How quickly must we be able to disable the system?

The Iron Triangle of ML

ML systems face fundamental trade-offs that cannot be escaped—only navigated. Understanding these trade-offs enables explicit prioritization:

Freshness vs. Cost: Real-time model updates enable rapid adaptation but require expensive streaming infrastructure. Daily retraining is cheaper but responds slowly to distribution shifts.

Generalization vs. Personalization: Global models are simpler to train and deploy. Per-user models capture individual preferences but create cold-start problems and scalability challenges.

Interpretability vs. Performance: Linear models are easy to explain but limited in what they can learn. Deep networks capture complex patterns but act as black boxes.

Constraint Prioritization Exercise

The ML Requirements Document

Standard ML Requirements Document Structure:

ml-requirements-template.md
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# ML System Requirements Document
 
## 1. Executive Summary
- One-paragraph description of system purpose
- Expected business impact (quantified)
- Key success metrics
- High-level timeline
 
## 2. Problem Definition
### 2.1 Business Context
- Current state and pain points
- Why ML is the right approach
- What alternatives were considered
 
### 2.2 ML Problem Formulation
- Task type (classification, regression, ranking, etc.)
- Input specification
- Output specification
- Decision that will be made with predictions
 
## 3. Success Criteria
### 3.1 Business Metrics
- Primary metric with target value
- Secondary metrics with targets
- Measurement methodology
 
### 3.2 Model Metrics
- Offline evaluation metrics
- Online evaluation metrics  
- Baseline performance to beat
 
### 3.3 System Metrics
- Latency requirements (P50, P95, P99)
- Throughput requirements
- Availability requirements
 
## 4. Data Requirements
### 4.1 Training Data
- Data sources with access method
- Volume and time range
- Labeling strategy and cost
 
### 4.2 Serving Data
- Real-time data requirements
- Feature computation strategy
- Freshness requirements
 
### 4.3 Data Governance
- Privacy compliance (GDPR, CCPA, etc.)
- Retention policies
- Access controls
 
## 5. Constraints
### 5.1 Technical Constraints
- Infrastructure limitations
- Integration requirements
- Performance budgets
 
### 5.2 Business Constraints
- Budget (development + operational)
- Timeline with milestones
- Team capabilities
 
### 5.3 Regulatory Constraints
- Explainability requirements
- Fairness requirements
- Audit requirements
 
## 6. Risks and Mitigations
| Risk | Likelihood | Impact | Mitigation |
|------|------------|--------|------------|
| ... | ... | ... | ... |
 
## 7. Stakeholder Sign-off
| Stakeholder | Role | Date | Signature |
|-------------|------|------|-----------|
| ... | ... | ... | ... |
 
## Appendix: Glossary
- Domain-specific terms defined

Document as Communication Tool

Common Pitfalls and Anti-patterns

Experience reveals recurring patterns of requirements failure. Recognizing these anti-patterns helps you avoid them in your own projects:

Requirements Anti-patterns

•Solution-First Thinking — Starting with 'we need deep learning' rather than 'we need to solve X problem.' Technology choice should follow problem understanding, not precede it.
•Metric Cargo Culting — Adopting metrics from research papers without understanding their relevance. AUC-ROC is great for balanced classification but misleading for highly imbalanced problems.
•The 'More Data' Delusion — Assuming more data always helps. Sometimes 10x data yields 1% improvement. Sometimes the signal isn't in the data at all.
•Ignoring the Human Loop — Designing for full automation when reality requires human review. The system needs to support, not replace, human judgment.
•Over-specifying Accuracy — Demanding 99% accuracy without understanding the precision-recall trade-off or what 99% means for their specific use case.
•Under-specifying Operations — Detailed model requirements but vague operational requirements. 'It just needs to work in production' is not a specification.
•The Pilot Fallacy — 'We'll figure out deployment after the pilot.' Production deployment often reveals requirements invisible in controlled experiments.
•Stakeholder Exclusion — Not involving end users, data teams, or operations until late. Their constraints should shape requirements, not react to them.

The Most Dangerous Assumption

Summary: Requirements Gathering Mastery

Key Takeaways

•Start with the decision, not the prediction — Understand what will change because this system exists. No changed decision means no value.
•Bridge three metric worlds — Business metrics, model metrics, and system metrics must all be defined and connected.
•Align stakeholders explicitly — Different stakeholders have different concerns. Surface conflicts early through structured sign-off.
•Audit data ruthlessly — Data availability, quality, and compliance determine project feasibility. Discover gaps before investing in modeling.
•Document constraints and trade-offs — Every ML system involves trade-offs. Make them explicit and get stakeholder buy-in on priorities.
•Use templates for consistency — A structured requirements document prevents omissions and enables accountability.
•Avoid common anti-patterns — Solution-first thinking, metric cargo culting, and ignoring operations doom projects before they start.

What's next:

Page Complete

1 / 5