Machine LearningThe ML Pipeline

The Machine Learning Pipeline

LevelBeginner

Duration90 mins

TopicThe ML Pipeline

1 / 5

Problem Formulation

The Foundation of Every ML Project

Every successful machine learning project begins not with data, not with algorithms, and not with code—but with a precisely formulated problem. Problem formulation is the art and science of translating vague business objectives into concrete, measurable machine learning tasks. It is, without exaggeration, the single most consequential decision in any ML endeavor.

Yet this critical step is routinely overlooked. Teams dive into data exploration, debate model architectures, and tune hyperparameters—only to discover months later that they've been solving the wrong problem entirely. The algorithm performs brilliantly on the metrics they chose, but the business sees no value. The model achieves state-of-the-art accuracy, but users don't adopt the feature.

Problem formulation failure is the leading cause of ML project failure. Not bad data. Not algorithmic complexity. Not infrastructure challenges. Simply: solving the wrong problem.

What You Will Learn

By the end of this page, you will understand how to systematically translate business problems into ML problems, define appropriate target variables and evaluation metrics, identify the type of ML problem you're facing, and avoid the most common formulation pitfalls that derail projects before they even begin.

From Business Problem to ML Problem

The journey from a business problem to an ML problem involves several translation layers. Consider a simple-sounding request:

"We want to reduce customer churn."

This is a business objective—a desired outcome expressed in business terms. It cannot be directly optimized by a machine learning algorithm. To make it actionable, we must perform a series of translations:

Step 1: Define what 'churn' means precisely

Churn seems obvious, but is a customer churned when they:

Cancel their subscription?
Haven't logged in for 30 days? 60 days? 90 days?
Downgrade from paid to free tier?
Reduce usage below a threshold?

Each definition leads to different datasets, different models, and different interventions. A customer who hasn't logged in for 30 days but renews their annual subscription isn't churned—but a 30-day inactivity definition would label them as such.

Step 2: Determine what can be predicted

ML models predict outcomes. For churn, we must decide:

Are we predicting whether a customer will churn (binary classification)?
Are we predicting when they'll churn (survival analysis)?
Are we predicting probability of churn (probabilistic classification)?
Are we predicting churn risk score (regression/ranking)?

Step 3: Define the prediction window

When do we make predictions, and when must the outcome occur to count?

Predict churn within 30 days of the prediction date?
Predict churn before the next billing cycle?
Predict lifetime value decrease over 12 months?

Step 4: Determine actionability

A prediction is only valuable if it enables action:

If we predict churn 1 day before cancellation, can the business intervene in time?
Do we need predictions far enough in advance to run a retention campaign?
Can the predicted customers actually be influenced, or is churn inevitable for some segments?

The Translation Framework

Business Objective → Operational Definition → ML Task Type → Prediction Target → Prediction Window → Actionability Check. Skip any step, and the project drifts from business value.

Examples of Business-to-ML Translation
Business Objective	Operational Definition	ML Task	Key Considerations
Reduce customer churn	Predict if subscription cancelled within 60 days	Binary classification	When to predict, intervention window, what actions decrease churn
Increase sales revenue	Predict purchase probability for each user/product pair	Recommendation/Ranking	Which products to recommend, inventory constraints, margin optimization
Improve content quality	Predict probability of harmful content	Multi-class classification	Definition of 'harmful', false positive tolerance, appeals process
Reduce fraud losses	Predict if transaction is fraudulent	Anomaly detection / Classification	Cost asymmetry (false positives vs false negatives), real-time requirements
Optimize pricing	Predict demand at various price points	Regression / Causal inference	Price sensitivity, competition, elasticity modeling

Defining the Target Variable

The target variable (also called the label, response, or dependent variable) is what your model learns to predict. Defining it correctly is perhaps the most consequential technical decision in an ML project.

The target variable must satisfy several properties:

1. Observable at Training Time

You can only train a model to predict something you can measure in your historical data. If you want to predict 'customer satisfaction,' you need historical satisfaction measurements (surveys, NPS scores, support tickets). If those don't exist, you must use a proxy—which introduces potential mismatch between what you optimize and what you actually care about.

2. Observable at Prediction Time (Eventually)

For supervised learning, you need ground truth labels to evaluate your model. If your target takes 2 years to observe (e.g., 'will this customer remain active for 24 months?'), you may need to redefine the problem or accept that model evaluation will be delayed.

3. Aligned with Business Value

Optimizing for click-through rate when you care about conversions leads to models that generate clicks but not revenue. Optimizing for accuracy when false positives and false negatives have vastly different costs leads to models that are mathematically impressive but operationally useless.

4. Stable and Consistent

If your definition of the target changes over time (e.g., the company redefines 'active user'), historical labels become inconsistent with future labels, degrading model performance.

The Proxy Problem

Often, the quantity you truly care about cannot be directly measured. You must use a proxy—a measurable quantity that correlates with your true objective. But proxies are imperfect. Optimizing engagement (measurable) as a proxy for user satisfaction (unmeasurable) can lead to addictive features that maximize time-on-site while decreasing long-term user wellbeing.

Common Target Variable Pitfalls

•Label Leakage — Information from the future (after prediction time) sneaks into features or labels. A churn prediction model trained on data that includes 'days since last activity' measured after the prediction date will appear perfect but fail in production.
•Survivorship Bias — Training only on customers who stayed long enough to generate labels ignores those who churned immediately. The model learns patterns that only apply to survivors.
•Feedback Loops — The model's predictions influence future labels. A fraud detection model that blocks suspected fraudsters will have fewer confirmed fraud cases to learn from, potentially missing new fraud patterns.
•Label Noise — Human-labeled data contains errors. Training on noisy labels limits model performance and can amplify systematic biases in labeling.
•Class Imbalance — Rare events (fraud, churn, conversions) create datasets where one class vastly outnumbers others, requiring specialized handling during training and evaluation.

Target Definition Example: E-commerce ConversionDefining 'conversion' for an e-commerce recommendation system

Input

Output

Identifying the ML Task Type

Once you've defined your target variable, you must identify which type of machine learning problem you're solving. This classification determines which algorithms to consider, how to evaluate performance, and what training procedures to apply.

The ML Task Taxonomy:

Machine learning problems generally fall into one of several categories, each with distinct characteristics:

ML Task Type Classification
Task Type	Target Variable	Output	Example Applications
Binary Classification	One of two categories	Class label or probability	Spam detection, fraud detection, churn prediction, medical diagnosis
Multi-class Classification	One of K categories (K > 2)	Class label or probability distribution	Document categorization, image recognition, intent classification
Multi-label Classification	Zero or more of K categories	Set of labels or probabilities	Tag prediction, document topics, medical conditions (patient can have multiple)
Regression	Continuous numeric value	Numeric prediction	Price prediction, demand forecasting, age estimation, stock returns
Ordinal Regression	Ordered categories	Ordinal class	Rating prediction (1-5 stars), education level, disease severity
Ranking	Relative ordering of items	Ranked list or scores	Search results, recommendation systems, ad placement
Clustering	None (unsupervised)	Cluster assignments	Customer segmentation, anomaly detection, topic discovery
Sequence Prediction	Sequence of values/tokens	Next token(s) or full sequence	Language modeling, time series forecasting, machine translation
Structured Prediction	Complex structured output	Trees, graphs, sequences	Named entity recognition, parsing, image segmentation

Choosing the Right Task Type:

The same business problem can often be formulated as different ML tasks. Consider predicting house prices:

As regression: Predict the exact selling price in dollars
As classification: Predict if price is 'low', 'medium', or 'high' (discretized buckets)
As ranking: For a given buyer, rank houses by likelihood of purchase

Each formulation has tradeoffs:

Formulation	Advantages	Disadvantages
Regression	Precise predictions, directly usable	Harder to optimize, sensitive to outliers
Classification	Simpler problem, robust to label noise	Loses precision, bucket boundaries are arbitrary
Ranking	Matches user behavior (comparing options)	Doesn't provide absolute price, harder to evaluate

The right choice depends on how predictions will be used. If users need exact prices for budgeting, regression is essential. If the goal is to surface relevant properties, ranking may suffice.

When in Doubt, Start Simple

If a problem can be formulated as classification or regression, and you're uncertain which is better, start with classification. Classification problems are generally easier to optimize, easier to evaluate, and easier to interpret. You can always add complexity later.

Task Type Decision Questions

•Is the target discrete or continuous? Discrete → Classification; Continuous → Regression (though continuous can be discretized)
•If discrete, how many categories? 2 → Binary classification; K > 2 → Multi-class; Can have multiple → Multi-label
•Is the order of categories meaningful? Yes → Ordinal regression; No → Standard classification
•Are you comparing items rather than evaluating absolutely? Yes → Consider ranking
•Do you have labels at all? No → Unsupervised (clustering, dimensionality reduction, density estimation)
•Is the output structured (sequences, trees, graphs)? Yes → Structured prediction
•Is the outcome rare? Yes → Consider anomaly detection formulation

Defining Success Metrics

A metric is how you measure success. In ML, we distinguish between optimization metrics (what the algorithm directly optimizes) and evaluation metrics (what we use to assess model quality).

The Metric Hierarchy:

Business Metrics — Revenue, conversion rate, customer lifetime value, NPS. These are what the business ultimately cares about but often can't be directly optimized (too slow to observe, too noisy, confounded by other factors).
Evaluation Metrics — Metrics used to assess model quality and select between models. Should be as close to business metrics as possible. Examples: precision, recall, RMSE, AUC-ROC.
Optimization Metrics — The loss function that the model directly minimizes during training. Examples: cross-entropy, mean squared error, hinge loss.

The goal is to choose optimization and evaluation metrics that are aligned with business metrics—when the model improves on evaluation metrics, business outcomes improve too.

Choosing Evaluation Metrics:

Common Evaluation Metrics by Task Type
Task Type	Common Metrics	When to Use
Binary Classification	Accuracy, Precision, Recall, F1, AUC-ROC, AUC-PR, Log Loss	AUC-ROC for balanced data; AUC-PR for imbalanced; Precision/Recall when costs are asymmetric
Multi-class Classification	Accuracy, Macro/Micro F1, Top-K Accuracy, Confusion Matrix	Top-K when near-misses are acceptable; Macro F1 for balanced evaluation across classes
Regression	MSE, RMSE, MAE, MAPE, R²	MAE for interpretability; RMSE when large errors are costly; MAPE for percentage-based assessment
Ranking	NDCG, MAP, MRR, Precision@K	NDCG when position matters; MRR for single-item retrieval; Precision@K for top-K only
Clustering	Silhouette Score, Adjusted Rand Index, NMI	Silhouette when no labels; ARI/NMI when ground truth available

Goodhart's Law in ML

"When a measure becomes a target, it ceases to be a good measure." If you optimize aggressively for a proxy metric, the model may 'game' the metric while failing to deliver business value. Always validate that improvements in your chosen metric correspond to real-world improvements.

Cost-Sensitive Metrics:

For many problems, not all errors are equal. In fraud detection:

A false positive (blocking a legitimate transaction) annoys a customer, but costs relatively little
A false negative (allowing a fraudulent transaction) causes direct financial loss

For medical diagnosis:

A false positive (telling a healthy patient they're sick) causes anxiety and unnecessary tests
A false negative (missing a disease) can be life-threatening

When errors have asymmetric costs, standard metrics like accuracy are misleading. You need cost-sensitive evaluation:

Expected Cost = Σ (Cost of Error Type × Probability of Error Type)

This requires estimating the actual business costs of each error type, which often involves conversations with business stakeholders who may have never quantified these costs before.

Metric Selection Checklist

•Does the metric correlate with business outcomes? Validate historically or through experiments
•Is the metric robust to class imbalance? If one class is rare, accuracy can be misleading
•Does the metric account for cost asymmetry? If FP and FN have different costs, use cost-sensitive metrics
•Can the metric be computed during development? Some metrics require production data or delayed labels
•Is the metric interpretable to stakeholders? Business teams need to understand what you're optimizing
•Is the metric stable? Small changes in data shouldn't cause wild swings in the metric

Constraints and Requirements

Problem formulation isn't complete until you've documented the constraints under which the solution must operate. These constraints often eliminate entire classes of solutions and fundamentally shape which approaches are viable.

Categories of Constraints:

Key Constraint Categories

•Latency Requirements — How fast must predictions be made? Milliseconds (ad serving), seconds (recommendation), minutes (batch predictions)? Real-time requirements eliminate models that are slow at inference (large transformers, expensive ensembles).
•Throughput Requirements — How many predictions per second/minute/day? High throughput may require model optimization, hardware acceleration, or simpler models.
•Availability Requirements — What's the acceptable downtime? 99.9% availability requires redundancy, fallback mechanisms, and extensive monitoring.
•Data Constraints — What data is available at prediction time? Features available in training may not exist in production. Privacy regulations may restrict what data can be used.
•Model Size Constraints — Does the model need to run on edge devices, mobile phones, or browsers? Size limits may require compression, distillation, or simpler architectures.
•Interpretability Requirements — Must predictions be explainable? Regulatory (GDPR, credit decisions) or business requirements may mandate interpretable models over black-box approaches.
•Fairness Requirements — Are there demographic groups that must be treated equitably? Legal or ethical requirements may constrain optimization to avoid disparate impact.
•Cost Constraints — What's the budget for compute (training and inference), data labeling, human review, and ongoing maintenance?

Constraints First, Not Last

Document constraints at the beginning of a project, not the end. A model that can't meet latency requirements is useless regardless of its accuracy. Knowing constraints early prevents wasted effort on infeasible approaches.

Constraint Impact on Model Selection
Constraint	Impact on Model Selection	Typical Tradeoffs
Latency < 10ms	Must use simple models, optimized inference, caching	May sacrifice accuracy for speed
Must be interpretable	Limited to linear models, decision trees, rule-based systems	May sacrifice accuracy for explainability
Must run on mobile	Must use quantized, small models	May sacrifice accuracy for size
No demographic data allowed	Cannot use protected attributes directly; must audit for indirect usage	May sacrifice some predictive power for fairness
Labels available only after 60 days	Model retraining is delayed; must monitor for drift	May need to use proxy labels or shorten observation window

Edge Cases and Failure Modes:

Problem formulation should also anticipate edge cases and failure modes:

What happens when the model is highly uncertain?
What's the fallback if the model fails or is unavailable?
How should the system handle inputs outside the training distribution?
What's the process when users dispute predictions?

Documenting these upfront ensures the system is designed for robustness, not just average-case performance.

The Problem Formulation Canvas

To systematize problem formulation, consider using a Problem Formulation Canvas—a structured template that captures all critical decisions before any modeling begins.

The Canvas Elements:

Problem Formulation Canvas

•Business Objective — What is the ultimate business goal? (Reduce churn, increase revenue, improve user experience)
•ML Objective — How does ML contribute? (Predict who will churn so we can intervene)
•Target Variable Definition — What exactly are we predicting? (Binary: will customer churn within 60 days of prediction date)
•ML Task Type — What kind of ML problem is this? (Binary classification)
•Prediction Context — When and where will predictions be used? (Daily batch predictions for retention campaign targeting)
•Success Metrics — How will we measure success? (Business: reduce churn rate by 5%; ML: AUC-ROC > 0.85)
•Baseline — What's the current approach and its performance? (Currently targeting all customers; 15% churn rate)
•Constraints — What limitations must we respect? (Must be interpretable for customer service; predictions needed by 6 AM daily)
•Data Requirements — What data is needed? (Customer demographics, usage logs, support tickets, billing history)
•Assumptions — What assumptions are we making? (Historical patterns will continue; interventions are effective)

A Living Document

The problem formulation canvas isn't filled out once and forgotten. It's a living document that evolves as you learn more about the problem, data, and business context. Revisit and refine it throughout the project.

Completed Canvas Example: Email Marketing OptimizationA fully worked problem formulation canvas for email campaign targeting

Input

Output

Common Formulation Pitfalls

Even experienced ML practitioners fall into formulation traps. Here are the most common—and how to avoid them:

Pitfall: Predicting What's Easy, Not What's Useful

•Predicting 'will user click' is easier than 'will user convert'
•But clicks don't generate revenue—conversions do
•Solution: Always trace back to business value

Pitfall: Optimizing the Metric, Not the Objective

•A news recommendation model optimized for 'time on site'
•Learned to show outrage-inducing content
•Solution: Include guardrails and secondary metrics

Additional Common Pitfalls

•Ignoring the Deployment Context — Training on full customer profiles but at prediction time only having partial data. Always verify that all training features will be available in production.
•Assuming Stationarity — The world changes. Customer behavior shifts, product changes disrupt patterns, external events (pandemics, recessions) invalidate historical relationships. Build monitoring and retraining into the solution.
•Not Defining the Null Model — Without a baseline (random guessing, simple rules, current system), you can't assess if ML adds value. Define what success looks like beyond the baseline.
•Conflating Correlation with Causation — Predicting that heavy app users won't churn doesn't mean pushing notifications to force app usage will prevent churn. If you need causal claims (intervention effects), you need causal methods.
•Ignoring Minority Classes — A fraud detection model with 99% accuracy that misses all actual fraud (1% of cases) is worthless. Always examine per-class performance, especially for rare but critical outcomes.

The Most Expensive Pitfall

The most expensive pitfall isn't getting the wrong answer—it's answering the wrong question. Months of effort can be wasted building a technically excellent model for a poorly formulated problem. Prevention is the only cure: invest heavily in problem formulation before writing a single line of model code.

Summary: Mastering Problem Formulation

Problem formulation is where ML projects are won or lost. It is not a mechanical process but a thoughtful translation between business needs and technical capabilities. Let's consolidate the key lessons:

Key Takeaways

•Start with the business problem — Understand the business objective before considering ML approaches. Work backward from value.
•Define the target variable precisely — Ambiguity in what you're predicting creates confusion throughout the project and misaligned evaluations.
•Identify the correct ML task type — Classification, regression, ranking, and other task types require different algorithms, metrics, and evaluation procedures.
•Choose metrics aligned with business value — Optimization and evaluation metrics should correlate with business outcomes. Beware Goodhart's Law.
•Document constraints early — Latency, interpretability, fairness, and other constraints shape which solutions are feasible.
•Use a systematic canvas — A structured problem formulation canvas ensures nothing critical is overlooked.
•Avoid common pitfalls — Predicting what's easy instead of useful, ignoring deployment context, and conflating correlation with causation derail projects.

What's Next:

With a well-formulated problem, you're ready to tackle the next stage of the ML pipeline: Data Collection and Preparation. This is where the theoretical problem meets messy reality—where you'll discover what data you actually have, what shape it's in, and how much work is required to make it model-ready.

The quality of your problem formulation directly impacts data work: a precisely defined target variable makes labeling straightforward; clear feature requirements guide data collection; documented constraints inform data governance.

Page Complete

You now understand how to formulate ML problems rigorously. You can translate business objectives into ML tasks, define target variables with precision, select appropriate metrics, document constraints, and avoid common formulation pitfalls. This foundation will serve you throughout your ML career—every project begins here.

1 / 5

Loading learning content...

Machine LearningThe ML Pipeline

The Machine Learning Pipeline

LevelBeginner

Duration90 mins

TopicThe ML Pipeline

1 / 5

Problem Formulation

The Foundation of Every ML Project

Problem formulation failure is the leading cause of ML project failure. Not bad data. Not algorithmic complexity. Not infrastructure challenges. Simply: solving the wrong problem.

What You Will Learn

From Business Problem to ML Problem

The journey from a business problem to an ML problem involves several translation layers. Consider a simple-sounding request:

"We want to reduce customer churn."

Step 1: Define what 'churn' means precisely

Churn seems obvious, but is a customer churned when they:

Cancel their subscription?
Haven't logged in for 30 days? 60 days? 90 days?
Downgrade from paid to free tier?
Reduce usage below a threshold?

Step 2: Determine what can be predicted

ML models predict outcomes. For churn, we must decide:

Are we predicting whether a customer will churn (binary classification)?
Are we predicting when they'll churn (survival analysis)?
Are we predicting probability of churn (probabilistic classification)?
Are we predicting churn risk score (regression/ranking)?

Step 3: Define the prediction window

When do we make predictions, and when must the outcome occur to count?

Predict churn within 30 days of the prediction date?
Predict churn before the next billing cycle?
Predict lifetime value decrease over 12 months?

Step 4: Determine actionability

A prediction is only valuable if it enables action:

If we predict churn 1 day before cancellation, can the business intervene in time?
Do we need predictions far enough in advance to run a retention campaign?
Can the predicted customers actually be influenced, or is churn inevitable for some segments?

The Translation Framework

Business Objective → Operational Definition → ML Task Type → Prediction Target → Prediction Window → Actionability Check. Skip any step, and the project drifts from business value.

Examples of Business-to-ML Translation
Business Objective	Operational Definition	ML Task	Key Considerations
Reduce customer churn	Predict if subscription cancelled within 60 days	Binary classification	When to predict, intervention window, what actions decrease churn
Increase sales revenue	Predict purchase probability for each user/product pair	Recommendation/Ranking	Which products to recommend, inventory constraints, margin optimization
Improve content quality	Predict probability of harmful content	Multi-class classification	Definition of 'harmful', false positive tolerance, appeals process
Reduce fraud losses	Predict if transaction is fraudulent	Anomaly detection / Classification	Cost asymmetry (false positives vs false negatives), real-time requirements
Optimize pricing	Predict demand at various price points	Regression / Causal inference	Price sensitivity, competition, elasticity modeling

Defining the Target Variable

The target variable must satisfy several properties:

1. Observable at Training Time

2. Observable at Prediction Time (Eventually)

3. Aligned with Business Value

4. Stable and Consistent

If your definition of the target changes over time (e.g., the company redefines 'active user'), historical labels become inconsistent with future labels, degrading model performance.

The Proxy Problem

Common Target Variable Pitfalls

•Label Leakage — Information from the future (after prediction time) sneaks into features or labels. A churn prediction model trained on data that includes 'days since last activity' measured after the prediction date will appear perfect but fail in production.
•Survivorship Bias — Training only on customers who stayed long enough to generate labels ignores those who churned immediately. The model learns patterns that only apply to survivors.
•Feedback Loops — The model's predictions influence future labels. A fraud detection model that blocks suspected fraudsters will have fewer confirmed fraud cases to learn from, potentially missing new fraud patterns.
•Label Noise — Human-labeled data contains errors. Training on noisy labels limits model performance and can amplify systematic biases in labeling.
•Class Imbalance — Rare events (fraud, churn, conversions) create datasets where one class vastly outnumbers others, requiring specialized handling during training and evaluation.

Target Definition Example: E-commerce ConversionDefining 'conversion' for an e-commerce recommendation system

Input

Output

Identifying the ML Task Type

The ML Task Taxonomy:

Machine learning problems generally fall into one of several categories, each with distinct characteristics:

ML Task Type Classification
Task Type	Target Variable	Output	Example Applications
Binary Classification	One of two categories	Class label or probability	Spam detection, fraud detection, churn prediction, medical diagnosis
Multi-class Classification	One of K categories (K > 2)	Class label or probability distribution	Document categorization, image recognition, intent classification
Multi-label Classification	Zero or more of K categories	Set of labels or probabilities	Tag prediction, document topics, medical conditions (patient can have multiple)
Regression	Continuous numeric value	Numeric prediction	Price prediction, demand forecasting, age estimation, stock returns
Ordinal Regression	Ordered categories	Ordinal class	Rating prediction (1-5 stars), education level, disease severity
Ranking	Relative ordering of items	Ranked list or scores	Search results, recommendation systems, ad placement
Clustering	None (unsupervised)	Cluster assignments	Customer segmentation, anomaly detection, topic discovery
Sequence Prediction	Sequence of values/tokens	Next token(s) or full sequence	Language modeling, time series forecasting, machine translation
Structured Prediction	Complex structured output	Trees, graphs, sequences	Named entity recognition, parsing, image segmentation

Choosing the Right Task Type:

The same business problem can often be formulated as different ML tasks. Consider predicting house prices:

As regression: Predict the exact selling price in dollars
As classification: Predict if price is 'low', 'medium', or 'high' (discretized buckets)
As ranking: For a given buyer, rank houses by likelihood of purchase

Each formulation has tradeoffs:

Formulation	Advantages	Disadvantages
Regression	Precise predictions, directly usable	Harder to optimize, sensitive to outliers
Classification	Simpler problem, robust to label noise	Loses precision, bucket boundaries are arbitrary
Ranking	Matches user behavior (comparing options)	Doesn't provide absolute price, harder to evaluate

The right choice depends on how predictions will be used. If users need exact prices for budgeting, regression is essential. If the goal is to surface relevant properties, ranking may suffice.

When in Doubt, Start Simple

Task Type Decision Questions

•Is the target discrete or continuous? Discrete → Classification; Continuous → Regression (though continuous can be discretized)
•If discrete, how many categories? 2 → Binary classification; K > 2 → Multi-class; Can have multiple → Multi-label
•Is the order of categories meaningful? Yes → Ordinal regression; No → Standard classification
•Are you comparing items rather than evaluating absolutely? Yes → Consider ranking
•Do you have labels at all? No → Unsupervised (clustering, dimensionality reduction, density estimation)
•Is the output structured (sequences, trees, graphs)? Yes → Structured prediction
•Is the outcome rare? Yes → Consider anomaly detection formulation

Defining Success Metrics

A metric is how you measure success. In ML, we distinguish between optimization metrics (what the algorithm directly optimizes) and evaluation metrics (what we use to assess model quality).

The Metric Hierarchy:

Business Metrics — Revenue, conversion rate, customer lifetime value, NPS. These are what the business ultimately cares about but often can't be directly optimized (too slow to observe, too noisy, confounded by other factors).
Evaluation Metrics — Metrics used to assess model quality and select between models. Should be as close to business metrics as possible. Examples: precision, recall, RMSE, AUC-ROC.
Optimization Metrics — The loss function that the model directly minimizes during training. Examples: cross-entropy, mean squared error, hinge loss.

The goal is to choose optimization and evaluation metrics that are aligned with business metrics—when the model improves on evaluation metrics, business outcomes improve too.

Choosing Evaluation Metrics:

Common Evaluation Metrics by Task Type
Task Type	Common Metrics	When to Use
Binary Classification	Accuracy, Precision, Recall, F1, AUC-ROC, AUC-PR, Log Loss	AUC-ROC for balanced data; AUC-PR for imbalanced; Precision/Recall when costs are asymmetric
Multi-class Classification	Accuracy, Macro/Micro F1, Top-K Accuracy, Confusion Matrix	Top-K when near-misses are acceptable; Macro F1 for balanced evaluation across classes
Regression	MSE, RMSE, MAE, MAPE, R²	MAE for interpretability; RMSE when large errors are costly; MAPE for percentage-based assessment
Ranking	NDCG, MAP, MRR, Precision@K	NDCG when position matters; MRR for single-item retrieval; Precision@K for top-K only
Clustering	Silhouette Score, Adjusted Rand Index, NMI	Silhouette when no labels; ARI/NMI when ground truth available

Goodhart's Law in ML

Cost-Sensitive Metrics:

For many problems, not all errors are equal. In fraud detection:

A false positive (blocking a legitimate transaction) annoys a customer, but costs relatively little
A false negative (allowing a fraudulent transaction) causes direct financial loss

For medical diagnosis:

A false positive (telling a healthy patient they're sick) causes anxiety and unnecessary tests
A false negative (missing a disease) can be life-threatening

When errors have asymmetric costs, standard metrics like accuracy are misleading. You need cost-sensitive evaluation:

Expected Cost = Σ (Cost of Error Type × Probability of Error Type)

This requires estimating the actual business costs of each error type, which often involves conversations with business stakeholders who may have never quantified these costs before.

Metric Selection Checklist

•Does the metric correlate with business outcomes? Validate historically or through experiments
•Is the metric robust to class imbalance? If one class is rare, accuracy can be misleading
•Does the metric account for cost asymmetry? If FP and FN have different costs, use cost-sensitive metrics
•Can the metric be computed during development? Some metrics require production data or delayed labels
•Is the metric interpretable to stakeholders? Business teams need to understand what you're optimizing
•Is the metric stable? Small changes in data shouldn't cause wild swings in the metric

Constraints and Requirements

Categories of Constraints:

Key Constraint Categories

•Latency Requirements — How fast must predictions be made? Milliseconds (ad serving), seconds (recommendation), minutes (batch predictions)? Real-time requirements eliminate models that are slow at inference (large transformers, expensive ensembles).
•Throughput Requirements — How many predictions per second/minute/day? High throughput may require model optimization, hardware acceleration, or simpler models.
•Availability Requirements — What's the acceptable downtime? 99.9% availability requires redundancy, fallback mechanisms, and extensive monitoring.
•Data Constraints — What data is available at prediction time? Features available in training may not exist in production. Privacy regulations may restrict what data can be used.
•Model Size Constraints — Does the model need to run on edge devices, mobile phones, or browsers? Size limits may require compression, distillation, or simpler architectures.
•Interpretability Requirements — Must predictions be explainable? Regulatory (GDPR, credit decisions) or business requirements may mandate interpretable models over black-box approaches.
•Fairness Requirements — Are there demographic groups that must be treated equitably? Legal or ethical requirements may constrain optimization to avoid disparate impact.
•Cost Constraints — What's the budget for compute (training and inference), data labeling, human review, and ongoing maintenance?

Constraints First, Not Last

Constraint Impact on Model Selection
Constraint	Impact on Model Selection	Typical Tradeoffs
Latency < 10ms	Must use simple models, optimized inference, caching	May sacrifice accuracy for speed
Must be interpretable	Limited to linear models, decision trees, rule-based systems	May sacrifice accuracy for explainability
Must run on mobile	Must use quantized, small models	May sacrifice accuracy for size
No demographic data allowed	Cannot use protected attributes directly; must audit for indirect usage	May sacrifice some predictive power for fairness
Labels available only after 60 days	Model retraining is delayed; must monitor for drift	May need to use proxy labels or shorten observation window

Edge Cases and Failure Modes:

Problem formulation should also anticipate edge cases and failure modes:

What happens when the model is highly uncertain?
What's the fallback if the model fails or is unavailable?
How should the system handle inputs outside the training distribution?
What's the process when users dispute predictions?

Documenting these upfront ensures the system is designed for robustness, not just average-case performance.

The Problem Formulation Canvas

To systematize problem formulation, consider using a Problem Formulation Canvas—a structured template that captures all critical decisions before any modeling begins.

The Canvas Elements:

Problem Formulation Canvas

•Business Objective — What is the ultimate business goal? (Reduce churn, increase revenue, improve user experience)
•ML Objective — How does ML contribute? (Predict who will churn so we can intervene)
•Target Variable Definition — What exactly are we predicting? (Binary: will customer churn within 60 days of prediction date)
•ML Task Type — What kind of ML problem is this? (Binary classification)
•Prediction Context — When and where will predictions be used? (Daily batch predictions for retention campaign targeting)
•Success Metrics — How will we measure success? (Business: reduce churn rate by 5%; ML: AUC-ROC > 0.85)
•Baseline — What's the current approach and its performance? (Currently targeting all customers; 15% churn rate)
•Constraints — What limitations must we respect? (Must be interpretable for customer service; predictions needed by 6 AM daily)
•Data Requirements — What data is needed? (Customer demographics, usage logs, support tickets, billing history)
•Assumptions — What assumptions are we making? (Historical patterns will continue; interventions are effective)

A Living Document

Completed Canvas Example: Email Marketing OptimizationA fully worked problem formulation canvas for email campaign targeting

Input

Output

Common Formulation Pitfalls

Even experienced ML practitioners fall into formulation traps. Here are the most common—and how to avoid them:

Pitfall: Predicting What's Easy, Not What's Useful

•Predicting 'will user click' is easier than 'will user convert'
•But clicks don't generate revenue—conversions do
•Solution: Always trace back to business value

Pitfall: Optimizing the Metric, Not the Objective

•A news recommendation model optimized for 'time on site'
•Learned to show outrage-inducing content
•Solution: Include guardrails and secondary metrics

Additional Common Pitfalls

•Ignoring the Deployment Context — Training on full customer profiles but at prediction time only having partial data. Always verify that all training features will be available in production.
•Assuming Stationarity — The world changes. Customer behavior shifts, product changes disrupt patterns, external events (pandemics, recessions) invalidate historical relationships. Build monitoring and retraining into the solution.
•Not Defining the Null Model — Without a baseline (random guessing, simple rules, current system), you can't assess if ML adds value. Define what success looks like beyond the baseline.
•Conflating Correlation with Causation — Predicting that heavy app users won't churn doesn't mean pushing notifications to force app usage will prevent churn. If you need causal claims (intervention effects), you need causal methods.
•Ignoring Minority Classes — A fraud detection model with 99% accuracy that misses all actual fraud (1% of cases) is worthless. Always examine per-class performance, especially for rare but critical outcomes.

The Most Expensive Pitfall

Summary: Mastering Problem Formulation

Key Takeaways

•Start with the business problem — Understand the business objective before considering ML approaches. Work backward from value.
•Define the target variable precisely — Ambiguity in what you're predicting creates confusion throughout the project and misaligned evaluations.
•Identify the correct ML task type — Classification, regression, ranking, and other task types require different algorithms, metrics, and evaluation procedures.
•Choose metrics aligned with business value — Optimization and evaluation metrics should correlate with business outcomes. Beware Goodhart's Law.
•Document constraints early — Latency, interpretability, fairness, and other constraints shape which solutions are feasible.
•Use a systematic canvas — A structured problem formulation canvas ensures nothing critical is overlooked.
•Avoid common pitfalls — Predicting what's easy instead of useful, ignoring deployment context, and conflating correlation with causation derail projects.

What's Next:

Page Complete

1 / 5