Loading learning content...
The adage "garbage in, garbage out" dramatically understates the relationship between data and ML success. In reality, data quality and availability are the primary determinants of ML project outcomes—more so than model architecture, hyperparameter tuning, or algorithmic sophistication.
Data collection planning is the systematic process of identifying, acquiring, validating, and organizing the data required for ML model development and operation. Poor planning leads to:
This page provides a comprehensive framework for data collection planning that prevents these failures.
By completing this page, you will be able to: (1) Systematically analyze data requirements for ML projects, (2) Design data acquisition strategies that balance cost, quality, and timeline, (3) Implement data quality frameworks that catch issues early, (4) Build data pipelines that serve both development and production, and (5) Navigate legal and ethical considerations in data collection.
Before collecting data, you must precisely define what data is needed. This analysis bridges the problem formulation (what the model must do) and the data strategy (what information enables it).
The Data Requirements Specification:
A complete data requirements analysis answers:
| Task Type | Minimum Samples | Recommended Samples | Notes |
|---|---|---|---|
| Linear models | 10× features | 100× features | Simple relationships, low variance |
| Tree ensembles | 1,000+ | 10,000+ | Handles moderate complexity well |
| Deep learning (tabular) | 10,000+ | 100,000+ | Benefits from scale, needs regularization |
| Image classification | 1,000/class | 10,000/class | Transfer learning reduces requirements |
| NLP (fine-tuning) | 1,000+ | 10,000+ | Pretrained models reduce data needs |
| Rare event detection | 1,000+ positives | 10,000+ positives | Class imbalance handling critical |
Volume alone is insufficient. 1 million samples from a biased distribution are worth less than 10,000 representative samples. Always verify that training data covers the deployment distribution—including edge cases, minority segments, and temporal variations the model will encounter.
Data acquisition is rarely as simple as querying a database. Real-world ML projects employ multiple acquisition strategies, each with distinct tradeoffs:
Strategy 1: Historical Data Mining
Leveraging existing organizational data—logs, transactions, user interactions. This is the fastest and cheapest approach but limited by what was previously collected.
Advantages: Immediate availability, no marginal cost, natural user behavior Challenges: Gaps in coverage, inconsistent formats, missing key signals
Strategy 2: Instrumentation and Logging
Adding new data collection to existing systems. Requires engineering investment but enables purpose-built datasets.
Advantages: Tailored to ML needs, controlled quality, forward-looking Challenges: Delayed availability (must wait for data accumulation), engineering cost
Strategy 3: External Data Procurement
Purchasing or licensing third-party data. Provides coverage your organization lacks.
Advantages: Immediate access, broader coverage, specialized data Challenges: Cost, licensing restrictions, integration complexity, quality uncertainty
Human labeling typically costs $0.05-$5 per example depending on complexity. For 50,000 labeled examples at $0.50 each, budget $25,000. Include quality assurance overhead (20-30% relabeling for validation). Always pilot with 500-1,000 examples before committing to full-scale labeling.
Data quality issues are the leading cause of ML project delays and failures. A systematic quality framework catches problems before they corrupt model training.
The Six Dimensions of Data Quality:
| Dimension | Definition | Example Issues |
|---|---|---|
| Completeness | Required data is present | Missing values, dropped records |
| Accuracy | Values reflect reality | Incorrect entries, outdated info |
| Consistency | Values agree across sources | Conflicting records, format variations |
| Timeliness | Data is sufficiently current | Stale records, delayed updates |
| Validity | Values conform to rules | Out-of-range values, invalid formats |
| Uniqueness | No unintended duplicates | Duplicate records inflating counts |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import pandas as pdimport numpy as npfrom dataclasses import dataclassfrom typing import List, Dict, Callable @dataclassclass DataQualityReport: """Comprehensive data quality assessment.""" total_records: int completeness_scores: Dict[str, float] # column: % non-null validity_issues: Dict[str, List[str]] # column: issue descriptions consistency_checks: Dict[str, bool] # check_name: passed recommendations: List[str] def run_quality_assessment(df: pd.DataFrame, config: dict) -> DataQualityReport: """ Execute comprehensive data quality checks. """ completeness = {} validity_issues = {} for col in df.columns: # Completeness completeness[col] = (df[col].notna().sum() / len(df)) * 100 # Validity checks issues = [] if col in config.get('numeric_columns', []): range_config = config.get('valid_ranges', {}).get(col) if range_config: out_of_range = df[ (df[col] < range_config['min']) | (df[col] > range_config['max']) ] if len(out_of_range) > 0: issues.append( f"{len(out_of_range)} values outside valid range" ) validity_issues[col] = issues # Generate recommendations recommendations = [] for col, score in completeness.items(): if score < 95: recommendations.append( f"Column '{col}' has {100-score:.1f}% missing values" ) return DataQualityReport( total_records=len(df), completeness_scores=completeness, validity_issues=validity_issues, consistency_checks={}, recommendations=recommendations )Data quality issues discovered during modeling cost 10× more to fix than issues caught during collection. Issues discovered in production cost 100× more. Invest in quality checks upfront—automated validation, sampling audits, and schema enforcement at ingestion time.
A well-designed data pipeline serves both model development (experimentation, training, evaluation) and production operation (inference, monitoring, retraining). The pipeline must be:
Reproducible — Given the same inputs and configuration, produce identical outputs Scalable — Handle growing data volumes without redesign Maintainable — Easy to understand, modify, and debug Observable — Provide visibility into data flow, quality, and freshness
Pipeline Architecture Patterns:
| Pattern | Use Case | Latency | Complexity |
|---|---|---|---|
| Batch ETL | Historical training data | Hours | Low |
| Micro-batch | Near-real-time features | Minutes | Medium |
| Streaming | Real-time features | Seconds | High |
| Lambda | Combined batch + stream | Mixed | Very High |
| Feature Store | Unified feature serving | Mixed | Medium-High |
Critical Pipeline Components:
The #1 cause of ML production failures is train-serve skew—when features computed differently in training vs. serving produce inconsistent model behavior. Use the same code path for training and serving feature computation. Feature stores enforce this consistency by design.
Data collection for ML operates within a complex web of legal requirements and ethical obligations. Violations can result in regulatory penalties, reputational damage, and model invalidation.
Key Regulatory Frameworks:
| Regulation | Jurisdiction | Key Requirements |
|---|---|---|
| GDPR | European Union | Consent, right to erasure, data minimization |
| CCPA/CPRA | California | Disclosure, opt-out rights, purpose limitation |
| HIPAA | US Healthcare | Protected health information handling |
| FCRA | US Credit | Permissible purpose, adverse action notices |
Data Governance Checklist:
✓ Purpose limitation — Data used only for stated purposes ✓ Data minimization — Collect only what's necessary ✓ Consent management — Proper consent obtained and tracked ✓ Access controls — Data accessible only to authorized personnel ✓ Retention policies — Data deleted when no longer needed ✓ Audit trails — Data access and modifications logged
Beyond legal compliance, ethical data collection requires: (1) No collection of sensitive attributes without explicit justification, (2) Transparency about data use with affected parties, (3) Consideration of downstream harms from model predictions, (4) Fairness audits for demographic imbalances in training data. Legal ≠ ethical. Both are required.
You now understand systematic data collection planning for ML projects. Next, we'll explore experiment management—the disciplined practice of organizing, tracking, and learning from the many experiments that ML development requires.