Machine LearningAutoML & Neural Architecture Search

Automated Preprocessing

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

5 / 5

Data Cleaning

Garbage In, Garbage Out

The oldest adage in computing is brutally accurate for machine learning: garbage in, garbage out. No algorithm, no matter how sophisticated, can extract valid patterns from corrupted, inconsistent, or erroneous data. Data scientists routinely spend 60-80% of their time on data cleaning—detecting issues, understanding their causes, and implementing fixes.

The automation imperative: If data cleaning consumes most of the work, automating it offers enormous leverage. But the challenge is immense: data quality issues are numerous, subtle, and domain-specific. AutoML systems must detect problems they cannot anticipate, fix issues without human guidance, and validate that fixes don't introduce new problems. This page explores how modern AutoML approaches this fundamental challenge.

What You Will Learn

By the end of this page, you will understand: the taxonomy of data quality issues, automated detection strategies for each issue type, remediation approaches and their tradeoffs, validation strategies to ensure cleaning improves data, and how AutoML systems orchestrate data cleaning pipelines.

Data Quality Taxonomy

Data quality issues fall into several distinct categories, each requiring different detection and remediation strategies. Understanding this taxonomy is the first step toward automated cleaning.

The Six Dimensions of Data Quality:

Data Quality Dimensions
Dimension	Description	Common Issues	Detection Approach
Completeness	All required data is present	Missing values, incomplete records	Null counts, coverage analysis
Accuracy	Data correctly represents reality	Typos, measurement errors, outdated values	Range checks, cross-validation
Consistency	Data doesn't contradict itself	Conflicting values, format variations	Constraint checking, deduplication
Validity	Data conforms to expected formats	Invalid emails, impossible dates	Regex, schema validation
Uniqueness	No unintended duplicates	Duplicate records, ID collisions	Exact/fuzzy matching
Timeliness	Data is sufficiently current	Stale data, outdated references	Timestamp analysis, freshness checks

Common Data Quality Issues

•Outliers — Values far from the typical range, either errors or genuine extremes
•Duplicates — Same entity appearing multiple times, exact or near-duplicate
•Inconsistent Formats — Same information represented differently (dates, names, addresses)
•Invalid Values — Values outside valid ranges or failing format constraints
•Type Mismatches — Numbers stored as strings, dates as integers, etc.
•Encoding Issues — Character encoding problems, special character corruption
•Label Noise — Incorrect target labels from annotation errors or ambiguity

Domain Matters

What constitutes 'bad' data is domain-specific. An age of 150 is clearly wrong for humans but reasonable for tortoises. A temperature of -40°C is normal in Siberia but suspicious in Singapore. AutoML data cleaning must incorporate domain constraints, either learned from data distributions or provided as external knowledge.

Outlier Detection and Handling

Outliers are observations that deviate significantly from the majority. They may be errors (measurement malfunction, data entry mistake) or genuine extremes (rare but real events). The challenge: distinguishing these cases automatically.

Detection Approaches:

Z-Score Method

Flag values more than k standard deviations from the mean.

Assumes normal distribution
Sensitive to the outliers themselves (mean/std are affected)
Typical threshold: |z| > 3

IQR Method

Flag values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR.

Robust to outliers (quartiles are resistant)
Works for skewed distributions
Standard in exploratory data analysis

Modified Z-Score

Use median and MAD (Median Absolute Deviation) instead of mean and std.

Robust to outliers
Better for non-normal distributions

outlier_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import pandas as pd
from scipy import stats
 
class AutoMLOutlierDetector:
    """
    Automated outlier detection with multiple strategies.
    """
    
    def __init__(
        self,
        method: str = 'auto',
        threshold: float = 3.0,
        iqr_multiplier: float = 1.5
    ):
        self.method = method
        self.threshold = threshold
        self.iqr_multiplier = iqr_multiplier
        self.outlier_masks_ = {}
        self.bounds_ = {}
    
    def _detect_zscore(self, x: np.ndarray) -> np.ndarray:
        """Z-score based outlier detection."""
        z = np.abs(stats.zscore(x, nan_policy='omit'))
        return z > self.threshold
    
    def _detect_iqr(self, x: np.ndarray) -> np.ndarray:
        """IQR-based outlier detection."""
        q1, q3 = np.nanpercentile(x, [25, 75])
        iqr = q3 - q1
        lower = q1 - self.iqr_multiplier * iqr
        upper = q3 + self.iqr_multiplier * iqr
        return (x < lower) | (x > upper)
    
    def _detect_modified_zscore(self, x: np.ndarray) -> np.ndarray:
        """Modified z-score using median and MAD."""
        median = np.nanmedian(x)
        mad = np.nanmedian(np.abs(x - median))
        # Avoid division by zero
        mad = mad if mad > 0 else 1e-10
        modified_z = 0.6745 * (x - median) / mad
        return np.abs(modified_z) > self.threshold
    
    def _select_method(self, x: np.ndarray) -> str:
        """Auto-select detection method based on distribution."""
        if self.method != 'auto':
            return self.method
        
        # Check for normality
        _, p_value = stats.normaltest(x[~np.isnan(x)])
        
        if p_value > 0.05:
            return 'zscore'  # Approximately normal
        else:
            skewness = stats.skew(x, nan_policy='omit')
            if abs(skewness) > 1:
                return 'modified_zscore'  # Highly skewed
            else:
                return 'iqr'  # Non-normal but not too skewed
    
    def fit(self, X: pd.DataFrame) -> 'AutoMLOutlierDetector':
        """Detect outliers in each numeric column."""
        for col in X.select_dtypes(include=[np.number]).columns:
            values = X[col].values
            method = self._select_method(values)
            
            if method == 'zscore':
                mask = self._detect_zscore(values)
            elif method == 'iqr':
                mask = self._detect_iqr(values)
            else:
                mask = self._detect_modified_zscore(values)
            
            self.outlier_masks_[col] = mask
            
            # Store bounds for later use
            if method == 'iqr':
                q1, q3 = np.nanpercentile(values, [25, 75])
                iqr = q3 - q1
                self.bounds_[col] = (
                    q1 - self.iqr_multiplier * iqr,
                    q3 + self.iqr_multiplier * iqr
                )
        
        return self
    
    def get_outlier_report(self) -> pd.DataFrame:
        """Summary of outliers detected per column."""
        return pd.DataFrame({
            'column': list(self.outlier_masks_.keys()),
            'n_outliers': [m.sum() for m in self.outlier_masks_.values()],
            'pct_outliers': [
                m.sum() / len(m) * 100 
                for m in self.outlier_masks_.values()
            ]
        }).sort_values('n_outliers', ascending=False)

Context Is Key

The right outlier strategy depends on context. For tree-based models (RF, XGBoost), outliers often don't matter—trees are robust. For linear models or neural nets, outliers can dominate and need handling. AutoML should consider the downstream model when deciding how aggressively to treat outliers.

Duplicate Detection and Removal

Duplicates inflate metrics, bias models, and waste computation. They range from exact copies to near-duplicates that require fuzzy matching to detect.

Types of Duplicates:

Duplicate Categories

•Exact Duplicates — Identical rows; trivial to detect with hashing
•Key Duplicates — Same primary key, different attributes (data versioning or error)
•Near-Duplicates — Same entity with variations (typos, format differences)
•Semantic Duplicates — Different representations of the same thing ('NYC' vs 'New York City')

duplicate_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import pandas as pd
import numpy as np
from typing import List, Tuple, Set
from collections import defaultdict
 
class AutoMLDuplicateDetector:
    """
    Automated duplicate detection with exact and fuzzy matching.
    """
    
    def __init__(
        self,
        key_columns: List[str] = None,
        similarity_threshold: float = 0.9
    ):
        self.key_columns = key_columns
        self.similarity_threshold = similarity_threshold
        self.duplicate_groups_ = []
    
    def find_exact_duplicates(self, df: pd.DataFrame) -> pd.DataFrame:
        """Find and report exact duplicate rows."""
        # Full row duplicates
        duplicates = df[df.duplicated(keep=False)]
        
        # Group duplicates
        duplicate_groups = df[df.duplicated(keep='first')].index.tolist()
        
        return df.duplicated(keep='first')
    
    def find_key_duplicates(
        self,
        df: pd.DataFrame,
        key_columns: List[str] = None
    ) -> pd.DataFrame:
        """Find rows with duplicate keys but different values."""
        keys = key_columns or self.key_columns
        if keys is None:
            return pd.Series([False] * len(df))
        
        # Find duplicate keys
        key_duplicates = df.duplicated(subset=keys, keep=False)
        
        # Among key duplicates, find those with differing values
        key_dup_df = df[key_duplicates]
        
        conflicts = []
        for _, group in key_dup_df.groupby(keys):
            if len(group) > 1:
                # Check if values differ
                if not group.iloc[:, len(keys):].drop_duplicates().shape[0] == 1:
                    conflicts.extend(group.index.tolist())
        
        return df.index.isin(conflicts)
    
    def find_near_duplicates_text(
        self,
        df: pd.DataFrame,
        text_column: str,
        threshold: float = None
    ) -> List[Tuple[int, int, float]]:
        """
        Find near-duplicate text entries using similarity.
        
        Returns list of (idx1, idx2, similarity) tuples.
        """
        from difflib import SequenceMatcher
        
        threshold = threshold or self.similarity_threshold
        texts = df[text_column].astype(str).tolist()
        
        near_dupes = []
        for i in range(len(texts)):
            for j in range(i + 1, len(texts)):
                similarity = SequenceMatcher(
                    None, texts[i], texts[j]
                ).ratio()
                if similarity >= threshold:
                    near_dupes.append((i, j, similarity))
        
        return near_dupes
    
    def deduplicate(
        self,
        df: pd.DataFrame,
        strategy: str = 'first',
        key_columns: List[str] = None
    ) -> pd.DataFrame:
        """
        Remove duplicates with specified strategy.
        
        Strategies:
        - 'first': Keep first occurrence
        - 'last': Keep last occurrence
        - 'most_complete': Keep row with fewest missing values
        """
        keys = key_columns or self.key_columns
        
        if strategy in ['first', 'last']:
            return df.drop_duplicates(subset=keys, keep=strategy)
        
        elif strategy == 'most_complete':
            # Sort by completeness (ascending), then keep last
            df = df.copy()
            df['_missing_count'] = df.isnull().sum(axis=1)
            df = df.sort_values('_missing_count', ascending=False)
            df = df.drop_duplicates(subset=keys, keep='last')
            df = df.drop('_missing_count', axis=1)
            return df
        
        return df
 
def remove_duplicates_for_ml(
    X: pd.DataFrame,
    y: pd.Series = None
) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Remove duplicates considering both features and target.
    
    For ML, we typically want to remove exact duplicates
    but keep near-duplicates (different training examples).
    """
    if y is not None:
        combined = X.copy()
        combined['_target'] = y
        combined = combined.drop_duplicates()
        y_clean = combined['_target']
        X_clean = combined.drop('_target', axis=1)
        return X_clean, y_clean
    else:
        return X.drop_duplicates(), None

Train-Test Leakage from Duplicates

If the same or very similar examples appear in both training and test sets, you have data leakage. The model memorizes training examples and 'recognizes' them in test, inflating metrics. Always deduplicate before splitting, or check for cross-set duplicates after splitting.

Data Type and Format Correction

Raw data often arrives with incorrect types: numbers as strings, dates as integers, booleans as 0/1. Additionally, the same information may be represented in multiple formats. Automated cleaning must detect and correct these issues.

Common Type/Format Issues:

Data Type Issues and Corrections
Issue	Example	Detection	Correction
Numeric as string	'123', '45.6'	Regex pattern match	pd.to_numeric()
Date as string	'2024-01-15', '01/15/24'	Date pattern match	pd.to_datetime()
Boolean as int	0, 1	Only 0/1 values	astype(bool)
Category as object	'red', 'blue', 'green'	Low cardinality	astype('category')
Mixed types	['123', 456, 'N/A']	Type diversity check	Coerce with errors='coerce'

type_correction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import pandas as pd
import numpy as np
import re
from typing import Dict, Any
 
class AutoMLTypeCorrector:
    """
    Automated data type detection and correction.
    """
    
    # Date patterns to try
    DATE_PATTERNS = [
        r'\d{4}-\d{2}-\d{2}',           # YYYY-MM-DD
        r'\d{2}/\d{2}/\d{4}',           # MM/DD/YYYY
        r'\d{2}-\d{2}-\d{4}',           # DD-MM-YYYY
        r'\d{4}/\d{2}/\d{2}',           # YYYY/MM/DD
    ]
    
    def __init__(
        self,
        categorical_threshold: int = 20,
        numeric_threshold: float = 0.9
    ):
        self.categorical_threshold = categorical_threshold
        self.numeric_threshold = numeric_threshold
        self.type_corrections_ = {}
    
    def _is_numeric_string(self, series: pd.Series) -> bool:
        """Check if string column can be converted to numeric."""
        if series.dtype != 'object':
            return False
        
        # Try conversion
        converted = pd.to_numeric(series, errors='coerce')
        valid_ratio = converted.notna().sum() / len(series)
        
        return valid_ratio >= self.numeric_threshold
    
    def _is_date_string(self, series: pd.Series) -> bool:
        """Check if string column contains dates."""
        if series.dtype != 'object':
            return False
        
        sample = series.dropna().head(100)
        
        # Check against date patterns
        for pattern in self.DATE_PATTERNS:
            matches = sample.str.match(pattern, na=False).mean()
            if matches >= 0.9:
                return True
        
        # Try pandas date parsing
        try:
            converted = pd.to_datetime(sample, errors='coerce')
            return converted.notna().mean() >= 0.9
        except:
            return False
    
    def _should_be_categorical(self, series: pd.Series) -> bool:
        """Check if column should be categorical."""
        nunique = series.nunique()
        return (
            nunique <= self.categorical_threshold and
            nunique < len(series) * 0.5
        )
    
    def analyze(self, df: pd.DataFrame) -> Dict[str, Dict[str, Any]]:
        """
        Analyze columns and suggest type corrections.
        """
        suggestions = {}
        
        for col in df.columns:
            series = df[col]
            current_type = str(series.dtype)
            suggested_type = current_type
            reason = "No change needed"
            
            if self._is_numeric_string(series):
                suggested_type = 'numeric'
                reason = "String column contains numeric values"
            elif self._is_date_string(series):
                suggested_type = 'datetime'
                reason = "String column contains date values"
            elif series.dtype == 'object' and self._should_be_categorical(series):
                suggested_type = 'category'
                reason = f"Low cardinality ({series.nunique()} unique values)"
            elif series.dtype in ['int64', 'float64']:
                if set(series.dropna().unique()).issubset({0, 1}):
                    suggested_type = 'boolean'
                    reason = "Only contains 0 and 1"
            
            if suggested_type != current_type:
                suggestions[col] = {
                    'current': current_type,
                    'suggested': suggested_type,
                    'reason': reason
                }
        
        self.type_corrections_ = suggestions
        return suggestions
    
    def correct(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply type corrections."""
        df = df.copy()
        
        for col, correction in self.type_corrections_.items():
            suggested = correction['suggested']
            
            try:
                if suggested == 'numeric':
                    df[col] = pd.to_numeric(df[col], errors='coerce')
                elif suggested == 'datetime':
                    df[col] = pd.to_datetime(df[col], errors='coerce')
                elif suggested == 'category':
                    df[col] = df[col].astype('category')
                elif suggested == 'boolean':
                    df[col] = df[col].astype(bool)
            except Exception as e:
                print(f"Warning: Could not convert {col} to {suggested}: {e}")
        
        return df

Category Type Benefits

Converting low-cardinality string columns to categorical type reduces memory significantly (stores integers instead of repeated strings) and speeds up groupby operations. AutoML systems should identify and convert categorical columns automatically.

Automated Data Cleaning Pipeline

AutoML systems combine individual cleaning operations into orchestrated pipelines. The order matters: some operations depend on others, and some should be skipped based on earlier findings.

Pipeline Order Matters:

Type correction first (enables numeric operations)
Duplicate removal (before imputation to avoid bias)
Outlier handling (before normalization)
Missing value handling (last, uses clean data for imputation)

cleaning_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
import pandas as pd
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from enum import Enum
 
class CleaningStep(Enum):
    TYPE_CORRECTION = "type_correction"
    DUPLICATE_REMOVAL = "duplicate_removal"
    OUTLIER_HANDLING = "outlier_handling"
    MISSING_HANDLING = "missing_handling"
    CONSTANT_REMOVAL = "constant_removal"
 
@dataclass
class CleaningReport:
    """Report of cleaning operations performed."""
    steps_performed: List[str] = field(default_factory=list)
    rows_removed: int = 0
    columns_removed: int = 0
    values_imputed: int = 0
    types_corrected: Dict[str, str] = field(default_factory=dict)
    outliers_handled: int = 0
    
class AutoMLDataCleaner:
    """
    Comprehensive automated data cleaning pipeline.
    
    Orchestrates multiple cleaning operations in the correct order
    with configurable strategies for each step.
    """
    
    def __init__(
        self,
        remove_duplicates: bool = True,
        handle_outliers: str = 'clip',  # 'clip', 'remove', 'flag', 'none'
        handle_missing: str = 'median',  # 'mean', 'median', 'drop', 'none'
        remove_constant: bool = True,
        correct_types: bool = True,
        outlier_threshold: float = 3.0,
        missing_threshold: float = 0.5  # Drop columns with > 50% missing
    ):
        self.remove_duplicates = remove_duplicates
        self.handle_outliers = handle_outliers
        self.handle_missing = handle_missing
        self.remove_constant = remove_constant
        self.correct_types = correct_types
        self.outlier_threshold = outlier_threshold
        self.missing_threshold = missing_threshold
        
        self.report_ = None
        self.fitted_params_ = {}
    
    def fit_transform(
        self,
        X: pd.DataFrame,
        y: pd.Series = None
    ) -> tuple:
        """
        Fit cleaning parameters and transform data.
        
        Returns cleaned X and y (if provided).
        """
        self.report_ = CleaningReport()
        X = X.copy()
        if y is not None:
            y = y.copy()
        
        original_shape = X.shape
        
        # Step 1: Type Correction
        if self.correct_types:
            X = self._correct_types(X)
        
        # Step 2: Remove Constant Columns
        if self.remove_constant:
            X = self._remove_constant_columns(X)
        
        # Step 3: Remove Duplicates
        if self.remove_duplicates:
            X, y = self._remove_duplicates(X, y)
        
        # Step 4: Handle Outliers (fit and transform)
        if self.handle_outliers != 'none':
            X = self._handle_outliers(X, fit=True)
        
        # Step 5: Handle Missing Values (fit and transform)
        if self.handle_missing != 'none':
            X = self._handle_missing(X, fit=True)
        
        # Update report
        self.report_.rows_removed = original_shape[0] - len(X)
        self.report_.columns_removed = original_shape[1] - X.shape[1]
        
        if y is not None:
            return X, y
        return X
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Apply fitted cleaning to new data.
        """
        X = X.copy()
        
        # Apply in same order
        if self.correct_types:
            X = self._apply_type_corrections(X)
        
        if self.remove_constant:
            X = self._apply_column_filter(X)
        
        # Note: Don't remove duplicates on test data
        
        if self.handle_outliers != 'none':
            X = self._handle_outliers(X, fit=False)
        
        if self.handle_missing != 'none':
            X = self._handle_missing(X, fit=False)
        
        return X
    
    def _correct_types(self, X: pd.DataFrame) -> pd.DataFrame:
        """Detect and correct data types."""
        type_corrector = AutoMLTypeCorrector()
        corrections = type_corrector.analyze(X)
        self.fitted_params_['type_corrections'] = corrections
        self.report_.types_corrected = {
            col: c['suggested'] for col, c in corrections.items()
        }
        return type_corrector.correct(X)
    
    def _remove_constant_columns(self, X: pd.DataFrame) -> pd.DataFrame:
        """Remove columns with no variance."""
        non_constant = X.columns[X.nunique() > 1]
        removed = set(X.columns) - set(non_constant)
        self.fitted_params_['kept_columns'] = list(non_constant)
        self.report_.steps_performed.append(f"Removed {len(removed)} constant columns")
        return X[non_constant]
    
    def _remove_duplicates(
        self,
        X: pd.DataFrame,
        y: pd.Series
    ) -> tuple:
        """Remove duplicate rows."""
        n_before = len(X)
        
        if y is not None:
            combined = X.copy()
            combined['_target'] = y
            combined = combined.drop_duplicates()
            y = combined['_target']
            X = combined.drop('_target', axis=1)
        else:
            X = X.drop_duplicates()
        
        n_removed = n_before - len(X)
        self.report_.steps_performed.append(f"Removed {n_removed} duplicate rows")
        return X, y
    
    def _handle_outliers(self, X: pd.DataFrame, fit: bool) -> pd.DataFrame:
        """Handle outliers in numeric columns."""
        numeric_cols = X.select_dtypes(include=[np.number]).columns
        
        if fit:
            self.fitted_params_['outlier_bounds'] = {}
            for col in numeric_cols:
                q1, q3 = X[col].quantile([0.01, 0.99])
                self.fitted_params_['outlier_bounds'][col] = (q1, q3)
        
        outlier_count = 0
        for col in numeric_cols:
            if col in self.fitted_params_.get('outlier_bounds', {}):
                lower, upper = self.fitted_params_['outlier_bounds'][col]
                
                if self.handle_outliers == 'clip':
                    outliers = (X[col] < lower) | (X[col] > upper)
                    outlier_count += outliers.sum()
                    X[col] = X[col].clip(lower, upper)
                elif self.handle_outliers == 'flag':
                    X[f'{col}_outlier'] = (
                        (X[col] < lower) | (X[col] > upper)
                    ).astype(int)
        
        self.report_.outliers_handled = outlier_count
        return X
    
    def _handle_missing(self, X: pd.DataFrame, fit: bool) -> pd.DataFrame:
        """Handle missing values."""
        if fit:
            # Store imputation values
            self.fitted_params_['impute_values'] = {}
            for col in X.columns:
                if X[col].isna().any():
                    if X[col].dtype in ['object', 'category']:
                        value = X[col].mode().iloc[0] if not X[col].mode().empty else 'MISSING'
                    elif self.handle_missing == 'mean':
                        value = X[col].mean()
                    else:
                        value = X[col].median()
                    self.fitted_params_['impute_values'][col] = value
        
        # Apply imputation
        impute_count = 0
        for col, value in self.fitted_params_.get('impute_values', {}).items():
            if col in X.columns:
                missing = X[col].isna()
                impute_count += missing.sum()
                X[col] = X[col].fillna(value)
        
        self.report_.values_imputed = impute_count
        return X
    
    def get_report(self) -> dict:
        """Return cleaning report as dictionary."""
        return {
            'steps': self.report_.steps_performed,
            'rows_removed': self.report_.rows_removed,
            'columns_removed': self.report_.columns_removed,
            'values_imputed': self.report_.values_imputed,
            'outliers_handled': self.report_.outliers_handled,
            'types_corrected': self.report_.types_corrected
        }

Fit Once, Transform Many

Like scalers and encoders, data cleaning parameters (imputation values, outlier bounds, type mappings) must be fitted on training data and applied consistently to all new data. AutoML cleaning pipelines store these parameters for reuse at inference time.

Summary: Automated Data Cleaning

Data cleaning is the foundation of reliable machine learning. AutoML systems automate detection and remediation of data quality issues, freeing practitioners to focus on modeling while ensuring consistent, reproducible preprocessing.

Key Takeaways

•Data quality has multiple dimensions — Completeness, accuracy, consistency, validity, uniqueness, and timeliness all matter.
•Outliers need context — Not all outliers are errors; detection method should match data distribution.
•Duplicates cause leakage — Remove before train/test split to prevent inflated metrics.
•Type correction enables operations — Fix types first so numeric/date operations work correctly.
•Order matters in pipelines — Type correction → duplicates → outliers → missing values.
•Fit on training, apply everywhere — Store cleaning parameters for consistent inference-time preprocessing.

Module Complete

Congratulations! You have completed Module 2: Automated Preprocessing. You now understand how AutoML systems handle the full preprocessing pipeline—from missing values and encoding through scaling, feature selection, and data cleaning. These capabilities form the foundation of end-to-end AutoML systems.

5 / 5

Loading learning content...

Machine LearningAutoML & Neural Architecture Search

Automated Preprocessing

LevelAdvanced

Duration90 mins

TopicAutoML & Neural Architecture Search

5 / 5

Data Cleaning

Garbage In, Garbage Out

What You Will Learn

Data Quality Taxonomy

Data quality issues fall into several distinct categories, each requiring different detection and remediation strategies. Understanding this taxonomy is the first step toward automated cleaning.

The Six Dimensions of Data Quality:

Data Quality Dimensions
Dimension	Description	Common Issues	Detection Approach
Completeness	All required data is present	Missing values, incomplete records	Null counts, coverage analysis
Accuracy	Data correctly represents reality	Typos, measurement errors, outdated values	Range checks, cross-validation
Consistency	Data doesn't contradict itself	Conflicting values, format variations	Constraint checking, deduplication
Validity	Data conforms to expected formats	Invalid emails, impossible dates	Regex, schema validation
Uniqueness	No unintended duplicates	Duplicate records, ID collisions	Exact/fuzzy matching
Timeliness	Data is sufficiently current	Stale data, outdated references	Timestamp analysis, freshness checks

Common Data Quality Issues

•Outliers — Values far from the typical range, either errors or genuine extremes
•Duplicates — Same entity appearing multiple times, exact or near-duplicate
•Inconsistent Formats — Same information represented differently (dates, names, addresses)
•Invalid Values — Values outside valid ranges or failing format constraints
•Type Mismatches — Numbers stored as strings, dates as integers, etc.
•Encoding Issues — Character encoding problems, special character corruption
•Label Noise — Incorrect target labels from annotation errors or ambiguity

Domain Matters

Outlier Detection and Handling

Detection Approaches:

Z-Score Method

Flag values more than k standard deviations from the mean.

Assumes normal distribution
Sensitive to the outliers themselves (mean/std are affected)
Typical threshold: |z| > 3

IQR Method

Flag values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR.

Robust to outliers (quartiles are resistant)
Works for skewed distributions
Standard in exploratory data analysis

Modified Z-Score

Use median and MAD (Median Absolute Deviation) instead of mean and std.

Robust to outliers
Better for non-normal distributions

outlier_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
import numpy as np
import pandas as pd
from scipy import stats
 
class AutoMLOutlierDetector:
    """
    Automated outlier detection with multiple strategies.
    """
    
    def __init__(
        self,
        method: str = 'auto',
        threshold: float = 3.0,
        iqr_multiplier: float = 1.5
    ):
        self.method = method
        self.threshold = threshold
        self.iqr_multiplier = iqr_multiplier
        self.outlier_masks_ = {}
        self.bounds_ = {}
    
    def _detect_zscore(self, x: np.ndarray) -> np.ndarray:
        """Z-score based outlier detection."""
        z = np.abs(stats.zscore(x, nan_policy='omit'))
        return z > self.threshold
    
    def _detect_iqr(self, x: np.ndarray) -> np.ndarray:
        """IQR-based outlier detection."""
        q1, q3 = np.nanpercentile(x, [25, 75])
        iqr = q3 - q1
        lower = q1 - self.iqr_multiplier * iqr
        upper = q3 + self.iqr_multiplier * iqr
        return (x < lower) | (x > upper)
    
    def _detect_modified_zscore(self, x: np.ndarray) -> np.ndarray:
        """Modified z-score using median and MAD."""
        median = np.nanmedian(x)
        mad = np.nanmedian(np.abs(x - median))
        # Avoid division by zero
        mad = mad if mad > 0 else 1e-10
        modified_z = 0.6745 * (x - median) / mad
        return np.abs(modified_z) > self.threshold
    
    def _select_method(self, x: np.ndarray) -> str:
        """Auto-select detection method based on distribution."""
        if self.method != 'auto':
            return self.method
        
        # Check for normality
        _, p_value = stats.normaltest(x[~np.isnan(x)])
        
        if p_value > 0.05:
            return 'zscore'  # Approximately normal
        else:
            skewness = stats.skew(x, nan_policy='omit')
            if abs(skewness) > 1:
                return 'modified_zscore'  # Highly skewed
            else:
                return 'iqr'  # Non-normal but not too skewed
    
    def fit(self, X: pd.DataFrame) -> 'AutoMLOutlierDetector':
        """Detect outliers in each numeric column."""
        for col in X.select_dtypes(include=[np.number]).columns:
            values = X[col].values
            method = self._select_method(values)
            
            if method == 'zscore':
                mask = self._detect_zscore(values)
            elif method == 'iqr':
                mask = self._detect_iqr(values)
            else:
                mask = self._detect_modified_zscore(values)
            
            self.outlier_masks_[col] = mask
            
            # Store bounds for later use
            if method == 'iqr':
                q1, q3 = np.nanpercentile(values, [25, 75])
                iqr = q3 - q1
                self.bounds_[col] = (
                    q1 - self.iqr_multiplier * iqr,
                    q3 + self.iqr_multiplier * iqr
                )
        
        return self
    
    def get_outlier_report(self) -> pd.DataFrame:
        """Summary of outliers detected per column."""
        return pd.DataFrame({
            'column': list(self.outlier_masks_.keys()),
            'n_outliers': [m.sum() for m in self.outlier_masks_.values()],
            'pct_outliers': [
                m.sum() / len(m) * 100 
                for m in self.outlier_masks_.values()
            ]
        }).sort_values('n_outliers', ascending=False)

Context Is Key

Duplicate Detection and Removal

Duplicates inflate metrics, bias models, and waste computation. They range from exact copies to near-duplicates that require fuzzy matching to detect.

Types of Duplicates:

Duplicate Categories

•Exact Duplicates — Identical rows; trivial to detect with hashing
•Key Duplicates — Same primary key, different attributes (data versioning or error)
•Near-Duplicates — Same entity with variations (typos, format differences)
•Semantic Duplicates — Different representations of the same thing ('NYC' vs 'New York City')

duplicate_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import pandas as pd
import numpy as np
from typing import List, Tuple, Set
from collections import defaultdict
 
class AutoMLDuplicateDetector:
    """
    Automated duplicate detection with exact and fuzzy matching.
    """
    
    def __init__(
        self,
        key_columns: List[str] = None,
        similarity_threshold: float = 0.9
    ):
        self.key_columns = key_columns
        self.similarity_threshold = similarity_threshold
        self.duplicate_groups_ = []
    
    def find_exact_duplicates(self, df: pd.DataFrame) -> pd.DataFrame:
        """Find and report exact duplicate rows."""
        # Full row duplicates
        duplicates = df[df.duplicated(keep=False)]
        
        # Group duplicates
        duplicate_groups = df[df.duplicated(keep='first')].index.tolist()
        
        return df.duplicated(keep='first')
    
    def find_key_duplicates(
        self,
        df: pd.DataFrame,
        key_columns: List[str] = None
    ) -> pd.DataFrame:
        """Find rows with duplicate keys but different values."""
        keys = key_columns or self.key_columns
        if keys is None:
            return pd.Series([False] * len(df))
        
        # Find duplicate keys
        key_duplicates = df.duplicated(subset=keys, keep=False)
        
        # Among key duplicates, find those with differing values
        key_dup_df = df[key_duplicates]
        
        conflicts = []
        for _, group in key_dup_df.groupby(keys):
            if len(group) > 1:
                # Check if values differ
                if not group.iloc[:, len(keys):].drop_duplicates().shape[0] == 1:
                    conflicts.extend(group.index.tolist())
        
        return df.index.isin(conflicts)
    
    def find_near_duplicates_text(
        self,
        df: pd.DataFrame,
        text_column: str,
        threshold: float = None
    ) -> List[Tuple[int, int, float]]:
        """
        Find near-duplicate text entries using similarity.
        
        Returns list of (idx1, idx2, similarity) tuples.
        """
        from difflib import SequenceMatcher
        
        threshold = threshold or self.similarity_threshold
        texts = df[text_column].astype(str).tolist()
        
        near_dupes = []
        for i in range(len(texts)):
            for j in range(i + 1, len(texts)):
                similarity = SequenceMatcher(
                    None, texts[i], texts[j]
                ).ratio()
                if similarity >= threshold:
                    near_dupes.append((i, j, similarity))
        
        return near_dupes
    
    def deduplicate(
        self,
        df: pd.DataFrame,
        strategy: str = 'first',
        key_columns: List[str] = None
    ) -> pd.DataFrame:
        """
        Remove duplicates with specified strategy.
        
        Strategies:
        - 'first': Keep first occurrence
        - 'last': Keep last occurrence
        - 'most_complete': Keep row with fewest missing values
        """
        keys = key_columns or self.key_columns
        
        if strategy in ['first', 'last']:
            return df.drop_duplicates(subset=keys, keep=strategy)
        
        elif strategy == 'most_complete':
            # Sort by completeness (ascending), then keep last
            df = df.copy()
            df['_missing_count'] = df.isnull().sum(axis=1)
            df = df.sort_values('_missing_count', ascending=False)
            df = df.drop_duplicates(subset=keys, keep='last')
            df = df.drop('_missing_count', axis=1)
            return df
        
        return df
 
def remove_duplicates_for_ml(
    X: pd.DataFrame,
    y: pd.Series = None
) -> Tuple[pd.DataFrame, pd.Series]:
    """
    Remove duplicates considering both features and target.
    
    For ML, we typically want to remove exact duplicates
    but keep near-duplicates (different training examples).
    """
    if y is not None:
        combined = X.copy()
        combined['_target'] = y
        combined = combined.drop_duplicates()
        y_clean = combined['_target']
        X_clean = combined.drop('_target', axis=1)
        return X_clean, y_clean
    else:
        return X.drop_duplicates(), None

Train-Test Leakage from Duplicates

Data Type and Format Correction

Common Type/Format Issues:

Data Type Issues and Corrections
Issue	Example	Detection	Correction
Numeric as string	'123', '45.6'	Regex pattern match	pd.to_numeric()
Date as string	'2024-01-15', '01/15/24'	Date pattern match	pd.to_datetime()
Boolean as int	0, 1	Only 0/1 values	astype(bool)
Category as object	'red', 'blue', 'green'	Low cardinality	astype('category')
Mixed types	['123', 456, 'N/A']	Type diversity check	Coerce with errors='coerce'

type_correction.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import pandas as pd
import numpy as np
import re
from typing import Dict, Any
 
class AutoMLTypeCorrector:
    """
    Automated data type detection and correction.
    """
    
    # Date patterns to try
    DATE_PATTERNS = [
        r'\d{4}-\d{2}-\d{2}',           # YYYY-MM-DD
        r'\d{2}/\d{2}/\d{4}',           # MM/DD/YYYY
        r'\d{2}-\d{2}-\d{4}',           # DD-MM-YYYY
        r'\d{4}/\d{2}/\d{2}',           # YYYY/MM/DD
    ]
    
    def __init__(
        self,
        categorical_threshold: int = 20,
        numeric_threshold: float = 0.9
    ):
        self.categorical_threshold = categorical_threshold
        self.numeric_threshold = numeric_threshold
        self.type_corrections_ = {}
    
    def _is_numeric_string(self, series: pd.Series) -> bool:
        """Check if string column can be converted to numeric."""
        if series.dtype != 'object':
            return False
        
        # Try conversion
        converted = pd.to_numeric(series, errors='coerce')
        valid_ratio = converted.notna().sum() / len(series)
        
        return valid_ratio >= self.numeric_threshold
    
    def _is_date_string(self, series: pd.Series) -> bool:
        """Check if string column contains dates."""
        if series.dtype != 'object':
            return False
        
        sample = series.dropna().head(100)
        
        # Check against date patterns
        for pattern in self.DATE_PATTERNS:
            matches = sample.str.match(pattern, na=False).mean()
            if matches >= 0.9:
                return True
        
        # Try pandas date parsing
        try:
            converted = pd.to_datetime(sample, errors='coerce')
            return converted.notna().mean() >= 0.9
        except:
            return False
    
    def _should_be_categorical(self, series: pd.Series) -> bool:
        """Check if column should be categorical."""
        nunique = series.nunique()
        return (
            nunique <= self.categorical_threshold and
            nunique < len(series) * 0.5
        )
    
    def analyze(self, df: pd.DataFrame) -> Dict[str, Dict[str, Any]]:
        """
        Analyze columns and suggest type corrections.
        """
        suggestions = {}
        
        for col in df.columns:
            series = df[col]
            current_type = str(series.dtype)
            suggested_type = current_type
            reason = "No change needed"
            
            if self._is_numeric_string(series):
                suggested_type = 'numeric'
                reason = "String column contains numeric values"
            elif self._is_date_string(series):
                suggested_type = 'datetime'
                reason = "String column contains date values"
            elif series.dtype == 'object' and self._should_be_categorical(series):
                suggested_type = 'category'
                reason = f"Low cardinality ({series.nunique()} unique values)"
            elif series.dtype in ['int64', 'float64']:
                if set(series.dropna().unique()).issubset({0, 1}):
                    suggested_type = 'boolean'
                    reason = "Only contains 0 and 1"
            
            if suggested_type != current_type:
                suggestions[col] = {
                    'current': current_type,
                    'suggested': suggested_type,
                    'reason': reason
                }
        
        self.type_corrections_ = suggestions
        return suggestions
    
    def correct(self, df: pd.DataFrame) -> pd.DataFrame:
        """Apply type corrections."""
        df = df.copy()
        
        for col, correction in self.type_corrections_.items():
            suggested = correction['suggested']
            
            try:
                if suggested == 'numeric':
                    df[col] = pd.to_numeric(df[col], errors='coerce')
                elif suggested == 'datetime':
                    df[col] = pd.to_datetime(df[col], errors='coerce')
                elif suggested == 'category':
                    df[col] = df[col].astype('category')
                elif suggested == 'boolean':
                    df[col] = df[col].astype(bool)
            except Exception as e:
                print(f"Warning: Could not convert {col} to {suggested}: {e}")
        
        return df

Category Type Benefits

Automated Data Cleaning Pipeline

AutoML systems combine individual cleaning operations into orchestrated pipelines. The order matters: some operations depend on others, and some should be skipped based on earlier findings.

Pipeline Order Matters:

Type correction first (enables numeric operations)
Duplicate removal (before imputation to avoid bias)
Outlier handling (before normalization)
Missing value handling (last, uses clean data for imputation)

cleaning_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
import pandas as pd
import numpy as np
from dataclasses import dataclass, field
from typing import List, Dict, Any, Optional
from enum import Enum
 
class CleaningStep(Enum):
    TYPE_CORRECTION = "type_correction"
    DUPLICATE_REMOVAL = "duplicate_removal"
    OUTLIER_HANDLING = "outlier_handling"
    MISSING_HANDLING = "missing_handling"
    CONSTANT_REMOVAL = "constant_removal"
 
@dataclass
class CleaningReport:
    """Report of cleaning operations performed."""
    steps_performed: List[str] = field(default_factory=list)
    rows_removed: int = 0
    columns_removed: int = 0
    values_imputed: int = 0
    types_corrected: Dict[str, str] = field(default_factory=dict)
    outliers_handled: int = 0
    
class AutoMLDataCleaner:
    """
    Comprehensive automated data cleaning pipeline.
    
    Orchestrates multiple cleaning operations in the correct order
    with configurable strategies for each step.
    """
    
    def __init__(
        self,
        remove_duplicates: bool = True,
        handle_outliers: str = 'clip',  # 'clip', 'remove', 'flag', 'none'
        handle_missing: str = 'median',  # 'mean', 'median', 'drop', 'none'
        remove_constant: bool = True,
        correct_types: bool = True,
        outlier_threshold: float = 3.0,
        missing_threshold: float = 0.5  # Drop columns with > 50% missing
    ):
        self.remove_duplicates = remove_duplicates
        self.handle_outliers = handle_outliers
        self.handle_missing = handle_missing
        self.remove_constant = remove_constant
        self.correct_types = correct_types
        self.outlier_threshold = outlier_threshold
        self.missing_threshold = missing_threshold
        
        self.report_ = None
        self.fitted_params_ = {}
    
    def fit_transform(
        self,
        X: pd.DataFrame,
        y: pd.Series = None
    ) -> tuple:
        """
        Fit cleaning parameters and transform data.
        
        Returns cleaned X and y (if provided).
        """
        self.report_ = CleaningReport()
        X = X.copy()
        if y is not None:
            y = y.copy()
        
        original_shape = X.shape
        
        # Step 1: Type Correction
        if self.correct_types:
            X = self._correct_types(X)
        
        # Step 2: Remove Constant Columns
        if self.remove_constant:
            X = self._remove_constant_columns(X)
        
        # Step 3: Remove Duplicates
        if self.remove_duplicates:
            X, y = self._remove_duplicates(X, y)
        
        # Step 4: Handle Outliers (fit and transform)
        if self.handle_outliers != 'none':
            X = self._handle_outliers(X, fit=True)
        
        # Step 5: Handle Missing Values (fit and transform)
        if self.handle_missing != 'none':
            X = self._handle_missing(X, fit=True)
        
        # Update report
        self.report_.rows_removed = original_shape[0] - len(X)
        self.report_.columns_removed = original_shape[1] - X.shape[1]
        
        if y is not None:
            return X, y
        return X
    
    def transform(self, X: pd.DataFrame) -> pd.DataFrame:
        """
        Apply fitted cleaning to new data.
        """
        X = X.copy()
        
        # Apply in same order
        if self.correct_types:
            X = self._apply_type_corrections(X)
        
        if self.remove_constant:
            X = self._apply_column_filter(X)
        
        # Note: Don't remove duplicates on test data
        
        if self.handle_outliers != 'none':
            X = self._handle_outliers(X, fit=False)
        
        if self.handle_missing != 'none':
            X = self._handle_missing(X, fit=False)
        
        return X
    
    def _correct_types(self, X: pd.DataFrame) -> pd.DataFrame:
        """Detect and correct data types."""
        type_corrector = AutoMLTypeCorrector()
        corrections = type_corrector.analyze(X)
        self.fitted_params_['type_corrections'] = corrections
        self.report_.types_corrected = {
            col: c['suggested'] for col, c in corrections.items()
        }
        return type_corrector.correct(X)
    
    def _remove_constant_columns(self, X: pd.DataFrame) -> pd.DataFrame:
        """Remove columns with no variance."""
        non_constant = X.columns[X.nunique() > 1]
        removed = set(X.columns) - set(non_constant)
        self.fitted_params_['kept_columns'] = list(non_constant)
        self.report_.steps_performed.append(f"Removed {len(removed)} constant columns")
        return X[non_constant]
    
    def _remove_duplicates(
        self,
        X: pd.DataFrame,
        y: pd.Series
    ) -> tuple:
        """Remove duplicate rows."""
        n_before = len(X)
        
        if y is not None:
            combined = X.copy()
            combined['_target'] = y
            combined = combined.drop_duplicates()
            y = combined['_target']
            X = combined.drop('_target', axis=1)
        else:
            X = X.drop_duplicates()
        
        n_removed = n_before - len(X)
        self.report_.steps_performed.append(f"Removed {n_removed} duplicate rows")
        return X, y
    
    def _handle_outliers(self, X: pd.DataFrame, fit: bool) -> pd.DataFrame:
        """Handle outliers in numeric columns."""
        numeric_cols = X.select_dtypes(include=[np.number]).columns
        
        if fit:
            self.fitted_params_['outlier_bounds'] = {}
            for col in numeric_cols:
                q1, q3 = X[col].quantile([0.01, 0.99])
                self.fitted_params_['outlier_bounds'][col] = (q1, q3)
        
        outlier_count = 0
        for col in numeric_cols:
            if col in self.fitted_params_.get('outlier_bounds', {}):
                lower, upper = self.fitted_params_['outlier_bounds'][col]
                
                if self.handle_outliers == 'clip':
                    outliers = (X[col] < lower) | (X[col] > upper)
                    outlier_count += outliers.sum()
                    X[col] = X[col].clip(lower, upper)
                elif self.handle_outliers == 'flag':
                    X[f'{col}_outlier'] = (
                        (X[col] < lower) | (X[col] > upper)
                    ).astype(int)
        
        self.report_.outliers_handled = outlier_count
        return X
    
    def _handle_missing(self, X: pd.DataFrame, fit: bool) -> pd.DataFrame:
        """Handle missing values."""
        if fit:
            # Store imputation values
            self.fitted_params_['impute_values'] = {}
            for col in X.columns:
                if X[col].isna().any():
                    if X[col].dtype in ['object', 'category']:
                        value = X[col].mode().iloc[0] if not X[col].mode().empty else 'MISSING'
                    elif self.handle_missing == 'mean':
                        value = X[col].mean()
                    else:
                        value = X[col].median()
                    self.fitted_params_['impute_values'][col] = value
        
        # Apply imputation
        impute_count = 0
        for col, value in self.fitted_params_.get('impute_values', {}).items():
            if col in X.columns:
                missing = X[col].isna()
                impute_count += missing.sum()
                X[col] = X[col].fillna(value)
        
        self.report_.values_imputed = impute_count
        return X
    
    def get_report(self) -> dict:
        """Return cleaning report as dictionary."""
        return {
            'steps': self.report_.steps_performed,
            'rows_removed': self.report_.rows_removed,
            'columns_removed': self.report_.columns_removed,
            'values_imputed': self.report_.values_imputed,
            'outliers_handled': self.report_.outliers_handled,
            'types_corrected': self.report_.types_corrected
        }

Fit Once, Transform Many

Summary: Automated Data Cleaning

Key Takeaways

•Data quality has multiple dimensions — Completeness, accuracy, consistency, validity, uniqueness, and timeliness all matter.
•Outliers need context — Not all outliers are errors; detection method should match data distribution.
•Duplicates cause leakage — Remove before train/test split to prevent inflated metrics.
•Type correction enables operations — Fix types first so numeric/date operations work correctly.
•Order matters in pipelines — Type correction → duplicates → outliers → missing values.
•Fit on training, apply everywhere — Store cleaning parameters for consistent inference-time preprocessing.

Module Complete

5 / 5