Loading learning content...
The oldest adage in computing is brutally accurate for machine learning: garbage in, garbage out. No algorithm, no matter how sophisticated, can extract valid patterns from corrupted, inconsistent, or erroneous data. Data scientists routinely spend 60-80% of their time on data cleaning—detecting issues, understanding their causes, and implementing fixes.
The automation imperative: If data cleaning consumes most of the work, automating it offers enormous leverage. But the challenge is immense: data quality issues are numerous, subtle, and domain-specific. AutoML systems must detect problems they cannot anticipate, fix issues without human guidance, and validate that fixes don't introduce new problems. This page explores how modern AutoML approaches this fundamental challenge.
By the end of this page, you will understand: the taxonomy of data quality issues, automated detection strategies for each issue type, remediation approaches and their tradeoffs, validation strategies to ensure cleaning improves data, and how AutoML systems orchestrate data cleaning pipelines.
Data quality issues fall into several distinct categories, each requiring different detection and remediation strategies. Understanding this taxonomy is the first step toward automated cleaning.
The Six Dimensions of Data Quality:
| Dimension | Description | Common Issues | Detection Approach |
|---|---|---|---|
| Completeness | All required data is present | Missing values, incomplete records | Null counts, coverage analysis |
| Accuracy | Data correctly represents reality | Typos, measurement errors, outdated values | Range checks, cross-validation |
| Consistency | Data doesn't contradict itself | Conflicting values, format variations | Constraint checking, deduplication |
| Validity | Data conforms to expected formats | Invalid emails, impossible dates | Regex, schema validation |
| Uniqueness | No unintended duplicates | Duplicate records, ID collisions | Exact/fuzzy matching |
| Timeliness | Data is sufficiently current | Stale data, outdated references | Timestamp analysis, freshness checks |
What constitutes 'bad' data is domain-specific. An age of 150 is clearly wrong for humans but reasonable for tortoises. A temperature of -40°C is normal in Siberia but suspicious in Singapore. AutoML data cleaning must incorporate domain constraints, either learned from data distributions or provided as external knowledge.
Outliers are observations that deviate significantly from the majority. They may be errors (measurement malfunction, data entry mistake) or genuine extremes (rare but real events). The challenge: distinguishing these cases automatically.
Detection Approaches:
Z-Score Method
Flag values more than k standard deviations from the mean.
IQR Method
Flag values beyond Q1 - 1.5×IQR or Q3 + 1.5×IQR.
Modified Z-Score
Use median and MAD (Median Absolute Deviation) instead of mean and std.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
import numpy as npimport pandas as pdfrom scipy import stats class AutoMLOutlierDetector: """ Automated outlier detection with multiple strategies. """ def __init__( self, method: str = 'auto', threshold: float = 3.0, iqr_multiplier: float = 1.5 ): self.method = method self.threshold = threshold self.iqr_multiplier = iqr_multiplier self.outlier_masks_ = {} self.bounds_ = {} def _detect_zscore(self, x: np.ndarray) -> np.ndarray: """Z-score based outlier detection.""" z = np.abs(stats.zscore(x, nan_policy='omit')) return z > self.threshold def _detect_iqr(self, x: np.ndarray) -> np.ndarray: """IQR-based outlier detection.""" q1, q3 = np.nanpercentile(x, [25, 75]) iqr = q3 - q1 lower = q1 - self.iqr_multiplier * iqr upper = q3 + self.iqr_multiplier * iqr return (x < lower) | (x > upper) def _detect_modified_zscore(self, x: np.ndarray) -> np.ndarray: """Modified z-score using median and MAD.""" median = np.nanmedian(x) mad = np.nanmedian(np.abs(x - median)) # Avoid division by zero mad = mad if mad > 0 else 1e-10 modified_z = 0.6745 * (x - median) / mad return np.abs(modified_z) > self.threshold def _select_method(self, x: np.ndarray) -> str: """Auto-select detection method based on distribution.""" if self.method != 'auto': return self.method # Check for normality _, p_value = stats.normaltest(x[~np.isnan(x)]) if p_value > 0.05: return 'zscore' # Approximately normal else: skewness = stats.skew(x, nan_policy='omit') if abs(skewness) > 1: return 'modified_zscore' # Highly skewed else: return 'iqr' # Non-normal but not too skewed def fit(self, X: pd.DataFrame) -> 'AutoMLOutlierDetector': """Detect outliers in each numeric column.""" for col in X.select_dtypes(include=[np.number]).columns: values = X[col].values method = self._select_method(values) if method == 'zscore': mask = self._detect_zscore(values) elif method == 'iqr': mask = self._detect_iqr(values) else: mask = self._detect_modified_zscore(values) self.outlier_masks_[col] = mask # Store bounds for later use if method == 'iqr': q1, q3 = np.nanpercentile(values, [25, 75]) iqr = q3 - q1 self.bounds_[col] = ( q1 - self.iqr_multiplier * iqr, q3 + self.iqr_multiplier * iqr ) return self def get_outlier_report(self) -> pd.DataFrame: """Summary of outliers detected per column.""" return pd.DataFrame({ 'column': list(self.outlier_masks_.keys()), 'n_outliers': [m.sum() for m in self.outlier_masks_.values()], 'pct_outliers': [ m.sum() / len(m) * 100 for m in self.outlier_masks_.values() ] }).sort_values('n_outliers', ascending=False)The right outlier strategy depends on context. For tree-based models (RF, XGBoost), outliers often don't matter—trees are robust. For linear models or neural nets, outliers can dominate and need handling. AutoML should consider the downstream model when deciding how aggressively to treat outliers.
Duplicates inflate metrics, bias models, and waste computation. They range from exact copies to near-duplicates that require fuzzy matching to detect.
Types of Duplicates:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import pandas as pdimport numpy as npfrom typing import List, Tuple, Setfrom collections import defaultdict class AutoMLDuplicateDetector: """ Automated duplicate detection with exact and fuzzy matching. """ def __init__( self, key_columns: List[str] = None, similarity_threshold: float = 0.9 ): self.key_columns = key_columns self.similarity_threshold = similarity_threshold self.duplicate_groups_ = [] def find_exact_duplicates(self, df: pd.DataFrame) -> pd.DataFrame: """Find and report exact duplicate rows.""" # Full row duplicates duplicates = df[df.duplicated(keep=False)] # Group duplicates duplicate_groups = df[df.duplicated(keep='first')].index.tolist() return df.duplicated(keep='first') def find_key_duplicates( self, df: pd.DataFrame, key_columns: List[str] = None ) -> pd.DataFrame: """Find rows with duplicate keys but different values.""" keys = key_columns or self.key_columns if keys is None: return pd.Series([False] * len(df)) # Find duplicate keys key_duplicates = df.duplicated(subset=keys, keep=False) # Among key duplicates, find those with differing values key_dup_df = df[key_duplicates] conflicts = [] for _, group in key_dup_df.groupby(keys): if len(group) > 1: # Check if values differ if not group.iloc[:, len(keys):].drop_duplicates().shape[0] == 1: conflicts.extend(group.index.tolist()) return df.index.isin(conflicts) def find_near_duplicates_text( self, df: pd.DataFrame, text_column: str, threshold: float = None ) -> List[Tuple[int, int, float]]: """ Find near-duplicate text entries using similarity. Returns list of (idx1, idx2, similarity) tuples. """ from difflib import SequenceMatcher threshold = threshold or self.similarity_threshold texts = df[text_column].astype(str).tolist() near_dupes = [] for i in range(len(texts)): for j in range(i + 1, len(texts)): similarity = SequenceMatcher( None, texts[i], texts[j] ).ratio() if similarity >= threshold: near_dupes.append((i, j, similarity)) return near_dupes def deduplicate( self, df: pd.DataFrame, strategy: str = 'first', key_columns: List[str] = None ) -> pd.DataFrame: """ Remove duplicates with specified strategy. Strategies: - 'first': Keep first occurrence - 'last': Keep last occurrence - 'most_complete': Keep row with fewest missing values """ keys = key_columns or self.key_columns if strategy in ['first', 'last']: return df.drop_duplicates(subset=keys, keep=strategy) elif strategy == 'most_complete': # Sort by completeness (ascending), then keep last df = df.copy() df['_missing_count'] = df.isnull().sum(axis=1) df = df.sort_values('_missing_count', ascending=False) df = df.drop_duplicates(subset=keys, keep='last') df = df.drop('_missing_count', axis=1) return df return df def remove_duplicates_for_ml( X: pd.DataFrame, y: pd.Series = None) -> Tuple[pd.DataFrame, pd.Series]: """ Remove duplicates considering both features and target. For ML, we typically want to remove exact duplicates but keep near-duplicates (different training examples). """ if y is not None: combined = X.copy() combined['_target'] = y combined = combined.drop_duplicates() y_clean = combined['_target'] X_clean = combined.drop('_target', axis=1) return X_clean, y_clean else: return X.drop_duplicates(), NoneIf the same or very similar examples appear in both training and test sets, you have data leakage. The model memorizes training examples and 'recognizes' them in test, inflating metrics. Always deduplicate before splitting, or check for cross-set duplicates after splitting.
Raw data often arrives with incorrect types: numbers as strings, dates as integers, booleans as 0/1. Additionally, the same information may be represented in multiple formats. Automated cleaning must detect and correct these issues.
Common Type/Format Issues:
| Issue | Example | Detection | Correction |
|---|---|---|---|
| Numeric as string | '123', '45.6' | Regex pattern match | pd.to_numeric() |
| Date as string | '2024-01-15', '01/15/24' | Date pattern match | pd.to_datetime() |
| Boolean as int | 0, 1 | Only 0/1 values | astype(bool) |
| Category as object | 'red', 'blue', 'green' | Low cardinality | astype('category') |
| Mixed types | ['123', 456, 'N/A'] | Type diversity check | Coerce with errors='coerce' |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import pandas as pdimport numpy as npimport refrom typing import Dict, Any class AutoMLTypeCorrector: """ Automated data type detection and correction. """ # Date patterns to try DATE_PATTERNS = [ r'\d{4}-\d{2}-\d{2}', # YYYY-MM-DD r'\d{2}/\d{2}/\d{4}', # MM/DD/YYYY r'\d{2}-\d{2}-\d{4}', # DD-MM-YYYY r'\d{4}/\d{2}/\d{2}', # YYYY/MM/DD ] def __init__( self, categorical_threshold: int = 20, numeric_threshold: float = 0.9 ): self.categorical_threshold = categorical_threshold self.numeric_threshold = numeric_threshold self.type_corrections_ = {} def _is_numeric_string(self, series: pd.Series) -> bool: """Check if string column can be converted to numeric.""" if series.dtype != 'object': return False # Try conversion converted = pd.to_numeric(series, errors='coerce') valid_ratio = converted.notna().sum() / len(series) return valid_ratio >= self.numeric_threshold def _is_date_string(self, series: pd.Series) -> bool: """Check if string column contains dates.""" if series.dtype != 'object': return False sample = series.dropna().head(100) # Check against date patterns for pattern in self.DATE_PATTERNS: matches = sample.str.match(pattern, na=False).mean() if matches >= 0.9: return True # Try pandas date parsing try: converted = pd.to_datetime(sample, errors='coerce') return converted.notna().mean() >= 0.9 except: return False def _should_be_categorical(self, series: pd.Series) -> bool: """Check if column should be categorical.""" nunique = series.nunique() return ( nunique <= self.categorical_threshold and nunique < len(series) * 0.5 ) def analyze(self, df: pd.DataFrame) -> Dict[str, Dict[str, Any]]: """ Analyze columns and suggest type corrections. """ suggestions = {} for col in df.columns: series = df[col] current_type = str(series.dtype) suggested_type = current_type reason = "No change needed" if self._is_numeric_string(series): suggested_type = 'numeric' reason = "String column contains numeric values" elif self._is_date_string(series): suggested_type = 'datetime' reason = "String column contains date values" elif series.dtype == 'object' and self._should_be_categorical(series): suggested_type = 'category' reason = f"Low cardinality ({series.nunique()} unique values)" elif series.dtype in ['int64', 'float64']: if set(series.dropna().unique()).issubset({0, 1}): suggested_type = 'boolean' reason = "Only contains 0 and 1" if suggested_type != current_type: suggestions[col] = { 'current': current_type, 'suggested': suggested_type, 'reason': reason } self.type_corrections_ = suggestions return suggestions def correct(self, df: pd.DataFrame) -> pd.DataFrame: """Apply type corrections.""" df = df.copy() for col, correction in self.type_corrections_.items(): suggested = correction['suggested'] try: if suggested == 'numeric': df[col] = pd.to_numeric(df[col], errors='coerce') elif suggested == 'datetime': df[col] = pd.to_datetime(df[col], errors='coerce') elif suggested == 'category': df[col] = df[col].astype('category') elif suggested == 'boolean': df[col] = df[col].astype(bool) except Exception as e: print(f"Warning: Could not convert {col} to {suggested}: {e}") return dfConverting low-cardinality string columns to categorical type reduces memory significantly (stores integers instead of repeated strings) and speeds up groupby operations. AutoML systems should identify and convert categorical columns automatically.
AutoML systems combine individual cleaning operations into orchestrated pipelines. The order matters: some operations depend on others, and some should be skipped based on earlier findings.
Pipeline Order Matters:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222
import pandas as pdimport numpy as npfrom dataclasses import dataclass, fieldfrom typing import List, Dict, Any, Optionalfrom enum import Enum class CleaningStep(Enum): TYPE_CORRECTION = "type_correction" DUPLICATE_REMOVAL = "duplicate_removal" OUTLIER_HANDLING = "outlier_handling" MISSING_HANDLING = "missing_handling" CONSTANT_REMOVAL = "constant_removal" @dataclassclass CleaningReport: """Report of cleaning operations performed.""" steps_performed: List[str] = field(default_factory=list) rows_removed: int = 0 columns_removed: int = 0 values_imputed: int = 0 types_corrected: Dict[str, str] = field(default_factory=dict) outliers_handled: int = 0 class AutoMLDataCleaner: """ Comprehensive automated data cleaning pipeline. Orchestrates multiple cleaning operations in the correct order with configurable strategies for each step. """ def __init__( self, remove_duplicates: bool = True, handle_outliers: str = 'clip', # 'clip', 'remove', 'flag', 'none' handle_missing: str = 'median', # 'mean', 'median', 'drop', 'none' remove_constant: bool = True, correct_types: bool = True, outlier_threshold: float = 3.0, missing_threshold: float = 0.5 # Drop columns with > 50% missing ): self.remove_duplicates = remove_duplicates self.handle_outliers = handle_outliers self.handle_missing = handle_missing self.remove_constant = remove_constant self.correct_types = correct_types self.outlier_threshold = outlier_threshold self.missing_threshold = missing_threshold self.report_ = None self.fitted_params_ = {} def fit_transform( self, X: pd.DataFrame, y: pd.Series = None ) -> tuple: """ Fit cleaning parameters and transform data. Returns cleaned X and y (if provided). """ self.report_ = CleaningReport() X = X.copy() if y is not None: y = y.copy() original_shape = X.shape # Step 1: Type Correction if self.correct_types: X = self._correct_types(X) # Step 2: Remove Constant Columns if self.remove_constant: X = self._remove_constant_columns(X) # Step 3: Remove Duplicates if self.remove_duplicates: X, y = self._remove_duplicates(X, y) # Step 4: Handle Outliers (fit and transform) if self.handle_outliers != 'none': X = self._handle_outliers(X, fit=True) # Step 5: Handle Missing Values (fit and transform) if self.handle_missing != 'none': X = self._handle_missing(X, fit=True) # Update report self.report_.rows_removed = original_shape[0] - len(X) self.report_.columns_removed = original_shape[1] - X.shape[1] if y is not None: return X, y return X def transform(self, X: pd.DataFrame) -> pd.DataFrame: """ Apply fitted cleaning to new data. """ X = X.copy() # Apply in same order if self.correct_types: X = self._apply_type_corrections(X) if self.remove_constant: X = self._apply_column_filter(X) # Note: Don't remove duplicates on test data if self.handle_outliers != 'none': X = self._handle_outliers(X, fit=False) if self.handle_missing != 'none': X = self._handle_missing(X, fit=False) return X def _correct_types(self, X: pd.DataFrame) -> pd.DataFrame: """Detect and correct data types.""" type_corrector = AutoMLTypeCorrector() corrections = type_corrector.analyze(X) self.fitted_params_['type_corrections'] = corrections self.report_.types_corrected = { col: c['suggested'] for col, c in corrections.items() } return type_corrector.correct(X) def _remove_constant_columns(self, X: pd.DataFrame) -> pd.DataFrame: """Remove columns with no variance.""" non_constant = X.columns[X.nunique() > 1] removed = set(X.columns) - set(non_constant) self.fitted_params_['kept_columns'] = list(non_constant) self.report_.steps_performed.append(f"Removed {len(removed)} constant columns") return X[non_constant] def _remove_duplicates( self, X: pd.DataFrame, y: pd.Series ) -> tuple: """Remove duplicate rows.""" n_before = len(X) if y is not None: combined = X.copy() combined['_target'] = y combined = combined.drop_duplicates() y = combined['_target'] X = combined.drop('_target', axis=1) else: X = X.drop_duplicates() n_removed = n_before - len(X) self.report_.steps_performed.append(f"Removed {n_removed} duplicate rows") return X, y def _handle_outliers(self, X: pd.DataFrame, fit: bool) -> pd.DataFrame: """Handle outliers in numeric columns.""" numeric_cols = X.select_dtypes(include=[np.number]).columns if fit: self.fitted_params_['outlier_bounds'] = {} for col in numeric_cols: q1, q3 = X[col].quantile([0.01, 0.99]) self.fitted_params_['outlier_bounds'][col] = (q1, q3) outlier_count = 0 for col in numeric_cols: if col in self.fitted_params_.get('outlier_bounds', {}): lower, upper = self.fitted_params_['outlier_bounds'][col] if self.handle_outliers == 'clip': outliers = (X[col] < lower) | (X[col] > upper) outlier_count += outliers.sum() X[col] = X[col].clip(lower, upper) elif self.handle_outliers == 'flag': X[f'{col}_outlier'] = ( (X[col] < lower) | (X[col] > upper) ).astype(int) self.report_.outliers_handled = outlier_count return X def _handle_missing(self, X: pd.DataFrame, fit: bool) -> pd.DataFrame: """Handle missing values.""" if fit: # Store imputation values self.fitted_params_['impute_values'] = {} for col in X.columns: if X[col].isna().any(): if X[col].dtype in ['object', 'category']: value = X[col].mode().iloc[0] if not X[col].mode().empty else 'MISSING' elif self.handle_missing == 'mean': value = X[col].mean() else: value = X[col].median() self.fitted_params_['impute_values'][col] = value # Apply imputation impute_count = 0 for col, value in self.fitted_params_.get('impute_values', {}).items(): if col in X.columns: missing = X[col].isna() impute_count += missing.sum() X[col] = X[col].fillna(value) self.report_.values_imputed = impute_count return X def get_report(self) -> dict: """Return cleaning report as dictionary.""" return { 'steps': self.report_.steps_performed, 'rows_removed': self.report_.rows_removed, 'columns_removed': self.report_.columns_removed, 'values_imputed': self.report_.values_imputed, 'outliers_handled': self.report_.outliers_handled, 'types_corrected': self.report_.types_corrected }Like scalers and encoders, data cleaning parameters (imputation values, outlier bounds, type mappings) must be fitted on training data and applied consistently to all new data. AutoML cleaning pipelines store these parameters for reuse at inference time.
Data cleaning is the foundation of reliable machine learning. AutoML systems automate detection and remediation of data quality issues, freeing practitioners to focus on modeling while ensuring consistent, reproducible preprocessing.
Congratulations! You have completed Module 2: Automated Preprocessing. You now understand how AutoML systems handle the full preprocessing pipeline—from missing values and encoding through scaling, feature selection, and data cleaning. These capabilities form the foundation of end-to-end AutoML systems.