Machine LearningFeature Engineering & Selection

Feature Engineering Mastery

LevelIntermediate

Duration90 mins

TopicFeature Engineering & Selection

2 / 5

Domain Knowledge

The Expert's Edge

In 2009, a team of statisticians won Netflix's million-dollar recommendation challenge by improving predictions by just 10%. Their secret wasn't exotic algorithms—it was understanding how people watch movies. They engineered features capturing viewing patterns, time-of-day effects, and 'first rating' bias that pure data mining would never discover.

This principle holds universally: domain knowledge is often worth more than algorithmic sophistication. A fraud analyst who knows that criminals test stolen cards with small purchases first will engineer a 'rapid small transaction sequence' feature that outperforms any automatically discovered pattern. A medical researcher who understands that drug interactions depend on metabolic pathways will create biochemical ratio features that no autoML system would propose.

What You Will Learn

This page teaches you to systematically extract and encode domain expertise into features. You'll learn techniques for eliciting knowledge from experts, translating business rules into computable features, and validating that your domain hypotheses actually improve predictions. This is where feature engineering becomes an art informed by science.

Why Domain Knowledge Matters

Machine learning models are pattern recognition engines. They find correlations in data—but they don't understand causation, context, or constraint. Domain knowledge fills this gap in several critical ways:

1. It reduces the search space

Without guidance, an ML algorithm explores all possible feature interactions. With 100 raw features, there are 4,950 pairwise interactions, 161,700 three-way interactions, and billions of higher-order combinations. Domain knowledge tells you which interactions are meaningful:

Height × Weight (meaningful: BMI approximation)
Temperature × Humidity (meaningful: heat index)
Zip Code × Shoe Size (likely meaningless)

2. It injects causal structure

Data shows correlation; domain knowledge provides causation. A model might learn that ice cream sales predict drowning deaths—both correlate with summer heat. A domain expert encodes the causal features (temperature, beach attendance) rather than spurious proxies.

3. It handles distributional shift

Patterns in historical data may not persist. Domain knowledge identifies robust features based on fundamental mechanisms rather than statistical artifacts that disappear when distributions change.

Domain Knowledge vs Pure Data Mining
Aspect	Pure Data Mining	Domain-Informed Approach
Feature discovery	Exhaustive search over combinations	Targeted features based on known mechanisms
Interpretability	Black-box feature importance	Features map to understood concepts
Robustness	May capture spurious correlations	More likely to generalize across distributions
Debugging	Why does the model fail on this case?	Missing or miscalculated domain feature
Regulatory compliance	Hard to explain to regulators	Domain features provide natural explanations
Speed to value	Requires extensive experimentation	Expert hypotheses accelerate iteration

The 80/20 of Feature Engineering

In most real-world ML projects, 80% of predictive power comes from 20% of features—and those top features are almost always domain-informed. The first 5 features a domain expert suggests often outperform the next 50 that automated feature generation discovers.

Eliciting Domain Knowledge

Domain experts often struggle to articulate what they know implicitly. A loan officer feels when an application is risky but can't always specify the exact signals. Effective knowledge elicitation is a structured skill.

Techniques for Knowledge Elicitation:

Elicitation Methods

•Case-Based Reasoning — Present specific examples: 'Why did you approve this loan but reject that one?' Comparative questions surface implicit criteria.
•Think-Aloud Protocol — Ask experts to verbalize their reasoning in real-time while making decisions. Record and analyze for feature patterns.
•Extreme Case Analysis — 'What's the riskiest customer you've seen?' 'The safest?' Extremes reveal boundary conditions and key discriminating factors.
•Counterfactual Probing — 'What would need to change for you to decide differently?' Identifies the most decision-relevant features.
•Rule Verbalization — 'If you had to write three rules for a new hire, what would they be?' Forces explicit articulation of tacit knowledge.
•Historical Error Analysis — Review past prediction errors together. 'What information would have helped here?' reveals missing features.

Structuring Elicitation Sessions:

1. PREPARATION (before meeting)
   - Gather sample cases: successes, failures, edge cases
   - Identify data fields available for feature construction
   - Review existing model features and their importance

2. SESSION STRUCTURE (60-90 minutes)
   - 10 min: Explain ML problem and current approach
   - 20 min: Walk through 3-5 cases using think-aloud
   - 20 min: Explore extreme cases and boundaries
   - 20 min: Counterfactual and rule verbalization
   - 10 min: Review and prioritize emerging feature ideas

3. FOLLOW-UP
   - Summarize extracted rules and features
   - Validate understanding with expert
   - Prototype features and measure predictive lift
   - Share results and iterate

Common Elicitation Pitfalls

Experts often report what they should consider rather than what they actually use. They may be unaware of their own biases or mental shortcuts. Cross-validate elicited rules against data—sometimes experts are confidently wrong. Combine multiple experts to surface disagreements that reveal uncertainty.

Translating Knowledge into Features

Once domain rules are articulated, they must be translated into computable features. This requires bridging qualitative understanding with quantitative representation.

Pattern: Business Rule → Feature Formula

Expert Statement	Feature Translation	Computation
'High-value customers buy frequently'	purchase_frequency	orders_last_90_days / 90
'Risk increases with leverage'	debt_to_income_ratio	total_debt / annual_income
'Churn spikes after price increases'	price_sensitivity	pct_change_after_price_hike
'Fraud happens in bursts'	transaction_velocity	transactions_last_hour / avg_hourly_rate
'Quality depends on experience'	tenure_quality_interaction	years_employed × quality_score

domain_feature_translation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import pandas as pd
import numpy as np
 
def create_domain_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Translate domain knowledge into computable features.
    
    Domain context: E-commerce customer churn prediction
    Expert insights encoded as features.
    """
    features = df.copy()
    
    # INSIGHT 1: "Customers who haven't purchased recently are at risk"
    # Translation: Recency score (days since last purchase)
    features['days_since_last_purchase'] = (
        pd.Timestamp.now() - pd.to_datetime(df['last_purchase_date'])
    ).dt.days
    
    # INSIGHT 2: "Declining purchase frequency signals disengagement"
    # Translation: Trend in purchase rate over time
    features['purchase_frequency_trend'] = (
        df['purchases_last_30d'] / df['purchases_days_31_60'].replace(0, 0.1)
    ) - 1  # Negative = declining
    
    # INSIGHT 3: "Customers who complain but don't churn are actually loyal"
    # Translation: Complaint-to-engagement ratio
    features['complaint_engagement_ratio'] = (
        df['complaints_last_90d'] / (df['support_contacts_last_90d'] + 1)
    )
    
    # INSIGHT 4: "Product diversity indicates stickiness"
    # Translation: Category spread (how many product categories purchased)
    features['category_diversity'] = df['unique_categories_purchased']
    
    # INSIGHT 5: "Price-sensitive customers churn when we raise prices"
    # Translation: Discount dependency score
    features['discount_dependency'] = (
        df['discounted_purchases'] / (df['total_purchases'] + 1)
    )
    
    # INSIGHT 6: "Long-tenure customers with sudden behavior change are concerning"
    # Translation: Behavioral anomaly score
    avg_monthly = df['lifetime_purchases'] / (df['tenure_months'] + 1)
    features['behavioral_anomaly'] = (
        df['purchases_last_30d'] - avg_monthly
    ) / (avg_monthly + 0.1)  # Negative = below average
    
    # INSIGHT 7: "Mobile-first customers have different patterns"
    # Translation: Channel concentration
    features['mobile_channel_pct'] = (
        df['mobile_purchases'] / (df['total_purchases'] + 1)
    )
    
    return features
 
# Example of documenting domain features
DOMAIN_FEATURE_CATALOG = {
    'days_since_last_purchase': {
        'source_insight': 'Customers who have not purchased recently are at risk',
        'expert': 'Customer Success Team',
        'expected_direction': 'Higher values increase churn probability',
        'computation': '(current_date - last_purchase_date).days',
        'validation_check': 'Should be 0+ integers; check for future dates',
    },
    'behavioral_anomaly': {
        'source_insight': 'Sudden behavior changes signal potential churn',
        'expert': 'Data Science Lead + Product Team',
        'expected_direction': 'Negative values (decline) increase churn risk',
        'computation': '(recent_monthly - avg_monthly) / avg_monthly',
        'validation_check': 'Should cluster around 0; outliers need review',
    },
    # ... continue for all domain features
}

Feature Documentation is Mandatory

Every domain feature should document: (1) the expert insight it encodes, (2) who provided the insight, (3) expected direction of effect, (4) exact computation logic, and (5) validation checks. This documentation enables debugging, onboarding, and regulatory review. Undocumented features become technical debt within months.

Domain Features Across Industries

Each industry has accumulated decades of domain knowledge that translates into powerful features. Let's examine how experts in different fields encode their understanding.

Financial Services: Credit Risk

Credit Risk Domain Features

•Debt-to-Income Ratio (DTI) — Total monthly debt payments / gross monthly income. Banks cap lending at DTI thresholds (typically 36-43%).
•Credit Utilization — Current credit card balance / credit limit. Utilization above 30% signals financial stress even with timely payments.
•Payment-to-Income Ratio — Proposed loan payment / monthly income. Captures affordability independent of other debt.
•Inquiries-to-Accounts Ratio — Recent credit inquiries / total accounts. High ratio suggests desperate credit-seeking behavior.
•Stability Score — Composite of time at job, time at address, account age. Stable customers are lower risk.
•Income Velocity — (Current income - Income at application) / months. Improving financial trajectory reduces default risk.

Healthcare: Patient Risk Stratification

Healthcare Domain Features

•Charlson Comorbidity Index — Weighted sum of 17 comorbid conditions predicting 10-year mortality. Standard in clinical risk models.
•eGFR (Kidney Function) — Estimated glomerular filtration rate from creatinine, age, sex, race. Clinical formula, not learned.
•Medication Adherence Ratio — Days with medication / days in period. Calculated from pharmacy fill data.
•ED Visit Velocity — Emergency department visits in last 6 months. Rapid increase signals deterioration.
•Social Determinants Index — Composite of zip code deprivation, housing stability, food access. Addresses non-clinical risk factors.
•Care Gap Score — Missing preventive services (screenings, vaccinations) relative to guidelines. Indicates engagement and access issues.

E-commerce: Customer Lifetime Value

E-commerce Domain Features

•RFM Scores — Recency (days since last purchase), Frequency (purchase count), Monetary (total spend). Classic customer segmentation.
•Basket Diversity — Unique categories per order. Higher diversity indicates browsing behavior vs. single-purpose purchase.
•Price Sensitivity Index — Correlation between purchase probability and discount depth. Captures individual price responsiveness.
•Session-to-Purchase Ratio — Sessions before first purchase / average sessions. Lower is better acquisition.
•Return Abuse Likelihood — Returns / purchases weighted by category-specific return rates. Flags potential wardrobing.
•Cross-Device Consistency — Same behavior patterns across mobile/desktop. Consistent users are more predictable.

Industry Knowledge is Portable

Many domain features transfer across companies within an industry. RFM is universal in retail. DTI is standard in lending. When entering a new domain, start by researching industry-standard features and metrics—don't reinvent what experts have already validated over decades.

Validating Domain Hypotheses

Domain experts are often right—but not always. Their intuitions may be based on memorable outliers, outdated patterns, or cognitive biases. Every domain hypothesis must be validated against data.

Validation Framework:

domain_hypothesis_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
 
def validate_domain_feature(
    df: pd.DataFrame,
    feature_name: str,
    target_name: str,
    expected_direction: str = 'positive',  # or 'negative'
    feature_type: str = 'numerical'  # or 'categorical'
) -> dict:
    """
    Comprehensive validation of a domain-hypothesized feature.
    
    Returns validation metrics and diagnostic information.
    """
    feature = df[feature_name].dropna()
    target = df.loc[feature.index, target_name]
    
    validation = {
        'feature_name': feature_name,
        'n_samples': len(feature),
        'missing_pct': (df[feature_name].isna().sum() / len(df)) * 100,
    }
    
    if feature_type == 'numerical':
        # 1. Correlation analysis
        correlation = feature.corr(target)
        validation['correlation'] = correlation
        
        # Check if direction matches hypothesis
        if expected_direction == 'positive':
            validation['direction_match'] = correlation > 0
        else:
            validation['direction_match'] = correlation < 0
        
        # 2. Statistical significance
        _, p_value = stats.pearsonr(feature, target)
        validation['p_value'] = p_value
        validation['significant'] = p_value < 0.05
        
        # 3. Mutual information (non-linear relationship strength)
        mi = mutual_info_classif(
            feature.values.reshape(-1, 1),
            target,
            random_state=42
        )[0]
        validation['mutual_information'] = mi
        
        # 4. Binned analysis (check monotonicity)
        feature_binned = pd.qcut(feature, q=5, duplicates='drop')
        bin_means = target.groupby(feature_binned).mean()
        validation['monotonic'] = bin_means.is_monotonic_increasing or bin_means.is_monotonic_decreasing
        validation['bin_target_rates'] = bin_means.to_dict()
        
    elif feature_type == 'categorical':
        # Chi-square test for independence
        contingency = pd.crosstab(feature, target)
        chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
        validation['chi2_statistic'] = chi2
        validation['p_value'] = p_value
        validation['significant'] = p_value < 0.05
        
        # Category-wise target rates
        cat_rates = target.groupby(feature).mean()
        validation['category_target_rates'] = cat_rates.to_dict()
        
        # Mutual information
        mi = mutual_info_classif(
            pd.factorize(feature)[0].reshape(-1, 1),
            target,
            discrete_features=True,
            random_state=42
        )[0]
        validation['mutual_information'] = mi
    
    # 5. Incremental predictive value
    # Compare model with and without this feature
    X_base = df.drop(columns=[feature_name, target_name]).select_dtypes(include=[np.number]).fillna(0)
    X_with_feature = X_base.copy()
    X_with_feature[feature_name] = df[feature_name].fillna(0)
    
    model = GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42)
    
    score_base = cross_val_score(model, X_base, target, cv=3, scoring='roc_auc').mean()
    score_with = cross_val_score(model, X_with_feature, target, cv=3, scoring='roc_auc').mean()
    
    validation['baseline_auc'] = score_base
    validation['with_feature_auc'] = score_with
    validation['auc_lift'] = score_with - score_base
    validation['provides_lift'] = (score_with - score_base) > 0.001
    
    # Final verdict
    validation['validated'] = (
        validation['significant'] and
        validation.get('direction_match', True) and
        validation['provides_lift']
    )
    
    return validation
 
# Example usage
validation_result = validate_domain_feature(
    df=customer_data,
    feature_name='days_since_last_purchase',
    target_name='churned',
    expected_direction='positive',  # Longer recency = higher churn
    feature_type='numerical'
)
 
print(f"Feature: {validation_result['feature_name']}")
print(f"Validated: {validation_result['validated']}")
print(f"Correlation: {validation_result['correlation']:.3f}")
print(f"Direction matches hypothesis: {validation_result['direction_match']}")
print(f"AUC lift: {validation_result['auc_lift']:.4f}")

When Experts Are Wrong

If a domain feature fails validation, don't immediately discard it. First check: (1) Is the feature computed correctly? (2) Is the validation sample representative? (3) Is the relationship non-linear? (4) Does it interact with other features? Often, the underlying insight is valuable but the initial operationalization needs refinement.

Building Your Domain Intuition

What if you are the domain expert for your ML problem? Or what if you're entering a new domain with no experts available? You can systematically build domain intuition through structured exploration.

Techniques for Self-Development of Domain Knowledge:

Building Domain Intuition

•Error Analysis Deep Dives — Manually inspect every prediction error for a sample of 50-100 cases. Ask: 'What would a human need to know to get this right?' Those missing signals become feature candidates.
•Cohort Comparison — Split data into meaningful groups (e.g., churned vs retained, converted vs bounced). What differs between cohorts? These differences suggest discriminating features.
•Literature Review — Academic papers in every domain identify predictive factors. A paper on 'Predictors of Hospital Readmission' is essentially a domain feature wishlist.
•Industry Reports and Benchmarks — Consulting firms and industry associations publish metrics and KPIs. These standardized measures encode collective domain wisdom.
•Competitive Analysis — What data does competitors collect? What features do their products expose? This reveals industry consensus on what matters.
•User Interviews — Talk to end users of your predictions. A doctor using a risk score can explain what additional context would make it actionable.

The Feature Engineering Journal:

Maintain a running document of feature hypotheses:

## Feature Hypothesis Log

### 2024-01-15: Session Depth Feature
- **Hypothesis**: Users who go deep in a session (many pages) but don't convert are comparison shopping
- **Proposed feature**: pages_viewed / time_on_site (browsing velocity)
- **Expected effect**: High velocity = comparison shopping = lower conversion
- **Status**: To be validated
- **Result**: [pending]

### 2024-01-12: Review Sentiment Mismatch
- **Hypothesis**: Products with mismatched review sentiment (some 5-star, some 1-star) are polarizing
- **Proposed feature**: std(review_scores) for each product
- **Expected effect**: High variance = polarizing = higher return rate
- **Status**: Validated
- **Result**: AUC +0.02, now in production model

This journal creates an institutional memory of feature engineering attempts, preventing repeated failed experiments and documenting successful patterns.

Summary: Domain Knowledge as Competitive Advantage

Domain knowledge is the moat around effective ML systems. Anyone can apply algorithms; encoding decades of industry wisdom into features is rare and valuable. Let's consolidate the key insights:

Key Takeaways

•Domain knowledge reduces search space by focusing on meaningful feature combinations rather than exhaustive exploration.
•Domain features encode causal understanding, not just correlation, leading to more robust models that generalize under distribution shift.
•Knowledge elicitation is a skill that requires structured techniques: case-based reasoning, think-aloud protocols, and counterfactual probing.
•Translate qualitative insights into quantitative computations with explicit formulas and documented rationale.
•Industry-standard features exist in every domain—research before reinventing what experts have validated.
•Always validate domain hypotheses against data. Experts provide valuable starting points, not guaranteed truth.
•Build domain intuition systematically through error analysis, cohort comparison, literature review, and feature hypothesis journaling.

Page Complete

You now understand how to extract, encode, and validate domain expertise as features. This knowledge transforms you from a model operator into a domain-aware ML engineer. Next, we'll explore interaction features—how combining features multiplicatively captures relationships that neither feature expresses alone.

2 / 5

Loading learning content...

Machine LearningFeature Engineering & Selection

Feature Engineering Mastery

LevelIntermediate

Duration90 mins

TopicFeature Engineering & Selection

2 / 5

Domain Knowledge

The Expert's Edge

What You Will Learn

Why Domain Knowledge Matters

1. It reduces the search space

Height × Weight (meaningful: BMI approximation)
Temperature × Humidity (meaningful: heat index)
Zip Code × Shoe Size (likely meaningless)

2. It injects causal structure

3. It handles distributional shift

Patterns in historical data may not persist. Domain knowledge identifies robust features based on fundamental mechanisms rather than statistical artifacts that disappear when distributions change.

Domain Knowledge vs Pure Data Mining
Aspect	Pure Data Mining	Domain-Informed Approach
Feature discovery	Exhaustive search over combinations	Targeted features based on known mechanisms
Interpretability	Black-box feature importance	Features map to understood concepts
Robustness	May capture spurious correlations	More likely to generalize across distributions
Debugging	Why does the model fail on this case?	Missing or miscalculated domain feature
Regulatory compliance	Hard to explain to regulators	Domain features provide natural explanations
Speed to value	Requires extensive experimentation	Expert hypotheses accelerate iteration

The 80/20 of Feature Engineering

Eliciting Domain Knowledge

Techniques for Knowledge Elicitation:

Elicitation Methods

•Case-Based Reasoning — Present specific examples: 'Why did you approve this loan but reject that one?' Comparative questions surface implicit criteria.
•Think-Aloud Protocol — Ask experts to verbalize their reasoning in real-time while making decisions. Record and analyze for feature patterns.
•Extreme Case Analysis — 'What's the riskiest customer you've seen?' 'The safest?' Extremes reveal boundary conditions and key discriminating factors.
•Counterfactual Probing — 'What would need to change for you to decide differently?' Identifies the most decision-relevant features.
•Rule Verbalization — 'If you had to write three rules for a new hire, what would they be?' Forces explicit articulation of tacit knowledge.
•Historical Error Analysis — Review past prediction errors together. 'What information would have helped here?' reveals missing features.

Structuring Elicitation Sessions:

1. PREPARATION (before meeting)
   - Gather sample cases: successes, failures, edge cases
   - Identify data fields available for feature construction
   - Review existing model features and their importance

2. SESSION STRUCTURE (60-90 minutes)
   - 10 min: Explain ML problem and current approach
   - 20 min: Walk through 3-5 cases using think-aloud
   - 20 min: Explore extreme cases and boundaries
   - 20 min: Counterfactual and rule verbalization
   - 10 min: Review and prioritize emerging feature ideas

3. FOLLOW-UP
   - Summarize extracted rules and features
   - Validate understanding with expert
   - Prototype features and measure predictive lift
   - Share results and iterate

Common Elicitation Pitfalls

Translating Knowledge into Features

Once domain rules are articulated, they must be translated into computable features. This requires bridging qualitative understanding with quantitative representation.

Pattern: Business Rule → Feature Formula

Expert Statement	Feature Translation	Computation
'High-value customers buy frequently'	purchase_frequency	orders_last_90_days / 90
'Risk increases with leverage'	debt_to_income_ratio	total_debt / annual_income
'Churn spikes after price increases'	price_sensitivity	pct_change_after_price_hike
'Fraud happens in bursts'	transaction_velocity	transactions_last_hour / avg_hourly_rate
'Quality depends on experience'	tenure_quality_interaction	years_employed × quality_score

domain_feature_translation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
import pandas as pd
import numpy as np
 
def create_domain_features(df: pd.DataFrame) -> pd.DataFrame:
    """
    Translate domain knowledge into computable features.
    
    Domain context: E-commerce customer churn prediction
    Expert insights encoded as features.
    """
    features = df.copy()
    
    # INSIGHT 1: "Customers who haven't purchased recently are at risk"
    # Translation: Recency score (days since last purchase)
    features['days_since_last_purchase'] = (
        pd.Timestamp.now() - pd.to_datetime(df['last_purchase_date'])
    ).dt.days
    
    # INSIGHT 2: "Declining purchase frequency signals disengagement"
    # Translation: Trend in purchase rate over time
    features['purchase_frequency_trend'] = (
        df['purchases_last_30d'] / df['purchases_days_31_60'].replace(0, 0.1)
    ) - 1  # Negative = declining
    
    # INSIGHT 3: "Customers who complain but don't churn are actually loyal"
    # Translation: Complaint-to-engagement ratio
    features['complaint_engagement_ratio'] = (
        df['complaints_last_90d'] / (df['support_contacts_last_90d'] + 1)
    )
    
    # INSIGHT 4: "Product diversity indicates stickiness"
    # Translation: Category spread (how many product categories purchased)
    features['category_diversity'] = df['unique_categories_purchased']
    
    # INSIGHT 5: "Price-sensitive customers churn when we raise prices"
    # Translation: Discount dependency score
    features['discount_dependency'] = (
        df['discounted_purchases'] / (df['total_purchases'] + 1)
    )
    
    # INSIGHT 6: "Long-tenure customers with sudden behavior change are concerning"
    # Translation: Behavioral anomaly score
    avg_monthly = df['lifetime_purchases'] / (df['tenure_months'] + 1)
    features['behavioral_anomaly'] = (
        df['purchases_last_30d'] - avg_monthly
    ) / (avg_monthly + 0.1)  # Negative = below average
    
    # INSIGHT 7: "Mobile-first customers have different patterns"
    # Translation: Channel concentration
    features['mobile_channel_pct'] = (
        df['mobile_purchases'] / (df['total_purchases'] + 1)
    )
    
    return features
 
# Example of documenting domain features
DOMAIN_FEATURE_CATALOG = {
    'days_since_last_purchase': {
        'source_insight': 'Customers who have not purchased recently are at risk',
        'expert': 'Customer Success Team',
        'expected_direction': 'Higher values increase churn probability',
        'computation': '(current_date - last_purchase_date).days',
        'validation_check': 'Should be 0+ integers; check for future dates',
    },
    'behavioral_anomaly': {
        'source_insight': 'Sudden behavior changes signal potential churn',
        'expert': 'Data Science Lead + Product Team',
        'expected_direction': 'Negative values (decline) increase churn risk',
        'computation': '(recent_monthly - avg_monthly) / avg_monthly',
        'validation_check': 'Should cluster around 0; outliers need review',
    },
    # ... continue for all domain features
}

Feature Documentation is Mandatory

Domain Features Across Industries

Each industry has accumulated decades of domain knowledge that translates into powerful features. Let's examine how experts in different fields encode their understanding.

Financial Services: Credit Risk

Credit Risk Domain Features

•Debt-to-Income Ratio (DTI) — Total monthly debt payments / gross monthly income. Banks cap lending at DTI thresholds (typically 36-43%).
•Credit Utilization — Current credit card balance / credit limit. Utilization above 30% signals financial stress even with timely payments.
•Payment-to-Income Ratio — Proposed loan payment / monthly income. Captures affordability independent of other debt.
•Inquiries-to-Accounts Ratio — Recent credit inquiries / total accounts. High ratio suggests desperate credit-seeking behavior.
•Stability Score — Composite of time at job, time at address, account age. Stable customers are lower risk.
•Income Velocity — (Current income - Income at application) / months. Improving financial trajectory reduces default risk.

Healthcare: Patient Risk Stratification

Healthcare Domain Features

•Charlson Comorbidity Index — Weighted sum of 17 comorbid conditions predicting 10-year mortality. Standard in clinical risk models.
•eGFR (Kidney Function) — Estimated glomerular filtration rate from creatinine, age, sex, race. Clinical formula, not learned.
•Medication Adherence Ratio — Days with medication / days in period. Calculated from pharmacy fill data.
•ED Visit Velocity — Emergency department visits in last 6 months. Rapid increase signals deterioration.
•Social Determinants Index — Composite of zip code deprivation, housing stability, food access. Addresses non-clinical risk factors.
•Care Gap Score — Missing preventive services (screenings, vaccinations) relative to guidelines. Indicates engagement and access issues.

E-commerce: Customer Lifetime Value

E-commerce Domain Features

•RFM Scores — Recency (days since last purchase), Frequency (purchase count), Monetary (total spend). Classic customer segmentation.
•Basket Diversity — Unique categories per order. Higher diversity indicates browsing behavior vs. single-purpose purchase.
•Price Sensitivity Index — Correlation between purchase probability and discount depth. Captures individual price responsiveness.
•Session-to-Purchase Ratio — Sessions before first purchase / average sessions. Lower is better acquisition.
•Return Abuse Likelihood — Returns / purchases weighted by category-specific return rates. Flags potential wardrobing.
•Cross-Device Consistency — Same behavior patterns across mobile/desktop. Consistent users are more predictable.

Industry Knowledge is Portable

Validating Domain Hypotheses

Domain experts are often right—but not always. Their intuitions may be based on memorable outliers, outdated patterns, or cognitive biases. Every domain hypothesis must be validated against data.

Validation Framework:

domain_hypothesis_validation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
import pandas as pd
import numpy as np
from scipy import stats
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier
import matplotlib.pyplot as plt
 
def validate_domain_feature(
    df: pd.DataFrame,
    feature_name: str,
    target_name: str,
    expected_direction: str = 'positive',  # or 'negative'
    feature_type: str = 'numerical'  # or 'categorical'
) -> dict:
    """
    Comprehensive validation of a domain-hypothesized feature.
    
    Returns validation metrics and diagnostic information.
    """
    feature = df[feature_name].dropna()
    target = df.loc[feature.index, target_name]
    
    validation = {
        'feature_name': feature_name,
        'n_samples': len(feature),
        'missing_pct': (df[feature_name].isna().sum() / len(df)) * 100,
    }
    
    if feature_type == 'numerical':
        # 1. Correlation analysis
        correlation = feature.corr(target)
        validation['correlation'] = correlation
        
        # Check if direction matches hypothesis
        if expected_direction == 'positive':
            validation['direction_match'] = correlation > 0
        else:
            validation['direction_match'] = correlation < 0
        
        # 2. Statistical significance
        _, p_value = stats.pearsonr(feature, target)
        validation['p_value'] = p_value
        validation['significant'] = p_value < 0.05
        
        # 3. Mutual information (non-linear relationship strength)
        mi = mutual_info_classif(
            feature.values.reshape(-1, 1),
            target,
            random_state=42
        )[0]
        validation['mutual_information'] = mi
        
        # 4. Binned analysis (check monotonicity)
        feature_binned = pd.qcut(feature, q=5, duplicates='drop')
        bin_means = target.groupby(feature_binned).mean()
        validation['monotonic'] = bin_means.is_monotonic_increasing or bin_means.is_monotonic_decreasing
        validation['bin_target_rates'] = bin_means.to_dict()
        
    elif feature_type == 'categorical':
        # Chi-square test for independence
        contingency = pd.crosstab(feature, target)
        chi2, p_value, dof, expected = stats.chi2_contingency(contingency)
        validation['chi2_statistic'] = chi2
        validation['p_value'] = p_value
        validation['significant'] = p_value < 0.05
        
        # Category-wise target rates
        cat_rates = target.groupby(feature).mean()
        validation['category_target_rates'] = cat_rates.to_dict()
        
        # Mutual information
        mi = mutual_info_classif(
            pd.factorize(feature)[0].reshape(-1, 1),
            target,
            discrete_features=True,
            random_state=42
        )[0]
        validation['mutual_information'] = mi
    
    # 5. Incremental predictive value
    # Compare model with and without this feature
    X_base = df.drop(columns=[feature_name, target_name]).select_dtypes(include=[np.number]).fillna(0)
    X_with_feature = X_base.copy()
    X_with_feature[feature_name] = df[feature_name].fillna(0)
    
    model = GradientBoostingClassifier(n_estimators=50, max_depth=3, random_state=42)
    
    score_base = cross_val_score(model, X_base, target, cv=3, scoring='roc_auc').mean()
    score_with = cross_val_score(model, X_with_feature, target, cv=3, scoring='roc_auc').mean()
    
    validation['baseline_auc'] = score_base
    validation['with_feature_auc'] = score_with
    validation['auc_lift'] = score_with - score_base
    validation['provides_lift'] = (score_with - score_base) > 0.001
    
    # Final verdict
    validation['validated'] = (
        validation['significant'] and
        validation.get('direction_match', True) and
        validation['provides_lift']
    )
    
    return validation
 
# Example usage
validation_result = validate_domain_feature(
    df=customer_data,
    feature_name='days_since_last_purchase',
    target_name='churned',
    expected_direction='positive',  # Longer recency = higher churn
    feature_type='numerical'
)
 
print(f"Feature: {validation_result['feature_name']}")
print(f"Validated: {validation_result['validated']}")
print(f"Correlation: {validation_result['correlation']:.3f}")
print(f"Direction matches hypothesis: {validation_result['direction_match']}")
print(f"AUC lift: {validation_result['auc_lift']:.4f}")

When Experts Are Wrong

Building Your Domain Intuition

Techniques for Self-Development of Domain Knowledge:

Building Domain Intuition

•Error Analysis Deep Dives — Manually inspect every prediction error for a sample of 50-100 cases. Ask: 'What would a human need to know to get this right?' Those missing signals become feature candidates.
•Cohort Comparison — Split data into meaningful groups (e.g., churned vs retained, converted vs bounced). What differs between cohorts? These differences suggest discriminating features.
•Literature Review — Academic papers in every domain identify predictive factors. A paper on 'Predictors of Hospital Readmission' is essentially a domain feature wishlist.
•Industry Reports and Benchmarks — Consulting firms and industry associations publish metrics and KPIs. These standardized measures encode collective domain wisdom.
•Competitive Analysis — What data does competitors collect? What features do their products expose? This reveals industry consensus on what matters.
•User Interviews — Talk to end users of your predictions. A doctor using a risk score can explain what additional context would make it actionable.

The Feature Engineering Journal:

Maintain a running document of feature hypotheses:

## Feature Hypothesis Log

### 2024-01-15: Session Depth Feature
- **Hypothesis**: Users who go deep in a session (many pages) but don't convert are comparison shopping
- **Proposed feature**: pages_viewed / time_on_site (browsing velocity)
- **Expected effect**: High velocity = comparison shopping = lower conversion
- **Status**: To be validated
- **Result**: [pending]

### 2024-01-12: Review Sentiment Mismatch
- **Hypothesis**: Products with mismatched review sentiment (some 5-star, some 1-star) are polarizing
- **Proposed feature**: std(review_scores) for each product
- **Expected effect**: High variance = polarizing = higher return rate
- **Status**: Validated
- **Result**: AUC +0.02, now in production model

This journal creates an institutional memory of feature engineering attempts, preventing repeated failed experiments and documenting successful patterns.

Summary: Domain Knowledge as Competitive Advantage

Domain knowledge is the moat around effective ML systems. Anyone can apply algorithms; encoding decades of industry wisdom into features is rare and valuable. Let's consolidate the key insights:

Key Takeaways

•Domain knowledge reduces search space by focusing on meaningful feature combinations rather than exhaustive exploration.
•Domain features encode causal understanding, not just correlation, leading to more robust models that generalize under distribution shift.
•Knowledge elicitation is a skill that requires structured techniques: case-based reasoning, think-aloud protocols, and counterfactual probing.
•Translate qualitative insights into quantitative computations with explicit formulas and documented rationale.
•Industry-standard features exist in every domain—research before reinventing what experts have validated.
•Always validate domain hypotheses against data. Experts provide valuable starting points, not guaranteed truth.
•Build domain intuition systematically through error analysis, cohort comparison, literature review, and feature hypothesis journaling.

Page Complete

2 / 5