The Ml Pipeline - Learning Module

Loading content...

0/278

Feature Engineering

The Art That Makes or Breaks Models

Feature engineering is the process of using domain knowledge to create, select, and transform variables that make machine learning algorithms work effectively. It is often described as more art than science—and like any art, it distinguishes masters from beginners.

Here's a secret that practitioners know but textbooks understate: feature engineering typically has a larger impact on model performance than algorithm selection. Given the same data, the difference between a random forest and gradient boosting might be a few percentage points. But thoughtful feature engineering can double or triple model accuracy.

Consider fraud detection. Raw transaction data includes: amount, time, merchant, card number. That's what you have. But what predicts fraud? Unusual spending patterns, velocity of transactions, geographic anomalies, deviations from past behavior. These aren't in the raw data—they must be engineered from it. The model doesn't know what's unusual until you tell it.

What You Will Learn

By the end of this page, you will understand the principles of effective feature engineering, techniques for numeric, categorical, temporal, and text features, feature interactions and aggregations, automated feature generation, and feature selection to focus on what matters most.

What Makes a Good Feature

Not all features are created equal. A good feature has several properties that make it useful for prediction:

Properties of Effective Features:

Predictive Power: The feature must carry information about the target. If it doesn't correlate (linearly or nonlinearly) with what you're predicting, it's noise.
Available at Prediction Time: A feature only works if you can compute it when you need to make predictions. Training on 'user's 7-day purchase history' but predicting for a new user without 7 days of data creates problems.
Generalizable: The feature should capture stable relationships, not just quirks of the training data. Features that work because of a specific past event won't generalize to the future.
Interpretable (often desirable): Features that make sense to domain experts can be validated, debugged, and explained. 'Transaction amount / average amount for this category' is interpretable; 'seventh principal component of user behavior matrix' is not.
Computationally Tractable: Features that require minutes to compute per example become bottlenecks in real-time systems.

Feature Quality Assessment Matrix
Feature	Predictive?	Available?	Generalizes?	Interpretable?	Verdict
User's age	Likely yes	Yes	Yes	Yes	✅ Good
Days since last purchase	Likely yes	Yes	Yes	Yes	✅ Good
Whether user purchased during Black Friday 2022	Maybe	Yes	No (one-time event)	Yes	⚠️ Questionable
35th coefficient of SVD on behavioral matrix	Perhaps	Yes	Maybe	No	⚠️ Hard to validate
User's actual purchase next week	Perfect!	No (future)	N/A	N/A	❌ Leakage
Random noise column	No	Yes	No	N/A	❌ Useless

Feature Leakage

Feature leakage occurs when information from the future or from the target itself sneaks into features. Examples: using 'treatment outcome' to predict 'treatment decision' (outcome is in the future), or using 'account status = closed' to predict churn (closing IS churning). Leakage creates models that look perfect in training but fail completely in deployment.

The Feature Engineering Loop:

Feature engineering is iterative:

Hypothesize: Based on domain knowledge, hypothesize why something might predict the target
Create: Engineer a feature that captures this hypothesis
Validate: Check if the feature is computed correctly and behaves as expected
Evaluate: Measure if adding the feature improves model performance
Iterate: Refine, combine with others, or discard and try new hypotheses

This process repeats dozens or hundreds of times in serious ML projects.

Engineering Numeric Features

Numeric features seem straightforward—they're already numbers! But raw numeric values often aren't the most useful form. Engineering transforms them into more predictive representations.

Key Numeric Engineering Techniques:

Numeric Feature Transformations

•Scaling: Min-max, standardization, robust scaling. Necessary for distance-based algorithms, neural networks, regularized linear models.
•Log/Power Transformations: Compress dynamic range, reduce skewness. Income, counts, and durations often benefit from log(1+x).
•Polynomial Features: Create x², x³, x¹·x² to capture nonlinear relationships in linear models. Risk of overfitting with many features.
•Binning/Discretization: Convert continuous to buckets (age groups, income brackets). Captures nonlinearity, improves robustness to outliers.
•Ratios: A/B often more predictive than A and B separately. Debt/income, clicks/impressions, revenue/employees.
•Differences: Change = current - previous. Captures trends, growth, decline.
•Statistical Summaries: Mean, median, std, min, max, percentiles over groups. User's average transaction amount vs. this transaction amount.
•Normalization by Group: Express as percentile within a group (e.g., how does this price compare to prices in this category?).

numeric_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
import numpy as np
import pandas as pd
 
# ===== RATIO FEATURES =====
df['debt_to_income'] = df['debt'] / (df['income'] + 1)  # +1 to avoid division by zero
df['price_per_sqft'] = df['price'] / df['sqft']
df['click_through_rate'] = df['clicks'] / (df['impressions'] + 1)
 
# ===== DIFFERENCE FEATURES =====
df['price_change'] = df['current_price'] - df['previous_price']
df['price_change_pct'] = (df['current_price'] - df['previous_price']) / (df['previous_price'] + 1)
 
# ===== BINNING =====
df['age_group'] = pd.cut(df['age'], 
                         bins=[0, 18, 25, 35, 50, 65, 100],
                         labels=['<18', '18-24', '25-34', '35-49', '50-64', '65+'])
 
# Quantile binning (equal-frequency)
df['income_quintile'] = pd.qcut(df['income'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5'])
 
# ===== RELATIVE TO GROUP =====
# Compare individual value to group average
user_avg_amount = df.groupby('user_id')['amount'].transform('mean')
df['amount_vs_user_avg'] = df['amount'] / (user_avg_amount + 1)
 
# Express as percentile within group
df['amount_percentile'] = df.groupby('category')['amount'].rank(pct=True)
 
# ===== POLYNOMIAL FEATURES =====
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, include_bias=False)
X_poly = poly.fit_transform(df[['x1', 'x2']])  # Creates: x1, x2, x1², x1*x2, x2²
 
# ===== LOG TRANSFORMATION =====
df['income_log'] = np.log1p(df['income'])  # log(1 + x), handles zeros gracefully

Domain Knowledge is Key

The best numeric features come from understanding the domain. In fraud detection, 'transaction amount' is less predictive than 'transaction amount relative to user's usual spending'. In real estate, 'price per square foot' is more useful than price or square footage alone. Always ask: what comparison would a domain expert make?

Engineering Categorical Features

Categorical features represent discrete values without inherent numeric meaning: country, product type, user segment, etc. Most algorithms require numeric inputs, so categorical features need careful engineering.

Encoding Strategies:

Categorical Encoding Methods
Encoding	Description	Best For	Watch Out For
One-Hot	Binary column per category	Low cardinality (<20), nominal	Explosion with many categories
Label/Ordinal	Integer per category	Tree-based models, ordinal categories	Implies false ordering for nominal
Target Encoding	Replace with target mean for that category	High cardinality, any model	Leakage without careful regularization
Frequency Encoding	Replace with category frequency	When popularity matters	May conflate distinct categories
Binary Encoding	Binary representation of label encoding	Medium-high cardinality	Harder to interpret
Hash Encoding	Hash to fixed-size vector	Very high cardinality, streaming	Collisions conflate categories
Learned Embedding	Neural network learns dense vector	Deep learning, rich category structure	Requires enough data, training overhead

High-Cardinality Categoricals:

Variables like user IDs, product IDs, or zip codes can have thousands or millions of unique values. One-hot encoding is impractical. Solutions include:

Group rare categories: Combine categories appearing fewer than N times into 'OTHER'
Target encoding with regularization: Blend category mean with global mean to prevent overfitting on rare categories
Hierarchical grouping: Zip code → city → state → region
Embeddings: Learn dense representations (common in deep learning)
Feature hashing: Project to fixed dimensions, accepting some collision

categorical_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder
from category_encoders import TargetEncoder
 
# ===== ONE-HOT ENCODING =====
# For low cardinality categoricals
df_encoded = pd.get_dummies(df, columns=['country', 'device_type'])
 
# ===== HANDLING RARE CATEGORIES =====
# Group infrequent categories into 'OTHER'
threshold = 100  # minimum count to keep
value_counts = df['product_id'].value_counts()
rare_categories = value_counts[value_counts < threshold].index
df['product_id_grouped'] = df['product_id'].replace(rare_categories, 'OTHER')
 
# ===== TARGET ENCODING WITH REGULARIZATION =====
# Smoothed target encoding: blend with global mean
def target_encode_smoothed(df, cat_col, target_col, min_samples=30, smoothing=10):
    """
    Regularized target encoding to prevent overfitting.
    Categories with few samples regress toward global mean.
    """
    global_mean = df[target_col].mean()
    agg = df.groupby(cat_col)[target_col].agg(['mean', 'count'])
    
    # Smoothing factor: weight toward global mean for small samples
    smooth = 1 / (1 + np.exp(-(agg['count'] - min_samples) / smoothing))
    agg['encoded'] = smooth * agg['mean'] + (1 - smooth) * global_mean
    
    return df[cat_col].map(agg['encoded'])
 
# Apply with cross-validation folds to prevent leakage
from sklearn.model_selection import KFold
df['category_target_enc'] = np.nan
kf = KFold(n_splits=5, shuffle=True, random_state=42)
for train_idx, val_idx in kf.split(df):
    encoding = target_encode_smoothed(
        df.iloc[train_idx], 'category', 'target'
    )
    df.loc[val_idx, 'category_target_enc'] = df.loc[val_idx, 'category'].map(encoding)
 
# ===== FREQUENCY ENCODING =====
freq = df['category'].value_counts(normalize=True)
df['category_freq'] = df['category'].map(freq)
 
# ===== HIERARCHICAL FEATURES =====
# Extract information from hierarchical categories
df['zip_prefix'] = df['zip_code'].str[:3]  # Zip to region
df['email_domain'] = df['email'].str.split('@').str[1]

Target Encoding Leakage

Target encoding uses the target variable to create features, which risks leakage. Always use cross-validation: encode the validation fold using statistics from only the training folds. Never compute target statistics on the same data you're encoding.

Engineering Temporal Features

Temporal features—data with timestamps—require specialized engineering. Raw timestamps are rarely useful, but the patterns they encode are invaluable.

Types of Temporal Features:

Temporal Feature Categories

•Date/Time Components: Year, month, day, hour, minute, day of week, day of year, week of year, quarter. Cyclical patterns live here.
•Cyclic Encoding: Encode cyclical features (hour, day of week) using sine/cosine to preserve circularity. Hour 23 is close to hour 0.
•Time Since Events: Days since signup, hours since last purchase, time until subscription renewal. Behavioral timing patterns.
•Rolling Aggregations: Mean/sum/count over past N hours/days/weeks. Recent behavior summaries.
•Lag Features: Value at previous time step (yesterday's price, last week's sales). Direct temporal dependencies.
•Date-Based Flags: Is holiday, is weekend, is peak season, is month-end. Domain-specific temporal patterns.
•Time-Based Interactions: How does behavior differ on weekends vs weekdays? Morning vs evening?

temporal_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
import pandas as pd
import numpy as np
 
# Convert to datetime
df['timestamp'] = pd.to_datetime(df['timestamp'])
 
# ===== DATE/TIME COMPONENTS =====
df['year'] = df['timestamp'].dt.year
df['month'] = df['timestamp'].dt.month
df['day'] = df['timestamp'].dt.day
df['hour'] = df['timestamp'].dt.hour
df['day_of_week'] = df['timestamp'].dt.dayofweek  # Monday=0, Sunday=6
df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)
df['quarter'] = df['timestamp'].dt.quarter
 
# ===== CYCLIC ENCODING =====
# Encode cyclical features to preserve circularity
# Hour 23 should be close to hour 0
df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)
df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)
df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)
df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)
df['month_sin'] = np.sin(2 * np.pi * (df['month'] - 1) / 12)
df['month_cos'] = np.cos(2 * np.pi * (df['month'] - 1) / 12)
 
# ===== TIME SINCE EVENTS =====
df['days_since_signup'] = (df['timestamp'] - df['signup_date']).dt.days
df['hours_since_last_activity'] = (
    df['timestamp'] - df['last_activity']
).dt.total_seconds() / 3600
 
# ===== ROLLING AGGREGATIONS =====
# Sort by user and time first
df = df.sort_values(['user_id', 'timestamp'])
 
# Rolling mean of amount over past 7 transactions (per user)
df['amount_rolling_7_mean'] = df.groupby('user_id')['amount'].transform(
    lambda x: x.rolling(7, min_periods=1).mean()
)
 
# Rolling count of transactions in past 24 hours
df['transactions_past_24h'] = df.groupby('user_id')['amount'].transform(
    lambda x: x.rolling('24H', on=df['timestamp'], min_periods=1).count()
)
 
# ===== LAG FEATURES =====
df['amount_lag_1'] = df.groupby('user_id')['amount'].shift(1)  # Previous transaction
df['amount_lag_7'] = df.groupby('user_id')['amount'].shift(7)  # 7 transactions ago
 
# Difference from lag
df['amount_change'] = df['amount'] - df['amount_lag_1']
 
# ===== HOLIDAY FLAGS =====
import holidays
us_holidays = holidays.US(years=df['timestamp'].dt.year.unique())
df['is_holiday'] = df['timestamp'].dt.date.isin(us_holidays).astype(int)

Temporal Leakage Warning

Temporal features are especially prone to leakage. Rolling aggregations must only include PAST data. If computing 'average of last 7 days' for a prediction on Monday, include only Sunday through the previous Monday—not Monday itself. Always verify that temporal features respect the prediction timestamp.

Engineering Text Features

Text data is ubiquitous: product descriptions, customer reviews, emails, chat logs. Converting unstructured text into numeric features is a specialized skill.

Text Feature Engineering Pipeline:

Preprocessing: Lowercase, remove punctuation, handle special characters
Tokenization: Split text into tokens (words, subwords, characters)
Normalization: Stemming (reduce to root form) or lemmatization (reduce to dictionary form)
Representation: Convert tokens to numeric vectors

Text Representation Methods
Method	Description	Pros	Cons
Bag of Words	Count occurrences of each word	Simple, interpretable	Ignores word order, high dimensional
TF-IDF	Term frequency weighted by inverse document frequency	Down-weights common words	Still ignores order, high dimensional
N-grams	Sequences of N consecutive tokens	Captures some word order	Explodes dimensionality
Word Embeddings	Dense vectors from Word2Vec, GloVe, FastText	Captures semantic similarity	Context-independent, needs aggregation
Sentence Embeddings	Average word vectors or models like Sentence-BERT	Fixed-size for any text length	May lose nuance
Transformer Embeddings	BERT, GPT, etc. contextual representations	Contextualized, state-of-the-art	Computationally expensive

text_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
import re
 
# ===== TEXT PREPROCESSING =====
def preprocess_text(text):
    # Lowercase
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^a-z\s]', '', text)
    # Remove extra whitespace
    text = ' '.join(text.split())
    return text
 
df['text_clean'] = df['text'].apply(preprocess_text)
 
# ===== BASIC TEXT FEATURES =====
df['text_length'] = df['text'].str.len()
df['word_count'] = df['text'].str.split().str.len()
df['avg_word_length'] = df['text_length'] / (df['word_count'] + 1)
df['uppercase_ratio'] = df['text'].apply(lambda x: sum(1 for c in x if c.isupper()) / (len(x) + 1))
df['punctuation_count'] = df['text'].apply(lambda x: sum(1 for c in x if c in '.,!?;:'))
 
# ===== TF-IDF =====
tfidf = TfidfVectorizer(
    max_features=1000,  # Limit dimensionality
    min_df=5,           # Ignore rare words
    max_df=0.95,        # Ignore very common words
    ngram_range=(1, 2)  # Unigrams and bigrams
)
tfidf_matrix = tfidf.fit_transform(df['text_clean'])
 
# Convert to dataframe and join
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=[f'tfidf_{w}' for w in tfidf.get_feature_names_out()]
)
df = pd.concat([df.reset_index(drop=True), tfidf_df], axis=1)
 
# ===== WORD EMBEDDINGS (using pre-trained) =====
# With sentence-transformers library
from sentence_transformers import SentenceTransformer
 
model = SentenceTransformer('all-MiniLM-L6-v2')  # Fast, good quality
embeddings = model.encode(df['text'].tolist(), show_progress_bar=True)
 
# Add embedding dimensions as features
embedding_df = pd.DataFrame(
    embeddings,
    columns=[f'embed_{i}' for i in range(embeddings.shape[1])]
)
df = pd.concat([df.reset_index(drop=True), embedding_df], axis=1)

Start Simple

TF-IDF with a linear model is a surprisingly strong baseline for text classification. Before reaching for deep learning, verify that the simple approach isn't sufficient. You might be surprised.

Feature Interactions and Aggregations

Often, the relationship between features and the target isn't captured by features in isolation—it emerges from combinations of features.

Feature Interactions:

A feature interaction exists when the effect of one feature depends on the value of another. For example:

The effect of 'time of day' on restaurant bookings depends on 'day of week' (weekend brunch vs. weekday lunch)
The effect of 'product price' on purchase depends on 'income level' (luxury goods vs. necessities)
The effect of 'marketing channel' on conversion depends on 'user age group'

Creating Interaction Features

•Multiplicative Interactions: feature_A * feature_B (for numeric features)
•Polynomial Cross-Terms: x₁², x₁x₂, x₂² (captures nonlinear interactions)
•Categorical Combinations: Concatenate categories: 'CA_morning', 'NY_evening'
•Conditional Features: Price_if_premium, Amount_if_weekend (multiply by indicator)
•Ratio Features: A/B captures relative size interactions

Aggregation Features:

When data has a hierarchical structure (multiple transactions per user, multiple items per order), aggregations summarize lower-level data for higher-level predictions:

Aggregation Level	Example Aggregations
User-level from transactions	Total spend, transaction count, average amount, time since last, variety of merchants
Order-level from items	Order total, item count, average item price, category diversity
Session-level from clicks	Pages viewed, time on site, search count, depth in funnel

Group-by Aggregations:

Beyond per-entity aggregations, comparing an entity to its group adds context:

# User's spending relative to their cohort
df['spend_vs_cohort'] = df['spend'] / df.groupby('signup_cohort')['spend'].transform('mean')

interactions_aggregations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
import pandas as pd
import numpy as np
 
# ===== FEATURE INTERACTIONS =====
# Multiplicative interaction
df['price_x_quantity'] = df['price'] * df['quantity']
df['income_x_age'] = df['income'] * df['age']
 
# Conditional features (interaction with indicator)
df['price_if_premium'] = df['price'] * df['is_premium']
df['amount_on_weekend'] = df['amount'] * df['is_weekend']
 
# Ratio interactions
df['spend_per_visit'] = df['total_spend'] / (df['visit_count'] + 1)
df['orders_per_month'] = df['order_count'] / (df['months_active'] + 1)
 
# Categorical combination
df['location_device'] = df['country'] + '_' + df['device_type']
df['category_daypart'] = df['product_category'] + '_' + df['daypart']
 
# ===== AGGREGATION FEATURES =====
# Per-user aggregations from transaction data
user_agg = transactions.groupby('user_id').agg({
    'amount': ['sum', 'mean', 'std', 'max', 'count'],
    'timestamp': ['min', 'max'],
    'merchant_id': 'nunique',  # Number of unique merchants
    'category': lambda x: x.mode()[0] if len(x.mode()) > 0 else 'unknown'  # Most common category
}).reset_index()
 
# Flatten column names
user_agg.columns = ['user_id', 'total_spend', 'avg_amount', 'std_amount', 
                     'max_amount', 'transaction_count', 'first_transaction',
                     'last_transaction', 'merchant_diversity', 'primary_category']
 
# Derived features from aggregations
user_agg['days_active'] = (user_agg['last_transaction'] - user_agg['first_transaction']).dt.days
user_agg['frequency'] = user_agg['transaction_count'] / (user_agg['days_active'] + 1)
 
# ===== RELATIVE TO GROUP =====
# Compare user to their cohort
df = df.merge(user_agg, on='user_id')
cohort_stats = df.groupby('signup_cohort')['total_spend'].agg(['mean', 'std'])
df = df.merge(cohort_stats, on='signup_cohort', suffixes=('', '_cohort'))
df['spend_z_score'] = (df['total_spend'] - df['mean']) / (df['std'] + 1)
df['above_cohort_avg'] = (df['total_spend'] > df['mean']).astype(int)

Interaction Explosion

With N features, you have O(N²) pairwise interactions and O(N³) triple interactions. Exhaustive interaction creation leads to overfitting and computational explosion. Use domain knowledge to select meaningful interactions, or use algorithms that discover interactions automatically (tree-based models, neural networks).

Feature Selection

After engineering many features, you often have more than your model needs—or can handle. Feature selection reduces dimensionality, improves interpretability, speeds up training, and can prevent overfitting.

Why Select Features?

Reduce overfitting: Fewer features = simpler model = better generalization
Improve interpretability: Fewer features = easier to explain
Reduce computational cost: Fewer features = faster training and inference
Handle noise: Irrelevant features add noise without signal

Feature Selection Methods:

Feature Selection Approaches
Approach	Methods	Pros	Cons
Filter Methods	Correlation, mutual information, chi-squared, variance threshold	Fast, model-independent	Ignores feature interactions
Wrapper Methods	Recursive feature elimination (RFE), forward/backward selection	Considers model performance	Slow, prone to overfitting
Embedded Methods	L1 regularization (Lasso), tree feature importance	Built into training, considers interactions	Model-specific
Dimensionality Reduction	PCA, autoencoders	Creates informative combinations	Loses interpretability

feature_selection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import pandas as pd
import numpy as np
from sklearn.feature_selection import (
    SelectKBest, mutual_info_classif, chi2,
    RFE, SelectFromModel, VarianceThreshold
)
from sklearn.linear_model import LassoCV
from sklearn.ensemble import RandomForestClassifier
 
# ===== FILTER: VARIANCE THRESHOLD =====
# Remove features with very low variance
selector = VarianceThreshold(threshold=0.01)
X_high_var = selector.fit_transform(X)
 
# ===== FILTER: CORRELATION =====
# Remove highly correlated features (keep one)
corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]
X_decorrelated = X.drop(columns=to_drop)
 
# ===== FILTER: MUTUAL INFORMATION =====
# Rank features by their information about the target
selector = SelectKBest(score_func=mutual_info_classif, k=50)
X_top_50 = selector.fit_transform(X, y)
selected_features = X.columns[selector.get_support()]
 
# ===== EMBEDDED: LASSO (L1) REGULARIZATION =====
# Features with non-zero coefficients are selected
lasso = LassoCV(cv=5)
lasso.fit(X, y)
selected_features = X.columns[lasso.coef_ != 0]
print(f"Lasso selected {len(selected_features)} features")
 
# ===== EMBEDDED: TREE IMPORTANCE =====
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X, y)
 
# Get feature importances
importance_df = pd.DataFrame({
    'feature': X.columns,
    'importance': rf.feature_importances_
}).sort_values('importance', ascending=False)
 
# Select top N by importance
top_n = 30
selected_features = importance_df.head(top_n)['feature'].tolist()
 
# ===== WRAPPER: RECURSIVE FEATURE ELIMINATION =====
from sklearn.linear_model import LogisticRegression
 
estimator = LogisticRegression(max_iter=1000)
rfe = RFE(estimator, n_features_to_select=20)
X_rfe = rfe.fit_transform(X, y)
selected_features = X.columns[rfe.support_]

Stability of Selection

Feature importance can be unstable—small changes in data lead to different selected features. For robustness, run selection on multiple bootstrap samples or cross-validation folds, and keep features that are consistently selected. This is especially important for interpretability.

Summary: The Art of Feature Engineering

Feature engineering is where domain knowledge meets machine learning. It transforms raw data into information-rich representations that enable models to learn effectively. Let's consolidate the key insights:

Key Takeaways

•Good features are predictive, available, and generalizable — They carry information about the target that will persist beyond the training data.
•Numeric features benefit from transformations — Scaling, log transforms, ratios, and differences extract more signal from raw numbers.
•Categorical encoding depends on cardinality — One-hot for low cardinality, target encoding for high cardinality, embeddings for deep learning.
•Temporal features unlock behavioral patterns — Date components, cyclical encoding, lags, and rolling aggregations capture time-based dynamics.
•Text features require careful representation — From simple TF-IDF to transformer embeddings, the choice depends on data size and performance needs.
•Interactions and aggregations add context — Combinations of features and group-level summaries often capture what isolated features miss.
•Feature selection prevents overfitting — Filter, wrapper, and embedded methods help focus on what matters most.
•Domain knowledge is the ultimate advantage — The best features come from understanding the problem, not from automated pipelines.

What's Next:

With well-engineered features, you're ready for Model Selection and Training—choosing from the vast landscape of machine learning algorithms, configuring them appropriately, and training them to achieve optimal performance on your prepared dataset.

Page Complete

You now understand feature engineering comprehensively: what makes good features, how to engineer numeric, categorical, temporal, and text features, creating interactions and aggregations, and selecting the most valuable features. This skill will be one of your most powerful tools as an ML practitioner.