Loading content...
Feature engineering is the process of using domain knowledge to create, select, and transform variables that make machine learning algorithms work effectively. It is often described as more art than science—and like any art, it distinguishes masters from beginners.
Here's a secret that practitioners know but textbooks understate: feature engineering typically has a larger impact on model performance than algorithm selection. Given the same data, the difference between a random forest and gradient boosting might be a few percentage points. But thoughtful feature engineering can double or triple model accuracy.
Consider fraud detection. Raw transaction data includes: amount, time, merchant, card number. That's what you have. But what predicts fraud? Unusual spending patterns, velocity of transactions, geographic anomalies, deviations from past behavior. These aren't in the raw data—they must be engineered from it. The model doesn't know what's unusual until you tell it.
By the end of this page, you will understand the principles of effective feature engineering, techniques for numeric, categorical, temporal, and text features, feature interactions and aggregations, automated feature generation, and feature selection to focus on what matters most.
Not all features are created equal. A good feature has several properties that make it useful for prediction:
Properties of Effective Features:
Predictive Power: The feature must carry information about the target. If it doesn't correlate (linearly or nonlinearly) with what you're predicting, it's noise.
Available at Prediction Time: A feature only works if you can compute it when you need to make predictions. Training on 'user's 7-day purchase history' but predicting for a new user without 7 days of data creates problems.
Generalizable: The feature should capture stable relationships, not just quirks of the training data. Features that work because of a specific past event won't generalize to the future.
Interpretable (often desirable): Features that make sense to domain experts can be validated, debugged, and explained. 'Transaction amount / average amount for this category' is interpretable; 'seventh principal component of user behavior matrix' is not.
Computationally Tractable: Features that require minutes to compute per example become bottlenecks in real-time systems.
| Feature | Predictive? | Available? | Generalizes? | Interpretable? | Verdict |
|---|---|---|---|---|---|
| User's age | Likely yes | Yes | Yes | Yes | ✅ Good |
| Days since last purchase | Likely yes | Yes | Yes | Yes | ✅ Good |
| Whether user purchased during Black Friday 2022 | Maybe | Yes | No (one-time event) | Yes | ⚠️ Questionable |
| 35th coefficient of SVD on behavioral matrix | Perhaps | Yes | Maybe | No | ⚠️ Hard to validate |
| User's actual purchase next week | Perfect! | No (future) | N/A | N/A | ❌ Leakage |
| Random noise column | No | Yes | No | N/A | ❌ Useless |
Feature leakage occurs when information from the future or from the target itself sneaks into features. Examples: using 'treatment outcome' to predict 'treatment decision' (outcome is in the future), or using 'account status = closed' to predict churn (closing IS churning). Leakage creates models that look perfect in training but fail completely in deployment.
The Feature Engineering Loop:
Feature engineering is iterative:
This process repeats dozens or hundreds of times in serious ML projects.
Numeric features seem straightforward—they're already numbers! But raw numeric values often aren't the most useful form. Engineering transforms them into more predictive representations.
Key Numeric Engineering Techniques:
1234567891011121314151617181920212223242526272829303132333435
import numpy as npimport pandas as pd # ===== RATIO FEATURES =====df['debt_to_income'] = df['debt'] / (df['income'] + 1) # +1 to avoid division by zerodf['price_per_sqft'] = df['price'] / df['sqft']df['click_through_rate'] = df['clicks'] / (df['impressions'] + 1) # ===== DIFFERENCE FEATURES =====df['price_change'] = df['current_price'] - df['previous_price']df['price_change_pct'] = (df['current_price'] - df['previous_price']) / (df['previous_price'] + 1) # ===== BINNING =====df['age_group'] = pd.cut(df['age'], bins=[0, 18, 25, 35, 50, 65, 100], labels=['<18', '18-24', '25-34', '35-49', '50-64', '65+']) # Quantile binning (equal-frequency)df['income_quintile'] = pd.qcut(df['income'], q=5, labels=['Q1', 'Q2', 'Q3', 'Q4', 'Q5']) # ===== RELATIVE TO GROUP =====# Compare individual value to group averageuser_avg_amount = df.groupby('user_id')['amount'].transform('mean')df['amount_vs_user_avg'] = df['amount'] / (user_avg_amount + 1) # Express as percentile within groupdf['amount_percentile'] = df.groupby('category')['amount'].rank(pct=True) # ===== POLYNOMIAL FEATURES =====from sklearn.preprocessing import PolynomialFeaturespoly = PolynomialFeatures(degree=2, include_bias=False)X_poly = poly.fit_transform(df[['x1', 'x2']]) # Creates: x1, x2, x1², x1*x2, x2² # ===== LOG TRANSFORMATION =====df['income_log'] = np.log1p(df['income']) # log(1 + x), handles zeros gracefullyThe best numeric features come from understanding the domain. In fraud detection, 'transaction amount' is less predictive than 'transaction amount relative to user's usual spending'. In real estate, 'price per square foot' is more useful than price or square footage alone. Always ask: what comparison would a domain expert make?
Categorical features represent discrete values without inherent numeric meaning: country, product type, user segment, etc. Most algorithms require numeric inputs, so categorical features need careful engineering.
Encoding Strategies:
| Encoding | Description | Best For | Watch Out For |
|---|---|---|---|
| One-Hot | Binary column per category | Low cardinality (<20), nominal | Explosion with many categories |
| Label/Ordinal | Integer per category | Tree-based models, ordinal categories | Implies false ordering for nominal |
| Target Encoding | Replace with target mean for that category | High cardinality, any model | Leakage without careful regularization |
| Frequency Encoding | Replace with category frequency | When popularity matters | May conflate distinct categories |
| Binary Encoding | Binary representation of label encoding | Medium-high cardinality | Harder to interpret |
| Hash Encoding | Hash to fixed-size vector | Very high cardinality, streaming | Collisions conflate categories |
| Learned Embedding | Neural network learns dense vector | Deep learning, rich category structure | Requires enough data, training overhead |
High-Cardinality Categoricals:
Variables like user IDs, product IDs, or zip codes can have thousands or millions of unique values. One-hot encoding is impractical. Solutions include:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950
import pandas as pdimport numpy as npfrom sklearn.preprocessing import OneHotEncoderfrom category_encoders import TargetEncoder # ===== ONE-HOT ENCODING =====# For low cardinality categoricalsdf_encoded = pd.get_dummies(df, columns=['country', 'device_type']) # ===== HANDLING RARE CATEGORIES =====# Group infrequent categories into 'OTHER'threshold = 100 # minimum count to keepvalue_counts = df['product_id'].value_counts()rare_categories = value_counts[value_counts < threshold].indexdf['product_id_grouped'] = df['product_id'].replace(rare_categories, 'OTHER') # ===== TARGET ENCODING WITH REGULARIZATION =====# Smoothed target encoding: blend with global meandef target_encode_smoothed(df, cat_col, target_col, min_samples=30, smoothing=10): """ Regularized target encoding to prevent overfitting. Categories with few samples regress toward global mean. """ global_mean = df[target_col].mean() agg = df.groupby(cat_col)[target_col].agg(['mean', 'count']) # Smoothing factor: weight toward global mean for small samples smooth = 1 / (1 + np.exp(-(agg['count'] - min_samples) / smoothing)) agg['encoded'] = smooth * agg['mean'] + (1 - smooth) * global_mean return df[cat_col].map(agg['encoded']) # Apply with cross-validation folds to prevent leakagefrom sklearn.model_selection import KFolddf['category_target_enc'] = np.nankf = KFold(n_splits=5, shuffle=True, random_state=42)for train_idx, val_idx in kf.split(df): encoding = target_encode_smoothed( df.iloc[train_idx], 'category', 'target' ) df.loc[val_idx, 'category_target_enc'] = df.loc[val_idx, 'category'].map(encoding) # ===== FREQUENCY ENCODING =====freq = df['category'].value_counts(normalize=True)df['category_freq'] = df['category'].map(freq) # ===== HIERARCHICAL FEATURES =====# Extract information from hierarchical categoriesdf['zip_prefix'] = df['zip_code'].str[:3] # Zip to regiondf['email_domain'] = df['email'].str.split('@').str[1]Target encoding uses the target variable to create features, which risks leakage. Always use cross-validation: encode the validation fold using statistics from only the training folds. Never compute target statistics on the same data you're encoding.
Temporal features—data with timestamps—require specialized engineering. Raw timestamps are rarely useful, but the patterns they encode are invaluable.
Types of Temporal Features:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556
import pandas as pdimport numpy as np # Convert to datetimedf['timestamp'] = pd.to_datetime(df['timestamp']) # ===== DATE/TIME COMPONENTS =====df['year'] = df['timestamp'].dt.yeardf['month'] = df['timestamp'].dt.monthdf['day'] = df['timestamp'].dt.daydf['hour'] = df['timestamp'].dt.hourdf['day_of_week'] = df['timestamp'].dt.dayofweek # Monday=0, Sunday=6df['is_weekend'] = df['day_of_week'].isin([5, 6]).astype(int)df['quarter'] = df['timestamp'].dt.quarter # ===== CYCLIC ENCODING =====# Encode cyclical features to preserve circularity# Hour 23 should be close to hour 0df['hour_sin'] = np.sin(2 * np.pi * df['hour'] / 24)df['hour_cos'] = np.cos(2 * np.pi * df['hour'] / 24)df['day_of_week_sin'] = np.sin(2 * np.pi * df['day_of_week'] / 7)df['day_of_week_cos'] = np.cos(2 * np.pi * df['day_of_week'] / 7)df['month_sin'] = np.sin(2 * np.pi * (df['month'] - 1) / 12)df['month_cos'] = np.cos(2 * np.pi * (df['month'] - 1) / 12) # ===== TIME SINCE EVENTS =====df['days_since_signup'] = (df['timestamp'] - df['signup_date']).dt.daysdf['hours_since_last_activity'] = ( df['timestamp'] - df['last_activity']).dt.total_seconds() / 3600 # ===== ROLLING AGGREGATIONS =====# Sort by user and time firstdf = df.sort_values(['user_id', 'timestamp']) # Rolling mean of amount over past 7 transactions (per user)df['amount_rolling_7_mean'] = df.groupby('user_id')['amount'].transform( lambda x: x.rolling(7, min_periods=1).mean()) # Rolling count of transactions in past 24 hoursdf['transactions_past_24h'] = df.groupby('user_id')['amount'].transform( lambda x: x.rolling('24H', on=df['timestamp'], min_periods=1).count()) # ===== LAG FEATURES =====df['amount_lag_1'] = df.groupby('user_id')['amount'].shift(1) # Previous transactiondf['amount_lag_7'] = df.groupby('user_id')['amount'].shift(7) # 7 transactions ago # Difference from lagdf['amount_change'] = df['amount'] - df['amount_lag_1'] # ===== HOLIDAY FLAGS =====import holidaysus_holidays = holidays.US(years=df['timestamp'].dt.year.unique())df['is_holiday'] = df['timestamp'].dt.date.isin(us_holidays).astype(int)Temporal features are especially prone to leakage. Rolling aggregations must only include PAST data. If computing 'average of last 7 days' for a prediction on Monday, include only Sunday through the previous Monday—not Monday itself. Always verify that temporal features respect the prediction timestamp.
Text data is ubiquitous: product descriptions, customer reviews, emails, chat logs. Converting unstructured text into numeric features is a specialized skill.
Text Feature Engineering Pipeline:
| Method | Description | Pros | Cons |
|---|---|---|---|
| Bag of Words | Count occurrences of each word | Simple, interpretable | Ignores word order, high dimensional |
| TF-IDF | Term frequency weighted by inverse document frequency | Down-weights common words | Still ignores order, high dimensional |
| N-grams | Sequences of N consecutive tokens | Captures some word order | Explodes dimensionality |
| Word Embeddings | Dense vectors from Word2Vec, GloVe, FastText | Captures semantic similarity | Context-independent, needs aggregation |
| Sentence Embeddings | Average word vectors or models like Sentence-BERT | Fixed-size for any text length | May lose nuance |
| Transformer Embeddings | BERT, GPT, etc. contextual representations | Contextualized, state-of-the-art | Computationally expensive |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253
import pandas as pdimport numpy as npfrom sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizerimport re # ===== TEXT PREPROCESSING =====def preprocess_text(text): # Lowercase text = text.lower() # Remove special characters and digits text = re.sub(r'[^a-z\s]', '', text) # Remove extra whitespace text = ' '.join(text.split()) return text df['text_clean'] = df['text'].apply(preprocess_text) # ===== BASIC TEXT FEATURES =====df['text_length'] = df['text'].str.len()df['word_count'] = df['text'].str.split().str.len()df['avg_word_length'] = df['text_length'] / (df['word_count'] + 1)df['uppercase_ratio'] = df['text'].apply(lambda x: sum(1 for c in x if c.isupper()) / (len(x) + 1))df['punctuation_count'] = df['text'].apply(lambda x: sum(1 for c in x if c in '.,!?;:')) # ===== TF-IDF =====tfidf = TfidfVectorizer( max_features=1000, # Limit dimensionality min_df=5, # Ignore rare words max_df=0.95, # Ignore very common words ngram_range=(1, 2) # Unigrams and bigrams)tfidf_matrix = tfidf.fit_transform(df['text_clean']) # Convert to dataframe and jointfidf_df = pd.DataFrame( tfidf_matrix.toarray(), columns=[f'tfidf_{w}' for w in tfidf.get_feature_names_out()])df = pd.concat([df.reset_index(drop=True), tfidf_df], axis=1) # ===== WORD EMBEDDINGS (using pre-trained) =====# With sentence-transformers libraryfrom sentence_transformers import SentenceTransformer model = SentenceTransformer('all-MiniLM-L6-v2') # Fast, good qualityembeddings = model.encode(df['text'].tolist(), show_progress_bar=True) # Add embedding dimensions as featuresembedding_df = pd.DataFrame( embeddings, columns=[f'embed_{i}' for i in range(embeddings.shape[1])])df = pd.concat([df.reset_index(drop=True), embedding_df], axis=1)TF-IDF with a linear model is a surprisingly strong baseline for text classification. Before reaching for deep learning, verify that the simple approach isn't sufficient. You might be surprised.
Often, the relationship between features and the target isn't captured by features in isolation—it emerges from combinations of features.
Feature Interactions:
A feature interaction exists when the effect of one feature depends on the value of another. For example:
Aggregation Features:
When data has a hierarchical structure (multiple transactions per user, multiple items per order), aggregations summarize lower-level data for higher-level predictions:
| Aggregation Level | Example Aggregations |
|---|---|
| User-level from transactions | Total spend, transaction count, average amount, time since last, variety of merchants |
| Order-level from items | Order total, item count, average item price, category diversity |
| Session-level from clicks | Pages viewed, time on site, search count, depth in funnel |
Group-by Aggregations:
Beyond per-entity aggregations, comparing an entity to its group adds context:
# User's spending relative to their cohort
df['spend_vs_cohort'] = df['spend'] / df.groupby('signup_cohort')['spend'].transform('mean')
123456789101112131415161718192021222324252627282930313233343536373839404142434445
import pandas as pdimport numpy as np # ===== FEATURE INTERACTIONS =====# Multiplicative interactiondf['price_x_quantity'] = df['price'] * df['quantity']df['income_x_age'] = df['income'] * df['age'] # Conditional features (interaction with indicator)df['price_if_premium'] = df['price'] * df['is_premium']df['amount_on_weekend'] = df['amount'] * df['is_weekend'] # Ratio interactionsdf['spend_per_visit'] = df['total_spend'] / (df['visit_count'] + 1)df['orders_per_month'] = df['order_count'] / (df['months_active'] + 1) # Categorical combinationdf['location_device'] = df['country'] + '_' + df['device_type']df['category_daypart'] = df['product_category'] + '_' + df['daypart'] # ===== AGGREGATION FEATURES =====# Per-user aggregations from transaction datauser_agg = transactions.groupby('user_id').agg({ 'amount': ['sum', 'mean', 'std', 'max', 'count'], 'timestamp': ['min', 'max'], 'merchant_id': 'nunique', # Number of unique merchants 'category': lambda x: x.mode()[0] if len(x.mode()) > 0 else 'unknown' # Most common category}).reset_index() # Flatten column namesuser_agg.columns = ['user_id', 'total_spend', 'avg_amount', 'std_amount', 'max_amount', 'transaction_count', 'first_transaction', 'last_transaction', 'merchant_diversity', 'primary_category'] # Derived features from aggregationsuser_agg['days_active'] = (user_agg['last_transaction'] - user_agg['first_transaction']).dt.daysuser_agg['frequency'] = user_agg['transaction_count'] / (user_agg['days_active'] + 1) # ===== RELATIVE TO GROUP =====# Compare user to their cohortdf = df.merge(user_agg, on='user_id')cohort_stats = df.groupby('signup_cohort')['total_spend'].agg(['mean', 'std'])df = df.merge(cohort_stats, on='signup_cohort', suffixes=('', '_cohort'))df['spend_z_score'] = (df['total_spend'] - df['mean']) / (df['std'] + 1)df['above_cohort_avg'] = (df['total_spend'] > df['mean']).astype(int)With N features, you have O(N²) pairwise interactions and O(N³) triple interactions. Exhaustive interaction creation leads to overfitting and computational explosion. Use domain knowledge to select meaningful interactions, or use algorithms that discover interactions automatically (tree-based models, neural networks).
After engineering many features, you often have more than your model needs—or can handle. Feature selection reduces dimensionality, improves interpretability, speeds up training, and can prevent overfitting.
Why Select Features?
Feature Selection Methods:
| Approach | Methods | Pros | Cons |
|---|---|---|---|
| Filter Methods | Correlation, mutual information, chi-squared, variance threshold | Fast, model-independent | Ignores feature interactions |
| Wrapper Methods | Recursive feature elimination (RFE), forward/backward selection | Considers model performance | Slow, prone to overfitting |
| Embedded Methods | L1 regularization (Lasso), tree feature importance | Built into training, considers interactions | Model-specific |
| Dimensionality Reduction | PCA, autoencoders | Creates informative combinations | Loses interpretability |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import pandas as pdimport numpy as npfrom sklearn.feature_selection import ( SelectKBest, mutual_info_classif, chi2, RFE, SelectFromModel, VarianceThreshold)from sklearn.linear_model import LassoCVfrom sklearn.ensemble import RandomForestClassifier # ===== FILTER: VARIANCE THRESHOLD =====# Remove features with very low varianceselector = VarianceThreshold(threshold=0.01)X_high_var = selector.fit_transform(X) # ===== FILTER: CORRELATION =====# Remove highly correlated features (keep one)corr_matrix = X.corr().abs()upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))to_drop = [col for col in upper.columns if any(upper[col] > 0.95)]X_decorrelated = X.drop(columns=to_drop) # ===== FILTER: MUTUAL INFORMATION =====# Rank features by their information about the targetselector = SelectKBest(score_func=mutual_info_classif, k=50)X_top_50 = selector.fit_transform(X, y)selected_features = X.columns[selector.get_support()] # ===== EMBEDDED: LASSO (L1) REGULARIZATION =====# Features with non-zero coefficients are selectedlasso = LassoCV(cv=5)lasso.fit(X, y)selected_features = X.columns[lasso.coef_ != 0]print(f"Lasso selected {len(selected_features)} features") # ===== EMBEDDED: TREE IMPORTANCE =====rf = RandomForestClassifier(n_estimators=100, random_state=42)rf.fit(X, y) # Get feature importancesimportance_df = pd.DataFrame({ 'feature': X.columns, 'importance': rf.feature_importances_}).sort_values('importance', ascending=False) # Select top N by importancetop_n = 30selected_features = importance_df.head(top_n)['feature'].tolist() # ===== WRAPPER: RECURSIVE FEATURE ELIMINATION =====from sklearn.linear_model import LogisticRegression estimator = LogisticRegression(max_iter=1000)rfe = RFE(estimator, n_features_to_select=20)X_rfe = rfe.fit_transform(X, y)selected_features = X.columns[rfe.support_]Feature importance can be unstable—small changes in data lead to different selected features. For robustness, run selection on multiple bootstrap samples or cross-validation folds, and keep features that are consistently selected. This is especially important for interpretability.
Feature engineering is where domain knowledge meets machine learning. It transforms raw data into information-rich representations that enable models to learn effectively. Let's consolidate the key insights:
What's Next:
With well-engineered features, you're ready for Model Selection and Training—choosing from the vast landscape of machine learning algorithms, configuring them appropriately, and training them to achieve optimal performance on your prepared dataset.
You now understand feature engineering comprehensively: what makes good features, how to engineer numeric, categorical, temporal, and text features, creating interactions and aggregations, and selecting the most valuable features. This skill will be one of your most powerful tools as an ML practitioner.