Loading learning content...
When a data scientist manually engineers features, they draw from intuition, domain knowledge, and past experience. But human engineers face a fundamental limitation: cognitive bandwidth. We can hold perhaps 7-10 candidate features in working memory, reason about their interactions, and iteratively refine them. Meanwhile, the true space of potentially useful features is combinatorially vast.
Consider a modest e-commerce dataset with:
The number of possible features from simple aggregations and transforms alone exceeds 10,000. Add interaction features, rolling windows, and conditional aggregations, and you're looking at millions of candidates.
This is where automated feature generation shines—not by replacing human judgment, but by comprehensively exploring the feature space and surfacing candidates for evaluation.
This page provides a complete taxonomy of automatically generated features, organized by type, complexity, and use case.
By the end of this page, you will understand: the complete taxonomy of auto-generated features (direct, aggregated, transformed, and composed), how to control feature generation through primitive selection, strategies for handling different data types, and patterns for creating domain-specific features automatically.
Direct features are the simplest category—they are the raw columns of your target entity, included as-is without any transformation or aggregation. While seemingly trivial, direct features form the foundation upon which all other features are built.
| Data Type | Example Columns | Notes |
|---|---|---|
| Numeric | age, income, balance | Ready for modeling |
| Categorical | country, product_type | Require encoding |
| Boolean | is_premium, email_verified | Already binary |
| Datetime | signup_date, last_login | Usually need extraction |
| Text | product_description | Require NLP processing |
12345678910111213141516171819202122
import featuretools as ft # Direct features are extracted from the target entityfeature_matrix, feature_defs = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=[], # No aggregations trans_primitives=[], # No transforms max_depth=0 # Only direct features) # This gives us the raw customer columnsprint("Direct Features:")for feat in feature_defs: print(f" {feat.get_name()}: {feat.variable_type}") # Output example:# Direct Features:# age: Numeric# country: Categorical# signup_date: Datetime# email_verified: BooleanSetting max_depth=0 with empty primitive lists is a useful pattern for extracting just the raw columns as feature definitions. This serves as a baseline for comparing the lift from automated feature engineering.
Transform features apply operations to individual rows without requiring aggregation across multiple records. They convert raw columns into derived features that expose latent patterns.
Datetime columns are gold mines for transform features. A single timestamp can yield dozens of meaningful signals:
| Transform | Input | Output | Use Case |
|---|---|---|---|
Year | datetime | int | Long-term trends |
Month | datetime | int | Seasonality |
Weekday | datetime | int | Weekly patterns |
Hour | datetime | int | Time-of-day effects |
IsWeekend | datetime | bool | Work vs. rest |
Quarter | datetime | int | Business cycles |
DayOfYear | datetime | int | Annual patterns |
12345678910111213141516171819202122232425262728293031
import featuretools as ftfrom featuretools.primitives import ( Year, Month, Weekday, Hour, Minute, IsWeekend, Quarter, DayOfYear, TimeSince, Age) # Apply temporal transforms to extract patternsfeature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", trans_primitives=[ "year", "month", "weekday", "day", "hour", "is_weekend", "quarter", "day_of_year" ], agg_primitives=[], max_depth=1) # Generated features from signup_date:# - YEAR(signup_date) -> 2023# - MONTH(signup_date) -> 1 (January)# - WEEKDAY(signup_date) -> 6 (Sunday)# - IS_WEEKEND(signup_date) -> True# - QUARTER(signup_date) -> 1# - DAY_OF_YEAR(signup_date) -> 15 print("Temporal Features Generated:")for feat in features: if 'signup_date' in feat.get_name().lower(): print(f" {feat.get_name()}")Numeric columns can be transformed to expose different aspects of their distribution:
| Transform | Formula | Use Case |
|---|---|---|
Absolute | |x| | Handle signed values |
Square | x² | Emphasize large values |
SquareRoot | √x | Compress large values |
Log | log(x) | Handle skewed distributions |
Percentile | rank(x)/n | Relative positioning |
ZScore | (x-μ)/σ | Standardization |
These create boolean flags from raw values, useful for creating marker features:
| Transform | Description | Output |
|---|---|---|
IsNull | Missing value indicator | Boolean |
IsIn | Membership in set | Boolean |
GreaterThan | Threshold comparison | Boolean |
Equals | Exact value match | Boolean |
12345678910111213141516171819202122232425262728293031
import featuretools as ftfrom featuretools.primitives import ( Absolute, SquareRoot, NaturalLogarithm, IsNull, Negate, CumSum, CumMean, Diff, Lag) # Numeric and cumulative transformsfeature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="orders", trans_primitives=[ "is_null", # Missing value indicators "absolute", # Absolute values "square_root", # Compression "natural_logarithm", # Log transform "cum_sum", # Cumulative sum "cum_mean", # Cumulative mean "diff", # First difference "lag" # Previous value ], agg_primitives=[], max_depth=1) # Sample generated features for total_amount:# - IS_NULL(total_amount) -> False# - SQUARE_ROOT(total_amount) -> 12.247... (for 150.0)# - NATURAL_LOGARITHM(total_amount) -> 5.01... (for 150.0)# - CUM_SUM(total_amount) -> Running total# - DIFF(total_amount) -> Change from previous order# - LAG(total_amount, n=1) -> Previous order's amountAggregation features are where automated feature engineering truly shines. They summarize child entity data to create features for parent entities, capturing patterns that would require explicit GROUP BY operations in SQL.
Given a one-to-many relationship (e.g., Customer → Orders), aggregations answer questions like:
The workhorse primitives for numeric summarization:
| Primitive | Description | Robustness | Use Case |
|---|---|---|---|
Sum | Total of all values | Sensitive to outliers | Totals, cumulative metrics |
Mean | Arithmetic average | Sensitive to outliers | Typical behavior |
Median | 50th percentile | Robust to outliers | Typical behavior (robust) |
Std | Standard deviation | Sensitive to outliers | Variability, consistency |
Min | Minimum value | Extreme sensitive | Lower bounds, floors |
Max | Maximum value | Extreme sensitive | Upper bounds, peaks |
Skew | Distribution asymmetry | Sample size dependent | Behavior patterns |
Kurtosis | Distribution tailedness | Sample size dependent | Extreme event frequency |
Percentile | Nth percentile value | Configurable | Distribution analysis |
12345678910111213141516171819202122232425262728293031323334353637383940
import featuretools as ft # Comprehensive aggregation feature generationfeature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=[ # Statistical "sum", "mean", "std", "median", "min", "max", "skew", "kurtosis", # Counting "count", "num_unique", "percent_true", # Mode and extremes "mode", "first", "last", # Temporal "time_since_last", "time_since_first", "trend" ], trans_primitives=[], max_depth=2) # Sample generated features:# From orders (depth 1):# - SUM(orders.total_amount) -> Total customer spend# - MEAN(orders.total_amount) -> Average order value# - STD(orders.total_amount) -> Order value consistency# - COUNT(orders) -> Number of orders# - NUM_UNIQUE(orders.payment_method) -> Payment diversity# - TIME_SINCE_LAST(orders.order_date) -> Recency# - TREND(orders.total_amount, orders.order_date) -> Spending trend # From order_items via orders (depth 2):# - MEAN(orders.SUM(order_items.quantity)) -> Avg items per order# - SUM(orders.COUNT(order_items)) -> Total items ever# - MAX(orders.MAX(order_items.unit_price)) -> Highest priced item print(f"Total aggregation features: {len(features)}")| Primitive | Description | Output Type |
|---|---|---|
Count | Number of child records | Integer |
NUnique | Count of unique values | Integer |
PercentTrue | Fraction of True values | Float [0,1] |
NumTrue | Count of True values | Integer |
All | Whether all values are True | Boolean |
Any | Whether any value is True | Boolean |
Time-aware aggregations are critical for sequential and event-based data:
| Primitive | Description | Insight |
|---|---|---|
TimeSinceLast | Duration since most recent | Recency |
TimeSinceFirst | Duration since earliest | Tenure |
Trend | Linear regression slope | Direction of change |
AvgTimeBetween | Mean inter-event duration | Engagement frequency |
Entropy | Shannon entropy of values | Behavioral diversity |
At depth=2, aggregations can be composed: MEAN(orders.SUM(order_items.quantity)) computes the mean across orders of the sum of quantities within each order. This captures 'average order size in items' without manual feature engineering. The combinatorial explosion at depth=2 is where most predictive signal is found.
One of Featuretools' most powerful capabilities is conditional aggregation—applying aggregation primitives only to subsets of child records that meet specific conditions. These correspond to SQL queries like:
SELECT customer_id, SUM(total_amount)
FROM orders
WHERE payment_method = 'credit'
GROUP BY customer_id
To generate conditional aggregations, you must:
where_primitives in your DFS call12345678910111213141516171819202122232425262728293031323334353637
import featuretools as ft # Step 1: Define interesting values# These are categorical values we want to filter ones.add_interesting_values( dataframe_name="orders", values={ "payment_method": ["credit", "debit", "paypal"] }) # For boolean columns, True/False are automatically interestinges.add_interesting_values( dataframe_name="order_items", values={} # Boolean columns handled automatically) # Step 2: Generate features with WHERE clausesfeature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["sum", "mean", "count"], trans_primitives=[], where_primitives=["sum", "mean", "count"], # Enable WHERE max_depth=2) # Generated WHERE features:# - SUM(orders.total_amount WHERE payment_method = credit)# - COUNT(orders WHERE payment_method = paypal)# - MEAN(orders.total_amount WHERE payment_method = debit) # Find WHERE features in the outputwhere_features = [f for f in features if "WHERE" in f.get_name()]print(f"WHERE features generated: {len(where_features)}")for feat in where_features[:10]: print(f" {feat.get_name()}")When child entities contain boolean columns, Featuretools automatically creates WHERE features:
# If orders had a boolean column 'is_discounted'
# Generated features would include:
# - COUNT(orders WHERE is_discounted = True)
# - SUM(orders.total_amount WHERE is_discounted = True)
# - MEAN(orders.total_amount WHERE is_discounted = False)
Not all categorical values warrant WHERE features. Choose values that:
| Good Interesting Values | Poor Interesting Values |
|---|---|
country: ["US", "UK", "DE"] | country: [all 195 countries] |
status: ["active", "cancelled"] | user_id: [all user IDs] |
channel: ["web", "mobile", "api"] | timestamp: [all timestamps] |
WHERE features multiply your feature count by the number of interesting values. With 3 aggregation primitives, 5 numeric columns, and 10 interesting values, you generate 3 × 5 × 10 = 150 additional WHERE features PER relationship. Use sparingly and curate interesting values carefully.
The true power of automated feature engineering emerges when primitives are composed—applied sequentially across relationship paths and transformation chains. This is where depth > 1 creates features that would be tedious to engineer manually.
Pattern 1: Transform → Aggregate Apply a transform to a child entity, then aggregate to parent:
MEAN(orders.MONTH(order_date))
→ "Average month of orders" (captures seasonality in purchasing)
Pattern 2: Aggregate → Aggregate (Stacking) Aggregate within child, then aggregate those aggregates:
STD(orders.SUM(order_items.quantity))
→ "Standard deviation of per-order item counts" (captures order size consistency)
Pattern 3: Aggregate → Transform Aggregate to create a numeric column, then transform:
LOG(SUM(orders.total_amount))
→ "Log of total customer spend" (handles skewed spend distribution)
123456789101112131415161718192021222324252627282930313233343536373839404142
import featuretools as ft # Generate composed features at depth 2feature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["sum", "mean", "std", "count", "max", "min"], trans_primitives=["month", "year", "is_weekend", "day"], max_depth=2) # Categorize features by composition patterndirect_features = []transform_only = []agg_depth1 = []agg_depth2 = []transform_then_agg = [] for feat in features: name = feat.get_name() depth = feat.get_depth() if depth == 0: direct_features.append(feat) elif depth == 1: if any(agg in name for agg in ["SUM", "MEAN", "STD", "COUNT"]): agg_depth1.append(feat) else: transform_only.append(feat) else: # depth >= 2 agg_depth2.append(feat) print(f"Feature Composition Breakdown:")print(f" Direct features: {len(direct_features)}")print(f" Transform-only (depth 1): {len(transform_only)}")print(f" Aggregation depth 1: {len(agg_depth1)}")print(f" Composed/stacked (depth 2+): {len(agg_depth2)}") # Examples of composed featuresprint("\nSample Composed Features:")for feat in agg_depth2[:15]: print(f" {feat.get_name()}")| Composed Feature | Decomposition | Interpretation |
|---|---|---|
MEAN(orders.MONTH(order_date)) | MONTH → MEAN | Average purchase month |
STD(orders.SUM(order_items.quantity)) | SUM → STD | Order size variability |
MAX(orders.COUNT(order_items)) | COUNT → MAX | Largest order by item count |
MEAN(orders.MAX(order_items.unit_price)) | MAX → MEAN | Avg most expensive item per order |
SUM(orders.NUM_UNIQUE(order_items.product_id)) | NUM_UNIQUE → SUM | Total distinct products purchased |
Composed features often capture signals that humans wouldn't think to engineer. 'Standard deviation of per-order item sums' might seem obscure, but it measures purchasing consistency—a strong churn predictor. Let the algorithm explore, then interpret the winners.
In relational databases, the most predictive signals often span multiple tables. Deep features traverse relationship paths to synthesize information from distant entities.
Consider a more complex schema:
Customers
└── Orders
└── OrderItems
└── Products
└── Categories
Featuretools can generate features that span this entire chain:
NUM_UNIQUE(orders.order_items.products.categories.category_name)
→ "Number of distinct product categories this customer has purchased from"
Empirical studies consistently find:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657
import featuretools as ftimport pandas as pd # Extended schema with products and categoriesproducts = pd.DataFrame({ "product_id": range(101, 120), "product_name": [f"Product_{i}" for i in range(101, 120)], "category_id": [1, 1, 2, 2, 3, 1, 2, 3, 1, 2, 3, 1, 2, 1, 3, 2, 1, 3, 2], "base_price": [25, 50, 15, 80, 45, 50, 35, 60, 25, 15, 47, 25, 50, 45, 16, 47, 30, 35, 42]}) categories = pd.DataFrame({ "category_id": [1, 2, 3], "category_name": ["Electronics", "Clothing", "Home"], "margin_pct": [0.15, 0.40, 0.25]}) # Add to EntitySetes.add_dataframe( dataframe_name="products", dataframe=products, index="product_id")es.add_dataframe( dataframe_name="categories", dataframe=categories, index="category_id") # Add relationshipses.add_relationship("products", "product_id", "order_items", "product_id")es.add_relationship("categories", "category_id", "products", "category_id") # Generate deep features across the full schemafeature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count", "num_unique", "mode"], trans_primitives=[], max_depth=3 # Traverse up to 3 relationship hops) # Find deep features (depth 3)deep_features = [f for f in features if f.get_depth() == 3]print(f"Deep features (depth 3): {len(deep_features)}") for feat in deep_features[:10]: print(f" {feat.get_name()}") # Example outputs:# - MODE(orders.order_items.products.category_id)# - NUM_UNIQUE(orders.order_items.products.categories.category_name)# - MEAN(orders.SUM(order_items.products.base_price))When an entity has multiple child relationships, Featuretools generates features from each path:
Customers
├── Orders → SUM(orders.total_amount)
├── Reviews → MEAN(reviews.rating)
└── SupportTickets → COUNT(support_tickets)
These parallel paths create complementary features that capture different aspects of customer behavior.
To manage complexity at higher depths:
| Strategy | Implementation | Effect |
|---|---|---|
| Limit primitives | agg_primitives=["mean", "count"] | Fewer features per depth |
| Cap features | max_features=500 | Hard limit on output |
| Ignore columns | ignore_columns={"orders": ["id"]} | Exclude non-informative columns |
| Prune primitives | primitive_options={...} | Fine-grained control |
Featuretools uses Woodwork for semantic type inference and management. Properly annotating your data types is crucial for generating appropriate features.
Woodwork distinguishes between:
1234567891011121314151617181920212223242526272829303132333435363738394041424344
import featuretools as ftfrom woodwork.logical_types import ( Categorical, Double, Integer, Datetime, NaturalLanguage, Boolean, Ordinal, EmailAddress) # Explicit type annotationses = ft.EntitySet(id="typed_data") es.add_dataframe( dataframe_name="customers", dataframe=customers_df, index="customer_id", time_index="signup_date", logical_types={ # Numeric types "age": Integer, "lifetime_value": Double, "account_balance": Double, # Categorical types "country": Categorical, "subscription_tier": Ordinal, # Has natural ordering # Text types "bio": NaturalLanguage, # Free-form text "email": EmailAddress, # Special format # Boolean "is_verified": Boolean, "accepts_marketing": Boolean }, semantic_tags={ "account_balance": {"currency"}, "age": {"person_attribute"} }) # Check inferred typesprint("Column Types:")for col in es["customers"].columns: lt = es["customers"].ww.logical_types[col] tags = es["customers"].ww.semantic_tags[col] print(f" {col}: {lt} | Tags: {tags}")| Logical Type | Example Columns | Applicable Transforms | Applicable Aggregations |
|---|---|---|---|
Integer | age, quantity | Absolute, Negate | Sum, Mean, Std |
Double | price, rating | Log, SquareRoot | Sum, Mean, Median |
Categorical | country, status | IsIn | Mode, NUnique |
Datetime | created_at | Year, Month, Hour | TimeSinceLast, Trend |
Boolean | is_premium | Not | PercentTrue, NumTrue |
NaturalLanguage | description | NumWords, NumChars | N/A |
Use Ordinal for categories with natural ordering (e.g., education level, subscription tier). This enables additional primitives that respect the ordering, and some ML models can leverage ordinal encoding directly.
We've explored the complete taxonomy of automatically generated features—from simple direct columns to complex multi-table compositions. Here's the consolidated view:
What's Next:
Now that we understand what features can be generated, we'll dive into Deep Feature Synthesis (DFS)—the algorithm that systematically explores the feature space. We'll examine how DFS constructs features, manage the combinatorial explosion, and optimize the synthesis process.
You now have a comprehensive understanding of the feature generation taxonomy. From direct columns to deeply composed multi-table features, you can recognize and categorize any automatically generated feature. Next, we'll explore DFS—the engine that makes this generation possible.