Feature Engineering & SelectionAutomated Feature Engineering

Automated Feature Engineering

LevelIntermediate

Duration90 mins

TopicAutomated Feature Engineering

2 / 5

Feature Generation: The Complete Taxonomy

The Universe of Possible Features

When a data scientist manually engineers features, they draw from intuition, domain knowledge, and past experience. But human engineers face a fundamental limitation: cognitive bandwidth. We can hold perhaps 7-10 candidate features in working memory, reason about their interactions, and iteratively refine them. Meanwhile, the true space of potentially useful features is combinatorially vast.

Consider a modest e-commerce dataset with:

5 numeric columns
3 categorical columns
2 datetime columns
3 related tables at depth 2

The number of possible features from simple aggregations and transforms alone exceeds 10,000. Add interaction features, rolling windows, and conditional aggregations, and you're looking at millions of candidates.

This is where automated feature generation shines—not by replacing human judgment, but by comprehensively exploring the feature space and surfacing candidates for evaluation.

This page provides a complete taxonomy of automatically generated features, organized by type, complexity, and use case.

What You Will Learn

By the end of this page, you will understand: the complete taxonomy of auto-generated features (direct, aggregated, transformed, and composed), how to control feature generation through primitive selection, strategies for handling different data types, and patterns for creating domain-specific features automatically.

Direct Features: The Baseline

Direct features are the simplest category—they are the raw columns of your target entity, included as-is without any transformation or aggregation. While seemingly trivial, direct features form the foundation upon which all other features are built.

Why Direct Features Matter

Baseline Predictive Power: Raw attributes often contain significant signal
Interpretability: No transformation means straightforward explanation
Composition Basis: Serve as inputs to transform primitives

Types of Direct Features

Data Type	Example Columns	Notes
Numeric	`age`, `income`, `balance`	Ready for modeling
Categorical	`country`, `product_type`	Require encoding
Boolean	`is_premium`, `email_verified`	Already binary
Datetime	`signup_date`, `last_login`	Usually need extraction
Text	`product_description`	Require NLP processing

direct_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import featuretools as ft
 
# Direct features are extracted from the target entity
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=[],     # No aggregations
    trans_primitives=[],   # No transforms
    max_depth=0            # Only direct features
)
 
# This gives us the raw customer columns
print("Direct Features:")
for feat in feature_defs:
    print(f"  {feat.get_name()}: {feat.variable_type}")
    
# Output example:
# Direct Features:
#   age: Numeric
#   country: Categorical
#   signup_date: Datetime
#   email_verified: Boolean

The max_depth=0 Pattern

Setting max_depth=0 with empty primitive lists is a useful pattern for extracting just the raw columns as feature definitions. This serves as a baseline for comparing the lift from automated feature engineering.

Transform Features: Single-Row Operations

Transform features apply operations to individual rows without requiring aggregation across multiple records. They convert raw columns into derived features that expose latent patterns.

Temporal Transforms

Datetime columns are gold mines for transform features. A single timestamp can yield dozens of meaningful signals:

Transform	Input	Output	Use Case
`Year`	`datetime`	`int`	Long-term trends
`Month`	`datetime`	`int`	Seasonality
`Weekday`	`datetime`	`int`	Weekly patterns
`Hour`	`datetime`	`int`	Time-of-day effects
`IsWeekend`	`datetime`	`bool`	Work vs. rest
`Quarter`	`datetime`	`int`	Business cycles
`DayOfYear`	`datetime`	`int`	Annual patterns

temporal_transforms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import featuretools as ft
from featuretools.primitives import (
    Year, Month, Weekday, Hour, Minute,
    IsWeekend, Quarter, DayOfYear,
    TimeSince, Age
)
 
# Apply temporal transforms to extract patterns
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    trans_primitives=[
        "year", "month", "weekday", "day", "hour",
        "is_weekend", "quarter", "day_of_year"
    ],
    agg_primitives=[],
    max_depth=1
)
 
# Generated features from signup_date:
# - YEAR(signup_date)      -> 2023
# - MONTH(signup_date)     -> 1 (January)
# - WEEKDAY(signup_date)   -> 6 (Sunday)
# - IS_WEEKEND(signup_date) -> True
# - QUARTER(signup_date)   -> 1
# - DAY_OF_YEAR(signup_date) -> 15
 
print("Temporal Features Generated:")
for feat in features:
    if 'signup_date' in feat.get_name().lower():
        print(f"  {feat.get_name()}")

Numeric Transforms

Numeric columns can be transformed to expose different aspects of their distribution:

Transform	Formula	Use Case
`Absolute`	`\|x\|`	Handle signed values
`Square`	`x²`	Emphasize large values
`SquareRoot`	`√x`	Compress large values
`Log`	`log(x)`	Handle skewed distributions
`Percentile`	`rank(x)/n`	Relative positioning
`ZScore`	`(x-μ)/σ`	Standardization

Binary Indicator Transforms

These create boolean flags from raw values, useful for creating marker features:

Transform	Description	Output
`IsNull`	Missing value indicator	Boolean
`IsIn`	Membership in set	Boolean
`GreaterThan`	Threshold comparison	Boolean
`Equals`	Exact value match	Boolean

numeric_transforms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import featuretools as ft
from featuretools.primitives import (
    Absolute, SquareRoot, NaturalLogarithm,
    IsNull, Negate, CumSum, CumMean, Diff, Lag
)
 
# Numeric and cumulative transforms
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="orders",
    trans_primitives=[
        "is_null",           # Missing value indicators
        "absolute",          # Absolute values
        "square_root",       # Compression
        "natural_logarithm", # Log transform
        "cum_sum",           # Cumulative sum
        "cum_mean",          # Cumulative mean
        "diff",              # First difference
        "lag"                # Previous value
    ],
    agg_primitives=[],
    max_depth=1
)
 
# Sample generated features for total_amount:
# - IS_NULL(total_amount)     -> False
# - SQUARE_ROOT(total_amount) -> 12.247... (for 150.0)
# - NATURAL_LOGARITHM(total_amount) -> 5.01... (for 150.0)
# - CUM_SUM(total_amount)     -> Running total
# - DIFF(total_amount)        -> Change from previous order
# - LAG(total_amount, n=1)    -> Previous order's amount

Aggregation Features: Cross-Row Summarization

Aggregation features are where automated feature engineering truly shines. They summarize child entity data to create features for parent entities, capturing patterns that would require explicit GROUP BY operations in SQL.

The Aggregation Paradigm

Given a one-to-many relationship (e.g., Customer → Orders), aggregations answer questions like:

"How many orders does each customer have?"
"What is the average order value per customer?"
"When was each customer's last order?"

Statistical Aggregations

The workhorse primitives for numeric summarization:

Statistical Aggregation Primitives
Primitive	Description	Robustness	Use Case
`Sum`	Total of all values	Sensitive to outliers	Totals, cumulative metrics
`Mean`	Arithmetic average	Sensitive to outliers	Typical behavior
`Median`	50th percentile	Robust to outliers	Typical behavior (robust)
`Std`	Standard deviation	Sensitive to outliers	Variability, consistency
`Min`	Minimum value	Extreme sensitive	Lower bounds, floors
`Max`	Maximum value	Extreme sensitive	Upper bounds, peaks
`Skew`	Distribution asymmetry	Sample size dependent	Behavior patterns
`Kurtosis`	Distribution tailedness	Sample size dependent	Extreme event frequency
`Percentile`	Nth percentile value	Configurable	Distribution analysis

aggregation_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import featuretools as ft
 
# Comprehensive aggregation feature generation
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=[
        # Statistical
        "sum", "mean", "std", "median", "min", "max",
        "skew", "kurtosis",
        
        # Counting
        "count", "num_unique", "percent_true",
        
        # Mode and extremes
        "mode", "first", "last",
        
        # Temporal
        "time_since_last", "time_since_first", "trend"
    ],
    trans_primitives=[],
    max_depth=2
)
 
# Sample generated features:
# From orders (depth 1):
# - SUM(orders.total_amount)      -> Total customer spend
# - MEAN(orders.total_amount)     -> Average order value
# - STD(orders.total_amount)      -> Order value consistency
# - COUNT(orders)                 -> Number of orders
# - NUM_UNIQUE(orders.payment_method) -> Payment diversity
# - TIME_SINCE_LAST(orders.order_date) -> Recency
# - TREND(orders.total_amount, orders.order_date) -> Spending trend
 
# From order_items via orders (depth 2):
# - MEAN(orders.SUM(order_items.quantity)) -> Avg items per order
# - SUM(orders.COUNT(order_items))         -> Total items ever
# - MAX(orders.MAX(order_items.unit_price)) -> Highest priced item
 
print(f"Total aggregation features: {len(features)}")

Counting and Cardinality Aggregations

Primitive	Description	Output Type
`Count`	Number of child records	Integer
`NUnique`	Count of unique values	Integer
`PercentTrue`	Fraction of True values	Float [0,1]
`NumTrue`	Count of True values	Integer
`All`	Whether all values are True	Boolean
`Any`	Whether any value is True	Boolean

Temporal Aggregations

Time-aware aggregations are critical for sequential and event-based data:

Primitive	Description	Insight
`TimeSinceLast`	Duration since most recent	Recency
`TimeSinceFirst`	Duration since earliest	Tenure
`Trend`	Linear regression slope	Direction of change
`AvgTimeBetween`	Mean inter-event duration	Engagement frequency
`Entropy`	Shannon entropy of values	Behavioral diversity

The Power of Depth=2

At depth=2, aggregations can be composed: MEAN(orders.SUM(order_items.quantity)) computes the mean across orders of the sum of quantities within each order. This captures 'average order size in items' without manual feature engineering. The combinatorial explosion at depth=2 is where most predictive signal is found.

Conditional Aggregations: WHERE Features

One of Featuretools' most powerful capabilities is conditional aggregation—applying aggregation primitives only to subsets of child records that meet specific conditions. These correspond to SQL queries like:

SELECT customer_id, SUM(total_amount)
FROM orders
WHERE payment_method = 'credit'
GROUP BY customer_id

Enabling WHERE Features

To generate conditional aggregations, you must:

Define interesting values for categorical columns
Include where_primitives in your DFS call

where_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import featuretools as ft
 
# Step 1: Define interesting values
# These are categorical values we want to filter on
es.add_interesting_values(
    dataframe_name="orders",
    values={
        "payment_method": ["credit", "debit", "paypal"]
    }
)
 
# For boolean columns, True/False are automatically interesting
es.add_interesting_values(
    dataframe_name="order_items",
    values={}  # Boolean columns handled automatically
)
 
# Step 2: Generate features with WHERE clauses
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["sum", "mean", "count"],
    trans_primitives=[],
    where_primitives=["sum", "mean", "count"],  # Enable WHERE
    max_depth=2
)
 
# Generated WHERE features:
# - SUM(orders.total_amount WHERE payment_method = credit)
# - COUNT(orders WHERE payment_method = paypal)
# - MEAN(orders.total_amount WHERE payment_method = debit)
 
# Find WHERE features in the output
where_features = [f for f in features if "WHERE" in f.get_name()]
print(f"WHERE features generated: {len(where_features)}")
for feat in where_features[:10]:
    print(f"  {feat.get_name()}")

Automatic Boolean WHERE Features

When child entities contain boolean columns, Featuretools automatically creates WHERE features:

# If orders had a boolean column 'is_discounted'
# Generated features would include:
# - COUNT(orders WHERE is_discounted = True)
# - SUM(orders.total_amount WHERE is_discounted = True)
# - MEAN(orders.total_amount WHERE is_discounted = False)

Strategic Use of Interesting Values

Not all categorical values warrant WHERE features. Choose values that:

Have sufficient frequency: Values with few occurrences produce noisy features
Carry semantic meaning: "premium" vs "standard" tier is more useful than arbitrary product IDs
Align with prediction task: For churn prediction, focus on engagement-related categories

Good Interesting Values	Poor Interesting Values
`country: ["US", "UK", "DE"]`	`country: [all 195 countries]`
`status: ["active", "cancelled"]`	`user_id: [all user IDs]`
`channel: ["web", "mobile", "api"]`	`timestamp: [all timestamps]`

Feature Explosion Warning

WHERE features multiply your feature count by the number of interesting values. With 3 aggregation primitives, 5 numeric columns, and 10 interesting values, you generate 3 × 5 × 10 = 150 additional WHERE features PER relationship. Use sparingly and curate interesting values carefully.

Stacked and Composed Features

The true power of automated feature engineering emerges when primitives are composed—applied sequentially across relationship paths and transformation chains. This is where depth > 1 creates features that would be tedious to engineer manually.

Feature Composition Patterns

Pattern 1: Transform → Aggregate Apply a transform to a child entity, then aggregate to parent:

MEAN(orders.MONTH(order_date))

→ "Average month of orders" (captures seasonality in purchasing)

Pattern 2: Aggregate → Aggregate (Stacking) Aggregate within child, then aggregate those aggregates:

STD(orders.SUM(order_items.quantity))

→ "Standard deviation of per-order item counts" (captures order size consistency)

Pattern 3: Aggregate → Transform Aggregate to create a numeric column, then transform:

LOG(SUM(orders.total_amount))

→ "Log of total customer spend" (handles skewed spend distribution)

composed_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import featuretools as ft
 
# Generate composed features at depth 2
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["sum", "mean", "std", "count", "max", "min"],
    trans_primitives=["month", "year", "is_weekend", "day"],
    max_depth=2
)
 
# Categorize features by composition pattern
direct_features = []
transform_only = []
agg_depth1 = []
agg_depth2 = []
transform_then_agg = []
 
for feat in features:
    name = feat.get_name()
    depth = feat.get_depth()
    
    if depth == 0:
        direct_features.append(feat)
    elif depth == 1:
        if any(agg in name for agg in ["SUM", "MEAN", "STD", "COUNT"]):
            agg_depth1.append(feat)
        else:
            transform_only.append(feat)
    else:  # depth >= 2
        agg_depth2.append(feat)
 
print(f"Feature Composition Breakdown:")
print(f"  Direct features: {len(direct_features)}")
print(f"  Transform-only (depth 1): {len(transform_only)}")
print(f"  Aggregation depth 1: {len(agg_depth1)}")
print(f"  Composed/stacked (depth 2+): {len(agg_depth2)}")
 
# Examples of composed features
print("\nSample Composed Features:")
for feat in agg_depth2[:15]:
    print(f"  {feat.get_name()}")

Feature Composition Examples
Composed Feature	Decomposition	Interpretation
`MEAN(orders.MONTH(order_date))`	MONTH → MEAN	Average purchase month
`STD(orders.SUM(order_items.quantity))`	SUM → STD	Order size variability
`MAX(orders.COUNT(order_items))`	COUNT → MAX	Largest order by item count
`MEAN(orders.MAX(order_items.unit_price))`	MAX → MEAN	Avg most expensive item per order
`SUM(orders.NUM_UNIQUE(order_items.product_id))`	NUM_UNIQUE → SUM	Total distinct products purchased

Composition Creates Novel Insights

Composed features often capture signals that humans wouldn't think to engineer. 'Standard deviation of per-order item sums' might seem obscure, but it measures purchasing consistency—a strong churn predictor. Let the algorithm explore, then interpret the winners.

Multi-Table Deep Features

In relational databases, the most predictive signals often span multiple tables. Deep features traverse relationship paths to synthesize information from distant entities.

Relationship Traversal

Consider a more complex schema:

Customers
    └── Orders
            └── OrderItems
                    └── Products
                            └── Categories

Featuretools can generate features that span this entire chain:

NUM_UNIQUE(orders.order_items.products.categories.category_name)

→ "Number of distinct product categories this customer has purchased from"

Depth vs. Predictive Value

Empirical studies consistently find:

Depth 1: Captures 60-70% of available signal
Depth 2: Captures 85-95% of available signal
Depth 3+: Diminishing returns, computational explosion

multi_table_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import featuretools as ft
import pandas as pd
 
# Extended schema with products and categories
products = pd.DataFrame({
    "product_id": range(101, 120),
    "product_name": [f"Product_{i}" for i in range(101, 120)],
    "category_id": [1, 1, 2, 2, 3, 1, 2, 3, 1, 2,
                    3, 1, 2, 1, 3, 2, 1, 3, 2],
    "base_price": [25, 50, 15, 80, 45, 50, 35, 60, 25, 15,
                   47, 25, 50, 45, 16, 47, 30, 35, 42]
})
 
categories = pd.DataFrame({
    "category_id": [1, 2, 3],
    "category_name": ["Electronics", "Clothing", "Home"],
    "margin_pct": [0.15, 0.40, 0.25]
})
 
# Add to EntitySet
es.add_dataframe(
    dataframe_name="products",
    dataframe=products,
    index="product_id"
)
es.add_dataframe(
    dataframe_name="categories",
    dataframe=categories,
    index="category_id"
)
 
# Add relationships
es.add_relationship("products", "product_id", 
                    "order_items", "product_id")
es.add_relationship("categories", "category_id", 
                    "products", "category_id")
 
# Generate deep features across the full schema
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "count", "num_unique", "mode"],
    trans_primitives=[],
    max_depth=3  # Traverse up to 3 relationship hops
)
 
# Find deep features (depth 3)
deep_features = [f for f in features if f.get_depth() == 3]
print(f"Deep features (depth 3): {len(deep_features)}")
 
for feat in deep_features[:10]:
    print(f"  {feat.get_name()}")
    
# Example outputs:
# - MODE(orders.order_items.products.category_id)
# - NUM_UNIQUE(orders.order_items.products.categories.category_name)
# - MEAN(orders.SUM(order_items.products.base_price))

Multi-Path Aggregation

When an entity has multiple child relationships, Featuretools generates features from each path:

Customers
    ├── Orders       → SUM(orders.total_amount)
    ├── Reviews      → MEAN(reviews.rating)
    └── SupportTickets → COUNT(support_tickets)

These parallel paths create complementary features that capture different aspects of customer behavior.

Controlling Deep Feature Generation

To manage complexity at higher depths:

Strategy	Implementation	Effect
Limit primitives	`agg_primitives=["mean", "count"]`	Fewer features per depth
Cap features	`max_features=500`	Hard limit on output
Ignore columns	`ignore_columns={"orders": ["id"]}`	Exclude non-informative columns
Prune primitives	`primitive_options={...}`	Fine-grained control

Handling Different Data Types

Featuretools uses Woodwork for semantic type inference and management. Properly annotating your data types is crucial for generating appropriate features.

Semantic Type System

Woodwork distinguishes between:

Physical types: How data is stored (int64, float64, object, datetime64)
Logical types: What data represents (Integer, Double, Categorical, Datetime, NaturalLanguage)
Semantic tags: Additional metadata (index, time_index, numeric, category)

Type Annotations and Feature Generation

type_annotations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import featuretools as ft
from woodwork.logical_types import (
    Categorical, Double, Integer, Datetime, 
    NaturalLanguage, Boolean, Ordinal, EmailAddress
)
 
# Explicit type annotations
es = ft.EntitySet(id="typed_data")
 
es.add_dataframe(
    dataframe_name="customers",
    dataframe=customers_df,
    index="customer_id",
    time_index="signup_date",
    logical_types={
        # Numeric types
        "age": Integer,
        "lifetime_value": Double,
        "account_balance": Double,
        
        # Categorical types
        "country": Categorical,
        "subscription_tier": Ordinal,  # Has natural ordering
        
        # Text types
        "bio": NaturalLanguage,  # Free-form text
        "email": EmailAddress,   # Special format
        
        # Boolean
        "is_verified": Boolean,
        "accepts_marketing": Boolean
    },
    semantic_tags={
        "account_balance": {"currency"},
        "age": {"person_attribute"}
    }
)
 
# Check inferred types
print("Column Types:")
for col in es["customers"].columns:
    lt = es["customers"].ww.logical_types[col]
    tags = es["customers"].ww.semantic_tags[col]
    print(f"  {col}: {lt} | Tags: {tags}")

Logical Types and Applicable Primitives
Logical Type	Example Columns	Applicable Transforms	Applicable Aggregations
`Integer`	`age`, `quantity`	`Absolute`, `Negate`	`Sum`, `Mean`, `Std`
`Double`	`price`, `rating`	`Log`, `SquareRoot`	`Sum`, `Mean`, `Median`
`Categorical`	`country`, `status`	`IsIn`	`Mode`, `NUnique`
`Datetime`	`created_at`	`Year`, `Month`, `Hour`	`TimeSinceLast`, `Trend`
`Boolean`	`is_premium`	`Not`	`PercentTrue`, `NumTrue`
`NaturalLanguage`	`description`	`NumWords`, `NumChars`	`N/A`

Ordinal vs Categorical

Use Ordinal for categories with natural ordering (e.g., education level, subscription tier). This enables additional primitives that respect the ordering, and some ML models can leverage ordinal encoding directly.

Summary: The Feature Generation Landscape

We've explored the complete taxonomy of automatically generated features—from simple direct columns to complex multi-table compositions. Here's the consolidated view:

Key Takeaways

•Direct features are raw columns—the baseline that all other features build upon
•Transform features apply single-row operations to expose latent patterns (temporal extraction, numeric transforms, binary indicators)
•Aggregation features summarize child entity data through statistical, counting, and temporal primitives
•WHERE features enable conditional aggregations, filtered by categorical or boolean values
•Composed features chain primitives across relationship paths, creating novel signals at depth 2+
•Multi-table features traverse deep relationship hierarchies, synthesizing information from distant entities
•Type annotations control which primitives apply to which columns, ensuring type-safe feature generation

What's Next:

Now that we understand what features can be generated, we'll dive into Deep Feature Synthesis (DFS)—the algorithm that systematically explores the feature space. We'll examine how DFS constructs features, manage the combinatorial explosion, and optimize the synthesis process.

Page Complete

You now have a comprehensive understanding of the feature generation taxonomy. From direct columns to deeply composed multi-table features, you can recognize and categorize any automatically generated feature. Next, we'll explore DFS—the engine that makes this generation possible.

2 / 5

Loading learning content...

Feature Engineering & SelectionAutomated Feature Engineering

Automated Feature Engineering

LevelIntermediate

Duration90 mins

TopicAutomated Feature Engineering

2 / 5

Feature Generation: The Complete Taxonomy

The Universe of Possible Features

Consider a modest e-commerce dataset with:

5 numeric columns
3 categorical columns
2 datetime columns
3 related tables at depth 2

This is where automated feature generation shines—not by replacing human judgment, but by comprehensively exploring the feature space and surfacing candidates for evaluation.

This page provides a complete taxonomy of automatically generated features, organized by type, complexity, and use case.

What You Will Learn

Direct Features: The Baseline

Why Direct Features Matter

Baseline Predictive Power: Raw attributes often contain significant signal
Interpretability: No transformation means straightforward explanation
Composition Basis: Serve as inputs to transform primitives

Types of Direct Features

Data Type	Example Columns	Notes
Numeric	`age`, `income`, `balance`	Ready for modeling
Categorical	`country`, `product_type`	Require encoding
Boolean	`is_premium`, `email_verified`	Already binary
Datetime	`signup_date`, `last_login`	Usually need extraction
Text	`product_description`	Require NLP processing

direct_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
import featuretools as ft
 
# Direct features are extracted from the target entity
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=[],     # No aggregations
    trans_primitives=[],   # No transforms
    max_depth=0            # Only direct features
)
 
# This gives us the raw customer columns
print("Direct Features:")
for feat in feature_defs:
    print(f"  {feat.get_name()}: {feat.variable_type}")
    
# Output example:
# Direct Features:
#   age: Numeric
#   country: Categorical
#   signup_date: Datetime
#   email_verified: Boolean

The max_depth=0 Pattern

Transform Features: Single-Row Operations

Transform features apply operations to individual rows without requiring aggregation across multiple records. They convert raw columns into derived features that expose latent patterns.

Temporal Transforms

Datetime columns are gold mines for transform features. A single timestamp can yield dozens of meaningful signals:

Transform	Input	Output	Use Case
`Year`	`datetime`	`int`	Long-term trends
`Month`	`datetime`	`int`	Seasonality
`Weekday`	`datetime`	`int`	Weekly patterns
`Hour`	`datetime`	`int`	Time-of-day effects
`IsWeekend`	`datetime`	`bool`	Work vs. rest
`Quarter`	`datetime`	`int`	Business cycles
`DayOfYear`	`datetime`	`int`	Annual patterns

temporal_transforms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import featuretools as ft
from featuretools.primitives import (
    Year, Month, Weekday, Hour, Minute,
    IsWeekend, Quarter, DayOfYear,
    TimeSince, Age
)
 
# Apply temporal transforms to extract patterns
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    trans_primitives=[
        "year", "month", "weekday", "day", "hour",
        "is_weekend", "quarter", "day_of_year"
    ],
    agg_primitives=[],
    max_depth=1
)
 
# Generated features from signup_date:
# - YEAR(signup_date)      -> 2023
# - MONTH(signup_date)     -> 1 (January)
# - WEEKDAY(signup_date)   -> 6 (Sunday)
# - IS_WEEKEND(signup_date) -> True
# - QUARTER(signup_date)   -> 1
# - DAY_OF_YEAR(signup_date) -> 15
 
print("Temporal Features Generated:")
for feat in features:
    if 'signup_date' in feat.get_name().lower():
        print(f"  {feat.get_name()}")

Numeric Transforms

Numeric columns can be transformed to expose different aspects of their distribution:

Transform	Formula	Use Case
`Absolute`	`\|x\|`	Handle signed values
`Square`	`x²`	Emphasize large values
`SquareRoot`	`√x`	Compress large values
`Log`	`log(x)`	Handle skewed distributions
`Percentile`	`rank(x)/n`	Relative positioning
`ZScore`	`(x-μ)/σ`	Standardization

Binary Indicator Transforms

These create boolean flags from raw values, useful for creating marker features:

Transform	Description	Output
`IsNull`	Missing value indicator	Boolean
`IsIn`	Membership in set	Boolean
`GreaterThan`	Threshold comparison	Boolean
`Equals`	Exact value match	Boolean

numeric_transforms.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
import featuretools as ft
from featuretools.primitives import (
    Absolute, SquareRoot, NaturalLogarithm,
    IsNull, Negate, CumSum, CumMean, Diff, Lag
)
 
# Numeric and cumulative transforms
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="orders",
    trans_primitives=[
        "is_null",           # Missing value indicators
        "absolute",          # Absolute values
        "square_root",       # Compression
        "natural_logarithm", # Log transform
        "cum_sum",           # Cumulative sum
        "cum_mean",          # Cumulative mean
        "diff",              # First difference
        "lag"                # Previous value
    ],
    agg_primitives=[],
    max_depth=1
)
 
# Sample generated features for total_amount:
# - IS_NULL(total_amount)     -> False
# - SQUARE_ROOT(total_amount) -> 12.247... (for 150.0)
# - NATURAL_LOGARITHM(total_amount) -> 5.01... (for 150.0)
# - CUM_SUM(total_amount)     -> Running total
# - DIFF(total_amount)        -> Change from previous order
# - LAG(total_amount, n=1)    -> Previous order's amount

Aggregation Features: Cross-Row Summarization

The Aggregation Paradigm

Given a one-to-many relationship (e.g., Customer → Orders), aggregations answer questions like:

"How many orders does each customer have?"
"What is the average order value per customer?"
"When was each customer's last order?"

Statistical Aggregations

The workhorse primitives for numeric summarization:

Statistical Aggregation Primitives
Primitive	Description	Robustness	Use Case
`Sum`	Total of all values	Sensitive to outliers	Totals, cumulative metrics
`Mean`	Arithmetic average	Sensitive to outliers	Typical behavior
`Median`	50th percentile	Robust to outliers	Typical behavior (robust)
`Std`	Standard deviation	Sensitive to outliers	Variability, consistency
`Min`	Minimum value	Extreme sensitive	Lower bounds, floors
`Max`	Maximum value	Extreme sensitive	Upper bounds, peaks
`Skew`	Distribution asymmetry	Sample size dependent	Behavior patterns
`Kurtosis`	Distribution tailedness	Sample size dependent	Extreme event frequency
`Percentile`	Nth percentile value	Configurable	Distribution analysis

aggregation_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
import featuretools as ft
 
# Comprehensive aggregation feature generation
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=[
        # Statistical
        "sum", "mean", "std", "median", "min", "max",
        "skew", "kurtosis",
        
        # Counting
        "count", "num_unique", "percent_true",
        
        # Mode and extremes
        "mode", "first", "last",
        
        # Temporal
        "time_since_last", "time_since_first", "trend"
    ],
    trans_primitives=[],
    max_depth=2
)
 
# Sample generated features:
# From orders (depth 1):
# - SUM(orders.total_amount)      -> Total customer spend
# - MEAN(orders.total_amount)     -> Average order value
# - STD(orders.total_amount)      -> Order value consistency
# - COUNT(orders)                 -> Number of orders
# - NUM_UNIQUE(orders.payment_method) -> Payment diversity
# - TIME_SINCE_LAST(orders.order_date) -> Recency
# - TREND(orders.total_amount, orders.order_date) -> Spending trend
 
# From order_items via orders (depth 2):
# - MEAN(orders.SUM(order_items.quantity)) -> Avg items per order
# - SUM(orders.COUNT(order_items))         -> Total items ever
# - MAX(orders.MAX(order_items.unit_price)) -> Highest priced item
 
print(f"Total aggregation features: {len(features)}")

Counting and Cardinality Aggregations

Primitive	Description	Output Type
`Count`	Number of child records	Integer
`NUnique`	Count of unique values	Integer
`PercentTrue`	Fraction of True values	Float [0,1]
`NumTrue`	Count of True values	Integer
`All`	Whether all values are True	Boolean
`Any`	Whether any value is True	Boolean

Temporal Aggregations

Time-aware aggregations are critical for sequential and event-based data:

Primitive	Description	Insight
`TimeSinceLast`	Duration since most recent	Recency
`TimeSinceFirst`	Duration since earliest	Tenure
`Trend`	Linear regression slope	Direction of change
`AvgTimeBetween`	Mean inter-event duration	Engagement frequency
`Entropy`	Shannon entropy of values	Behavioral diversity

The Power of Depth=2

Conditional Aggregations: WHERE Features

SELECT customer_id, SUM(total_amount)
FROM orders
WHERE payment_method = 'credit'
GROUP BY customer_id

Enabling WHERE Features

To generate conditional aggregations, you must:

Define interesting values for categorical columns
Include where_primitives in your DFS call

where_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import featuretools as ft
 
# Step 1: Define interesting values
# These are categorical values we want to filter on
es.add_interesting_values(
    dataframe_name="orders",
    values={
        "payment_method": ["credit", "debit", "paypal"]
    }
)
 
# For boolean columns, True/False are automatically interesting
es.add_interesting_values(
    dataframe_name="order_items",
    values={}  # Boolean columns handled automatically
)
 
# Step 2: Generate features with WHERE clauses
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["sum", "mean", "count"],
    trans_primitives=[],
    where_primitives=["sum", "mean", "count"],  # Enable WHERE
    max_depth=2
)
 
# Generated WHERE features:
# - SUM(orders.total_amount WHERE payment_method = credit)
# - COUNT(orders WHERE payment_method = paypal)
# - MEAN(orders.total_amount WHERE payment_method = debit)
 
# Find WHERE features in the output
where_features = [f for f in features if "WHERE" in f.get_name()]
print(f"WHERE features generated: {len(where_features)}")
for feat in where_features[:10]:
    print(f"  {feat.get_name()}")

Automatic Boolean WHERE Features

When child entities contain boolean columns, Featuretools automatically creates WHERE features:

# If orders had a boolean column 'is_discounted'
# Generated features would include:
# - COUNT(orders WHERE is_discounted = True)
# - SUM(orders.total_amount WHERE is_discounted = True)
# - MEAN(orders.total_amount WHERE is_discounted = False)

Strategic Use of Interesting Values

Not all categorical values warrant WHERE features. Choose values that:

Have sufficient frequency: Values with few occurrences produce noisy features
Carry semantic meaning: "premium" vs "standard" tier is more useful than arbitrary product IDs
Align with prediction task: For churn prediction, focus on engagement-related categories

Good Interesting Values	Poor Interesting Values
`country: ["US", "UK", "DE"]`	`country: [all 195 countries]`
`status: ["active", "cancelled"]`	`user_id: [all user IDs]`
`channel: ["web", "mobile", "api"]`	`timestamp: [all timestamps]`

Feature Explosion Warning

Stacked and Composed Features

Feature Composition Patterns

Pattern 1: Transform → Aggregate Apply a transform to a child entity, then aggregate to parent:

MEAN(orders.MONTH(order_date))

→ "Average month of orders" (captures seasonality in purchasing)

Pattern 2: Aggregate → Aggregate (Stacking) Aggregate within child, then aggregate those aggregates:

STD(orders.SUM(order_items.quantity))

→ "Standard deviation of per-order item counts" (captures order size consistency)

Pattern 3: Aggregate → Transform Aggregate to create a numeric column, then transform:

LOG(SUM(orders.total_amount))

→ "Log of total customer spend" (handles skewed spend distribution)

composed_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
import featuretools as ft
 
# Generate composed features at depth 2
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["sum", "mean", "std", "count", "max", "min"],
    trans_primitives=["month", "year", "is_weekend", "day"],
    max_depth=2
)
 
# Categorize features by composition pattern
direct_features = []
transform_only = []
agg_depth1 = []
agg_depth2 = []
transform_then_agg = []
 
for feat in features:
    name = feat.get_name()
    depth = feat.get_depth()
    
    if depth == 0:
        direct_features.append(feat)
    elif depth == 1:
        if any(agg in name for agg in ["SUM", "MEAN", "STD", "COUNT"]):
            agg_depth1.append(feat)
        else:
            transform_only.append(feat)
    else:  # depth >= 2
        agg_depth2.append(feat)
 
print(f"Feature Composition Breakdown:")
print(f"  Direct features: {len(direct_features)}")
print(f"  Transform-only (depth 1): {len(transform_only)}")
print(f"  Aggregation depth 1: {len(agg_depth1)}")
print(f"  Composed/stacked (depth 2+): {len(agg_depth2)}")
 
# Examples of composed features
print("\nSample Composed Features:")
for feat in agg_depth2[:15]:
    print(f"  {feat.get_name()}")

Feature Composition Examples
Composed Feature	Decomposition	Interpretation
`MEAN(orders.MONTH(order_date))`	MONTH → MEAN	Average purchase month
`STD(orders.SUM(order_items.quantity))`	SUM → STD	Order size variability
`MAX(orders.COUNT(order_items))`	COUNT → MAX	Largest order by item count
`MEAN(orders.MAX(order_items.unit_price))`	MAX → MEAN	Avg most expensive item per order
`SUM(orders.NUM_UNIQUE(order_items.product_id))`	NUM_UNIQUE → SUM	Total distinct products purchased

Composition Creates Novel Insights

Multi-Table Deep Features

In relational databases, the most predictive signals often span multiple tables. Deep features traverse relationship paths to synthesize information from distant entities.

Relationship Traversal

Consider a more complex schema:

Customers
    └── Orders
            └── OrderItems
                    └── Products
                            └── Categories

Featuretools can generate features that span this entire chain:

NUM_UNIQUE(orders.order_items.products.categories.category_name)

→ "Number of distinct product categories this customer has purchased from"

Depth vs. Predictive Value

Empirical studies consistently find:

Depth 1: Captures 60-70% of available signal
Depth 2: Captures 85-95% of available signal
Depth 3+: Diminishing returns, computational explosion

multi_table_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
import featuretools as ft
import pandas as pd
 
# Extended schema with products and categories
products = pd.DataFrame({
    "product_id": range(101, 120),
    "product_name": [f"Product_{i}" for i in range(101, 120)],
    "category_id": [1, 1, 2, 2, 3, 1, 2, 3, 1, 2,
                    3, 1, 2, 1, 3, 2, 1, 3, 2],
    "base_price": [25, 50, 15, 80, 45, 50, 35, 60, 25, 15,
                   47, 25, 50, 45, 16, 47, 30, 35, 42]
})
 
categories = pd.DataFrame({
    "category_id": [1, 2, 3],
    "category_name": ["Electronics", "Clothing", "Home"],
    "margin_pct": [0.15, 0.40, 0.25]
})
 
# Add to EntitySet
es.add_dataframe(
    dataframe_name="products",
    dataframe=products,
    index="product_id"
)
es.add_dataframe(
    dataframe_name="categories",
    dataframe=categories,
    index="category_id"
)
 
# Add relationships
es.add_relationship("products", "product_id", 
                    "order_items", "product_id")
es.add_relationship("categories", "category_id", 
                    "products", "category_id")
 
# Generate deep features across the full schema
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "count", "num_unique", "mode"],
    trans_primitives=[],
    max_depth=3  # Traverse up to 3 relationship hops
)
 
# Find deep features (depth 3)
deep_features = [f for f in features if f.get_depth() == 3]
print(f"Deep features (depth 3): {len(deep_features)}")
 
for feat in deep_features[:10]:
    print(f"  {feat.get_name()}")
    
# Example outputs:
# - MODE(orders.order_items.products.category_id)
# - NUM_UNIQUE(orders.order_items.products.categories.category_name)
# - MEAN(orders.SUM(order_items.products.base_price))

Multi-Path Aggregation

When an entity has multiple child relationships, Featuretools generates features from each path:

Customers
    ├── Orders       → SUM(orders.total_amount)
    ├── Reviews      → MEAN(reviews.rating)
    └── SupportTickets → COUNT(support_tickets)

These parallel paths create complementary features that capture different aspects of customer behavior.

Controlling Deep Feature Generation

To manage complexity at higher depths:

Strategy	Implementation	Effect
Limit primitives	`agg_primitives=["mean", "count"]`	Fewer features per depth
Cap features	`max_features=500`	Hard limit on output
Ignore columns	`ignore_columns={"orders": ["id"]}`	Exclude non-informative columns
Prune primitives	`primitive_options={...}`	Fine-grained control

Handling Different Data Types

Featuretools uses Woodwork for semantic type inference and management. Properly annotating your data types is crucial for generating appropriate features.

Semantic Type System

Woodwork distinguishes between:

Physical types: How data is stored (int64, float64, object, datetime64)
Logical types: What data represents (Integer, Double, Categorical, Datetime, NaturalLanguage)
Semantic tags: Additional metadata (index, time_index, numeric, category)

Type Annotations and Feature Generation

type_annotations.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
import featuretools as ft
from woodwork.logical_types import (
    Categorical, Double, Integer, Datetime, 
    NaturalLanguage, Boolean, Ordinal, EmailAddress
)
 
# Explicit type annotations
es = ft.EntitySet(id="typed_data")
 
es.add_dataframe(
    dataframe_name="customers",
    dataframe=customers_df,
    index="customer_id",
    time_index="signup_date",
    logical_types={
        # Numeric types
        "age": Integer,
        "lifetime_value": Double,
        "account_balance": Double,
        
        # Categorical types
        "country": Categorical,
        "subscription_tier": Ordinal,  # Has natural ordering
        
        # Text types
        "bio": NaturalLanguage,  # Free-form text
        "email": EmailAddress,   # Special format
        
        # Boolean
        "is_verified": Boolean,
        "accepts_marketing": Boolean
    },
    semantic_tags={
        "account_balance": {"currency"},
        "age": {"person_attribute"}
    }
)
 
# Check inferred types
print("Column Types:")
for col in es["customers"].columns:
    lt = es["customers"].ww.logical_types[col]
    tags = es["customers"].ww.semantic_tags[col]
    print(f"  {col}: {lt} | Tags: {tags}")

Logical Types and Applicable Primitives
Logical Type	Example Columns	Applicable Transforms	Applicable Aggregations
`Integer`	`age`, `quantity`	`Absolute`, `Negate`	`Sum`, `Mean`, `Std`
`Double`	`price`, `rating`	`Log`, `SquareRoot`	`Sum`, `Mean`, `Median`
`Categorical`	`country`, `status`	`IsIn`	`Mode`, `NUnique`
`Datetime`	`created_at`	`Year`, `Month`, `Hour`	`TimeSinceLast`, `Trend`
`Boolean`	`is_premium`	`Not`	`PercentTrue`, `NumTrue`
`NaturalLanguage`	`description`	`NumWords`, `NumChars`	`N/A`

Ordinal vs Categorical

Summary: The Feature Generation Landscape

We've explored the complete taxonomy of automatically generated features—from simple direct columns to complex multi-table compositions. Here's the consolidated view:

Key Takeaways

•Direct features are raw columns—the baseline that all other features build upon
•Transform features apply single-row operations to expose latent patterns (temporal extraction, numeric transforms, binary indicators)
•Aggregation features summarize child entity data through statistical, counting, and temporal primitives
•WHERE features enable conditional aggregations, filtered by categorical or boolean values
•Composed features chain primitives across relationship paths, creating novel signals at depth 2+
•Multi-table features traverse deep relationship hierarchies, synthesizing information from distant entities
•Type annotations control which primitives apply to which columns, ensuring type-safe feature generation

What's Next:

Page Complete

2 / 5