Automated Feature Engineering - Learning Module

Loading content...

0/245

Deep Feature Synthesis: The Algorithm Behind Automated FE

From Data to Features: A Systematic Approach

At the heart of Featuretools lies Deep Feature Synthesis (DFS)—a deterministic algorithm that systematically generates features by traversing relational structures and applying composable primitives. Unlike black-box AutoML systems, DFS is fully transparent: every generated feature has an explicit, traceable derivation.

The term "deep" in DFS refers not to deep learning, but to the depth of relationship traversal. Just as deep neural networks compose simple functions across many layers, DFS composes simple primitives across relationship paths.

The Core Insight

DFS is built on a profound observation:

Every feature that a human engineer would manually create can be expressed as a composition of primitive operations applied along a path through the relational schema.

This means that instead of writing custom code for each feature, we can:

Define the relational structure (EntitySet)
Define the primitive operations (transforms and aggregations)
Let DFS systematically explore all valid compositions

The result is exhaustive coverage of the feature space bounded only by computational constraints.

What You Will Learn

By the end of this page, you will understand: the algorithmic foundations of DFS, how the algorithm traverses EntitySets and constructs feature trees, primitive composition rules and type constraints, stacking versus non-stacking behavior, and mathematical properties that guarantee feature validity.

The DFS Algorithm: A Formal Description

Deep Feature Synthesis can be understood as a graph traversal algorithm that operates on the EntitySet schema. Here's the formal structure:

Input

EntitySet E: A collection of dataframes D₁, D₂, ..., Dₙ with defined relationships
Target Dataframe D_target: The entity for which we want features
Transform Primitives T: Set of single-row operations
Aggregation Primitives A: Set of aggregation operations
Max Depth d: Maximum relationship hops to traverse

Output

Feature Definitions F: Set of symbolic feature definitions
Feature Matrix M: Computed values for each feature across target instances

Algorithm Pseudocode

dfs_algorithm.pseudo

Pseudocode

ALGORITHM: Deep Feature Synthesis
INPUT: EntitySet E, Target D_target, Transforms T, Aggregations A, MaxDepth d
OUTPUT: Features F, FeatureMatrix M
 
PROCEDURE DFS(E, D_target, T, A, d):
    # Initialize with direct features from target entity
    F ← DirectFeatures(D_target)
    
    # Generate transform features for target entity
    FOR EACH column c IN D_target.columns:
        FOR EACH transform t IN T:
            IF t.input_type MATCHES c.type:
                f ← TransformFeature(t, c)
                F ← F ∪ {f}
    
    # Recursively generate aggregation features
    F ← F ∪ GenerateAggregationFeatures(E, D_target, A, T, d, depth=1)
    
    # Compute feature values
    M ← ComputeFeatureMatrix(F, E)
    
    RETURN F, M
 
PROCEDURE GenerateAggregationFeatures(E, D_parent, A, T, max_d, depth):
    IF depth > max_d:
        RETURN ∅
    
    agg_features ← ∅
    
    FOR EACH child D_child IN Children(E, D_parent):
        # Get features from child (recursive call for stacking)
        child_features ← DirectFeatures(D_child)
        child_features ← child_features ∪ TransformFeatures(D_child, T)
        
        IF depth < max_d:
            # Stack: aggregate over aggregated features
            child_features ← child_features ∪ 
                GenerateAggregationFeatures(E, D_child, A, T, max_d, depth+1)
        
        # Apply aggregations to child features
        FOR EACH cf IN child_features:
            FOR EACH agg IN A:
                IF agg.input_type MATCHES cf.output_type:
                    f ← AggregationFeature(agg, cf, relationship)
                    agg_features ← agg_features ∪ {f}
    
    RETURN agg_features

Key Observations

Recursive structure: DFS naturally recurses through the relationship graph
Type matching: Only valid primitive-column combinations are generated
Depth control: max_depth prevents infinite recursion and controls complexity
Stacking: Aggregations can be applied over other aggregations at depth > 1

Deterministic Output

Given the same EntitySet, primitives, and parameters, DFS will always generate the identical set of features. This determinism is crucial for reproducibility and debugging. There's no random sampling or heuristic pruning in the core algorithm.

Feature Tree Construction

Every feature generated by DFS can be represented as a feature tree—a hierarchical structure that encodes the complete derivation of the feature from raw data.

Anatomy of a Feature Tree

Consider the feature: MEAN(orders.SUM(order_items.quantity))

This feature has the following tree structure:

        MEAN (aggregation)
          │
          ▼
        orders
          │
     SUM (aggregation)
          │
          ▼
     order_items
          │
      quantity (direct)

Tree Properties

Property	Description	Value
Depth	Number of relationship hops	2
Root	The primitive at the top level	`MEAN`
Base	The leaf column	`order_items.quantity`
Path	Sequence of entities	`customers → orders → order_items`
Type	Output data type	`Double`

feature_tree_inspection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import featuretools as ft
 
# Generate features
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum"],
    trans_primitives=["month"],
    max_depth=2
)
 
# Find a specific composed feature
for feat in features:
    if "MEAN(orders.SUM" in feat.get_name():
        composed_feature = feat
        break
 
# Inspect the feature tree
print(f"Feature: {composed_feature.get_name()}")
print(f"Depth: {composed_feature.get_depth()}")
print(f"Output type: {composed_feature.column_schema}")
 
# Get the base features (building blocks)
print(f"\nBase features used:")
for base in composed_feature.get_deep_dependencies():
    print(f"  - {base.get_name()}")
 
# Examine the primitive used
print(f"\nTop-level primitive: {composed_feature.primitive.name}")
 
# For aggregation features, see the relationship used
if hasattr(composed_feature, 'relationship_path'):
    print(f"Relationship path: {composed_feature.relationship_path}")
 
# Serialize feature definition (for storage/versioning)
feature_json = composed_feature.to_dictionary()
print(f"\nSerialized feature definition:")
print(feature_json)

Feature Definition Serialization

Featuretools can serialize feature definitions to JSON, enabling:

Versioning: Store feature definitions in version control
Sharing: Exchange features between team members
Production deployment: Load feature definitions without rerunning DFS
Documentation: Auto-generate feature documentation

import featuretools as ft

# Save features to file
ft.save_features(features, "features.json")

# Load features later
loaded_features = ft.load_features("features.json")

# Compute new data using loaded definitions
new_feature_matrix = ft.calculate_feature_matrix(
    features=loaded_features,
    entityset=new_es
)

Primitive Composition Rules

Not all primitive combinations are valid. DFS enforces strict composition rules based on input/output type compatibility and semantic constraints.

Type Compatibility Matrix

Primitives declare their input and output types. A primitive can only be applied if its input type matches the column's type:

Operation	Input Type	Output Type	Can Stack Over
`MEAN`	Numeric	Double	Any Numeric
`SUM`	Numeric	Double	Any Numeric
`COUNT`	Any	Integer	N/A (no input)
`MODE`	Categorical	Categorical	N/A
`YEAR`	Datetime	Integer	N/A
`IS_WEEKEND`	Datetime	Boolean	N/A

composition_rules.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from featuretools.primitives import (
    Mean, Sum, Count, Mode, Year, IsWeekend, Std
)
 
# Examine primitive type signatures
print("=== Primitive Type Signatures ===")
 
for prim_class in [Mean, Sum, Count, Mode, Year, IsWeekend, Std]:
    prim = prim_class()
    print(f"\n{prim.name}:")
    print(f"  Input types:  {prim.input_types}")
    print(f"  Return type:  {prim.return_type}")
    print(f"  Commutative:  {getattr(prim, 'commutative', 'N/A')}")
    print(f"  Stacks on:    {getattr(prim, 'stack_on', None)}")
 
# Example valid compositions:
# MEAN(orders.total_amount)         ✓ Numeric → Double
# SUM(orders.COUNT(order_items))    ✓ Integer → Double
# STD(orders.MEAN(order_items.qty)) ✓ Double → Double
 
# Example invalid compositions:
# MEAN(orders.payment_method)       ✗ Categorical → ? (type mismatch)
# YEAR(orders.total_amount)         ✗ Numeric → ? (expects Datetime)
# SUM(orders.MODE(order_items.cat)) ✗ Categorical → ? (can't sum categories)

Aggregation Stacking Rules

When max_depth > 1, aggregations can be stacked—applied over the results of other aggregations. This creates the powerful composed features that capture complex patterns.

Valid Stacking Patterns:

Aggregate → Aggregate: Most common at depth 2
- MEAN(orders.SUM(order_items.quantity)) ✓
- MAX(orders.COUNT(order_items)) ✓
- STD(orders.MEAN(order_items.unit_price)) ✓
Transform → Aggregate: Apply transform, then aggregate result
- SUM(orders.YEAR(order_date)) — Sum of years (rarely useful)
- MEAN(orders.IS_WEEKEND(order_date)) — Fraction of weekend orders ✓
Aggregate → Transform: Aggregate, then transform (only at target level)
- LOG(SUM(orders.total_amount)) — Log of total spend ✓
- YEAR(MAX(orders.order_date)) — Year of most recent order ✓

Invalid Stacking Patterns:

Type mismatches: MEAN(MODE(...)) where MODE returns categorical
Transform → Transform over aggregation: Generally not meaningful

The Stacking Explosion

Each level of stacking multiplies the feature count. If you have 5 aggregation primitives and 10 numeric columns, depth 1 gives 50 features. Depth 2 can give 50 × 5 = 250 stacked features per child entity. Control this with primitive selection and max_features limits.

Relationship Path Traversal

DFS navigates the EntitySet by following relationship paths—sequences of relationships that connect entities in the schema graph.

Path Types

Forward Paths (Transform Direction) Traverse from parent to child, bringing parent attributes to child rows:

customers → orders
Result: Each order gets the customer's country, age, etc.

Backward Paths (Aggregation Direction) Traverse from child to parent, aggregating child data:

orders → customers
Result: Each customer gets SUM, MEAN, COUNT of their orders

Deep Paths (Multi-Hop) Traverse multiple relationships:

customers → orders → order_items
Result: Aggregations span two relationship hops

path_traversal.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
import featuretools as ft
 
# Examine the relationship graph
print("EntitySet Relationships:")
for rel in es.relationships:
    parent = rel.parent_dataframe.ww.name
    child = rel.child_dataframe.ww.name
    print(f"  {parent} --[{rel.parent_column.name}]--→ {child}")
 
# Generate features and analyze paths
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "count"],
    trans_primitives=[],
    max_depth=3
)
 
# Group features by depth (path length)
from collections import defaultdict
features_by_depth = defaultdict(list)
 
for feat in features:
    depth = feat.get_depth()
    features_by_depth[depth].append(feat)
 
print("\nFeatures by Relationship Path Length:")
for depth, feats in sorted(features_by_depth.items()):
    print(f"  Depth {depth}: {len(feats)} features")
    if depth > 0:
        # Show sample paths
        sample = feats[0]
        print(f"    Sample: {sample.get_name()}")
 
# Analyze multi-hop paths
print("\n=== Multi-hop Feature Examples ===")
for feat in features:
    if feat.get_depth() >= 2:
        print(f"Feature: {feat.get_name()}")
        print(f"  Entities in path: {[p.ww.name for p in feat.entity_path]}")
        break

Path Pruning Strategies

Not all paths are equally valuable. DFS provides mechanisms to control traversal:

Strategy	Implementation	Use Case
Depth limit	`max_depth=2`	Primary complexity control
Primitive selection	`agg_primitives=["mean"]`	Reduce features per path
Entity exclusion	`ignore_dataframes=["logs"]`	Skip irrelevant entities
Column exclusion	`ignore_columns={"orders": ["id"]}`	Skip non-informative columns
Relationship pruning	Manual EntitySet curation	Remove paths before DFS

Path-Aware Feature Naming

Feature names in Featuretools encode the full path: MEAN(orders.order_items.quantity) shows that we aggregate quantity from order_items through the orders relationship. This self-documenting naming is invaluable for feature interpretation and debugging.

The Mathematical Foundation

DFS rests on solid mathematical foundations that guarantee valid, interpretable features. Understanding these foundations helps in debugging and extending the algorithm.

Feature Space as a Lattice

The set of all possible features forms a lattice structure under the composition operation:

Base elements: Direct columns {x₁, x₂, ..., xₙ}
Operators: Transform primitives T = {t₁, t₂, ...} and aggregation primitives A = {a₁, a₂, ...}
Composition: f ∘ g applies f to the output of g

The lattice is bounded:

Bottom: Direct columns (depth 0)
Top: Features at max_depth with all primitives applied

Feature Complexity Bounds

Let:

n = number of columns in target entity
m = number of child relationships
|T| = number of transform primitives
|A| = number of aggregation primitives
d = max_depth

The number of features grows as:

|F| ≈ n + n·|T| + m·c·|A| · (1 + |A|)^(d-1)

where c is the average number of columns per child entity.

This exponential growth in depth explains why depth control is critical.

complexity_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import featuretools as ft
import matplotlib.pyplot as plt
 
def count_features_by_depth(es, target, max_d):
    """Count features generated at each max_depth level."""
    counts = []
    for d in range(max_d + 1):
        _, features = ft.dfs(
            entityset=es,
            target_dataframe_name=target,
            agg_primitives=["mean", "sum", "count", "std", "max"],
            trans_primitives=["month", "year", "day", "is_weekend"],
            max_depth=d,
            verbose=False
        )
        counts.append(len(features))
    return counts
 
# Analyze feature count growth
depths = [0, 1, 2, 3]
feature_counts = count_features_by_depth(es, "customers", max_d=3)
 
print("Feature Count by Depth:")
for d, count in zip(depths, feature_counts):
    print(f"  Depth {d}: {count:,} features")
 
# Calculate growth rate
for i in range(1, len(feature_counts)):
    if feature_counts[i-1] > 0:
        growth = feature_counts[i] / feature_counts[i-1]
        print(f"  Growth {i-1}→{i}: {growth:.1f}x")
 
# The exponential pattern becomes clear:
# Depth 0: ~10 features (direct columns)
# Depth 1: ~100 features (10x growth)
# Depth 2: ~1,000 features (10x growth)
# Depth 3: ~10,000+ features (10x growth)

Feature Independence Properties

DFS-generated features have important mathematical properties:

Determinism: Same inputs → same outputs, always
Type Safety: Output type is fully determined by the composition
Temporal Consistency: With cutoff times, features only use valid historical data
Compositional Semantics: The meaning of f(g(x)) is the composition of meanings of f and g

Null Handling

Featuretools handles nulls consistently:

Transform on null: Propagates null (unless primitive defines otherwise)
Aggregation ignoring nulls: Most aggregations skip nulls (like SQL)
All-null aggregation: Returns null (not 0)

This preserves statistical validity and prevents silent corruption.

Time-Aware DFS

When entities have time indices, DFS becomes time-aware, ensuring that features only use data available at the prediction time. This is crucial for preventing data leakage.

The Cutoff Time Contract

When you provide a cutoff_time parameter, DFS guarantees:

For each row in the feature matrix, all aggregated data has a timestamp before that row's cutoff time.

This is enforced at the database/dataframe level before aggregation, ensuring temporal validity by construction.

Multiple Cutoff Times

Often you need features at different points in time for the same entity:

time_aware_dfs.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
import featuretools as ft
import pandas as pd
 
# Scenario: Churn prediction
# We need features at multiple time points for training
 
# Define cutoff times for training data
# Each row represents: (customer_id, prediction_time, label)
cutoff_df = pd.DataFrame({
    "customer_id": [1, 1, 1, 2, 2, 3, 3, 3],
    "time": pd.to_datetime([
        "2023-02-01", "2023-03-01", "2023-04-01",  # Customer 1's windows
        "2023-02-15", "2023-03-15",                 # Customer 2's windows
        "2023-02-01", "2023-03-01", "2023-04-01"   # Customer 3's windows
    ]),
    "churned_30d": [False, False, True,  # Customer 1 churned after April
                    False, False,         # Customer 2 didn't churn
                    False, True, True]    # Customer 3 churned after March
})
 
# Run DFS with multiple cutoff times
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    cutoff_time=cutoff_df[["customer_id", "time"]],
    cutoff_time_in_index=True,
    agg_primitives=["mean", "sum", "count", "time_since_last"],
    trans_primitives=[],
    max_depth=2
)
 
print("Feature matrix shape:", feature_matrix.shape)
print("Index levels:", feature_matrix.index.names)
 
# The feature matrix now has 8 rows (one per cutoff time)
# Each row contains features computed using only data BEFORE its cutoff
 
# View features for customer 1 at different times
cust_1_features = feature_matrix.loc[1]
print("\nCustomer 1 features over time:")
print(cust_1_features[["COUNT(orders)", "SUM(orders.total_amount)"]])
 
# Notice: COUNT(orders) increases over time as more orders occur!

Training Windows

Beyond cutoff times, you can limit the lookback window for aggregations:

feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    cutoff_time=cutoff_df,
    training_window="30 days"  # Only use last 30 days
)

Use Cases for Training Windows:

Window	Interpretation	Use Case
`"7 days"`	Very recent behavior	Real-time scoring
`"30 days"`	Monthly behavior	Monthly engagement
`"90 days"`	Quarterly patterns	Seasonal products
`"1 year"`	Annual cycles	Year-over-year
No window	All history	Lifetime value

Leakage Prevention by Design

Time-aware DFS is one of the most valuable features of Featuretools. Data leakage is a leading cause of ML model failures in production—models that appear to work brilliantly in backtesting but fail when deployed because they were trained on future data. DFS eliminates this class of bugs by construction.

Advanced DFS Configuration

Beyond the basic parameters, DFS offers fine-grained control over the feature generation process. Mastering these options enables efficient, targeted feature synthesis.

Primitive Options

Customize primitive behavior per-column or per-entity:

advanced_config.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
import featuretools as ft
 
# Customize which primitives apply to which columns
primitive_options = {
    # Only apply sum/mean to monetary columns
    "sum": {
        "include_columns": {
            "orders": ["total_amount"],
            "order_items": ["unit_price"]
        }
    },
    "mean": {
        "include_columns": {
            "orders": ["total_amount"],
            "order_items": ["unit_price", "quantity"]
        }
    },
    # Exclude IDs from count unique
    "num_unique": {
        "exclude_columns": {
            "orders": ["order_id"],
            "order_items": ["item_id"]
        }
    }
}
 
# Use seed features to start from specific features
from featuretools.primitives import Mean, Sum
 
# Create custom seed features
base_revenue = ft.Feature(
    es["order_items"].ww["unit_price"]
) * ft.Feature(
    es["order_items"].ww["quantity"]
)
base_revenue.set_name("revenue")
 
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", "sum", "count"],
    trans_primitives=[],
    primitive_options=primitive_options,
    seed_features=[base_revenue],  # Include custom features
    max_depth=2,
    max_features=100  # Limit total features
)
 
print(f"Features generated: {len(features)}")

Advanced DFS Parameters
Parameter	Type	Description
`primitive_options`	dict	Per-primitive include/exclude column lists
`seed_features`	list	Pre-defined features to include and build upon
`drop_contains`	list	Feature name patterns to exclude
`drop_exact`	list	Exact feature names to exclude
`max_features`	int	Maximum number of features to generate
`ignore_dataframes`	list	Entities to skip during traversal
`ignore_columns`	dict	Columns to exclude per entity

Custom Primitives

When built-in primitives don't suffice, you can define custom primitives:

from featuretools.primitives import AggregationPrimitive
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double
import numpy as np

class GeometricMean(AggregationPrimitive):
    """Computes the geometric mean of a numeric column."""
    
    name = "geometric_mean"
    input_types = [ColumnSchema(logical_type=Double)]
    return_type = ColumnSchema(logical_type=Double)
    
    def get_function(self):
        def geometric_mean(x):
            x = x[x > 0]  # Only positive values
            if len(x) == 0:
                return np.nan
            return np.exp(np.log(x).mean())
        return geometric_mean

# Use the custom primitive
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["mean", GeometricMean],
    max_depth=2
)

Summary: Mastering Deep Feature Synthesis

Deep Feature Synthesis is a powerful algorithm that transforms the tedious, error-prone process of manual feature engineering into a systematic, reproducible operation. Let's consolidate the key concepts:

Key Takeaways

•DFS is a graph traversal algorithm that systematically explores the EntitySet schema to generate features
•Feature trees represent the full derivation of each feature, enabling interpretability and debugging
•Composition rules based on type compatibility ensure only valid primitive combinations are generated
•Relationship path traversal enables multi-hop feature generation, with depth controlling complexity
•The feature space grows exponentially with depth—careful parameter tuning is essential
•Time-aware DFS with cutoff times prevents data leakage by construction
•Advanced configuration (primitive options, seed features, custom primitives) enables domain-specific tuning

What's Next:

Now that we understand how DFS generates features, we turn to the critical question: Which features are actually useful? The next page covers Feature Evaluation—methods for assessing feature quality, selecting the most predictive features, and managing the feature explosion that DFS can produce.

Page Complete

You now have a deep understanding of the DFS algorithm—from its recursive structure and composition rules to time-aware generation and advanced configuration. You're equipped to leverage DFS effectively while understanding its mathematical foundations.