Loading content...
At the heart of Featuretools lies Deep Feature Synthesis (DFS)—a deterministic algorithm that systematically generates features by traversing relational structures and applying composable primitives. Unlike black-box AutoML systems, DFS is fully transparent: every generated feature has an explicit, traceable derivation.
The term "deep" in DFS refers not to deep learning, but to the depth of relationship traversal. Just as deep neural networks compose simple functions across many layers, DFS composes simple primitives across relationship paths.
DFS is built on a profound observation:
Every feature that a human engineer would manually create can be expressed as a composition of primitive operations applied along a path through the relational schema.
This means that instead of writing custom code for each feature, we can:
The result is exhaustive coverage of the feature space bounded only by computational constraints.
By the end of this page, you will understand: the algorithmic foundations of DFS, how the algorithm traverses EntitySets and constructs feature trees, primitive composition rules and type constraints, stacking versus non-stacking behavior, and mathematical properties that guarantee feature validity.
Deep Feature Synthesis can be understood as a graph traversal algorithm that operates on the EntitySet schema. Here's the formal structure:
E: A collection of dataframes D₁, D₂, ..., Dₙ with defined relationshipsD_target: The entity for which we want featuresT: Set of single-row operationsA: Set of aggregation operationsd: Maximum relationship hops to traverseF: Set of symbolic feature definitionsM: Computed values for each feature across target instances1234567891011121314151617181920212223242526272829303132333435363738394041424344454647
ALGORITHM: Deep Feature SynthesisINPUT: EntitySet E, Target D_target, Transforms T, Aggregations A, MaxDepth dOUTPUT: Features F, FeatureMatrix M PROCEDURE DFS(E, D_target, T, A, d): # Initialize with direct features from target entity F ← DirectFeatures(D_target) # Generate transform features for target entity FOR EACH column c IN D_target.columns: FOR EACH transform t IN T: IF t.input_type MATCHES c.type: f ← TransformFeature(t, c) F ← F ∪ {f} # Recursively generate aggregation features F ← F ∪ GenerateAggregationFeatures(E, D_target, A, T, d, depth=1) # Compute feature values M ← ComputeFeatureMatrix(F, E) RETURN F, M PROCEDURE GenerateAggregationFeatures(E, D_parent, A, T, max_d, depth): IF depth > max_d: RETURN ∅ agg_features ← ∅ FOR EACH child D_child IN Children(E, D_parent): # Get features from child (recursive call for stacking) child_features ← DirectFeatures(D_child) child_features ← child_features ∪ TransformFeatures(D_child, T) IF depth < max_d: # Stack: aggregate over aggregated features child_features ← child_features ∪ GenerateAggregationFeatures(E, D_child, A, T, max_d, depth+1) # Apply aggregations to child features FOR EACH cf IN child_features: FOR EACH agg IN A: IF agg.input_type MATCHES cf.output_type: f ← AggregationFeature(agg, cf, relationship) agg_features ← agg_features ∪ {f} RETURN agg_featuresmax_depth prevents infinite recursion and controls complexityGiven the same EntitySet, primitives, and parameters, DFS will always generate the identical set of features. This determinism is crucial for reproducibility and debugging. There's no random sampling or heuristic pruning in the core algorithm.
Every feature generated by DFS can be represented as a feature tree—a hierarchical structure that encodes the complete derivation of the feature from raw data.
Consider the feature: MEAN(orders.SUM(order_items.quantity))
This feature has the following tree structure:
MEAN (aggregation)
│
▼
orders
│
SUM (aggregation)
│
▼
order_items
│
quantity (direct)
| Property | Description | Value |
|---|---|---|
| Depth | Number of relationship hops | 2 |
| Root | The primitive at the top level | MEAN |
| Base | The leaf column | order_items.quantity |
| Path | Sequence of entities | customers → orders → order_items |
| Type | Output data type | Double |
1234567891011121314151617181920212223242526272829303132333435363738
import featuretools as ft # Generate featuresfeature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum"], trans_primitives=["month"], max_depth=2) # Find a specific composed featurefor feat in features: if "MEAN(orders.SUM" in feat.get_name(): composed_feature = feat break # Inspect the feature treeprint(f"Feature: {composed_feature.get_name()}")print(f"Depth: {composed_feature.get_depth()}")print(f"Output type: {composed_feature.column_schema}") # Get the base features (building blocks)print(f"\nBase features used:")for base in composed_feature.get_deep_dependencies(): print(f" - {base.get_name()}") # Examine the primitive usedprint(f"\nTop-level primitive: {composed_feature.primitive.name}") # For aggregation features, see the relationship usedif hasattr(composed_feature, 'relationship_path'): print(f"Relationship path: {composed_feature.relationship_path}") # Serialize feature definition (for storage/versioning)feature_json = composed_feature.to_dictionary()print(f"\nSerialized feature definition:")print(feature_json)Featuretools can serialize feature definitions to JSON, enabling:
import featuretools as ft
# Save features to file
ft.save_features(features, "features.json")
# Load features later
loaded_features = ft.load_features("features.json")
# Compute new data using loaded definitions
new_feature_matrix = ft.calculate_feature_matrix(
features=loaded_features,
entityset=new_es
)
Not all primitive combinations are valid. DFS enforces strict composition rules based on input/output type compatibility and semantic constraints.
Primitives declare their input and output types. A primitive can only be applied if its input type matches the column's type:
| Operation | Input Type | Output Type | Can Stack Over |
|---|---|---|---|
MEAN | Numeric | Double | Any Numeric |
SUM | Numeric | Double | Any Numeric |
COUNT | Any | Integer | N/A (no input) |
MODE | Categorical | Categorical | N/A |
YEAR | Datetime | Integer | N/A |
IS_WEEKEND | Datetime | Boolean | N/A |
123456789101112131415161718192021222324
from featuretools.primitives import ( Mean, Sum, Count, Mode, Year, IsWeekend, Std) # Examine primitive type signaturesprint("=== Primitive Type Signatures ===") for prim_class in [Mean, Sum, Count, Mode, Year, IsWeekend, Std]: prim = prim_class() print(f"\n{prim.name}:") print(f" Input types: {prim.input_types}") print(f" Return type: {prim.return_type}") print(f" Commutative: {getattr(prim, 'commutative', 'N/A')}") print(f" Stacks on: {getattr(prim, 'stack_on', None)}") # Example valid compositions:# MEAN(orders.total_amount) ✓ Numeric → Double# SUM(orders.COUNT(order_items)) ✓ Integer → Double# STD(orders.MEAN(order_items.qty)) ✓ Double → Double # Example invalid compositions:# MEAN(orders.payment_method) ✗ Categorical → ? (type mismatch)# YEAR(orders.total_amount) ✗ Numeric → ? (expects Datetime)# SUM(orders.MODE(order_items.cat)) ✗ Categorical → ? (can't sum categories)When max_depth > 1, aggregations can be stacked—applied over the results of other aggregations. This creates the powerful composed features that capture complex patterns.
Valid Stacking Patterns:
Aggregate → Aggregate: Most common at depth 2
MEAN(orders.SUM(order_items.quantity)) ✓MAX(orders.COUNT(order_items)) ✓STD(orders.MEAN(order_items.unit_price)) ✓Transform → Aggregate: Apply transform, then aggregate result
SUM(orders.YEAR(order_date)) — Sum of years (rarely useful)MEAN(orders.IS_WEEKEND(order_date)) — Fraction of weekend orders ✓Aggregate → Transform: Aggregate, then transform (only at target level)
LOG(SUM(orders.total_amount)) — Log of total spend ✓YEAR(MAX(orders.order_date)) — Year of most recent order ✓Invalid Stacking Patterns:
MEAN(MODE(...)) where MODE returns categoricalEach level of stacking multiplies the feature count. If you have 5 aggregation primitives and 10 numeric columns, depth 1 gives 50 features. Depth 2 can give 50 × 5 = 250 stacked features per child entity. Control this with primitive selection and max_features limits.
DFS navigates the EntitySet by following relationship paths—sequences of relationships that connect entities in the schema graph.
Forward Paths (Transform Direction) Traverse from parent to child, bringing parent attributes to child rows:
customers → orders
Result: Each order gets the customer's country, age, etc.
Backward Paths (Aggregation Direction) Traverse from child to parent, aggregating child data:
orders → customers
Result: Each customer gets SUM, MEAN, COUNT of their orders
Deep Paths (Multi-Hop) Traverse multiple relationships:
customers → orders → order_items
Result: Aggregations span two relationship hops
1234567891011121314151617181920212223242526272829303132333435363738394041
import featuretools as ft # Examine the relationship graphprint("EntitySet Relationships:")for rel in es.relationships: parent = rel.parent_dataframe.ww.name child = rel.child_dataframe.ww.name print(f" {parent} --[{rel.parent_column.name}]--→ {child}") # Generate features and analyze pathsfeature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"], trans_primitives=[], max_depth=3) # Group features by depth (path length)from collections import defaultdictfeatures_by_depth = defaultdict(list) for feat in features: depth = feat.get_depth() features_by_depth[depth].append(feat) print("\nFeatures by Relationship Path Length:")for depth, feats in sorted(features_by_depth.items()): print(f" Depth {depth}: {len(feats)} features") if depth > 0: # Show sample paths sample = feats[0] print(f" Sample: {sample.get_name()}") # Analyze multi-hop pathsprint("\n=== Multi-hop Feature Examples ===")for feat in features: if feat.get_depth() >= 2: print(f"Feature: {feat.get_name()}") print(f" Entities in path: {[p.ww.name for p in feat.entity_path]}") breakNot all paths are equally valuable. DFS provides mechanisms to control traversal:
| Strategy | Implementation | Use Case |
|---|---|---|
| Depth limit | max_depth=2 | Primary complexity control |
| Primitive selection | agg_primitives=["mean"] | Reduce features per path |
| Entity exclusion | ignore_dataframes=["logs"] | Skip irrelevant entities |
| Column exclusion | ignore_columns={"orders": ["id"]} | Skip non-informative columns |
| Relationship pruning | Manual EntitySet curation | Remove paths before DFS |
Feature names in Featuretools encode the full path: MEAN(orders.order_items.quantity) shows that we aggregate quantity from order_items through the orders relationship. This self-documenting naming is invaluable for feature interpretation and debugging.
DFS rests on solid mathematical foundations that guarantee valid, interpretable features. Understanding these foundations helps in debugging and extending the algorithm.
The set of all possible features forms a lattice structure under the composition operation:
{x₁, x₂, ..., xₙ}T = {t₁, t₂, ...} and aggregation primitives A = {a₁, a₂, ...}f ∘ g applies f to the output of gThe lattice is bounded:
Let:
n = number of columns in target entitym = number of child relationships|T| = number of transform primitives|A| = number of aggregation primitivesd = max_depthThe number of features grows as:
|F| ≈ n + n·|T| + m·c·|A| · (1 + |A|)^(d-1)
where c is the average number of columns per child entity.
This exponential growth in depth explains why depth control is critical.
12345678910111213141516171819202122232425262728293031323334353637
import featuretools as ftimport matplotlib.pyplot as plt def count_features_by_depth(es, target, max_d): """Count features generated at each max_depth level.""" counts = [] for d in range(max_d + 1): _, features = ft.dfs( entityset=es, target_dataframe_name=target, agg_primitives=["mean", "sum", "count", "std", "max"], trans_primitives=["month", "year", "day", "is_weekend"], max_depth=d, verbose=False ) counts.append(len(features)) return counts # Analyze feature count growthdepths = [0, 1, 2, 3]feature_counts = count_features_by_depth(es, "customers", max_d=3) print("Feature Count by Depth:")for d, count in zip(depths, feature_counts): print(f" Depth {d}: {count:,} features") # Calculate growth ratefor i in range(1, len(feature_counts)): if feature_counts[i-1] > 0: growth = feature_counts[i] / feature_counts[i-1] print(f" Growth {i-1}→{i}: {growth:.1f}x") # The exponential pattern becomes clear:# Depth 0: ~10 features (direct columns)# Depth 1: ~100 features (10x growth)# Depth 2: ~1,000 features (10x growth)# Depth 3: ~10,000+ features (10x growth)DFS-generated features have important mathematical properties:
Determinism: Same inputs → same outputs, always
Type Safety: Output type is fully determined by the composition
Temporal Consistency: With cutoff times, features only use valid historical data
Compositional Semantics: The meaning of f(g(x)) is the composition of meanings of f and g
Featuretools handles nulls consistently:
This preserves statistical validity and prevents silent corruption.
When entities have time indices, DFS becomes time-aware, ensuring that features only use data available at the prediction time. This is crucial for preventing data leakage.
When you provide a cutoff_time parameter, DFS guarantees:
For each row in the feature matrix, all aggregated data has a timestamp before that row's cutoff time.
This is enforced at the database/dataframe level before aggregation, ensuring temporal validity by construction.
Often you need features at different points in time for the same entity:
12345678910111213141516171819202122232425262728293031323334353637383940414243
import featuretools as ftimport pandas as pd # Scenario: Churn prediction# We need features at multiple time points for training # Define cutoff times for training data# Each row represents: (customer_id, prediction_time, label)cutoff_df = pd.DataFrame({ "customer_id": [1, 1, 1, 2, 2, 3, 3, 3], "time": pd.to_datetime([ "2023-02-01", "2023-03-01", "2023-04-01", # Customer 1's windows "2023-02-15", "2023-03-15", # Customer 2's windows "2023-02-01", "2023-03-01", "2023-04-01" # Customer 3's windows ]), "churned_30d": [False, False, True, # Customer 1 churned after April False, False, # Customer 2 didn't churn False, True, True] # Customer 3 churned after March}) # Run DFS with multiple cutoff timesfeature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", cutoff_time=cutoff_df[["customer_id", "time"]], cutoff_time_in_index=True, agg_primitives=["mean", "sum", "count", "time_since_last"], trans_primitives=[], max_depth=2) print("Feature matrix shape:", feature_matrix.shape)print("Index levels:", feature_matrix.index.names) # The feature matrix now has 8 rows (one per cutoff time)# Each row contains features computed using only data BEFORE its cutoff # View features for customer 1 at different timescust_1_features = feature_matrix.loc[1]print("\nCustomer 1 features over time:")print(cust_1_features[["COUNT(orders)", "SUM(orders.total_amount)"]]) # Notice: COUNT(orders) increases over time as more orders occur!Beyond cutoff times, you can limit the lookback window for aggregations:
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="customers",
cutoff_time=cutoff_df,
training_window="30 days" # Only use last 30 days
)
Use Cases for Training Windows:
| Window | Interpretation | Use Case |
|---|---|---|
"7 days" | Very recent behavior | Real-time scoring |
"30 days" | Monthly behavior | Monthly engagement |
"90 days" | Quarterly patterns | Seasonal products |
"1 year" | Annual cycles | Year-over-year |
| No window | All history | Lifetime value |
Time-aware DFS is one of the most valuable features of Featuretools. Data leakage is a leading cause of ML model failures in production—models that appear to work brilliantly in backtesting but fail when deployed because they were trained on future data. DFS eliminates this class of bugs by construction.
Beyond the basic parameters, DFS offers fine-grained control over the feature generation process. Mastering these options enables efficient, targeted feature synthesis.
Customize primitive behavior per-column or per-entity:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
import featuretools as ft # Customize which primitives apply to which columnsprimitive_options = { # Only apply sum/mean to monetary columns "sum": { "include_columns": { "orders": ["total_amount"], "order_items": ["unit_price"] } }, "mean": { "include_columns": { "orders": ["total_amount"], "order_items": ["unit_price", "quantity"] } }, # Exclude IDs from count unique "num_unique": { "exclude_columns": { "orders": ["order_id"], "order_items": ["item_id"] } }} # Use seed features to start from specific featuresfrom featuretools.primitives import Mean, Sum # Create custom seed featuresbase_revenue = ft.Feature( es["order_items"].ww["unit_price"]) * ft.Feature( es["order_items"].ww["quantity"])base_revenue.set_name("revenue") feature_matrix, features = ft.dfs( entityset=es, target_dataframe_name="customers", agg_primitives=["mean", "sum", "count"], trans_primitives=[], primitive_options=primitive_options, seed_features=[base_revenue], # Include custom features max_depth=2, max_features=100 # Limit total features) print(f"Features generated: {len(features)}")| Parameter | Type | Description |
|---|---|---|
primitive_options | dict | Per-primitive include/exclude column lists |
seed_features | list | Pre-defined features to include and build upon |
drop_contains | list | Feature name patterns to exclude |
drop_exact | list | Exact feature names to exclude |
max_features | int | Maximum number of features to generate |
ignore_dataframes | list | Entities to skip during traversal |
ignore_columns | dict | Columns to exclude per entity |
When built-in primitives don't suffice, you can define custom primitives:
from featuretools.primitives import AggregationPrimitive
from woodwork.column_schema import ColumnSchema
from woodwork.logical_types import Double
import numpy as np
class GeometricMean(AggregationPrimitive):
"""Computes the geometric mean of a numeric column."""
name = "geometric_mean"
input_types = [ColumnSchema(logical_type=Double)]
return_type = ColumnSchema(logical_type=Double)
def get_function(self):
def geometric_mean(x):
x = x[x > 0] # Only positive values
if len(x) == 0:
return np.nan
return np.exp(np.log(x).mean())
return geometric_mean
# Use the custom primitive
feature_matrix, features = ft.dfs(
entityset=es,
target_dataframe_name="customers",
agg_primitives=["mean", GeometricMean],
max_depth=2
)
Deep Feature Synthesis is a powerful algorithm that transforms the tedious, error-prone process of manual feature engineering into a systematic, reproducible operation. Let's consolidate the key concepts:
What's Next:
Now that we understand how DFS generates features, we turn to the critical question: Which features are actually useful? The next page covers Feature Evaluation—methods for assessing feature quality, selecting the most predictive features, and managing the feature explosion that DFS can produce.
You now have a deep understanding of the DFS algorithm—from its recursive structure and composition rules to time-aware generation and advanced configuration. You're equipped to leverage DFS effectively while understanding its mathematical foundations.