Feature Engineering & SelectionAutomated Feature Engineering

Automated Feature Engineering

LevelIntermediate

Duration90 mins

TopicAutomated Feature Engineering

1 / 5

Featuretools: The Foundation of Automated Feature Engineering

The Feature Engineering Bottleneck

In the modern machine learning pipeline, feature engineering consistently emerges as the most time-consuming and expertise-dependent phase. According to surveys from Kaggle and industry practitioners, data scientists spend approximately 60-80% of their time on data preparation and feature engineering, yet this critical work remains largely manual and artisanal.

The consequences of this bottleneck are profound:

Scalability limitations: Manual feature engineering doesn't scale across thousands of tables or evolving schemas
Knowledge silos: Feature engineering expertise remains locked in individual practitioners' heads
Reproducibility challenges: Hand-crafted features often lack documentation and versioning
Missed opportunities: Human engineers cannot exhaustively explore the combinatorial space of possible features

This is where Featuretools enters the picture—an open-source Python library that fundamentally reimagines how we approach feature engineering.

What You Will Learn

By the end of this page, you will understand: the architectural foundations of Featuretools, how to model relational data as EntitySets, the primitive-based approach to feature generation, and how Featuretools achieves automated feature engineering through composition. You'll gain the conceptual framework necessary to apply automated FE in production environments.

The Genesis of Automated Feature Engineering

Featuretools was developed by Feature Labs (later acquired by Alteryx) as the first comprehensive framework for automated feature engineering. Its creation was motivated by a fundamental observation: while machine learning algorithms had become increasingly sophisticated and automated (through AutoML), the feature engineering step remained stubbornly manual.

The Historical Context

To appreciate Featuretools, we must understand the evolution of feature engineering:

1990s-2000s: Domain Expert Era Feature engineering was synonymous with domain expertise. Financial engineers crafted technical indicators, NLP researchers designed linguistic features, and computer vision experts engineered edge detectors. Each domain developed its own feature vocabularies, passed down through apprenticeship.

2010s: The Deep Learning Disruption Deep learning promised automatic feature learning—neural networks would discover relevant representations from raw data. While revolutionary for images and sequences, this approach struggled with structured/tabular data, which remains the lifeblood of enterprise ML.

2015+: The Rise of Automated Feature Engineering Researchers recognized that for relational and tabular data, a systematic approach to feature generation could complement both traditional ML and deep learning. Featuretools emerged as the defining implementation of this vision.

Why Tabular Data is Different

Unlike images (where convolutions exploit local spatial structure) or text (where attention exploits sequential dependencies), tabular data has no inherent structure that deep learning can automatically exploit. Each column may have different semantics, scales, and relationships. This is why feature engineering remains crucial for tabular ML, and why automation in this domain requires a fundamentally different approach.

The Featuretools Philosophy

Featuretools is built on three core principles:

Declarative Relationships: Rather than imperatively coding feature transformations, you declare the relationships between entities, and Featuretools reasons about valid feature paths.
Composable Primitives: Complex features are built by composing simple, well-defined operations (primitives). This enables systematic exploration of the feature space.
Reproducibility by Design: Every generated feature has a traceable lineage—you can inspect exactly how any feature was computed, ensuring auditability and debugging capability.

EntitySet: Modeling Relational Data

At the heart of Featuretools lies the EntitySet, a data structure that captures the relational schema of your data. An EntitySet is not merely a collection of DataFrames—it's a semantic model that encodes entities, their attributes, and the relationships between them.

What is an EntitySet?

An EntitySet consists of:

Dataframes: The actual data, organized as pandas DataFrames (or Dask/Spark DataFrames for distributed computing)
Indices: Unique identifiers for each row in each DataFrame
Time Indices: Optional temporal columns that enable time-aware feature engineering
Relationships: Explicit links between DataFrames via foreign keys
Variable Types: Semantic annotations for columns (numeric, categorical, datetime, ordinal, etc.)

This rich representation enables Featuretools to reason about your data in sophisticated ways.

entityset_creation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import featuretools as ft
import pandas as pd
 
# Sample e-commerce data
customers = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "signup_date": pd.to_datetime([
        "2023-01-15", "2023-02-20", "2023-01-10", 
        "2023-03-05", "2023-02-28"
    ]),
    "country": ["US", "UK", "US", "DE", "FR"],
    "age": [28, 35, 42, 31, 25]
})
 
orders = pd.DataFrame({
    "order_id": range(1, 11),
    "customer_id": [1, 1, 2, 3, 3, 3, 4, 4, 5, 1],
    "order_date": pd.to_datetime([
        "2023-02-01", "2023-03-15", "2023-03-01", "2023-02-10",
        "2023-03-20", "2023-04-01", "2023-03-25", "2023-04-10",
        "2023-03-15", "2023-04-20"
    ]),
    "total_amount": [150.0, 200.0, 75.0, 300.0, 125.0, 
                    180.0, 95.0, 220.0, 50.0, 175.0],
    "payment_method": ["credit", "debit", "credit", "paypal", 
                       "credit", "credit", "debit", "paypal",
                       "credit", "debit"]
})
 
order_items = pd.DataFrame({
    "item_id": range(1, 26),
    "order_id": [1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 
                 6, 6, 7, 8, 8, 8, 9, 10, 10, 10,
                 4, 5, 6, 7, 8],
    "product_id": [101, 102, 101, 103, 104, 102, 105, 106,
                   101, 107, 103, 108, 109, 101, 102, 110,
                   111, 105, 112, 101, 113, 114, 115, 116, 117],
    "quantity": [2, 1, 1, 3, 1, 2, 1, 1, 4, 2,
                 1, 1, 2, 1, 1, 1, 3, 2, 1, 1,
                 1, 2, 1, 1, 2],
    "unit_price": [25.0, 100.0, 25.0, 15.0, 80.0, 50.0,
                   45.0, 50.0, 25.0, 35.0, 15.0, 60.0,
                   47.5, 25.0, 50.0, 45.0, 16.67, 47.5,
                   30.0, 25.0, 35.0, 12.5, 30.0, 42.5, 15.0]
})
 
# Create the EntitySet
es = ft.EntitySet(id="ecommerce")
 
# Add dataframes with proper type annotations
es = es.add_dataframe(
    dataframe_name="customers",
    dataframe=customers,
    index="customer_id",
    time_index="signup_date",
    logical_types={
        "country": "Categorical",
        "age": "Integer"
    }
)
 
es = es.add_dataframe(
    dataframe_name="orders",
    dataframe=orders,
    index="order_id",
    time_index="order_date",
    logical_types={
        "payment_method": "Categorical",
        "total_amount": "Double"
    }
)
 
es = es.add_dataframe(
    dataframe_name="order_items",
    dataframe=order_items,
    index="item_id",
    logical_types={
        "quantity": "Integer",
        "unit_price": "Double"
    }
)
 
# Define relationships (parent -> child)
es = es.add_relationship("customers", "customer_id", 
                          "orders", "customer_id")
es = es.add_relationship("orders", "order_id", 
                          "order_items", "order_id")
 
print(es)

Understanding Relationships

Relationships in Featuretools are directional and follow a parent-child hierarchy:

Parent Entity: Contains the primary key (the "one" in a one-to-many relationship)
Child Entity: Contains the foreign key (the "many" in a one-to-many relationship)

In our e-commerce example:

customers is the parent of orders (one customer has many orders)
orders is the parent of order_items (one order has many items)

This hierarchy is crucial because it determines the direction of aggregations and transformations during feature synthesis.

Relationship Semantics in Featuretools
Relationship Type	Direction	Feature Operations	Example
Parent → Child (Forward)	Downward traversal	Transform primitives	Customer's country attached to each order
Child → Parent (Backward)	Upward aggregation	Aggregation primitives	SUM of order totals per customer
Multi-hop	Chained traversal	Composed operations	MEAN of item quantities per customer

Feature Primitives: The Building Blocks

Feature primitives are the atomic operations that Featuretools uses to generate features. They represent the fundamental transformations and aggregations that can be applied to data. Understanding primitives is essential because all automatically generated features are compositions of these basic operations.

Types of Primitives

Featuretools distinguishes between two fundamental categories of primitives:

1. Transform Primitives These operate on a single row and produce a single output value. They don't require any aggregation across multiple rows.

Examples:

Year, Month, Day, Hour — Extract components from datetime
IsNull — Binary indicator for missing values
Absolute — Absolute value of numeric column
CumSum, CumMean — Cumulative statistics
Diff — Difference from previous row

2. Aggregation Primitives These operate on a collection of rows (typically grouped by a parent entity) and produce a single aggregate value.

Examples:

Sum, Mean, Std, Median — Statistical aggregations
Count, NUnique — Counting operations
Min, Max, Mode — Extreme values
Trend — Linear regression slope over time
TimeSinceLast, TimeSinceFirst — Temporal aggregations

primitives_exploration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import featuretools as ft
 
# List all available primitives
all_primitives = ft.primitives.list_primitives()
print(f"Total primitives available: {len(all_primitives)}")
 
# Filter by type
transform_primitives = all_primitives[
    all_primitives["type"] == "transform"
]
aggregation_primitives = all_primitives[
    all_primitives["type"] == "aggregation"
]
 
print(f"Transform primitives: {len(transform_primitives)}")
print(f"Aggregation primitives: {len(aggregation_primitives)}")
 
# Explore specific primitives
print("\n--- Sample Transform Primitives ---")
print(transform_primitives[["name", "description"]].head(10))
 
print("\n--- Sample Aggregation Primitives ---")
print(aggregation_primitives[["name", "description"]].head(10))
 
# Get detailed info about a specific primitive
from featuretools.primitives import Mean, Mode, Trend
 
print("\n--- Mean Primitive Details ---")
print(f"Name: {Mean.name}")
print(f"Input types: {Mean.input_types}")
print(f"Return type: {Mean.return_type}")
 
print("\n--- Trend Primitive Details ---")
print(f"Name: {Trend.name}")
print(f"Input types: {Trend.input_types}")
print(f"Return type: {Trend.return_type}")

Primitive Type Compatibility

Each primitive has defined input and output types. Featuretools automatically matches primitives to compatible columns based on their semantic types. For example, the Day transform only applies to datetime columns, while Mean only applies to numeric columns. This type-awareness prevents invalid feature combinations and reduces noise in the generated feature set.

The Primitive Taxonomy

Featuretools organizes primitives into a rich taxonomy that reflects the diversity of feature engineering operations encountered in practice:

Category	Examples	Use Case
Temporal	`Year`, `Weekday`, `TimeSince`	Extracting temporal patterns
Statistical	`Mean`, `Std`, `Skew`, `Kurtosis`	Summarizing distributions
Counting	`Count`, `PercentTrue`, `NUnique`	Quantifying occurrences
Text	`NumWords`, `NumCharacters`	Basic NLP features
Cumulative	`CumSum`, `CumMax`, `CumCount`	Rolling computations
Binary	`IsNull`, `IsWeekend`, `IsIn`	Indicator variables
Comparison	`GreaterThan`, `Equal`	Threshold-based flags

The power of Featuretools lies in composing these primitives across relationship paths to generate complex, meaningful features automatically.

Running Deep Feature Synthesis

With an EntitySet defined and primitives understood, we can now run Deep Feature Synthesis (DFS)—Featuretools' core algorithm for automated feature generation. DFS systematically traverses relationships and applies primitives to generate a comprehensive feature matrix.

The DFS Function

The ft.dfs() function is the primary interface for feature generation:

dfs_execution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import featuretools as ft
 
# Assuming 'es' is our EntitySet from earlier
 
# Run DFS to generate features for customers
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",  # Entity we want features for
    agg_primitives=[
        "sum", "mean", "std", "max", "min", 
        "count", "num_unique", "mode"
    ],
    trans_primitives=[
        "year", "month", "weekday", "day"
    ],
    max_depth=2,  # How deep to traverse relationships
    verbose=True
)
 
print(f"Generated {len(feature_defs)} features")
print(f"Feature matrix shape: {feature_matrix.shape}")
 
# Examine generated features
print("\n--- Sample Generated Features ---")
for feat in feature_defs[:20]:
    print(f"  {feat.get_name()}")
 
# View the feature matrix
print("\n--- Feature Matrix Head ---")
print(feature_matrix.head())

Understanding DFS Parameters

The dfs() function accepts numerous parameters that control the feature generation process:

Parameter	Description	Impact
`target_dataframe_name`	Entity for which to generate features	Determines the granularity of the output
`agg_primitives`	List of aggregation primitives to use	Controls parent→child aggregations
`trans_primitives`	List of transform primitives to use	Controls single-row transformations
`max_depth`	Maximum relationship hops to traverse	Exponentially affects feature count
`max_features`	Cap on total features generated	Prevents explosion; useful for exploration
`cutoff_time`	Time-based filtering for temporal features	Critical for preventing data leakage
`ignore_columns`	Columns to exclude from feature generation	Removes irrelevant or leaky columns

The Depth Parameter

The max_depth parameter deserves special attention. It controls how many relationship hops DFS will traverse when building features:

Depth 1: Only direct aggregations from immediate children
- Example: MEAN(orders.total_amount) — average order value per customer
Depth 2: Aggregations can be stacked or traverse two hops
- Example: MEAN(orders.SUM(order_items.quantity)) — average of per-order item counts
- Example: SUM(orders.order_items.unit_price) — total across all items across all orders
Depth 3+: Further nesting, feature counts grow exponentially
- Use with caution—diminishing returns and computational explosion

The Depth Explosion

Increasing max_depth by 1 can multiply your feature count by 10x or more. A depth of 2 might generate 500 features; depth 3 could produce 50,000. Always start with depth=2 and increase only if model performance plateaus and you have computational budget to spare.

Cutoff Times and Temporal Validity

One of Featuretools' most sophisticated capabilities is its handling of temporal data. In real-world ML applications, we often need to predict future events based on historical data. This requires strict discipline to avoid data leakage—accidentally using future information when making predictions.

The Cutoff Time Concept

A cutoff time defines the moment at which we're making a prediction. Any feature we generate must use only data available before that cutoff:

Training a churn model? Cutoff is when we last observed the customer before churning
Predicting next purchase? Cutoff is the day before the purchase we're predicting
Fraud detection? Cutoff is the transaction timestamp

Featuretools enforces temporal validity automatically when you provide cutoff times.

cutoff_times.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import featuretools as ft
import pandas as pd
 
# Define cutoff times for each customer
# This represents "when we're making the prediction"
cutoff_times = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "time": pd.to_datetime([
        "2023-04-01",  # Predict customer 1's behavior as of April 1
        "2023-04-01",  # Same for customer 2
        "2023-03-15",  # Customer 3's cutoff is earlier
        "2023-04-15",  # Customer 4's cutoff is later
        "2023-04-01"   # Customer 5
    ])
})
 
# Run DFS with cutoff times
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    cutoff_time=cutoff_times,
    cutoff_time_in_index=True,  # Include cutoff time in output index
    agg_primitives=["sum", "mean", "count"],
    trans_primitives=["month", "year"],
    max_depth=2,
    training_window="90 days"  # Only use last 90 days before cutoff
)
 
print("Feature matrix with temporal filtering:")
print(feature_matrix)
 
# The beauty: features for customer 3 (cutoff March 15)
# won't include their March 20 or April 1 orders!

Training Windows

Beyond cutoff times, Featuretools supports training windows—limiting how far back in time to look when computing features:

training_window="30 days" — Only aggregate data from the last 30 days before cutoff
training_window="6 months" — Use up to 6 months of history
No training window — Use all available history

Training windows are essential for:

Recency Relevance: Recent behavior often predicts better than ancient history
Computational Efficiency: Limiting data scope reduces computation
Concept Drift: Older patterns may no longer be predictive

Leakage Prevention Built-In

Featuretools' temporal handling is one of its most valuable features for production ML. Many data leakage bugs occur when engineers manually compute features without rigorously respecting temporal constraints. By declaring cutoff times and letting Featuretools handle the filtering, you get leakage prevention by design rather than by vigilance.

Feature Engineering at Scale

Featuretools is designed to scale beyond single-machine pandas DataFrames. For production workloads with millions of rows or terabytes of data, Featuretools integrates with distributed computing frameworks.

Scaling Options

1. Dask Integration For datasets that exceed memory but can fit on a single machine with out-of-core computation:

import dask.dataframe as dd
import featuretools as ft

# Convert pandas to Dask
dask_df = dd.from_pandas(large_df, npartitions=10)

# Create EntitySet with Dask dataframe
es = ft.EntitySet(id="large_data")
es.add_dataframe(
    dataframe=dask_df,
    dataframe_name="transactions",
    index="transaction_id"
)

# DFS works the same way
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers"
)

2. Spark Integration (via Woodwork) For truly distributed computation across clusters:

from pyspark.sql import SparkSession
import featuretools as ft

spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.parquet("s3://bucket/transactions/")

# Featuretools can work with Spark via Koalas/pandas API on Spark
# Best practice: use Spark for data prep, convert to pandas for DFS

3. Incremental Computation For streaming or frequently updated data, compute features incrementally:

incremental_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import featuretools as ft
import pandas as pd
 
# Suppose we already computed features for cutoff_time_1
# Now we have new data and a new cutoff time
 
# Method 1: Approximate entity matching for new instances
new_cutoff_times = pd.DataFrame({
    "customer_id": [6, 7],  # New customers
    "time": pd.to_datetime(["2023-05-01", "2023-05-01"])
})
 
# Run DFS only for new instances
new_feature_matrix, _ = ft.dfs(
    entityset=es_updated,  # EntitySet with new data
    target_dataframe_name="customers",
    cutoff_time=new_cutoff_times,
    features=feature_defs  # Reuse existing feature definitions!
)
 
# Method 2: Using calculate_feature_matrix for efficiency
# When you already have feature definitions
from featuretools import calculate_feature_matrix
 
feature_matrix = calculate_feature_matrix(
    features=feature_defs,
    entityset=es_updated,
    cutoff_time=new_cutoff_times
)

Scaling Strategies for Featuretools
Data Size	Recommended Approach	Key Consideration
< 1 GB	Standard pandas	Simplicity; no setup overhead
1-50 GB	Dask DataFrames	Out-of-core computation on single machine
50 GB - 1 TB	Spark + sampling	Compute on sample, apply to full data
1 TB	Custom feature pipelines	Use DFS for prototyping, SQL/Spark for production

Featuretools Best Practices

After years of adoption across industry and research, a set of best practices has emerged for using Featuretools effectively:

1. Start Simple, Iterate

Don't begin with max_depth=3 and every primitive enabled. Start with:

max_depth=1
A handful of core primitives (mean, sum, count, max, min)
Evaluate model performance
Gradually increase complexity

2. Curate Your Primitives

Not all primitives are equally valuable for every problem. Domain knowledge still matters:

Financial data: Include Trend, Percentile, TimeSinceLast
E-commerce: Include Mode, NUnique, PercentTrue
Time series: Include CumSum, CumMean, Lag

3. Use Interesting Values for Categorical Variables

Featuretools can create features based on specific categorical values:

interesting_values.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import featuretools as ft
 
# Define interesting values for payment_method
es["orders"].ww.set_types(
    logical_types={"payment_method": "Categorical"}
)
 
# Tell Featuretools which values to create features for
es.add_interesting_values(
    dataframe_name="orders",
    values={"payment_method": ["credit", "paypal"]}
)
 
# Now DFS will generate features like:
# - COUNT(orders WHERE payment_method = credit)
# - SUM(orders.total_amount WHERE payment_method = paypal)
 
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["count", "sum", "mean"],
    where_primitives=["count", "sum", "mean"],  # Enable WHERE clauses
    max_depth=2
)

Production Readiness Checklist

•Validate temporal logic — Ensure cutoff times align with your prediction task
•Profile memory usage — Monitor memory before scaling to full dataset
•Document feature definitions — Export and version control your feature definitions
•Test for leakage — Verify no future data is used in feature computation
•Benchmark computation time — Measure DFS duration to plan production batching
•Set max_features — Prevent runaway feature explosion during exploration

Summary: Featuretools as Foundation

Featuretools represents a paradigm shift in how we approach feature engineering—from artisanal, manual craft to systematic, automated synthesis. Let's consolidate the key concepts:

Key Takeaways

•EntitySet is the semantic data model that captures entities, relationships, and variable types—the foundation for automated feature engineering
•Feature Primitives are the atomic operations (transforms and aggregations) that Featuretools composes to generate complex features
•Deep Feature Synthesis (DFS) systematically traverses relationships and applies primitives to generate comprehensive feature matrices
•Cutoff Times enable temporally valid features, preventing data leakage by construction
•Scalability is achieved through Dask/Spark integration and incremental computation
•Best practices emphasize starting simple, curating primitives, and validating temporal logic

What's Next:

Now that we understand the Featuretools framework, we'll dive deeper into Feature Generation—exploring the full taxonomy of features that can be automatically created, from simple aggregations to complex temporal patterns and interaction features.

Page Complete

You now have a solid foundation in Featuretools—the architecture, primitives, and practices that enable automated feature engineering. Next, we'll explore the full spectrum of features that DFS can generate and how to guide the generation process for your specific domain.

1 / 5

Loading learning content...

Feature Engineering & SelectionAutomated Feature Engineering

Automated Feature Engineering

LevelIntermediate

Duration90 mins

TopicAutomated Feature Engineering

1 / 5

Featuretools: The Foundation of Automated Feature Engineering

The Feature Engineering Bottleneck

The consequences of this bottleneck are profound:

Scalability limitations: Manual feature engineering doesn't scale across thousands of tables or evolving schemas
Knowledge silos: Feature engineering expertise remains locked in individual practitioners' heads
Reproducibility challenges: Hand-crafted features often lack documentation and versioning
Missed opportunities: Human engineers cannot exhaustively explore the combinatorial space of possible features

This is where Featuretools enters the picture—an open-source Python library that fundamentally reimagines how we approach feature engineering.

What You Will Learn

The Genesis of Automated Feature Engineering

The Historical Context

To appreciate Featuretools, we must understand the evolution of feature engineering:

Why Tabular Data is Different

The Featuretools Philosophy

Featuretools is built on three core principles:

Declarative Relationships: Rather than imperatively coding feature transformations, you declare the relationships between entities, and Featuretools reasons about valid feature paths.
Composable Primitives: Complex features are built by composing simple, well-defined operations (primitives). This enables systematic exploration of the feature space.
Reproducibility by Design: Every generated feature has a traceable lineage—you can inspect exactly how any feature was computed, ensuring auditability and debugging capability.

EntitySet: Modeling Relational Data

What is an EntitySet?

An EntitySet consists of:

Dataframes: The actual data, organized as pandas DataFrames (or Dask/Spark DataFrames for distributed computing)
Indices: Unique identifiers for each row in each DataFrame
Time Indices: Optional temporal columns that enable time-aware feature engineering
Relationships: Explicit links between DataFrames via foreign keys
Variable Types: Semantic annotations for columns (numeric, categorical, datetime, ordinal, etc.)

This rich representation enables Featuretools to reason about your data in sophisticated ways.

entityset_creation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
import featuretools as ft
import pandas as pd
 
# Sample e-commerce data
customers = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "signup_date": pd.to_datetime([
        "2023-01-15", "2023-02-20", "2023-01-10", 
        "2023-03-05", "2023-02-28"
    ]),
    "country": ["US", "UK", "US", "DE", "FR"],
    "age": [28, 35, 42, 31, 25]
})
 
orders = pd.DataFrame({
    "order_id": range(1, 11),
    "customer_id": [1, 1, 2, 3, 3, 3, 4, 4, 5, 1],
    "order_date": pd.to_datetime([
        "2023-02-01", "2023-03-15", "2023-03-01", "2023-02-10",
        "2023-03-20", "2023-04-01", "2023-03-25", "2023-04-10",
        "2023-03-15", "2023-04-20"
    ]),
    "total_amount": [150.0, 200.0, 75.0, 300.0, 125.0, 
                    180.0, 95.0, 220.0, 50.0, 175.0],
    "payment_method": ["credit", "debit", "credit", "paypal", 
                       "credit", "credit", "debit", "paypal",
                       "credit", "debit"]
})
 
order_items = pd.DataFrame({
    "item_id": range(1, 26),
    "order_id": [1, 1, 2, 2, 2, 3, 4, 4, 5, 5, 
                 6, 6, 7, 8, 8, 8, 9, 10, 10, 10,
                 4, 5, 6, 7, 8],
    "product_id": [101, 102, 101, 103, 104, 102, 105, 106,
                   101, 107, 103, 108, 109, 101, 102, 110,
                   111, 105, 112, 101, 113, 114, 115, 116, 117],
    "quantity": [2, 1, 1, 3, 1, 2, 1, 1, 4, 2,
                 1, 1, 2, 1, 1, 1, 3, 2, 1, 1,
                 1, 2, 1, 1, 2],
    "unit_price": [25.0, 100.0, 25.0, 15.0, 80.0, 50.0,
                   45.0, 50.0, 25.0, 35.0, 15.0, 60.0,
                   47.5, 25.0, 50.0, 45.0, 16.67, 47.5,
                   30.0, 25.0, 35.0, 12.5, 30.0, 42.5, 15.0]
})
 
# Create the EntitySet
es = ft.EntitySet(id="ecommerce")
 
# Add dataframes with proper type annotations
es = es.add_dataframe(
    dataframe_name="customers",
    dataframe=customers,
    index="customer_id",
    time_index="signup_date",
    logical_types={
        "country": "Categorical",
        "age": "Integer"
    }
)
 
es = es.add_dataframe(
    dataframe_name="orders",
    dataframe=orders,
    index="order_id",
    time_index="order_date",
    logical_types={
        "payment_method": "Categorical",
        "total_amount": "Double"
    }
)
 
es = es.add_dataframe(
    dataframe_name="order_items",
    dataframe=order_items,
    index="item_id",
    logical_types={
        "quantity": "Integer",
        "unit_price": "Double"
    }
)
 
# Define relationships (parent -> child)
es = es.add_relationship("customers", "customer_id", 
                          "orders", "customer_id")
es = es.add_relationship("orders", "order_id", 
                          "order_items", "order_id")
 
print(es)

Understanding Relationships

Relationships in Featuretools are directional and follow a parent-child hierarchy:

Parent Entity: Contains the primary key (the "one" in a one-to-many relationship)
Child Entity: Contains the foreign key (the "many" in a one-to-many relationship)

In our e-commerce example:

customers is the parent of orders (one customer has many orders)
orders is the parent of order_items (one order has many items)

This hierarchy is crucial because it determines the direction of aggregations and transformations during feature synthesis.

Relationship Semantics in Featuretools
Relationship Type	Direction	Feature Operations	Example
Parent → Child (Forward)	Downward traversal	Transform primitives	Customer's country attached to each order
Child → Parent (Backward)	Upward aggregation	Aggregation primitives	SUM of order totals per customer
Multi-hop	Chained traversal	Composed operations	MEAN of item quantities per customer

Feature Primitives: The Building Blocks

Types of Primitives

Featuretools distinguishes between two fundamental categories of primitives:

1. Transform Primitives These operate on a single row and produce a single output value. They don't require any aggregation across multiple rows.

Examples:

Year, Month, Day, Hour — Extract components from datetime
IsNull — Binary indicator for missing values
Absolute — Absolute value of numeric column
CumSum, CumMean — Cumulative statistics
Diff — Difference from previous row

2. Aggregation Primitives These operate on a collection of rows (typically grouped by a parent entity) and produce a single aggregate value.

Examples:

Sum, Mean, Std, Median — Statistical aggregations
Count, NUnique — Counting operations
Min, Max, Mode — Extreme values
Trend — Linear regression slope over time
TimeSinceLast, TimeSinceFirst — Temporal aggregations

primitives_exploration.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import featuretools as ft
 
# List all available primitives
all_primitives = ft.primitives.list_primitives()
print(f"Total primitives available: {len(all_primitives)}")
 
# Filter by type
transform_primitives = all_primitives[
    all_primitives["type"] == "transform"
]
aggregation_primitives = all_primitives[
    all_primitives["type"] == "aggregation"
]
 
print(f"Transform primitives: {len(transform_primitives)}")
print(f"Aggregation primitives: {len(aggregation_primitives)}")
 
# Explore specific primitives
print("\n--- Sample Transform Primitives ---")
print(transform_primitives[["name", "description"]].head(10))
 
print("\n--- Sample Aggregation Primitives ---")
print(aggregation_primitives[["name", "description"]].head(10))
 
# Get detailed info about a specific primitive
from featuretools.primitives import Mean, Mode, Trend
 
print("\n--- Mean Primitive Details ---")
print(f"Name: {Mean.name}")
print(f"Input types: {Mean.input_types}")
print(f"Return type: {Mean.return_type}")
 
print("\n--- Trend Primitive Details ---")
print(f"Name: {Trend.name}")
print(f"Input types: {Trend.input_types}")
print(f"Return type: {Trend.return_type}")

Primitive Type Compatibility

The Primitive Taxonomy

Featuretools organizes primitives into a rich taxonomy that reflects the diversity of feature engineering operations encountered in practice:

Category	Examples	Use Case
Temporal	`Year`, `Weekday`, `TimeSince`	Extracting temporal patterns
Statistical	`Mean`, `Std`, `Skew`, `Kurtosis`	Summarizing distributions
Counting	`Count`, `PercentTrue`, `NUnique`	Quantifying occurrences
Text	`NumWords`, `NumCharacters`	Basic NLP features
Cumulative	`CumSum`, `CumMax`, `CumCount`	Rolling computations
Binary	`IsNull`, `IsWeekend`, `IsIn`	Indicator variables
Comparison	`GreaterThan`, `Equal`	Threshold-based flags

The power of Featuretools lies in composing these primitives across relationship paths to generate complex, meaningful features automatically.

Running Deep Feature Synthesis

The DFS Function

The ft.dfs() function is the primary interface for feature generation:

dfs_execution.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
import featuretools as ft
 
# Assuming 'es' is our EntitySet from earlier
 
# Run DFS to generate features for customers
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",  # Entity we want features for
    agg_primitives=[
        "sum", "mean", "std", "max", "min", 
        "count", "num_unique", "mode"
    ],
    trans_primitives=[
        "year", "month", "weekday", "day"
    ],
    max_depth=2,  # How deep to traverse relationships
    verbose=True
)
 
print(f"Generated {len(feature_defs)} features")
print(f"Feature matrix shape: {feature_matrix.shape}")
 
# Examine generated features
print("\n--- Sample Generated Features ---")
for feat in feature_defs[:20]:
    print(f"  {feat.get_name()}")
 
# View the feature matrix
print("\n--- Feature Matrix Head ---")
print(feature_matrix.head())

Understanding DFS Parameters

The dfs() function accepts numerous parameters that control the feature generation process:

Parameter	Description	Impact
`target_dataframe_name`	Entity for which to generate features	Determines the granularity of the output
`agg_primitives`	List of aggregation primitives to use	Controls parent→child aggregations
`trans_primitives`	List of transform primitives to use	Controls single-row transformations
`max_depth`	Maximum relationship hops to traverse	Exponentially affects feature count
`max_features`	Cap on total features generated	Prevents explosion; useful for exploration
`cutoff_time`	Time-based filtering for temporal features	Critical for preventing data leakage
`ignore_columns`	Columns to exclude from feature generation	Removes irrelevant or leaky columns

The Depth Parameter

The max_depth parameter deserves special attention. It controls how many relationship hops DFS will traverse when building features:

Depth 1: Only direct aggregations from immediate children
- Example: MEAN(orders.total_amount) — average order value per customer
Depth 2: Aggregations can be stacked or traverse two hops
- Example: MEAN(orders.SUM(order_items.quantity)) — average of per-order item counts
- Example: SUM(orders.order_items.unit_price) — total across all items across all orders
Depth 3+: Further nesting, feature counts grow exponentially
- Use with caution—diminishing returns and computational explosion

The Depth Explosion

Cutoff Times and Temporal Validity

The Cutoff Time Concept

A cutoff time defines the moment at which we're making a prediction. Any feature we generate must use only data available before that cutoff:

Training a churn model? Cutoff is when we last observed the customer before churning
Predicting next purchase? Cutoff is the day before the purchase we're predicting
Fraud detection? Cutoff is the transaction timestamp

Featuretools enforces temporal validity automatically when you provide cutoff times.

cutoff_times.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
import featuretools as ft
import pandas as pd
 
# Define cutoff times for each customer
# This represents "when we're making the prediction"
cutoff_times = pd.DataFrame({
    "customer_id": [1, 2, 3, 4, 5],
    "time": pd.to_datetime([
        "2023-04-01",  # Predict customer 1's behavior as of April 1
        "2023-04-01",  # Same for customer 2
        "2023-03-15",  # Customer 3's cutoff is earlier
        "2023-04-15",  # Customer 4's cutoff is later
        "2023-04-01"   # Customer 5
    ])
})
 
# Run DFS with cutoff times
feature_matrix, feature_defs = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    cutoff_time=cutoff_times,
    cutoff_time_in_index=True,  # Include cutoff time in output index
    agg_primitives=["sum", "mean", "count"],
    trans_primitives=["month", "year"],
    max_depth=2,
    training_window="90 days"  # Only use last 90 days before cutoff
)
 
print("Feature matrix with temporal filtering:")
print(feature_matrix)
 
# The beauty: features for customer 3 (cutoff March 15)
# won't include their March 20 or April 1 orders!

Training Windows

Beyond cutoff times, Featuretools supports training windows—limiting how far back in time to look when computing features:

training_window="30 days" — Only aggregate data from the last 30 days before cutoff
training_window="6 months" — Use up to 6 months of history
No training window — Use all available history

Training windows are essential for:

Recency Relevance: Recent behavior often predicts better than ancient history
Computational Efficiency: Limiting data scope reduces computation
Concept Drift: Older patterns may no longer be predictive

Leakage Prevention Built-In

Feature Engineering at Scale

Scaling Options

1. Dask Integration For datasets that exceed memory but can fit on a single machine with out-of-core computation:

import dask.dataframe as dd
import featuretools as ft

# Convert pandas to Dask
dask_df = dd.from_pandas(large_df, npartitions=10)

# Create EntitySet with Dask dataframe
es = ft.EntitySet(id="large_data")
es.add_dataframe(
    dataframe=dask_df,
    dataframe_name="transactions",
    index="transaction_id"
)

# DFS works the same way
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers"
)

2. Spark Integration (via Woodwork) For truly distributed computation across clusters:

from pyspark.sql import SparkSession
import featuretools as ft

spark = SparkSession.builder.getOrCreate()
spark_df = spark.read.parquet("s3://bucket/transactions/")

# Featuretools can work with Spark via Koalas/pandas API on Spark
# Best practice: use Spark for data prep, convert to pandas for DFS

3. Incremental Computation For streaming or frequently updated data, compute features incrementally:

incremental_computation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import featuretools as ft
import pandas as pd
 
# Suppose we already computed features for cutoff_time_1
# Now we have new data and a new cutoff time
 
# Method 1: Approximate entity matching for new instances
new_cutoff_times = pd.DataFrame({
    "customer_id": [6, 7],  # New customers
    "time": pd.to_datetime(["2023-05-01", "2023-05-01"])
})
 
# Run DFS only for new instances
new_feature_matrix, _ = ft.dfs(
    entityset=es_updated,  # EntitySet with new data
    target_dataframe_name="customers",
    cutoff_time=new_cutoff_times,
    features=feature_defs  # Reuse existing feature definitions!
)
 
# Method 2: Using calculate_feature_matrix for efficiency
# When you already have feature definitions
from featuretools import calculate_feature_matrix
 
feature_matrix = calculate_feature_matrix(
    features=feature_defs,
    entityset=es_updated,
    cutoff_time=new_cutoff_times
)

Scaling Strategies for Featuretools
Data Size	Recommended Approach	Key Consideration
< 1 GB	Standard pandas	Simplicity; no setup overhead
1-50 GB	Dask DataFrames	Out-of-core computation on single machine
50 GB - 1 TB	Spark + sampling	Compute on sample, apply to full data
1 TB	Custom feature pipelines	Use DFS for prototyping, SQL/Spark for production

Featuretools Best Practices

After years of adoption across industry and research, a set of best practices has emerged for using Featuretools effectively:

1. Start Simple, Iterate

Don't begin with max_depth=3 and every primitive enabled. Start with:

max_depth=1
A handful of core primitives (mean, sum, count, max, min)
Evaluate model performance
Gradually increase complexity

2. Curate Your Primitives

Not all primitives are equally valuable for every problem. Domain knowledge still matters:

Financial data: Include Trend, Percentile, TimeSinceLast
E-commerce: Include Mode, NUnique, PercentTrue
Time series: Include CumSum, CumMean, Lag

3. Use Interesting Values for Categorical Variables

Featuretools can create features based on specific categorical values:

interesting_values.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import featuretools as ft
 
# Define interesting values for payment_method
es["orders"].ww.set_types(
    logical_types={"payment_method": "Categorical"}
)
 
# Tell Featuretools which values to create features for
es.add_interesting_values(
    dataframe_name="orders",
    values={"payment_method": ["credit", "paypal"]}
)
 
# Now DFS will generate features like:
# - COUNT(orders WHERE payment_method = credit)
# - SUM(orders.total_amount WHERE payment_method = paypal)
 
feature_matrix, features = ft.dfs(
    entityset=es,
    target_dataframe_name="customers",
    agg_primitives=["count", "sum", "mean"],
    where_primitives=["count", "sum", "mean"],  # Enable WHERE clauses
    max_depth=2
)

Production Readiness Checklist

•Validate temporal logic — Ensure cutoff times align with your prediction task
•Profile memory usage — Monitor memory before scaling to full dataset
•Document feature definitions — Export and version control your feature definitions
•Test for leakage — Verify no future data is used in feature computation
•Benchmark computation time — Measure DFS duration to plan production batching
•Set max_features — Prevent runaway feature explosion during exploration

Summary: Featuretools as Foundation

Featuretools represents a paradigm shift in how we approach feature engineering—from artisanal, manual craft to systematic, automated synthesis. Let's consolidate the key concepts:

Key Takeaways

•EntitySet is the semantic data model that captures entities, relationships, and variable types—the foundation for automated feature engineering
•Feature Primitives are the atomic operations (transforms and aggregations) that Featuretools composes to generate complex features
•Deep Feature Synthesis (DFS) systematically traverses relationships and applies primitives to generate comprehensive feature matrices
•Cutoff Times enable temporally valid features, preventing data leakage by construction
•Scalability is achieved through Dask/Spark integration and incremental computation
•Best practices emphasize starting simple, curating primitives, and validating temporal logic

What's Next:

Page Complete

1 / 5