Feature Engineering Mastery - Learning Module

Loading content...

0/278

Feature Types

The Atoms of Machine Learning

In chemistry, understanding atomic properties determines how elements combine into molecules. In machine learning, understanding feature types determines how data transforms into predictions. Features are the atomic units of machine learning—the individual measurements, attributes, and characteristics that models consume to make decisions.

Yet many practitioners treat features as interchangeable inputs, feeding whatever data they have into models without considering the fundamental nature of each variable. This approach leaves enormous predictive power on the table and often leads to subtle bugs that silently degrade model performance.

What You Will Learn

By the end of this page, you will have a rigorous understanding of feature type taxonomy—numerical, categorical, ordinal, binary, and derived features. You'll know how each type behaves mathematically, what encoding strategies they require, and which modeling assumptions they satisfy or violate. This foundation is essential for everything that follows in feature engineering.

The Feature Type Taxonomy

Before diving into individual feature types, we need to understand why this taxonomy matters. The type of a feature determines:

Mathematical operations that are valid (can you compute a mean?)
Distance metrics that are meaningful (is Euclidean distance sensible?)
Encoding strategies required for model consumption
Preprocessing steps needed (scaling, normalization, imputation)
Model compatibility (tree-based vs. linear vs. neural approaches)

Getting feature types wrong doesn't cause obvious errors—models still train and produce predictions. But those predictions are based on mathematically incoherent operations, leading to degraded performance that's difficult to diagnose.

Feature Type Taxonomy Overview
Feature Type	Mathematical Properties	Examples	Key Considerations
Numerical (Continuous)	Ordered, arithmetic meaningful, infinite precision theoretically	Temperature, price, height, duration	Scale-sensitive; requires normalization for many algorithms
Numerical (Discrete)	Ordered, countable, integer-valued	Age in years, count of items, number of children	May need binning; consider as continuous or categorical depending on cardinality
Categorical (Nominal)	Unordered classes, no arithmetic meaning	Country, color, product category, user ID	Requires encoding; high cardinality is challenging
Categorical (Ordinal)	Ordered classes, intervals not equal	Education level, satisfaction rating, size (S/M/L)	Order matters but magnitude doesn't; encoding must preserve order
Binary	Two mutually exclusive states	Yes/No, True/False, Male/Female, Purchased/Not	Special case of categorical; simple 0/1 encoding usually sufficient
Derived/Composite	Created from other features	Ratios, interactions, aggregations, embeddings	Type depends on construction; often where domain expertise shines

Type Ambiguity is Common

Real-world features often don't fit cleanly into one category. A 'star rating' (1-5) could be treated as ordinal, discrete numerical, or even continuous after averaging. Zip codes are numerical in representation but categorical in meaning. Always consider the semantic meaning, not just the data format.

Numerical Features: Continuous and Discrete

Numerical features represent quantities where arithmetic operations are meaningful. The distinction between continuous and discrete is theoretically important but practically blurred—most algorithms treat them identically after preprocessing.

Continuous Numerical Features

Continuous features can theoretically take any value within a range. Examples include:

Physical measurements: height (cm), weight (kg), temperature (°C)
Financial values: price ($), revenue, stock returns
Temporal durations: seconds since event, session length
Derived metrics: click-through rate, conversion probability

These features have several key properties:

Properties of Continuous Features

•Arithmetic is meaningful — Mean, standard deviation, and differences convey information. A temperature of 20°C is genuinely 'between' 10°C and 30°C.
•Scale matters — A feature ranging 0-1 behaves differently than one ranging 0-1,000,000. Most algorithms are scale-sensitive.
•Distribution shape affects modeling — Heavily skewed distributions often benefit from log or power transforms.
•Outliers have disproportionate influence — Extreme values can dominate distance calculations and gradient updates.
•Missing values require careful imputation — Mean, median, or model-based imputation each has tradeoffs.

Discrete Numerical Features

Discrete features take countable values, typically integers:

Counts: number of purchases, page views, children
Rankings: position in search results, leaderboard rank
Indices: day of week (1-7), month (1-12), hour (0-23)

The key question with discrete numerics is cardinality:

Cardinality	Treatment Strategy	Example
Very low (2-5)	Often better as categorical	Number of bedrooms
Low (5-20)	Could go either way	Age in decades
Medium (20-100)	Usually treat as numeric	Age in years
High (100+)	Almost always numeric	Days since signup

numerical_feature_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
import numpy as np
import pandas as pd
from scipy import stats
 
def analyze_numerical_feature(series: pd.Series, name: str) -> dict:
    """
    Comprehensive analysis of a numerical feature.
    Returns statistics and recommendations for preprocessing.
    """
    analysis = {
        "name": name,
        "dtype": str(series.dtype),
        "count": len(series),
        "missing": series.isna().sum(),
        "missing_pct": series.isna().mean() * 100,
        "unique_values": series.nunique(),
        "is_likely_discrete": series.nunique() < 50 and series.dropna().apply(float.is_integer).all(),
    }
    
    # Basic statistics (for non-missing values)
    clean = series.dropna()
    if len(clean) > 0:
        analysis.update({
            "min": clean.min(),
            "max": clean.max(),
            "mean": clean.mean(),
            "median": clean.median(),
            "std": clean.std(),
            "skewness": stats.skew(clean),
            "kurtosis": stats.kurtosis(clean),
        })
        
        # Distribution shape recommendations
        skew = abs(analysis["skewness"])
        if skew > 2:
            analysis["transform_recommendation"] = "log or sqrt transform (highly skewed)"
        elif skew > 1:
            analysis["transform_recommendation"] = "consider log transform (moderately skewed)"
        else:
            analysis["transform_recommendation"] = "standard scaling likely sufficient"
        
        # Outlier detection using IQR
        q1, q3 = clean.quantile([0.25, 0.75])
        iqr = q3 - q1
        outlier_mask = (clean < q1 - 1.5 * iqr) | (clean > q3 + 1.5 * iqr)
        analysis["outlier_count"] = outlier_mask.sum()
        analysis["outlier_pct"] = outlier_mask.mean() * 100
        
    return analysis
 
# Example usage
np.random.seed(42)
income = pd.Series(np.random.lognormal(10, 1, 10000))  # Log-normal income data
analysis = analyze_numerical_feature(income, "household_income")
 
print(f"Feature: {analysis['name']}")
print(f"Skewness: {analysis['skewness']:.2f}")
print(f"Recommendation: {analysis['transform_recommendation']}")
print(f"Outliers: {analysis['outlier_pct']:.1f}%")

The Scale Normalization Decision

For tree-based models (Random Forest, XGBoost), scaling is unnecessary—trees split on thresholds regardless of scale. For linear models, SVMs, neural networks, and distance-based methods (k-NN, k-means), scaling is critical. When in doubt, standardize (z-score) or min-max normalize, but be aware that both approaches handle outliers differently.

Categorical Features: Nominal and Ordinal

Categorical features represent discrete classes or groups. Unlike numerical features, arithmetic operations are meaningless—you cannot average 'red' and 'blue.' The fundamental distinction within categoricals is whether an order exists.

Nominal (Unordered) Categorical Features

Nominal features have no inherent ordering. Any ordering you impose is arbitrary:

Geographic: country, city, region, timezone
Identity: user_id, product_sku, session_id
Classification: animal species, disease type, error category
Descriptive: color, brand, payment method

The critical insight: assigning integers to categories implies a false ordering. If you encode {red: 1, blue: 2, green: 3}, a linear model will treat green as 'more than' blue, which is semantically meaningless.

Nominal Feature Encoding Strategies

•One-Hot Encoding — Creates binary column per category. Safe default but explodes dimensionality for high cardinality. Use for <50 categories typically.
•Dummy Encoding — One-hot minus one column (reference category). Prevents multicollinearity in linear models. Preferred for regression.
•Target Encoding — Replaces category with mean of target variable. Powerful but risks overfitting—requires careful regularization and cross-validation.
•Frequency Encoding — Replaces category with its frequency in training data. Simple and often effective for high-cardinality features.
•Hashing Trick — Hashes categories to fixed-size vector. Handles unseen categories but introduces collisions. Good for very high cardinality.
•Embedding Layers — Learns dense vector representation. Best for neural networks with high cardinality (user_id, product_id).

Ordinal Categorical Features

Ordinal features have meaningful order but non-uniform intervals:

Education: high school < bachelor's < master's < PhD
Satisfaction: very unsatisfied < unsatisfied < neutral < satisfied < very satisfied
Size: small < medium < large < extra-large
Quality grade: F < D < C < B < A

The challenge: the gaps between categories are semantically unequal. The jump from 'unsatisfied' to 'neutral' might be psychologically larger than 'satisfied' to 'very satisfied.'

Ordinal Encoding Strategies
Strategy	Description	Preserves Order	Preserves Intervals	Use Case
Integer Encoding	Map to 0, 1, 2, ..., k-1	✓ Yes	✗ Assumes equal	Tree models; quick baseline
Custom Mapping	Domain-informed numeric values	✓ Yes	~ Approximated	When domain knowledge suggests intervals
Target-Based Ordering	Order by target correlation	✓ Yes	~ Data-driven	When order is uncertain or task-specific
Treat as Nominal	One-hot encoding	✗ Lost	N/A	When interval uncertainty is high; linear models

categorical_encoding.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from category_encoders import TargetEncoder
 
# Sample data
df = pd.DataFrame({
    "color": ["red", "blue", "green", "red", "blue", "green", "red", "blue"],
    "size": ["S", "M", "L", "XL", "S", "M", "L", "XL"],
    "target": [0, 1, 1, 1, 0, 0, 1, 1]
})
 
# 1. One-Hot Encoding for nominal features
onehot = OneHotEncoder(sparse_output=False, handle_unknown="ignore")
color_encoded = onehot.fit_transform(df[["color"]])
color_columns = onehot.get_feature_names_out(["color"])
print("One-Hot Encoded Colors:")
print(pd.DataFrame(color_encoded, columns=color_columns))
 
# 2. Ordinal Encoding with explicit order
size_order = ["S", "M", "L", "XL"]  # Smallest to largest
ordinal = OrdinalEncoder(categories=[size_order])
df["size_ordinal"] = ordinal.fit_transform(df[["size"]])
print(f"
Ordinal Encoded Sizes: {df['size_ordinal'].tolist()}")
 
# 3. Target Encoding (with smoothing to prevent overfitting)
target_enc = TargetEncoder(smoothing=1.0)
df["color_target_encoded"] = target_enc.fit_transform(df["color"], df["target"])
print(f"
Target Encoded Colors:
{df[['color', 'color_target_encoded']].drop_duplicates()}")
 
# 4. Frequency Encoding
freq_map = df["color"].value_counts(normalize=True).to_dict()
df["color_freq"] = df["color"].map(freq_map)
print(f"
Frequency Encoded Colors:
{df[['color', 'color_freq']].drop_duplicates()}")

The High-Cardinality Challenge

Features like user_id, product_sku, or IP address can have millions of unique values. One-hot encoding is impractical (millions of columns). Target encoding risks overfitting (some categories appear once). Hashing causes collisions. Embeddings require neural architectures. There's no perfect solution—choose based on model type, data volume, and acceptable complexity.

Binary Features: The Special Case

Binary features represent exactly two mutually exclusive states. While technically categorical, they deserve special treatment due to their simplicity and ubiquity.

Common Binary Feature Types:

Category	Examples
Boolean flags	is_verified, has_subscription, email_confirmed
Presence/absence	clicked, purchased, churned, converted
Binary classification	pass/fail, spam/ham, positive/negative
Demographic	gender (when binary), citizen/non-citizen
Threshold-based	high_value_customer, above_median_income

Encoding is trivial: 0/1 representation works for virtually all algorithms. The main considerations are semantic clarity and handling edge cases.

Binary Feature Best Practices

•Consistent semantics — Decide whether 1 means 'positive' or 'present' and stick with it. Don't mix is_churned (1=bad) with is_active (1=good) without clear documentation.
•Handle missing values explicitly — A missing 'has_subscription' might mean 'unknown' (impute or model) vs 'not applicable' (should be 0 or separate category).
•Consider class imbalance — Binary features that are 99% one class carry little information for splits but may be critical for the 1% case.
•Watch for data leakage — Binary labels created after observation time (is_churned_30days) can leak future information if not handled carefully.
•Aggregation context — Mean of a binary feature gives the rate (click-through rate, conversion rate), which is often more useful than raw counts.

Creating Binary Features from Other Types

Binary features are often derived by thresholding or categorizing other feature types:

# From numerical: thresholding
df['is_high_income'] = (df['income'] > 100000).astype(int)
df['is_adult'] = (df['age'] >= 18).astype(int)

# From categorical: presence detection
df['is_premium_member'] = (df['membership_type'] == 'premium').astype(int)
df['is_us_based'] = df['country'].isin(['US', 'USA', 'United States']).astype(int)

# From text: pattern matching
df['has_email'] = df['email'].notna().astype(int)
df['mentions_discount'] = df['review_text'].str.contains('discount|coupon|sale', case=False).astype(int)

These derived binaries often capture domain-specific thresholds that are more predictive than the raw values.

Feature Type Relationships and Conversions

Features types aren't fixed—they can be converted based on modeling needs. Understanding these conversions and their tradeoffs is essential for effective feature engineering.

Information Flow in Conversions:

Continuous → Discrete → Ordinal → Nominal → Binary
    ↓           ↓          ↓          ↓         ↓
  (binning)  (ordering)  (collapse)  (one-hot)  (present)
  
Each arrow represents INFORMATION LOSS.
Reverse arrows (binned → continuous) recover approximations, not originals.

Feature Type Conversion Matrix
From	To	Method	Information Impact
Continuous	Discrete	Rounding, floor, ceiling	Loses precision; useful for counts or when precision is noise
Continuous	Ordinal	Quantile binning, equal-width binning	Loses magnitude; gains robustness to outliers
Continuous	Binary	Thresholding	Major information loss; captures only one split point
Discrete	Categorical	Treat values as labels	Loses numerical relationships; gains flexibility
Ordinal	Nominal	Ignore ordering	Loses order information; may be necessary for fair one-hot
Ordinal	Numerical	Integer coding	Assumes equal intervals (may be false)
Categorical	Binary (each)	One-hot encoding	Preserves all; explodes dimensionality
Categorical	Numerical	Target/frequency encoding	Summarizes via statistics; risks leakage

When to Convert Types

Convert continuous → binned when: outliers cause problems, relationship is non-linear and bins capture it, or interpretability matters. Convert categorical → numerical when: high cardinality makes one-hot impractical, target encoding reduces overfitting via smoothing, or embedding layers aren't available. Keep original features when possible—let the model discover the best representation.

Identifying Feature Types in Practice

Real datasets rarely come with explicit type annotations. The data format (int, float, string) is a clue but not definitive. Here's a systematic approach to type identification:

feature_type_detection.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
import pandas as pd
import numpy as np
 
def infer_feature_type(series: pd.Series, name: str) -> dict:
    """
    Infer semantic feature type from data characteristics.
    Returns type classification and confidence.
    """
    result = {"name": name, "dtype": str(series.dtype)}
    
    clean = series.dropna()
    n_unique = clean.nunique()
    n_total = len(clean)
    unique_ratio = n_unique / n_total if n_total > 0 else 0
    
    # String-like columns
    if series.dtype == 'object' or str(series.dtype) == 'string':
        if n_unique == 2:
            result["inferred_type"] = "binary"
            result["confidence"] = "high"
        elif n_unique <= 20:
            result["inferred_type"] = "categorical_lowcard"
            result["confidence"] = "high"
        elif n_unique <= 100:
            result["inferred_type"] = "categorical_medcard"
            result["confidence"] = "medium"
        else:
            # Could be high-cardinality categorical or text
            avg_len = clean.astype(str).str.len().mean()
            if avg_len > 50:
                result["inferred_type"] = "text"
                result["confidence"] = "medium"
            else:
                result["inferred_type"] = "categorical_highcard"
                result["confidence"] = "medium"
    
    # Numeric columns
    elif np.issubdtype(series.dtype, np.number):
        is_integer = clean.apply(lambda x: float(x).is_integer()).all()
        
        if n_unique == 2:
            result["inferred_type"] = "binary"
            result["confidence"] = "high"
        elif is_integer and n_unique <= 10:
            # Low-cardinality integers: could be ordinal, discrete, or categorical
            result["inferred_type"] = "discrete_or_ordinal"
            result["confidence"] = "low"
            result["note"] = "Inspect values to determine if ordinal or categorical"
        elif is_integer and n_unique <= 50:
            result["inferred_type"] = "discrete_numerical"
            result["confidence"] = "medium"
        elif unique_ratio > 0.9:
            result["inferred_type"] = "continuous"
            result["confidence"] = "high"
        elif unique_ratio > 0.5:
            result["inferred_type"] = "continuous"
            result["confidence"] = "medium"
        else:
            result["inferred_type"] = "discrete_numerical"
            result["confidence"] = "medium"
    
    # Datetime
    elif pd.api.types.is_datetime64_any_dtype(series):
        result["inferred_type"] = "datetime"
        result["confidence"] = "high"
    
    else:
        result["inferred_type"] = "unknown"
        result["confidence"] = "low"
    
    result["unique_values"] = n_unique
    result["unique_ratio"] = unique_ratio
    
    return result
 
# Apply to a DataFrame
def analyze_all_features(df: pd.DataFrame) -> pd.DataFrame:
    results = [infer_feature_type(df[col], col) for col in df.columns]
    return pd.DataFrame(results)
 
# Example
df = pd.DataFrame({
    "user_id": range(1000),
    "age": np.random.randint(18, 80, 1000),
    "income": np.random.lognormal(10, 1, 1000),
    "gender": np.random.choice(["M", "F"], 1000),
    "education": np.random.choice(["HS", "BS", "MS", "PhD"], 1000),
    "rating": np.random.randint(1, 6, 1000),
    "is_premium": np.random.choice([0, 1], 1000),
})
 
print(analyze_all_features(df).to_string())

Common Type Identification Pitfalls

•Zip codes — Stored as integers but semantically categorical. Arithmetic is meaningless (10001 + 10002 ≠ meaningful).
•Phone numbers, SSNs, IDs — Numeric format but identifier semantics. Never treat as numerical.
•Dates as integers — YYYYMMDD format needs datetime parsing, not numeric treatment.
•Boolean as 1/2 or Yes/No — May need recoding to 0/1 for proper binary handling.
•Ratings stored as floats — Often discrete ordinal despite float dtype (1.0, 2.0, 3.0...).
•Encoded categoricals — Previous label encoding makes categoricals look numeric. Check data dictionary.

Summary: Mastering Feature Types

Feature types are the foundation of feature engineering. Getting them right enables all subsequent transformations; getting them wrong corrupts your entire modeling pipeline. Here are the key takeaways:

Key Takeaways

•Numerical features allow arithmetic but require scaling for many algorithms. Distinguish continuous from discrete based on cardinality and domain meaning.
•Categorical features require encoding. Nominal features need one-hot or similar; ordinal features need order-preserving encoding. High cardinality requires special techniques.
•Binary features are simple but ubiquitous. Consistent semantics and careful missing value handling are essential.
•Type conversions trade information for structure. Binning loses precision but gains robustness; encoding compresses categories but may distort relationships.
•Always verify types semantically, not just by data format. Zip codes aren't numerical; ratings might be ordinal despite integer storage.
•When uncertain, try multiple representations. Let model performance guide the choice between treating a feature as categorical vs. numerical.

Page Complete

You now have a rigorous understanding of feature types—the atomic units of machine learning. Next, we'll explore how domain knowledge transforms raw features into powerful predictors that capture business and scientific understanding.