Loading content...
In chemistry, understanding atomic properties determines how elements combine into molecules. In machine learning, understanding feature types determines how data transforms into predictions. Features are the atomic units of machine learning—the individual measurements, attributes, and characteristics that models consume to make decisions.
Yet many practitioners treat features as interchangeable inputs, feeding whatever data they have into models without considering the fundamental nature of each variable. This approach leaves enormous predictive power on the table and often leads to subtle bugs that silently degrade model performance.
By the end of this page, you will have a rigorous understanding of feature type taxonomy—numerical, categorical, ordinal, binary, and derived features. You'll know how each type behaves mathematically, what encoding strategies they require, and which modeling assumptions they satisfy or violate. This foundation is essential for everything that follows in feature engineering.
Before diving into individual feature types, we need to understand why this taxonomy matters. The type of a feature determines:
Getting feature types wrong doesn't cause obvious errors—models still train and produce predictions. But those predictions are based on mathematically incoherent operations, leading to degraded performance that's difficult to diagnose.
| Feature Type | Mathematical Properties | Examples | Key Considerations |
|---|---|---|---|
| Numerical (Continuous) | Ordered, arithmetic meaningful, infinite precision theoretically | Temperature, price, height, duration | Scale-sensitive; requires normalization for many algorithms |
| Numerical (Discrete) | Ordered, countable, integer-valued | Age in years, count of items, number of children | May need binning; consider as continuous or categorical depending on cardinality |
| Categorical (Nominal) | Unordered classes, no arithmetic meaning | Country, color, product category, user ID | Requires encoding; high cardinality is challenging |
| Categorical (Ordinal) | Ordered classes, intervals not equal | Education level, satisfaction rating, size (S/M/L) | Order matters but magnitude doesn't; encoding must preserve order |
| Binary | Two mutually exclusive states | Yes/No, True/False, Male/Female, Purchased/Not | Special case of categorical; simple 0/1 encoding usually sufficient |
| Derived/Composite | Created from other features | Ratios, interactions, aggregations, embeddings | Type depends on construction; often where domain expertise shines |
Real-world features often don't fit cleanly into one category. A 'star rating' (1-5) could be treated as ordinal, discrete numerical, or even continuous after averaging. Zip codes are numerical in representation but categorical in meaning. Always consider the semantic meaning, not just the data format.
Numerical features represent quantities where arithmetic operations are meaningful. The distinction between continuous and discrete is theoretically important but practically blurred—most algorithms treat them identically after preprocessing.
Continuous Numerical Features
Continuous features can theoretically take any value within a range. Examples include:
These features have several key properties:
Discrete Numerical Features
Discrete features take countable values, typically integers:
The key question with discrete numerics is cardinality:
| Cardinality | Treatment Strategy | Example |
|---|---|---|
| Very low (2-5) | Often better as categorical | Number of bedrooms |
| Low (5-20) | Could go either way | Age in decades |
| Medium (20-100) | Usually treat as numeric | Age in years |
| High (100+) | Almost always numeric | Days since signup |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859
import numpy as npimport pandas as pdfrom scipy import stats def analyze_numerical_feature(series: pd.Series, name: str) -> dict: """ Comprehensive analysis of a numerical feature. Returns statistics and recommendations for preprocessing. """ analysis = { "name": name, "dtype": str(series.dtype), "count": len(series), "missing": series.isna().sum(), "missing_pct": series.isna().mean() * 100, "unique_values": series.nunique(), "is_likely_discrete": series.nunique() < 50 and series.dropna().apply(float.is_integer).all(), } # Basic statistics (for non-missing values) clean = series.dropna() if len(clean) > 0: analysis.update({ "min": clean.min(), "max": clean.max(), "mean": clean.mean(), "median": clean.median(), "std": clean.std(), "skewness": stats.skew(clean), "kurtosis": stats.kurtosis(clean), }) # Distribution shape recommendations skew = abs(analysis["skewness"]) if skew > 2: analysis["transform_recommendation"] = "log or sqrt transform (highly skewed)" elif skew > 1: analysis["transform_recommendation"] = "consider log transform (moderately skewed)" else: analysis["transform_recommendation"] = "standard scaling likely sufficient" # Outlier detection using IQR q1, q3 = clean.quantile([0.25, 0.75]) iqr = q3 - q1 outlier_mask = (clean < q1 - 1.5 * iqr) | (clean > q3 + 1.5 * iqr) analysis["outlier_count"] = outlier_mask.sum() analysis["outlier_pct"] = outlier_mask.mean() * 100 return analysis # Example usagenp.random.seed(42)income = pd.Series(np.random.lognormal(10, 1, 10000)) # Log-normal income dataanalysis = analyze_numerical_feature(income, "household_income") print(f"Feature: {analysis['name']}")print(f"Skewness: {analysis['skewness']:.2f}")print(f"Recommendation: {analysis['transform_recommendation']}")print(f"Outliers: {analysis['outlier_pct']:.1f}%")For tree-based models (Random Forest, XGBoost), scaling is unnecessary—trees split on thresholds regardless of scale. For linear models, SVMs, neural networks, and distance-based methods (k-NN, k-means), scaling is critical. When in doubt, standardize (z-score) or min-max normalize, but be aware that both approaches handle outliers differently.
Categorical features represent discrete classes or groups. Unlike numerical features, arithmetic operations are meaningless—you cannot average 'red' and 'blue.' The fundamental distinction within categoricals is whether an order exists.
Nominal (Unordered) Categorical Features
Nominal features have no inherent ordering. Any ordering you impose is arbitrary:
The critical insight: assigning integers to categories implies a false ordering. If you encode {red: 1, blue: 2, green: 3}, a linear model will treat green as 'more than' blue, which is semantically meaningless.
Ordinal Categorical Features
Ordinal features have meaningful order but non-uniform intervals:
The challenge: the gaps between categories are semantically unequal. The jump from 'unsatisfied' to 'neutral' might be psychologically larger than 'satisfied' to 'very satisfied.'
| Strategy | Description | Preserves Order | Preserves Intervals | Use Case |
|---|---|---|---|---|
| Integer Encoding | Map to 0, 1, 2, ..., k-1 | ✓ Yes | ✗ Assumes equal | Tree models; quick baseline |
| Custom Mapping | Domain-informed numeric values | ✓ Yes | ~ Approximated | When domain knowledge suggests intervals |
| Target-Based Ordering | Order by target correlation | ✓ Yes | ~ Data-driven | When order is uncertain or task-specific |
| Treat as Nominal | One-hot encoding | ✗ Lost | N/A | When interval uncertainty is high; linear models |
123456789101112131415161718192021222324252627282930313233343536373839
import pandas as pdimport numpy as npfrom sklearn.preprocessing import OneHotEncoder, OrdinalEncoderfrom category_encoders import TargetEncoder # Sample datadf = pd.DataFrame({ "color": ["red", "blue", "green", "red", "blue", "green", "red", "blue"], "size": ["S", "M", "L", "XL", "S", "M", "L", "XL"], "target": [0, 1, 1, 1, 0, 0, 1, 1]}) # 1. One-Hot Encoding for nominal featuresonehot = OneHotEncoder(sparse_output=False, handle_unknown="ignore")color_encoded = onehot.fit_transform(df[["color"]])color_columns = onehot.get_feature_names_out(["color"])print("One-Hot Encoded Colors:")print(pd.DataFrame(color_encoded, columns=color_columns)) # 2. Ordinal Encoding with explicit ordersize_order = ["S", "M", "L", "XL"] # Smallest to largestordinal = OrdinalEncoder(categories=[size_order])df["size_ordinal"] = ordinal.fit_transform(df[["size"]])print(f"Ordinal Encoded Sizes: {df['size_ordinal'].tolist()}") # 3. Target Encoding (with smoothing to prevent overfitting)target_enc = TargetEncoder(smoothing=1.0)df["color_target_encoded"] = target_enc.fit_transform(df["color"], df["target"])print(f"Target Encoded Colors:{df[['color', 'color_target_encoded']].drop_duplicates()}") # 4. Frequency Encodingfreq_map = df["color"].value_counts(normalize=True).to_dict()df["color_freq"] = df["color"].map(freq_map)print(f"Frequency Encoded Colors:{df[['color', 'color_freq']].drop_duplicates()}")Features like user_id, product_sku, or IP address can have millions of unique values. One-hot encoding is impractical (millions of columns). Target encoding risks overfitting (some categories appear once). Hashing causes collisions. Embeddings require neural architectures. There's no perfect solution—choose based on model type, data volume, and acceptable complexity.
Binary features represent exactly two mutually exclusive states. While technically categorical, they deserve special treatment due to their simplicity and ubiquity.
Common Binary Feature Types:
| Category | Examples |
|---|---|
| Boolean flags | is_verified, has_subscription, email_confirmed |
| Presence/absence | clicked, purchased, churned, converted |
| Binary classification | pass/fail, spam/ham, positive/negative |
| Demographic | gender (when binary), citizen/non-citizen |
| Threshold-based | high_value_customer, above_median_income |
Encoding is trivial: 0/1 representation works for virtually all algorithms. The main considerations are semantic clarity and handling edge cases.
Creating Binary Features from Other Types
Binary features are often derived by thresholding or categorizing other feature types:
# From numerical: thresholding
df['is_high_income'] = (df['income'] > 100000).astype(int)
df['is_adult'] = (df['age'] >= 18).astype(int)
# From categorical: presence detection
df['is_premium_member'] = (df['membership_type'] == 'premium').astype(int)
df['is_us_based'] = df['country'].isin(['US', 'USA', 'United States']).astype(int)
# From text: pattern matching
df['has_email'] = df['email'].notna().astype(int)
df['mentions_discount'] = df['review_text'].str.contains('discount|coupon|sale', case=False).astype(int)
These derived binaries often capture domain-specific thresholds that are more predictive than the raw values.
Features types aren't fixed—they can be converted based on modeling needs. Understanding these conversions and their tradeoffs is essential for effective feature engineering.
Information Flow in Conversions:
Continuous → Discrete → Ordinal → Nominal → Binary
↓ ↓ ↓ ↓ ↓
(binning) (ordering) (collapse) (one-hot) (present)
Each arrow represents INFORMATION LOSS.
Reverse arrows (binned → continuous) recover approximations, not originals.
| From | To | Method | Information Impact |
|---|---|---|---|
| Continuous | Discrete | Rounding, floor, ceiling | Loses precision; useful for counts or when precision is noise |
| Continuous | Ordinal | Quantile binning, equal-width binning | Loses magnitude; gains robustness to outliers |
| Continuous | Binary | Thresholding | Major information loss; captures only one split point |
| Discrete | Categorical | Treat values as labels | Loses numerical relationships; gains flexibility |
| Ordinal | Nominal | Ignore ordering | Loses order information; may be necessary for fair one-hot |
| Ordinal | Numerical | Integer coding | Assumes equal intervals (may be false) |
| Categorical | Binary (each) | One-hot encoding | Preserves all; explodes dimensionality |
| Categorical | Numerical | Target/frequency encoding | Summarizes via statistics; risks leakage |
Convert continuous → binned when: outliers cause problems, relationship is non-linear and bins capture it, or interpretability matters. Convert categorical → numerical when: high cardinality makes one-hot impractical, target encoding reduces overfitting via smoothing, or embedding layers aren't available. Keep original features when possible—let the model discover the best representation.
Real datasets rarely come with explicit type annotations. The data format (int, float, string) is a clue but not definitive. Here's a systematic approach to type identification:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889909192
import pandas as pdimport numpy as np def infer_feature_type(series: pd.Series, name: str) -> dict: """ Infer semantic feature type from data characteristics. Returns type classification and confidence. """ result = {"name": name, "dtype": str(series.dtype)} clean = series.dropna() n_unique = clean.nunique() n_total = len(clean) unique_ratio = n_unique / n_total if n_total > 0 else 0 # String-like columns if series.dtype == 'object' or str(series.dtype) == 'string': if n_unique == 2: result["inferred_type"] = "binary" result["confidence"] = "high" elif n_unique <= 20: result["inferred_type"] = "categorical_lowcard" result["confidence"] = "high" elif n_unique <= 100: result["inferred_type"] = "categorical_medcard" result["confidence"] = "medium" else: # Could be high-cardinality categorical or text avg_len = clean.astype(str).str.len().mean() if avg_len > 50: result["inferred_type"] = "text" result["confidence"] = "medium" else: result["inferred_type"] = "categorical_highcard" result["confidence"] = "medium" # Numeric columns elif np.issubdtype(series.dtype, np.number): is_integer = clean.apply(lambda x: float(x).is_integer()).all() if n_unique == 2: result["inferred_type"] = "binary" result["confidence"] = "high" elif is_integer and n_unique <= 10: # Low-cardinality integers: could be ordinal, discrete, or categorical result["inferred_type"] = "discrete_or_ordinal" result["confidence"] = "low" result["note"] = "Inspect values to determine if ordinal or categorical" elif is_integer and n_unique <= 50: result["inferred_type"] = "discrete_numerical" result["confidence"] = "medium" elif unique_ratio > 0.9: result["inferred_type"] = "continuous" result["confidence"] = "high" elif unique_ratio > 0.5: result["inferred_type"] = "continuous" result["confidence"] = "medium" else: result["inferred_type"] = "discrete_numerical" result["confidence"] = "medium" # Datetime elif pd.api.types.is_datetime64_any_dtype(series): result["inferred_type"] = "datetime" result["confidence"] = "high" else: result["inferred_type"] = "unknown" result["confidence"] = "low" result["unique_values"] = n_unique result["unique_ratio"] = unique_ratio return result # Apply to a DataFramedef analyze_all_features(df: pd.DataFrame) -> pd.DataFrame: results = [infer_feature_type(df[col], col) for col in df.columns] return pd.DataFrame(results) # Exampledf = pd.DataFrame({ "user_id": range(1000), "age": np.random.randint(18, 80, 1000), "income": np.random.lognormal(10, 1, 1000), "gender": np.random.choice(["M", "F"], 1000), "education": np.random.choice(["HS", "BS", "MS", "PhD"], 1000), "rating": np.random.randint(1, 6, 1000), "is_premium": np.random.choice([0, 1], 1000),}) print(analyze_all_features(df).to_string())Feature types are the foundation of feature engineering. Getting them right enables all subsequent transformations; getting them wrong corrupts your entire modeling pipeline. Here are the key takeaways:
You now have a rigorous understanding of feature types—the atomic units of machine learning. Next, we'll explore how domain knowledge transforms raw features into powerful predictors that capture business and scientific understanding.