Machine LearningKey Concepts and Terminology

Key Concepts and Terminology

LevelBeginner

Duration90 mins

TopicKey Concepts and Terminology

1 / 5

Features and Labels: The Language of Machine Learning Data

How Machines See the World

When humans observe the world, we perceive objects through a rich tapestry of sensory input—colors, textures, sounds, context, and meaning accumulated over a lifetime of experience. A doctor glancing at a patient's face sees not just features, but subtle signs of health or illness. An experienced car buyer evaluates not just specifications, but an intuitive sense of value.

Machine learning algorithms have no such intuition. They cannot 'see' in any meaningful sense. They cannot understand. What they can do is process numbers—vast matrices of numerical values, computed with extraordinary speed and precision.

This creates a fundamental challenge: How do we translate the richness of the real world into numbers that machines can process?

The answer lies in two foundational concepts that form the bedrock of every machine learning system: features and labels. These are not mere vocabulary—they represent a profound philosophical stance about how knowledge can be represented, quantified, and learned from data.

What You Will Master

By the end of this page, you will understand: • What features and labels are in precise mathematical terms • How raw data is transformed into feature representations • The different types of features and their implications • How feature engineering affects model performance • The relationship between features, labels, and the learning task • Real-world examples across diverse domains

Features: Quantifying the Measurable World

A feature (also called an attribute, predictor, input variable, or independent variable) is a measurable property or characteristic of an entity that we believe contains information relevant to the prediction task at hand.

Formally, given an entity $x$ from our domain of interest, a feature is a function $\phi: \mathcal{X} \rightarrow \mathbb{R}$ (or more generally, to some measurable space) that extracts a numerical value from that entity.

The Feature Vector:

When we combine multiple features, we create a feature vector—an ordered collection of feature values that completely describes an entity from the machine learning algorithm's perspective:

$$\mathbf{x} = [x_1, x_2, x_3, \ldots, x_d]^T \in \mathbb{R}^d$$

where $d$ is the dimensionality of the feature space, and each $x_i$ represents the value of the $i$-th feature for this particular entity.

The Representation Matters

The choice of features is arguably the most critical decision in any machine learning project. The same underlying data can yield completely different feature representations, and thus completely different model performance. A house can be represented by [square_footage, bedrooms, bathrooms] or by [price_per_sqft, age_years, school_district_rating]. Same house, different features, different insights.

Concrete Example: Predicting House Prices

Consider a house in a real estate dataset. The physical house is a complex object with countless properties. But for our ML model, we might represent it as:

$$\mathbf{x}_{house} = \begin{bmatrix} 2500 & \text{(square footage)} \ 4 & \text{(bedrooms)} \ 2.5 & \text{(bathrooms)} \ 15 & \text{(age in years)} \ 0.25 & \text{(lot size in acres)} \ 3 & \text{(garage capacity)} \ 8.5 & \text{(school district rating)} \end{bmatrix}$$

This 7-dimensional feature vector is all the algorithm 'knows' about the house. Any information not captured in these features is invisible to the model. This is both a limitation and a feature (pun intended)—it forces us to be explicit about what information we believe is predictively relevant.

Types of Features in Machine Learning
Feature Type	Definition	Examples	Representation Challenges
Numerical (Continuous)	Features that can take any real value within a range	Temperature, income, height, age	Scaling, normalization, handling outliers
Categorical (Nominal)	Features with discrete, unordered categories	Color, country, blood type, gender	One-hot encoding, embedding, cardinality
Ordinal	Categorical features with meaningful ordering	Education level, customer rating, T-shirt size	Encoding to preserve order information
Binary	Features with exactly two possible values	Is_spam, has_pool, is_weekend	Typically encoded as 0/1
Text	Natural language content	Email content, reviews, tweets	Bag-of-words, TF-IDF, embeddings
Temporal	Time-based measurements	Timestamps, durations, sequences	Cyclical encoding, lag features, trends
Spatial	Location-based information	Coordinates, addresses, regions	Geo-hashing, distance features, clustering

The Feature Space: A Geometric Perspective

When we represent entities as feature vectors, we implicitly place them in a feature space—a mathematical space where each dimension corresponds to a feature. This geometric perspective is fundamental to understanding how ML algorithms work.

The Feature Space $\mathcal{X}$:

For a problem with $d$ features, each entity becomes a point in $\mathbb{R}^d$. This transformation is profound:

A dataset of 1,000 houses becomes 1,000 points in a 7-dimensional space
A corpus of 10,000 emails becomes 10,000 points in perhaps a 50,000-dimensional space (if using word counts)
A collection of images becomes points in a space with millions of dimensions (one per pixel per color channel)

Why This Matters:

In this geometric view, similar entities are nearby points, and dissimilar entities are distant points. Most ML algorithms exploit this geometry in some way:

K-Nearest Neighbors: Literally finds the closest points in feature space
Support Vector Machines: Finds hyperplanes that separate points by class
Clustering algorithms: Group nearby points together
Neural networks: Learn to transform the feature space to make patterns more apparent

The Curse of Dimensionality

As dimensionality increases, the geometry of the feature space becomes counterintuitive. In high dimensions, almost all pairs of points are nearly equidistant, and the volume concentrates in thin shells near the surface of hyperspheres. This 'curse of dimensionality' profoundly affects algorithm design and is why dimensionality reduction techniques (PCA, t-SNE, UMAP) are so important.

feature_space_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
 
# Load the classic Iris dataset
# Each flower has 4 features: sepal length, sepal width, petal length, petal width
iris = load_iris()
X = iris.data  # Shape: (150, 4) - 150 flowers, 4 features each
y = iris.target  # Labels: 0, 1, 2 for three species
 
# Each row is a point in 4-dimensional feature space
print(f"Feature space dimensionality: {X.shape[1]}")
print(f"Number of data points: {X.shape[0]}")
 
# Example: First flower as a point in R^4
print(f"\nFirst flower in feature space:")
print(f"  x = [{X[0, 0]:.1f}, {X[0, 1]:.1f}, {X[0, 2]:.1f}, {X[0, 3]:.1f}]")
print(f"  Features: {iris.feature_names}")
 
# Compute distances between points in feature space
# Distance reflects similarity - closer points are more similar
from scipy.spatial.distance import cdist
 
distances = cdist(X, X, metric='euclidean')
print(f"\nDistance between flower 0 and flower 1: {distances[0, 1]:.2f}")
print(f"Distance between flower 0 and flower 50: {distances[0, 50]:.2f}")
# Note: Flower 0 and 1 are same species, likely closer in feature space
 
# Visualize projection to 2D (we lose information but can see patterns)
fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
                     edgecolor='black', s=100, alpha=0.7)
ax.set_xlabel(f'{iris.feature_names[0]}')
ax.set_ylabel(f'{iris.feature_names[1]}')
ax.set_title('Iris Dataset: 2D Projection of 4D Feature Space')
plt.colorbar(scatter, label='Species')
plt.show()

Key Insight: Features Define What the Model Can Learn

The feature space is the universe within which your ML algorithm operates. If relevant information isn't encoded in the features, it's as if that information doesn't exist. The algorithm cannot discover patterns that aren't represented in its input.

This is why feature engineering—the art and science of creating informative features—is often more impactful than algorithm selection. A simple model with excellent features will frequently outperform a complex model with poor features.

Labels: What We're Trying to Predict

A label (also called target, outcome, response variable, dependent variable, or ground truth) is the value we want our model to predict. In supervised learning, labels are the 'answers' that the model learns to produce given input features.

Formally, for a supervised learning problem, we have:

Input: feature vector $\mathbf{x} \in \mathcal{X}$
Output: label $y \in \mathcal{Y}$

The nature of $\mathcal{Y}$ (the label space) determines the type of learning problem:

Regression ($\mathcal{Y} = \mathbb{R}$ or $\mathbb{R}^k$): Labels are continuous values

Predicting house prices: $y \in [0, \infty)$
Forecasting temperature: $y \in \mathbb{R}$
Estimating age from photos: $y \in [0, 120]$

Classification ($\mathcal{Y}$ is a finite set): Labels are discrete categories

Binary: $\mathcal{Y} = {0, 1}$ or ${negative, positive}$
Multi-class: $\mathcal{Y} = {cat, dog, bird, fish, ...}$
Multi-label: $\mathcal{Y} = \mathcal{P}({label_1, label_2, ..., label_k})$ (power set)

Regression Labels

•Continuous values on the real number line
•Can be bounded (0-100%) or unbounded
•Examples: price, temperature, probability, time
•Evaluated with MSE, MAE, R²
•Model outputs a point estimate or distribution
•Ordering and magnitude are meaningful

Classification Labels

•Discrete categories from a finite set
•No inherent ordering (nominal) or ordered (ordinal)
•Examples: species, sentiment, disease diagnosis
•Evaluated with accuracy, precision, recall, F1
•Model outputs class probabilities
•Only equality comparisons meaningful

The Label is Not Always Obvious

Defining the label requires careful thought. Predicting 'customer churn' sounds straightforward, but: • What counts as churned? No activity for 30 days? 90 days? Account cancellation? • At what point do we make the prediction? At signup? After first purchase? • Is it binary (churn/not churn) or continuous (probability of churn)?

The definition of the label fundamentally shapes what the model learns and how useful it is in practice.

The Supervised Learning Setup

With features and labels defined, we can now formalize the supervised learning setup. This framework underlies everything from simple linear regression to deep neural networks.

The Dataset:

A labeled dataset $\mathcal{D}$ consists of $n$ examples:

$$\mathcal{D} = {(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)})}$$

This is often written in matrix form:

$$\mathbf{X} = \begin{bmatrix} — (\mathbf{x}^{(1)})^T — \ — (\mathbf{x}^{(2)})^T — \ \vdots \ — (\mathbf{x}^{(n)})^T — \end{bmatrix} \in \mathbb{R}^{n \times d}, \quad \mathbf{y} = \begin{bmatrix} y^{(1)} \ y^{(2)} \ \vdots \ y^{(n)} \end{bmatrix} \in \mathbb{R}^n$$

where:

$n$ = number of examples (samples, observations)
$d$ = number of features (dimensionality)
$\mathbf{X}$ = design matrix (each row is a feature vector)
$\mathbf{y}$ = label vector (corresponding outputs)

Notation Conventions Across ML Literature
Concept	This Course	Statistics	Deep Learning	Alternative Names
Input	$\mathbf{x}$	$\mathbf{x}$ or $\mathbf{X}$	$\mathbf{x}$	features, predictors, covariates
Output	$y$	$y$ or $Y$	$y$ or $t$ (target)	label, response, outcome
Number of samples	$n$	$n$	$N$ or $m$	observations, examples
Number of features	$d$	$p$	$D$ or $n_x$	dimensions, attributes
Sample index	$(i)$ superscript	$i$ subscript	$(i)$ or $i$	observation index
Feature index	$j$ subscript	$j$ subscript	$j$ subscript	variable index
Design matrix	$\mathbf{X}$	$\mathbf{X}$	$\mathbf{X}$	feature matrix, data matrix

The Learning Goal:

Given the dataset $\mathcal{D}$, the goal is to learn a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that maps feature vectors to labels. We want $f$ to:

Fit the training data: $f(\mathbf{x}^{(i)}) \approx y^{(i)}$ for $(\mathbf{x}^{(i)}, y^{(i)}) \in \mathcal{D}$
Generalize to new data: $f(\mathbf{x}{new}) \approx y{new}$ for unseen examples

The tension between these goals—fitting training data while generalizing well—is the central challenge of machine learning and leads directly to concepts like overfitting, regularization, and model selection.

supervised_learning_setup.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
 
# Note: Using California housing as Boston is deprecated
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
 
# The Feature Matrix X: shape (n_samples, n_features)
X = housing.data
print(f"Feature Matrix X shape: {X.shape}")
print(f"  n (samples): {X.shape[0]}")
print(f"  d (features): {X.shape[1]}")
print(f"  Feature names: {housing.feature_names}")
 
# The Label Vector y: shape (n_samples,)
y = housing.target
print(f"\nLabel Vector y shape: {y.shape}")
print(f"  Label: Median house value (in $100,000s)")
print(f"  Range: [{y.min():.2f}, {y.max():.2f}]")
 
# One example: the i-th data point
i = 0
print(f"\nExample {i}:")
print(f"  Feature vector x^({i}): {X[i]}")
print(f"  Label y^({i}): {y[i]:.3f} (=${y[i] * 100000: .0f
                                    })")
 
# The supervised learning setup
# Goal: Learn f such that f(x) ≈ y
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size = 0.2, random_state = 42
    )
 
    print(f"\nTraining set: {X_train.shape[0]} samples")
    print(f"Test set: {X_test.shape[0]} samples")
 
# Learn the function f(a linear model in this case)
    model = LinearRegression()
    model.fit(X_train, y_train)  # Learning from(X, y) pairs
 
# Evaluate: Does f generalize to unseen data ?
        y_pred = model.predict(X_test)
print(f"\nModel Performance on Test Set:")
    print(f"  R² score: {r2_score(y_test, y_pred):.4f}")
    print(f"  RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred))*100000:.0f}")

Feature Engineering: The Art of Representation

Feature engineering is the process of using domain knowledge to create, transform, and select features that make machine learning algorithms work better. It is often the difference between a mediocre model and a highly effective one.

"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng, Stanford

Feature engineering encompasses several activities:

Feature Engineering Techniques

•Feature Extraction: Deriving new features from raw data (e.g., extracting edges from images, TF-IDF from text)
•Feature Transformation: Applying mathematical functions (log, sqrt, polynomial, binning) to improve distributions
•Feature Creation: Combining existing features (ratios, interactions, differences) to capture relationships
•Feature Selection: Choosing the most informative features and removing redundant or noisy ones
•Feature Encoding: Converting categorical variables to numerical representations (one-hot, ordinal, target encoding)
•Dimensionality Reduction: Projecting to lower dimensions while preserving important structure (PCA, autoencoders)

Example: Feature Engineering for Date/Time

Consider a raw timestamp feature: 2024 - 12 - 25 14: 30:00

A naive approach might encode this as a single number (Unix timestamp). But this loses valuable patterns. A skilled feature engineer might extract:

Derived Feature	Value	Captures
`hour_of_day`	14	Daily patterns (rush hour, late night)
`day_of_week`	2 (Wednesday)	Weekly patterns (weekday vs weekend)
`month`	12	Seasonal patterns
`is_weekend`	0	Binary weekend indicator
`is_holiday`	1 (Christmas)	Holiday effects
`sin_hour`	sin(2π × 14/24)	Cyclical encoding (14:00 is close to 13:00 AND 15:00)
`cos_hour`	cos(2π × 14/24)	Cyclical encoding companion

This transformation makes patterns that were implicit in the raw timestamp explicit and learnable.

Deep Learning Automates (Some) Feature Engineering

One reason deep learning has been transformative is that neural networks can learn feature representations directly from raw data (images, audio, text). However, even in deep learning, feature engineering still matters: data preprocessing, augmentation, and architecture design are all forms of encoding human knowledge about the problem structure.

feature_engineering_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import pandas as pd
    import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
        from sklearn.pipeline import Pipeline
 
# Raw data with mixed feature types
    data = pd.DataFrame({
        'price': [250000, 350000, 180000, 420000, 290000],
        'sqft': [1500, 2200, 1100, 2800, 1700],
        'bedrooms': [3, 4, 2, 5, 3],
        'city': ['Seattle', 'Portland', 'Seattle', 'Portland', 'Seattle'],
        'year_built': [1990, 2005, 1975, 2018, 1998],
        'sale_date': pd.to_datetime(['2024-03-15', '2024-06-20',
            '2024-01-10', '2024-09-05', '2024-04-22'])
    })
 
    print("Raw Data:")
    print(data)
 
# Feature Engineering Examples
 
# 1. Create derived features
    data['age'] = 2024 - data['year_built']  # More interpretable than year
    data['price_per_sqft'] = data['price'] / data['sqft']  # Normalized metric
 
# 2. Extract temporal features
    data['sale_month'] = data['sale_date'].dt.month
    data['sale_quarter'] = data['sale_date'].dt.quarter
    data['is_spring_summer'] = data['sale_month'].isin([3, 4, 5, 6, 7, 8]).astype(int)
 
# 3. Create interaction features
    data['sqft_per_bedroom'] = data['sqft'] / data['bedrooms']  # Room size indicator
 
# 4. Binning continuous features
    data['size_category'] = pd.cut(data['sqft'],
        bins = [0, 1200, 1800, 2500, np.inf],
        labels = ['small', 'medium', 'large', 'xlarge'])
 
    print("\nEngineered Features:")
    print(data[['age', 'price_per_sqft', 'sale_month', 'sqft_per_bedroom', 'size_category']])
 
# 5. Encoding categorical features for ML
# One - hot encode 'city'
    city_dummies = pd.get_dummies(data['city'], prefix = 'city')
    print("\nOne-Hot Encoded City:")
    print(city_dummies)
 
# The final feature matrix would include:
# - Numeric features(scaled): sqft, bedrooms, age, price_per_sqft, sale_month, sqft_per_bedroom
# - Categorical features(one - hot): city_Portland, city_Seattle
# - Binary features: is_spring_summer
 
    print("\n→ Final feature count increased from 5 raw features to 10+ engineered features")
    print("→ Each engineered feature captures domain-specific patterns")

Deep Dive: Classes of Features

Understanding the different types of features and their proper handling is essential for effective machine learning. Each feature type requires specific preprocessing and encoding strategies.

Numerical (Continuous) Features take values from a continuous range. They are the most straightforward to use in ML algorithms but still require careful handling.

Key Considerations:

Scaling: Many algorithms (SVM, neural networks, k-NN) are sensitive to feature magnitudes. Features should often be scaled to comparable ranges.
- Standard scaling: $z = (x - \mu) / \sigma$ (zero mean, unit variance)
- Min-max scaling: $x' = (x - x_{min}) / (x_{max} - x_{min})$ (to [0,1])
- Robust scaling: Uses median and IQR, robust to outliers
Distribution: Heavily skewed features may benefit from transformations:
- Log transform: $x' = \log(x + 1)$ for right-skewed data
- Square root: $x' = \sqrt{x}$ for count data
- Box-Cox: Finds optimal power transformation
Outliers: Extreme values can dominate learning. Consider:
- Winsorizing (capping at percentiles)
- Robust scaling
- Separate outlier indicator features

numerical_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
    import numpy as np
 
# Sample data with different scales
    X = np.array([
        [25, 50000, 0.15],    # age, income, savings_rate
        [35, 75000, 0.22],
        [45, 120000, 0.35],
        [28, 45000, 0.08],
        [55, 200000, 0.45]
    ])
 
# Standard Scaling: mean = 0, std = 1
    scaler = StandardScaler()
    X_standard = scaler.fit_transform(X)
    print("Standard Scaled (z-scores):")
    print(X_standard.round(2))
 
# Min - Max Scaling: to[0, 1] range
    scaler = MinMaxScaler()
    X_minmax = scaler.fit_transform(X)
    print("\nMin-Max Scaled:")
    print(X_minmax.round(2))

Features and Labels Across Domains

The concepts of features and labels apply universally across machine learning applications. Let's examine how they manifest in different domains:

Features and Labels Across ML Domains
Domain	Example Task	Sample Features	Label	Challenge
Healthcare	Disease prediction	Age, symptoms, lab values, genetic markers, imaging features	Disease present (binary) or severity (continuous)	Class imbalance, missing data, interpretability requirements
Finance	Credit scoring	Income, debt ratio, payment history, employment length, credit utilization	Default probability or risk category	Fairness constraints, regulatory requirements, concept drift
E-commerce	Product recommendation	User history, item attributes, session behavior, temporal patterns	Purchase probability or rating	Cold start problem, implicit feedback, real-time requirements
Autonomous Vehicles	Object detection	Pixel values from cameras, LiDAR point clouds, radar returns	Bounding boxes + object classes	Real-time constraints, safety-critical, rare events
Natural Language	Sentiment analysis	Word embeddings, syntactic features, n-grams, sentence length	Positive/negative/neutral sentiment	Sarcasm, context-dependence, domain adaptation
Manufacturing	Defect detection	Sensor readings, process parameters, image features, timing data	Defect type or pass/fail	Extreme class imbalance, interpretability, sensor noise

The Feature-Label Relationship is Not Always Obvious

In many real applications, defining effective features and appropriate labels requires deep domain expertise. A radiologist understands which image features indicate malignancy. A financial analyst knows which market indicators predict volatility. This domain knowledge is often more valuable than ML algorithm expertise.

Case Study: Email Spam Detection

Let's trace through a complete example of defining features and labels for spam detection:

The Label:

Binary classification: y ∈ { spam, not_spam } or { 1, 0 }
Ground truth comes from human annotation or user actions (reporting, moving to spam)

Feature Engineering Process:

Step 1: Text-Based Features

Word frequencies (especially words like 'free', 'winner', 'click')
Character ratios (ALL CAPS percentage, special characters)
N-gram patterns ('act now', 'limited time offer')

Step 2: Structural Features

Email length, subject line length
Number of links, link-to-text ratio
Presence of attachments, attachment types
HTML complexity

Step 3: Sender Features

Sender domain reputation
SPF/DKIM authentication status
Past interaction with sender

Step 4: Behavioral Features

Time of day sent
Whether sender is in contacts
Similar emails to other users

The final feature vector might have hundreds of dimensions, each capturing a different signal of spamminess. The model learns which combination of features best predicts the label.

Common Pitfalls with Features and Labels

Even experienced practitioners make mistakes when defining features and labels. Here are critical pitfalls to avoid:

Feature and Label Pitfalls

•Data Leakage: Including features that wouldn't be available at prediction time. Example: Using 'outcome' information to predict 'outcome' (the target appears in the features, perhaps disguised as a proxy variable).
•Label Leakage: Features derived from or correlated with the label in unrealistic ways. Example: Predicting 'will customer churn' using 'days since last activity' computed AFTER the churn happened.
•Target Shift: The definition of the label changes over time or across contexts. Example: 'Spam' in 2010 vs 'Spam' in 2024—different patterns, same label.
•Label Noise: Inconsistent or incorrect labels in training data. Example: Different human annotators label the same image differently.
•Feature Drift: Feature distributions change over time. Example: Salary features from 2015 data applied to 2024 predictions without adjustment for inflation.
•Multicollinearity: Highly correlated features add no information but can destabilize models. Example: Including both 'height in inches' and 'height in cm'.
•Missing Value Mishandling: Imputing missing values without understanding why they're missing. Example: Missing income values might indicate unemployment—imputing the median hides this signal.
•Scale Sensitivity: Forgetting to scale features for algorithms that require it. Example: k-NN with features ranging from [0, 1] and [0, 1000000] will be dominated by the larger-scale feature.

The Data Leakage Trap

Data leakage is particularly insidious because it makes models appear to perform brilliantly during evaluation but fail catastrophically in production. Always ask: 'At the moment of prediction, would I actually have access to this feature?' If the answer is no—or even 'maybe not'—exclude it or engineer it differently.

Summary: Foundations of ML Data

We have established the foundational vocabulary for how machines perceive and process data. Let's consolidate the key concepts:

Key Takeaways

•Features are the language of ML — They translate real-world entities into numerical representations that algorithms can process. The choice of features determines what patterns can be learned.
•The feature vector defines the machine's view — Each entity becomes a point in a d-dimensional feature space. Information not encoded in features is invisible to the model.
•Labels define the learning objective — They specify what we want to predict, and their type (continuous, categorical, structured) determines the problem class.
•Feature engineering is often decisive — Domain knowledge encoded as informative features frequently matters more than algorithm sophistication.
•Different feature types require different handling — Numerical, categorical, text, and temporal features each have appropriate encoding strategies.
•Pitfalls abound — Data leakage, label noise, feature drift, and scaling issues can silently undermine model performance.

What's Next:

With features and labels understood, we now turn to a critical question: How do we properly evaluate our models? The next page explores training, validation, and test sets—the methodology that ensures our models generalize beyond the data they were trained on.

Page Complete

You now understand features and labels—the fundamental building blocks of supervised machine learning. You can identify feature types, apply appropriate transformations, and avoid common pitfalls. Next, we'll learn how to split data properly for robust model evaluation.

1 / 5

Loading learning content...

Machine LearningKey Concepts and Terminology

Key Concepts and Terminology

LevelBeginner

Duration90 mins

TopicKey Concepts and Terminology

1 / 5

Features and Labels: The Language of Machine Learning Data

How Machines See the World

This creates a fundamental challenge: How do we translate the richness of the real world into numbers that machines can process?

What You Will Master

Features: Quantifying the Measurable World

The Feature Vector:

When we combine multiple features, we create a feature vector—an ordered collection of feature values that completely describes an entity from the machine learning algorithm's perspective:

$$\mathbf{x} = [x_1, x_2, x_3, \ldots, x_d]^T \in \mathbb{R}^d$$

where $d$ is the dimensionality of the feature space, and each $x_i$ represents the value of the $i$-th feature for this particular entity.

The Representation Matters

Concrete Example: Predicting House Prices

Consider a house in a real estate dataset. The physical house is a complex object with countless properties. But for our ML model, we might represent it as:

Types of Features in Machine Learning
Feature Type	Definition	Examples	Representation Challenges
Numerical (Continuous)	Features that can take any real value within a range	Temperature, income, height, age	Scaling, normalization, handling outliers
Categorical (Nominal)	Features with discrete, unordered categories	Color, country, blood type, gender	One-hot encoding, embedding, cardinality
Ordinal	Categorical features with meaningful ordering	Education level, customer rating, T-shirt size	Encoding to preserve order information
Binary	Features with exactly two possible values	Is_spam, has_pool, is_weekend	Typically encoded as 0/1
Text	Natural language content	Email content, reviews, tweets	Bag-of-words, TF-IDF, embeddings
Temporal	Time-based measurements	Timestamps, durations, sequences	Cyclical encoding, lag features, trends
Spatial	Location-based information	Coordinates, addresses, regions	Geo-hashing, distance features, clustering

The Feature Space: A Geometric Perspective

The Feature Space $\mathcal{X}$:

For a problem with $d$ features, each entity becomes a point in $\mathbb{R}^d$. This transformation is profound:

A dataset of 1,000 houses becomes 1,000 points in a 7-dimensional space
A corpus of 10,000 emails becomes 10,000 points in perhaps a 50,000-dimensional space (if using word counts)
A collection of images becomes points in a space with millions of dimensions (one per pixel per color channel)

Why This Matters:

In this geometric view, similar entities are nearby points, and dissimilar entities are distant points. Most ML algorithms exploit this geometry in some way:

K-Nearest Neighbors: Literally finds the closest points in feature space
Support Vector Machines: Finds hyperplanes that separate points by class
Clustering algorithms: Group nearby points together
Neural networks: Learn to transform the feature space to make patterns more apparent

The Curse of Dimensionality

feature_space_visualization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
 
# Load the classic Iris dataset
# Each flower has 4 features: sepal length, sepal width, petal length, petal width
iris = load_iris()
X = iris.data  # Shape: (150, 4) - 150 flowers, 4 features each
y = iris.target  # Labels: 0, 1, 2 for three species
 
# Each row is a point in 4-dimensional feature space
print(f"Feature space dimensionality: {X.shape[1]}")
print(f"Number of data points: {X.shape[0]}")
 
# Example: First flower as a point in R^4
print(f"\nFirst flower in feature space:")
print(f"  x = [{X[0, 0]:.1f}, {X[0, 1]:.1f}, {X[0, 2]:.1f}, {X[0, 3]:.1f}]")
print(f"  Features: {iris.feature_names}")
 
# Compute distances between points in feature space
# Distance reflects similarity - closer points are more similar
from scipy.spatial.distance import cdist
 
distances = cdist(X, X, metric='euclidean')
print(f"\nDistance between flower 0 and flower 1: {distances[0, 1]:.2f}")
print(f"Distance between flower 0 and flower 50: {distances[0, 50]:.2f}")
# Note: Flower 0 and 1 are same species, likely closer in feature space
 
# Visualize projection to 2D (we lose information but can see patterns)
fig, ax = plt.subplots(figsize=(10, 8))
scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', 
                     edgecolor='black', s=100, alpha=0.7)
ax.set_xlabel(f'{iris.feature_names[0]}')
ax.set_ylabel(f'{iris.feature_names[1]}')
ax.set_title('Iris Dataset: 2D Projection of 4D Feature Space')
plt.colorbar(scatter, label='Species')
plt.show()

Key Insight: Features Define What the Model Can Learn

Labels: What We're Trying to Predict

Formally, for a supervised learning problem, we have:

Input: feature vector $\mathbf{x} \in \mathcal{X}$
Output: label $y \in \mathcal{Y}$

The nature of $\mathcal{Y}$ (the label space) determines the type of learning problem:

Regression ($\mathcal{Y} = \mathbb{R}$ or $\mathbb{R}^k$): Labels are continuous values

Predicting house prices: $y \in [0, \infty)$
Forecasting temperature: $y \in \mathbb{R}$
Estimating age from photos: $y \in [0, 120]$

Classification ($\mathcal{Y}$ is a finite set): Labels are discrete categories

Binary: $\mathcal{Y} = {0, 1}$ or ${negative, positive}$
Multi-class: $\mathcal{Y} = {cat, dog, bird, fish, ...}$
Multi-label: $\mathcal{Y} = \mathcal{P}({label_1, label_2, ..., label_k})$ (power set)

Regression Labels

•Continuous values on the real number line
•Can be bounded (0-100%) or unbounded
•Examples: price, temperature, probability, time
•Evaluated with MSE, MAE, R²
•Model outputs a point estimate or distribution
•Ordering and magnitude are meaningful

Classification Labels

•Discrete categories from a finite set
•No inherent ordering (nominal) or ordered (ordinal)
•Examples: species, sentiment, disease diagnosis
•Evaluated with accuracy, precision, recall, F1
•Model outputs class probabilities
•Only equality comparisons meaningful

The Label is Not Always Obvious

The definition of the label fundamentally shapes what the model learns and how useful it is in practice.

The Supervised Learning Setup

With features and labels defined, we can now formalize the supervised learning setup. This framework underlies everything from simple linear regression to deep neural networks.

The Dataset:

A labeled dataset $\mathcal{D}$ consists of $n$ examples:

$$\mathcal{D} = {(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)})}$$

This is often written in matrix form:

where:

$n$ = number of examples (samples, observations)
$d$ = number of features (dimensionality)
$\mathbf{X}$ = design matrix (each row is a feature vector)
$\mathbf{y}$ = label vector (corresponding outputs)

Notation Conventions Across ML Literature
Concept	This Course	Statistics	Deep Learning	Alternative Names
Input	$\mathbf{x}$	$\mathbf{x}$ or $\mathbf{X}$	$\mathbf{x}$	features, predictors, covariates
Output	$y$	$y$ or $Y$	$y$ or $t$ (target)	label, response, outcome
Number of samples	$n$	$n$	$N$ or $m$	observations, examples
Number of features	$d$	$p$	$D$ or $n_x$	dimensions, attributes
Sample index	$(i)$ superscript	$i$ subscript	$(i)$ or $i$	observation index
Feature index	$j$ subscript	$j$ subscript	$j$ subscript	variable index
Design matrix	$\mathbf{X}$	$\mathbf{X}$	$\mathbf{X}$	feature matrix, data matrix

The Learning Goal:

Given the dataset $\mathcal{D}$, the goal is to learn a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that maps feature vectors to labels. We want $f$ to:

Fit the training data: $f(\mathbf{x}^{(i)}) \approx y^{(i)}$ for $(\mathbf{x}^{(i)}, y^{(i)}) \in \mathcal{D}$
Generalize to new data: $f(\mathbf{x}{new}) \approx y{new}$ for unseen examples

supervised_learning_setup.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
import numpy as np
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
 
# Note: Using California housing as Boston is deprecated
from sklearn.datasets import fetch_california_housing
housing = fetch_california_housing()
 
# The Feature Matrix X: shape (n_samples, n_features)
X = housing.data
print(f"Feature Matrix X shape: {X.shape}")
print(f"  n (samples): {X.shape[0]}")
print(f"  d (features): {X.shape[1]}")
print(f"  Feature names: {housing.feature_names}")
 
# The Label Vector y: shape (n_samples,)
y = housing.target
print(f"\nLabel Vector y shape: {y.shape}")
print(f"  Label: Median house value (in $100,000s)")
print(f"  Range: [{y.min():.2f}, {y.max():.2f}]")
 
# One example: the i-th data point
i = 0
print(f"\nExample {i}:")
print(f"  Feature vector x^({i}): {X[i]}")
print(f"  Label y^({i}): {y[i]:.3f} (=${y[i] * 100000: .0f
                                    })")
 
# The supervised learning setup
# Goal: Learn f such that f(x) ≈ y
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size = 0.2, random_state = 42
    )
 
    print(f"\nTraining set: {X_train.shape[0]} samples")
    print(f"Test set: {X_test.shape[0]} samples")
 
# Learn the function f(a linear model in this case)
    model = LinearRegression()
    model.fit(X_train, y_train)  # Learning from(X, y) pairs
 
# Evaluate: Does f generalize to unseen data ?
        y_pred = model.predict(X_test)
print(f"\nModel Performance on Test Set:")
    print(f"  R² score: {r2_score(y_test, y_pred):.4f}")
    print(f"  RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred))*100000:.0f}")

Feature Engineering: The Art of Representation

"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng, Stanford

Feature engineering encompasses several activities:

Feature Engineering Techniques

•Feature Extraction: Deriving new features from raw data (e.g., extracting edges from images, TF-IDF from text)
•Feature Transformation: Applying mathematical functions (log, sqrt, polynomial, binning) to improve distributions
•Feature Creation: Combining existing features (ratios, interactions, differences) to capture relationships
•Feature Selection: Choosing the most informative features and removing redundant or noisy ones
•Feature Encoding: Converting categorical variables to numerical representations (one-hot, ordinal, target encoding)
•Dimensionality Reduction: Projecting to lower dimensions while preserving important structure (PCA, autoencoders)

Example: Feature Engineering for Date/Time

Consider a raw timestamp feature: 2024 - 12 - 25 14: 30:00

A naive approach might encode this as a single number (Unix timestamp). But this loses valuable patterns. A skilled feature engineer might extract:

Derived Feature	Value	Captures
`hour_of_day`	14	Daily patterns (rush hour, late night)
`day_of_week`	2 (Wednesday)	Weekly patterns (weekday vs weekend)
`month`	12	Seasonal patterns
`is_weekend`	0	Binary weekend indicator
`is_holiday`	1 (Christmas)	Holiday effects
`sin_hour`	sin(2π × 14/24)	Cyclical encoding (14:00 is close to 13:00 AND 15:00)
`cos_hour`	cos(2π × 14/24)	Cyclical encoding companion

This transformation makes patterns that were implicit in the raw timestamp explicit and learnable.

Deep Learning Automates (Some) Feature Engineering

feature_engineering_example.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
import pandas as pd
    import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
        from sklearn.pipeline import Pipeline
 
# Raw data with mixed feature types
    data = pd.DataFrame({
        'price': [250000, 350000, 180000, 420000, 290000],
        'sqft': [1500, 2200, 1100, 2800, 1700],
        'bedrooms': [3, 4, 2, 5, 3],
        'city': ['Seattle', 'Portland', 'Seattle', 'Portland', 'Seattle'],
        'year_built': [1990, 2005, 1975, 2018, 1998],
        'sale_date': pd.to_datetime(['2024-03-15', '2024-06-20',
            '2024-01-10', '2024-09-05', '2024-04-22'])
    })
 
    print("Raw Data:")
    print(data)
 
# Feature Engineering Examples
 
# 1. Create derived features
    data['age'] = 2024 - data['year_built']  # More interpretable than year
    data['price_per_sqft'] = data['price'] / data['sqft']  # Normalized metric
 
# 2. Extract temporal features
    data['sale_month'] = data['sale_date'].dt.month
    data['sale_quarter'] = data['sale_date'].dt.quarter
    data['is_spring_summer'] = data['sale_month'].isin([3, 4, 5, 6, 7, 8]).astype(int)
 
# 3. Create interaction features
    data['sqft_per_bedroom'] = data['sqft'] / data['bedrooms']  # Room size indicator
 
# 4. Binning continuous features
    data['size_category'] = pd.cut(data['sqft'],
        bins = [0, 1200, 1800, 2500, np.inf],
        labels = ['small', 'medium', 'large', 'xlarge'])
 
    print("\nEngineered Features:")
    print(data[['age', 'price_per_sqft', 'sale_month', 'sqft_per_bedroom', 'size_category']])
 
# 5. Encoding categorical features for ML
# One - hot encode 'city'
    city_dummies = pd.get_dummies(data['city'], prefix = 'city')
    print("\nOne-Hot Encoded City:")
    print(city_dummies)
 
# The final feature matrix would include:
# - Numeric features(scaled): sqft, bedrooms, age, price_per_sqft, sale_month, sqft_per_bedroom
# - Categorical features(one - hot): city_Portland, city_Seattle
# - Binary features: is_spring_summer
 
    print("\n→ Final feature count increased from 5 raw features to 10+ engineered features")
    print("→ Each engineered feature captures domain-specific patterns")

Deep Dive: Classes of Features

Understanding the different types of features and their proper handling is essential for effective machine learning. Each feature type requires specific preprocessing and encoding strategies.

Numerical (Continuous) Features take values from a continuous range. They are the most straightforward to use in ML algorithms but still require careful handling.

Key Considerations:

Scaling: Many algorithms (SVM, neural networks, k-NN) are sensitive to feature magnitudes. Features should often be scaled to comparable ranges.
- Standard scaling: $z = (x - \mu) / \sigma$ (zero mean, unit variance)
- Min-max scaling: $x' = (x - x_{min}) / (x_{max} - x_{min})$ (to [0,1])
- Robust scaling: Uses median and IQR, robust to outliers
Distribution: Heavily skewed features may benefit from transformations:
- Log transform: $x' = \log(x + 1)$ for right-skewed data
- Square root: $x' = \sqrt{x}$ for count data
- Box-Cox: Finds optimal power transformation
Outliers: Extreme values can dominate learning. Consider:
- Winsorizing (capping at percentiles)
- Robust scaling
- Separate outlier indicator features

numerical_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler
    import numpy as np
 
# Sample data with different scales
    X = np.array([
        [25, 50000, 0.15],    # age, income, savings_rate
        [35, 75000, 0.22],
        [45, 120000, 0.35],
        [28, 45000, 0.08],
        [55, 200000, 0.45]
    ])
 
# Standard Scaling: mean = 0, std = 1
    scaler = StandardScaler()
    X_standard = scaler.fit_transform(X)
    print("Standard Scaled (z-scores):")
    print(X_standard.round(2))
 
# Min - Max Scaling: to[0, 1] range
    scaler = MinMaxScaler()
    X_minmax = scaler.fit_transform(X)
    print("\nMin-Max Scaled:")
    print(X_minmax.round(2))

Features and Labels Across Domains

The concepts of features and labels apply universally across machine learning applications. Let's examine how they manifest in different domains:

Features and Labels Across ML Domains
Domain	Example Task	Sample Features	Label	Challenge
Healthcare	Disease prediction	Age, symptoms, lab values, genetic markers, imaging features	Disease present (binary) or severity (continuous)	Class imbalance, missing data, interpretability requirements
Finance	Credit scoring	Income, debt ratio, payment history, employment length, credit utilization	Default probability or risk category	Fairness constraints, regulatory requirements, concept drift
E-commerce	Product recommendation	User history, item attributes, session behavior, temporal patterns	Purchase probability or rating	Cold start problem, implicit feedback, real-time requirements
Autonomous Vehicles	Object detection	Pixel values from cameras, LiDAR point clouds, radar returns	Bounding boxes + object classes	Real-time constraints, safety-critical, rare events
Natural Language	Sentiment analysis	Word embeddings, syntactic features, n-grams, sentence length	Positive/negative/neutral sentiment	Sarcasm, context-dependence, domain adaptation
Manufacturing	Defect detection	Sensor readings, process parameters, image features, timing data	Defect type or pass/fail	Extreme class imbalance, interpretability, sensor noise

The Feature-Label Relationship is Not Always Obvious

Case Study: Email Spam Detection

Let's trace through a complete example of defining features and labels for spam detection:

The Label:

Binary classification: y ∈ { spam, not_spam } or { 1, 0 }
Ground truth comes from human annotation or user actions (reporting, moving to spam)

Feature Engineering Process:

Step 1: Text-Based Features

Word frequencies (especially words like 'free', 'winner', 'click')
Character ratios (ALL CAPS percentage, special characters)
N-gram patterns ('act now', 'limited time offer')

Step 2: Structural Features

Email length, subject line length
Number of links, link-to-text ratio
Presence of attachments, attachment types
HTML complexity

Step 3: Sender Features

Sender domain reputation
SPF/DKIM authentication status
Past interaction with sender

Step 4: Behavioral Features

Time of day sent
Whether sender is in contacts
Similar emails to other users

The final feature vector might have hundreds of dimensions, each capturing a different signal of spamminess. The model learns which combination of features best predicts the label.

Common Pitfalls with Features and Labels

Even experienced practitioners make mistakes when defining features and labels. Here are critical pitfalls to avoid:

Feature and Label Pitfalls

•Data Leakage: Including features that wouldn't be available at prediction time. Example: Using 'outcome' information to predict 'outcome' (the target appears in the features, perhaps disguised as a proxy variable).
•Label Leakage: Features derived from or correlated with the label in unrealistic ways. Example: Predicting 'will customer churn' using 'days since last activity' computed AFTER the churn happened.
•Target Shift: The definition of the label changes over time or across contexts. Example: 'Spam' in 2010 vs 'Spam' in 2024—different patterns, same label.
•Label Noise: Inconsistent or incorrect labels in training data. Example: Different human annotators label the same image differently.
•Feature Drift: Feature distributions change over time. Example: Salary features from 2015 data applied to 2024 predictions without adjustment for inflation.
•Multicollinearity: Highly correlated features add no information but can destabilize models. Example: Including both 'height in inches' and 'height in cm'.
•Missing Value Mishandling: Imputing missing values without understanding why they're missing. Example: Missing income values might indicate unemployment—imputing the median hides this signal.
•Scale Sensitivity: Forgetting to scale features for algorithms that require it. Example: k-NN with features ranging from [0, 1] and [0, 1000000] will be dominated by the larger-scale feature.

The Data Leakage Trap

Summary: Foundations of ML Data

We have established the foundational vocabulary for how machines perceive and process data. Let's consolidate the key concepts:

Key Takeaways

•Features are the language of ML — They translate real-world entities into numerical representations that algorithms can process. The choice of features determines what patterns can be learned.
•The feature vector defines the machine's view — Each entity becomes a point in a d-dimensional feature space. Information not encoded in features is invisible to the model.
•Labels define the learning objective — They specify what we want to predict, and their type (continuous, categorical, structured) determines the problem class.
•Feature engineering is often decisive — Domain knowledge encoded as informative features frequently matters more than algorithm sophistication.
•Different feature types require different handling — Numerical, categorical, text, and temporal features each have appropriate encoding strategies.
•Pitfalls abound — Data leakage, label noise, feature drift, and scaling issues can silently undermine model performance.

What's Next:

Page Complete

1 / 5