Loading learning content...
When humans observe the world, we perceive objects through a rich tapestry of sensory input—colors, textures, sounds, context, and meaning accumulated over a lifetime of experience. A doctor glancing at a patient's face sees not just features, but subtle signs of health or illness. An experienced car buyer evaluates not just specifications, but an intuitive sense of value.
Machine learning algorithms have no such intuition. They cannot 'see' in any meaningful sense. They cannot understand. What they can do is process numbers—vast matrices of numerical values, computed with extraordinary speed and precision.
This creates a fundamental challenge: How do we translate the richness of the real world into numbers that machines can process?
The answer lies in two foundational concepts that form the bedrock of every machine learning system: features and labels. These are not mere vocabulary—they represent a profound philosophical stance about how knowledge can be represented, quantified, and learned from data.
By the end of this page, you will understand: • What features and labels are in precise mathematical terms • How raw data is transformed into feature representations • The different types of features and their implications • How feature engineering affects model performance • The relationship between features, labels, and the learning task • Real-world examples across diverse domains
A feature (also called an attribute, predictor, input variable, or independent variable) is a measurable property or characteristic of an entity that we believe contains information relevant to the prediction task at hand.
Formally, given an entity $x$ from our domain of interest, a feature is a function $\phi: \mathcal{X} \rightarrow \mathbb{R}$ (or more generally, to some measurable space) that extracts a numerical value from that entity.
The Feature Vector:
When we combine multiple features, we create a feature vector—an ordered collection of feature values that completely describes an entity from the machine learning algorithm's perspective:
$$\mathbf{x} = [x_1, x_2, x_3, \ldots, x_d]^T \in \mathbb{R}^d$$
where $d$ is the dimensionality of the feature space, and each $x_i$ represents the value of the $i$-th feature for this particular entity.
The choice of features is arguably the most critical decision in any machine learning project. The same underlying data can yield completely different feature representations, and thus completely different model performance. A house can be represented by [square_footage, bedrooms, bathrooms] or by [price_per_sqft, age_years, school_district_rating]. Same house, different features, different insights.
Concrete Example: Predicting House Prices
Consider a house in a real estate dataset. The physical house is a complex object with countless properties. But for our ML model, we might represent it as:
$$\mathbf{x}_{house} = \begin{bmatrix} 2500 & \text{(square footage)} \ 4 & \text{(bedrooms)} \ 2.5 & \text{(bathrooms)} \ 15 & \text{(age in years)} \ 0.25 & \text{(lot size in acres)} \ 3 & \text{(garage capacity)} \ 8.5 & \text{(school district rating)} \end{bmatrix}$$
This 7-dimensional feature vector is all the algorithm 'knows' about the house. Any information not captured in these features is invisible to the model. This is both a limitation and a feature (pun intended)—it forces us to be explicit about what information we believe is predictively relevant.
| Feature Type | Definition | Examples | Representation Challenges |
|---|---|---|---|
| Numerical (Continuous) | Features that can take any real value within a range | Temperature, income, height, age | Scaling, normalization, handling outliers |
| Categorical (Nominal) | Features with discrete, unordered categories | Color, country, blood type, gender | One-hot encoding, embedding, cardinality |
| Ordinal | Categorical features with meaningful ordering | Education level, customer rating, T-shirt size | Encoding to preserve order information |
| Binary | Features with exactly two possible values | Is_spam, has_pool, is_weekend | Typically encoded as 0/1 |
| Text | Natural language content | Email content, reviews, tweets | Bag-of-words, TF-IDF, embeddings |
| Temporal | Time-based measurements | Timestamps, durations, sequences | Cyclical encoding, lag features, trends |
| Spatial | Location-based information | Coordinates, addresses, regions | Geo-hashing, distance features, clustering |
When we represent entities as feature vectors, we implicitly place them in a feature space—a mathematical space where each dimension corresponds to a feature. This geometric perspective is fundamental to understanding how ML algorithms work.
The Feature Space $\mathcal{X}$:
For a problem with $d$ features, each entity becomes a point in $\mathbb{R}^d$. This transformation is profound:
Why This Matters:
In this geometric view, similar entities are nearby points, and dissimilar entities are distant points. Most ML algorithms exploit this geometry in some way:
As dimensionality increases, the geometry of the feature space becomes counterintuitive. In high dimensions, almost all pairs of points are nearly equidistant, and the volume concentrates in thin shells near the surface of hyperspheres. This 'curse of dimensionality' profoundly affects algorithm design and is why dimensionality reduction techniques (PCA, t-SNE, UMAP) are so important.
1234567891011121314151617181920212223242526272829303132333435363738
import numpy as npimport matplotlib.pyplot as pltfrom sklearn.datasets import load_irisfrom sklearn.preprocessing import StandardScaler # Load the classic Iris dataset# Each flower has 4 features: sepal length, sepal width, petal length, petal widthiris = load_iris()X = iris.data # Shape: (150, 4) - 150 flowers, 4 features eachy = iris.target # Labels: 0, 1, 2 for three species # Each row is a point in 4-dimensional feature spaceprint(f"Feature space dimensionality: {X.shape[1]}")print(f"Number of data points: {X.shape[0]}") # Example: First flower as a point in R^4print(f"\nFirst flower in feature space:")print(f" x = [{X[0, 0]:.1f}, {X[0, 1]:.1f}, {X[0, 2]:.1f}, {X[0, 3]:.1f}]")print(f" Features: {iris.feature_names}") # Compute distances between points in feature space# Distance reflects similarity - closer points are more similarfrom scipy.spatial.distance import cdist distances = cdist(X, X, metric='euclidean')print(f"\nDistance between flower 0 and flower 1: {distances[0, 1]:.2f}")print(f"Distance between flower 0 and flower 50: {distances[0, 50]:.2f}")# Note: Flower 0 and 1 are same species, likely closer in feature space # Visualize projection to 2D (we lose information but can see patterns)fig, ax = plt.subplots(figsize=(10, 8))scatter = ax.scatter(X[:, 0], X[:, 1], c=y, cmap='viridis', edgecolor='black', s=100, alpha=0.7)ax.set_xlabel(f'{iris.feature_names[0]}')ax.set_ylabel(f'{iris.feature_names[1]}')ax.set_title('Iris Dataset: 2D Projection of 4D Feature Space')plt.colorbar(scatter, label='Species')plt.show()Key Insight: Features Define What the Model Can Learn
The feature space is the universe within which your ML algorithm operates. If relevant information isn't encoded in the features, it's as if that information doesn't exist. The algorithm cannot discover patterns that aren't represented in its input.
This is why feature engineering—the art and science of creating informative features—is often more impactful than algorithm selection. A simple model with excellent features will frequently outperform a complex model with poor features.
A label (also called target, outcome, response variable, dependent variable, or ground truth) is the value we want our model to predict. In supervised learning, labels are the 'answers' that the model learns to produce given input features.
Formally, for a supervised learning problem, we have:
The nature of $\mathcal{Y}$ (the label space) determines the type of learning problem:
Regression ($\mathcal{Y} = \mathbb{R}$ or $\mathbb{R}^k$): Labels are continuous values
Classification ($\mathcal{Y}$ is a finite set): Labels are discrete categories
Defining the label requires careful thought. Predicting 'customer churn' sounds straightforward, but: • What counts as churned? No activity for 30 days? 90 days? Account cancellation? • At what point do we make the prediction? At signup? After first purchase? • Is it binary (churn/not churn) or continuous (probability of churn)?
The definition of the label fundamentally shapes what the model learns and how useful it is in practice.
With features and labels defined, we can now formalize the supervised learning setup. This framework underlies everything from simple linear regression to deep neural networks.
The Dataset:
A labeled dataset $\mathcal{D}$ consists of $n$ examples:
$$\mathcal{D} = {(\mathbf{x}^{(1)}, y^{(1)}), (\mathbf{x}^{(2)}, y^{(2)}), \ldots, (\mathbf{x}^{(n)}, y^{(n)})}$$
This is often written in matrix form:
$$\mathbf{X} = \begin{bmatrix} — (\mathbf{x}^{(1)})^T — \ — (\mathbf{x}^{(2)})^T — \ \vdots \ — (\mathbf{x}^{(n)})^T — \end{bmatrix} \in \mathbb{R}^{n \times d}, \quad \mathbf{y} = \begin{bmatrix} y^{(1)} \ y^{(2)} \ \vdots \ y^{(n)} \end{bmatrix} \in \mathbb{R}^n$$
where:
| Concept | This Course | Statistics | Deep Learning | Alternative Names |
|---|---|---|---|---|
| Input | $\mathbf{x}$ | $\mathbf{x}$ or $\mathbf{X}$ | $\mathbf{x}$ | features, predictors, covariates |
| Output | $y$ | $y$ or $Y$ | $y$ or $t$ (target) | label, response, outcome |
| Number of samples | $n$ | $n$ | $N$ or $m$ | observations, examples |
| Number of features | $d$ | $p$ | $D$ or $n_x$ | dimensions, attributes |
| Sample index | $(i)$ superscript | $i$ subscript | $(i)$ or $i$ | observation index |
| Feature index | $j$ subscript | $j$ subscript | $j$ subscript | variable index |
| Design matrix | $\mathbf{X}$ | $\mathbf{X}$ | $\mathbf{X}$ | feature matrix, data matrix |
The Learning Goal:
Given the dataset $\mathcal{D}$, the goal is to learn a function $f: \mathcal{X} \rightarrow \mathcal{Y}$ that maps feature vectors to labels. We want $f$ to:
The tension between these goals—fitting training data while generalizing well—is the central challenge of machine learning and leads directly to concepts like overfitting, regularization, and model selection.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748
import numpy as npfrom sklearn.datasets import load_bostonfrom sklearn.model_selection import train_test_splitfrom sklearn.linear_model import LinearRegressionfrom sklearn.metrics import mean_squared_error, r2_score # Note: Using California housing as Boston is deprecatedfrom sklearn.datasets import fetch_california_housinghousing = fetch_california_housing() # The Feature Matrix X: shape (n_samples, n_features)X = housing.dataprint(f"Feature Matrix X shape: {X.shape}")print(f" n (samples): {X.shape[0]}")print(f" d (features): {X.shape[1]}")print(f" Feature names: {housing.feature_names}") # The Label Vector y: shape (n_samples,)y = housing.targetprint(f"\nLabel Vector y shape: {y.shape}")print(f" Label: Median house value (in $100,000s)")print(f" Range: [{y.min():.2f}, {y.max():.2f}]") # One example: the i-th data pointi = 0print(f"\nExample {i}:")print(f" Feature vector x^({i}): {X[i]}")print(f" Label y^({i}): {y[i]:.3f} (=${y[i] * 100000: .0f })") # The supervised learning setup# Goal: Learn f such that f(x) ≈ y X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.2, random_state = 42 ) print(f"\nTraining set: {X_train.shape[0]} samples") print(f"Test set: {X_test.shape[0]} samples") # Learn the function f(a linear model in this case) model = LinearRegression() model.fit(X_train, y_train) # Learning from(X, y) pairs # Evaluate: Does f generalize to unseen data ? y_pred = model.predict(X_test)print(f"\nModel Performance on Test Set:") print(f" R² score: {r2_score(y_test, y_pred):.4f}") print(f" RMSE: ${np.sqrt(mean_squared_error(y_test, y_pred))*100000:.0f}")Feature engineering is the process of using domain knowledge to create, transform, and select features that make machine learning algorithms work better. It is often the difference between a mediocre model and a highly effective one.
"Coming up with features is difficult, time-consuming, requires expert knowledge. Applied machine learning is basically feature engineering." — Andrew Ng, Stanford
Feature engineering encompasses several activities:
Example: Feature Engineering for Date/Time
Consider a raw timestamp feature: 2024 - 12 - 25 14: 30:00
A naive approach might encode this as a single number (Unix timestamp). But this loses valuable patterns. A skilled feature engineer might extract:
| Derived Feature | Value | Captures |
|---|---|---|
hour_of_day | 14 | Daily patterns (rush hour, late night) |
day_of_week | 2 (Wednesday) | Weekly patterns (weekday vs weekend) |
month | 12 | Seasonal patterns |
is_weekend | 0 | Binary weekend indicator |
is_holiday | 1 (Christmas) | Holiday effects |
sin_hour | sin(2π × 14/24) | Cyclical encoding (14:00 is close to 13:00 AND 15:00) |
cos_hour | cos(2π × 14/24) | Cyclical encoding companion |
This transformation makes patterns that were implicit in the raw timestamp explicit and learnable.
One reason deep learning has been transformative is that neural networks can learn feature representations directly from raw data (images, audio, text). However, even in deep learning, feature engineering still matters: data preprocessing, augmentation, and architecture design are all forms of encoding human knowledge about the problem structure.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455
import pandas as pd import numpy as npfrom sklearn.preprocessing import StandardScaler, OneHotEncoderfrom sklearn.compose import ColumnTransformer from sklearn.pipeline import Pipeline # Raw data with mixed feature types data = pd.DataFrame({ 'price': [250000, 350000, 180000, 420000, 290000], 'sqft': [1500, 2200, 1100, 2800, 1700], 'bedrooms': [3, 4, 2, 5, 3], 'city': ['Seattle', 'Portland', 'Seattle', 'Portland', 'Seattle'], 'year_built': [1990, 2005, 1975, 2018, 1998], 'sale_date': pd.to_datetime(['2024-03-15', '2024-06-20', '2024-01-10', '2024-09-05', '2024-04-22']) }) print("Raw Data:") print(data) # Feature Engineering Examples # 1. Create derived features data['age'] = 2024 - data['year_built'] # More interpretable than year data['price_per_sqft'] = data['price'] / data['sqft'] # Normalized metric # 2. Extract temporal features data['sale_month'] = data['sale_date'].dt.month data['sale_quarter'] = data['sale_date'].dt.quarter data['is_spring_summer'] = data['sale_month'].isin([3, 4, 5, 6, 7, 8]).astype(int) # 3. Create interaction features data['sqft_per_bedroom'] = data['sqft'] / data['bedrooms'] # Room size indicator # 4. Binning continuous features data['size_category'] = pd.cut(data['sqft'], bins = [0, 1200, 1800, 2500, np.inf], labels = ['small', 'medium', 'large', 'xlarge']) print("\nEngineered Features:") print(data[['age', 'price_per_sqft', 'sale_month', 'sqft_per_bedroom', 'size_category']]) # 5. Encoding categorical features for ML# One - hot encode 'city' city_dummies = pd.get_dummies(data['city'], prefix = 'city') print("\nOne-Hot Encoded City:") print(city_dummies) # The final feature matrix would include:# - Numeric features(scaled): sqft, bedrooms, age, price_per_sqft, sale_month, sqft_per_bedroom# - Categorical features(one - hot): city_Portland, city_Seattle# - Binary features: is_spring_summer print("\n→ Final feature count increased from 5 raw features to 10+ engineered features") print("→ Each engineered feature captures domain-specific patterns")Understanding the different types of features and their proper handling is essential for effective machine learning. Each feature type requires specific preprocessing and encoding strategies.
Numerical (Continuous) Features take values from a continuous range. They are the most straightforward to use in ML algorithms but still require careful handling.
Key Considerations:
Scaling: Many algorithms (SVM, neural networks, k-NN) are sensitive to feature magnitudes. Features should often be scaled to comparable ranges.
Distribution: Heavily skewed features may benefit from transformations:
Outliers: Extreme values can dominate learning. Consider:
1234567891011121314151617181920212223
from sklearn.preprocessing import StandardScaler, MinMaxScaler, RobustScaler import numpy as np # Sample data with different scales X = np.array([ [25, 50000, 0.15], # age, income, savings_rate [35, 75000, 0.22], [45, 120000, 0.35], [28, 45000, 0.08], [55, 200000, 0.45] ]) # Standard Scaling: mean = 0, std = 1 scaler = StandardScaler() X_standard = scaler.fit_transform(X) print("Standard Scaled (z-scores):") print(X_standard.round(2)) # Min - Max Scaling: to[0, 1] range scaler = MinMaxScaler() X_minmax = scaler.fit_transform(X) print("\nMin-Max Scaled:") print(X_minmax.round(2))The concepts of features and labels apply universally across machine learning applications. Let's examine how they manifest in different domains:
| Domain | Example Task | Sample Features | Label | Challenge |
|---|---|---|---|---|
| Healthcare | Disease prediction | Age, symptoms, lab values, genetic markers, imaging features | Disease present (binary) or severity (continuous) | Class imbalance, missing data, interpretability requirements |
| Finance | Credit scoring | Income, debt ratio, payment history, employment length, credit utilization | Default probability or risk category | Fairness constraints, regulatory requirements, concept drift |
| E-commerce | Product recommendation | User history, item attributes, session behavior, temporal patterns | Purchase probability or rating | Cold start problem, implicit feedback, real-time requirements |
| Autonomous Vehicles | Object detection | Pixel values from cameras, LiDAR point clouds, radar returns | Bounding boxes + object classes | Real-time constraints, safety-critical, rare events |
| Natural Language | Sentiment analysis | Word embeddings, syntactic features, n-grams, sentence length | Positive/negative/neutral sentiment | Sarcasm, context-dependence, domain adaptation |
| Manufacturing | Defect detection | Sensor readings, process parameters, image features, timing data | Defect type or pass/fail | Extreme class imbalance, interpretability, sensor noise |
In many real applications, defining effective features and appropriate labels requires deep domain expertise. A radiologist understands which image features indicate malignancy. A financial analyst knows which market indicators predict volatility. This domain knowledge is often more valuable than ML algorithm expertise.
Case Study: Email Spam Detection
Let's trace through a complete example of defining features and labels for spam detection:
The Label:
y ∈ { spam, not_spam } or { 1, 0 } Feature Engineering Process:
Step 1: Text-Based Features
Step 2: Structural Features
Step 3: Sender Features
Step 4: Behavioral Features
The final feature vector might have hundreds of dimensions, each capturing a different signal of spamminess. The model learns which combination of features best predicts the label.
Even experienced practitioners make mistakes when defining features and labels. Here are critical pitfalls to avoid:
Data leakage is particularly insidious because it makes models appear to perform brilliantly during evaluation but fail catastrophically in production. Always ask: 'At the moment of prediction, would I actually have access to this feature?' If the answer is no—or even 'maybe not'—exclude it or engineer it differently.
We have established the foundational vocabulary for how machines perceive and process data. Let's consolidate the key concepts:
What's Next:
With features and labels understood, we now turn to a critical question: How do we properly evaluate our models? The next page explores training, validation, and test sets—the methodology that ensures our models generalize beyond the data they were trained on.
You now understand features and labels—the fundamental building blocks of supervised machine learning. You can identify feature types, apply appropriate transformations, and avoid common pitfalls. Next, we'll learn how to split data properly for robust model evaluation.