Feature Engineering Mastery - Learning Module

Loading content...

0/278

Text and Categorical

When Words Meet Numbers

Machine learning algorithms operate on numbers. But much of the world's data is text—product descriptions, customer reviews, support tickets, medical notes—and categories—product types, user segments, geographic regions. Bridging the gap between human language and numerical computation is arguably the most impactful domain in feature engineering.

A model predicting customer churn gains little from raw text: 'The product was not what I expected.' But transform that into sentiment scores, keyword flags, and semantic embeddings, and suddenly the model sees what the customer felt.

Similarly, a 'product_category' column with 10,000 unique values can't be naively one-hot encoded (10,000 sparse columns!) or label-encoded (false ordinal relationships). The engineering choices here determine whether your model learns meaningful patterns or drowns in noise.

What You Will Learn

This page covers the full spectrum of text and categorical feature engineering: from basic one-hot and label encoding to target encoding, hashing tricks, and learned embeddings. You'll learn text preprocessing pipelines, bag-of-words and TF-IDF representations, and how to leverage pre-trained language model embeddings. By the end, you'll handle any text or categorical feature with confidence.

Categorical Encoding Fundamentals

Categorical features require encoding into numerical form. The choice of encoding significantly impacts model performance and should match both the feature's semantics and the model type.

Encoding Taxonomy:

Categorical Encoding Methods Comparison
Method	Output Dimension	Preserves Meaning	Best For
One-Hot Encoding	k (number of categories)	No ordinal assumption	Low cardinality (< 50); linear models
Dummy Encoding	k - 1	Avoids multicollinearity	Regression models
Label Encoding	1	Implies false order	Tree models only; never for linear
Ordinal Encoding	1	Uses real order	Naturally ordered categories
Target Encoding	1	Captures predictive value	High cardinality; with regularization
Frequency Encoding	1	Captures popularity	When frequency correlates with target
Hash Encoding	Configurable	Handles unseen; some collisions	Very high cardinality; streaming
Embedding Layers	Configurable dense	Learned representations	Neural networks; millions of categories

categorical_encoding_methods.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
import pandas as pd
import numpy as np
from sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoder
from category_encoders import TargetEncoder, HashingEncoder, LeaveOneOutEncoder
 
# Sample data
df = pd.DataFrame({
    'color': ['red', 'blue', 'green', 'red', 'blue', 'green', 'red', 'blue'],
    'size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'],
    'product_id': [f'PROD_{i:04d}' for i in np.random.randint(1, 10000, 8)],
    'target': [0, 1, 1, 1, 0, 0, 1, 1]
})
 
# 1. One-Hot Encoding
print("=== One-Hot Encoding ===")
onehot = OneHotEncoder(sparse_output=False, handle_unknown='ignore')
color_onehot = onehot.fit_transform(df[['color']])
print(f"Shape: {color_onehot.shape}")
print(pd.DataFrame(color_onehot, columns=onehot.get_feature_names_out(['color'])))
 
# 2. Ordinal Encoding (for ordered categories)
print("
=== Ordinal Encoding ===")
ordinal = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL']])  # Explicit order
df['size_ordinal'] = ordinal.fit_transform(df[['size']])
print(df[['size', 'size_ordinal']])
 
# 3. Label Encoding (simple numeric mapping)
print("
=== Label Encoding ===")
label = LabelEncoder()
df['color_label'] = label.fit_transform(df['color'])
print(f"Mapping: {dict(zip(label.classes_, range(len(label.classes_))))}")
 
# 4. Target Encoding (mean target per category)
print("
=== Target Encoding ===")
target_enc = TargetEncoder(smoothing=1.0)  # Smoothing prevents overfitting
df['color_target_enc'] = target_enc.fit_transform(df['color'], df['target'])
print(df[['color', 'color_target_enc', 'target']].drop_duplicates())
 
# 5. Frequency Encoding
print("
=== Frequency Encoding ===")
freq_map = df['color'].value_counts(normalize=True).to_dict()
df['color_freq'] = df['color'].map(freq_map)
print(df[['color', 'color_freq']].drop_duplicates())
 
# 6. Hash Encoding
print("
=== Hash Encoding ===")
hasher = HashingEncoder(n_components=5)  # Fixed output dimension
product_hashed = hasher.fit_transform(df[['product_id']])
print(f"High cardinality ({df['product_id'].nunique()} unique) -> {product_hashed.shape[1]} columns")
 
# 7. Leave-One-Out Encoding (target encoding variant)
print("
=== Leave-One-Out Encoding ===")
loo_enc = LeaveOneOutEncoder()
df['color_loo'] = loo_enc.fit_transform(df['color'], df['target'])
print("LOO encoding excludes current row from mean calculation to reduce overfitting")

Target Encoding Leakage

Target encoding uses the target variable to create features—a potential source of data leakage. Always use cross-validation or leave-one-out approaches: compute the encoding for each row using only data from OTHER rows. The category_encoders library handles this correctly, but manual implementations often don't.

Handling High Cardinality Categories

High cardinality categories—user IDs, product SKUs, zip codes, IP addresses—pose unique challenges:

One-hot encoding creates thousands/millions of sparse columns
Most categories appear rarely, making patterns hard to learn
New categories appear in production that didn't exist in training

Strategies for High Cardinality:

High Cardinality Solutions

•Frequency Thresholding — Keep only top-k most frequent categories; lump rare ones into 'other'. Simple and effective.
•Target Encoding with Smoothing — Replace category with smoothed target mean. Smoothing regularizes toward global mean for rare categories.
•Hash Encoding — Hash category to fixed-size vector. Collisions are acceptable; handles unseen categories automatically.
•Entity Embeddings — Learn dense vector per category via neural network embedding layer. Captures latent structure.
•Hierarchical Encoding — Use category hierarchy (product → subcategory → category). Aggregate rare items to parent level.
•Clustering — Group similar categories using external features, then encode cluster membership instead.

high_cardinality_strategies.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
import pandas as pd
import numpy as np
from collections import defaultdict
 
def frequency_threshold_encoding(
    df: pd.DataFrame,
    cat_col: str,
    min_frequency: int = 10,
    other_label: str = '_OTHER_'
) -> pd.Series:
    """
    Keep only categories with >= min_frequency occurrences.
    Rare categories become 'other'.
    """
    freq = df[cat_col].value_counts()
    common = freq[freq >= min_frequency].index
    return df[cat_col].where(df[cat_col].isin(common), other_label)
 
 
def smoothed_target_encoding(
    df: pd.DataFrame,
    cat_col: str,
    target_col: str,
    smoothing: float = 10.0,
    noise: float = 0.01
) -> pd.Series:
    """
    Target encoding with Bayesian smoothing.
    
    Smoothing blends category mean toward global mean based on sample size.
    Formula: (n * category_mean + smoothing * global_mean) / (n + smoothing)
    """
    global_mean = df[target_col].mean()
    
    # Compute category statistics
    agg = df.groupby(cat_col)[target_col].agg(['mean', 'count'])
    agg.columns = ['cat_mean', 'cat_count']
    
    # Apply smoothing
    agg['smoothed'] = (
        agg['cat_count'] * agg['cat_mean'] + smoothing * global_mean
    ) / (agg['cat_count'] + smoothing)
    
    # Map to original rows
    encoded = df[cat_col].map(agg['smoothed'])
    
    # Add small noise to prevent overfitting
    if noise > 0:
        encoded += np.random.normal(0, noise, len(encoded))
    
    return encoded
 
 
def hierarchical_encoding(
    df: pd.DataFrame,
    detail_col: str,
    parent_col: str,
    target_col: str,
    min_detail_count: int = 5
) -> pd.DataFrame:
    """
    Use parent encoding when detail category is too rare.
    
    Example: Use 'product_category' encoding for rare 'product_id' values.
    """
    # Count occurrences at detail level
    detail_counts = df[detail_col].value_counts()
    
    # Compute target mean at both levels
    detail_means = df.groupby(detail_col)[target_col].mean()
    parent_means = df.groupby(parent_col)[target_col].mean()
    
    def encode_row(row):
        if detail_counts.get(row[detail_col], 0) >= min_detail_count:
            return detail_means[row[detail_col]]
        else:
            return parent_means[row[parent_col]]
    
    return df.apply(encode_row, axis=1)
 
 
class EmbeddingLookup:
    """
    Simple entity embedding using pre-computed statistics.
    For neural network embedding layers, see deep learning frameworks.
    """
    def __init__(self, embedding_dim: int = 8):
        self.embedding_dim = embedding_dim
        self.embeddings = {}
    
    def fit(self, categories: pd.Series, features: pd.DataFrame):
        """
        Create embedding from average features per category.
        """
        from sklearn.decomposition import PCA
        
        combined = pd.concat([categories.reset_index(drop=True), 
                              features.reset_index(drop=True)], axis=1)
        combined.columns = ['category'] + list(features.columns)
        
        # Average features per category
        cat_features = combined.groupby('category').mean()
        
        # Reduce to embedding dimension via PCA
        if len(cat_features.columns) > self.embedding_dim:
            pca = PCA(n_components=self.embedding_dim)
            embeddings = pca.fit_transform(cat_features)
        else:
            embeddings = cat_features.values
        
        self.embeddings = dict(zip(cat_features.index, embeddings))
        self.default_embedding = np.zeros(self.embedding_dim)
        
        return self
    
    def transform(self, categories: pd.Series) -> np.ndarray:
        return np.array([
            self.embeddings.get(cat, self.default_embedding) 
            for cat in categories
        ])
 
 
# Example usage with product data
np.random.seed(42)
n_products = 10000
n_samples = 50000
 
product_ids = [f'PROD_{i:05d}' for i in range(n_products)]
df = pd.DataFrame({
    'product_id': np.random.choice(product_ids, n_samples, 
                                   p=np.random.dirichlet(np.ones(n_products))),  # Power-law-ish
    'product_category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Food'], n_samples),
    'price': np.random.lognormal(3, 1, n_samples),
    'rating': np.random.uniform(1, 5, n_samples),
    'purchased': np.random.binomial(1, 0.1, n_samples)
})
 
# Apply frequency thresholding
df['product_id_thresholded'] = frequency_threshold_encoding(df, 'product_id', min_frequency=10)
print(f"Original cardinality: {df['product_id'].nunique()}")
print(f"After thresholding: {df['product_id_thresholded'].nunique()}")
 
# Apply smoothed target encoding
df['product_id_target_enc'] = smoothed_target_encoding(
    df, 'product_id', 'purchased', smoothing=20.0
)

The Power Law of Category Frequencies

Most high-cardinality features follow power-law distributions: a few categories are very common, most are very rare. This means aggressive thresholding (keeping top-100 of 10,000) often retains 90%+ of samples while dramatically reducing complexity. Always check the frequency distribution before choosing an encoding strategy.

Text Preprocessing Pipeline

Raw text requires substantial preprocessing before feature extraction. A well-designed pipeline handles:

Standard Preprocessing Steps:

Text Preprocessing Steps
Step	Purpose	Example
Lowercasing	Normalize case variations	'iPhone' → 'iphone'
Punctuation removal	Remove non-informative characters	'Hello!' → 'Hello'
Tokenization	Split text into words/tokens	'good product' → ['good', 'product']
Stop word removal	Remove common, low-info words	Remove 'the', 'is', 'and'
Stemming	Reduce to word stems (crude)	'running' → 'run'
Lemmatization	Reduce to dictionary forms	'better' → 'good'
Number handling	Normalize or remove numbers	'$50' → '<PRICE>' or removal
Spell correction	Fix common misspellings	'recieve' → 'receive'
Contraction expansion	Expand contracted forms	"don't" → 'do not'

text_preprocessing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
import re
import string
from typing import List, Callable
 
# Using NLTK for NLP preprocessing
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from nltk.tokenize import word_tokenize
 
# Download required resources (run once)
# nltk.download('punkt')
# nltk.download('stopwords')
# nltk.download('wordnet')
 
class TextPreprocessor:
    """
    Configurable text preprocessing pipeline.
    """
    def __init__(
        self,
        lowercase: bool = True,
        remove_punctuation: bool = True,
        remove_numbers: bool = False,
        remove_stopwords: bool = True,
        stemming: bool = False,
        lemmatization: bool = True,
        min_word_length: int = 2,
        custom_stopwords: List[str] = None
    ):
        self.lowercase = lowercase
        self.remove_punctuation = remove_punctuation
        self.remove_numbers = remove_numbers
        self.remove_stopwords = remove_stopwords
        self.stemming = stemming
        self.lemmatization = lemmatization
        self.min_word_length = min_word_length
        
        # Initialize tools
        self.stemmer = PorterStemmer() if stemming else None
        self.lemmatizer = WordNetLemmatizer() if lemmatization else None
        
        self.stopwords = set(stopwords.words('english'))
        if custom_stopwords:
            self.stopwords.update(custom_stopwords)
    
    def preprocess(self, text: str) -> str:
        """
        Apply full preprocessing pipeline to text.
        """
        if not isinstance(text, str):
            return ""
        
        # Lowercase
        if self.lowercase:
            text = text.lower()
        
        # Remove URLs
        text = re.sub(r'https?://\S+|www\.\S+', '', text)
        
        # Remove emails
        text = re.sub(r'\S+@\S+', '', text)
        
        # Handle contractions
        contractions = {
            "don't": "do not", "won't": "will not", "can't": "cannot",
            "it's": "it is", "i'm": "i am", "you're": "you are",
            "they're": "they are", "we're": "we are", "i've": "i have",
            "you've": "you have", "we've": "we have", "isn't": "is not",
            "aren't": "are not", "wasn't": "was not", "weren't": "were not"
        }
        for contraction, expansion in contractions.items():
            text = text.replace(contraction, expansion)
        
        # Remove punctuation
        if self.remove_punctuation:
            text = text.translate(str.maketrans('', '', string.punctuation))
        
        # Remove numbers
        if self.remove_numbers:
            text = re.sub(r'\d+', '', text)
        
        # Tokenize
        tokens = word_tokenize(text)
        
        # Remove stopwords
        if self.remove_stopwords:
            tokens = [t for t in tokens if t not in self.stopwords]
        
        # Apply stemming or lemmatization
        if self.stemmer:
            tokens = [self.stemmer.stem(t) for t in tokens]
        elif self.lemmatizer:
            tokens = [self.lemmatizer.lemmatize(t) for t in tokens]
        
        # Filter by length
        tokens = [t for t in tokens if len(t) >= self.min_word_length]
        
        return ' '.join(tokens)
    
    def preprocess_batch(self, texts: List[str]) -> List[str]:
        """
        Preprocess a list of texts.
        """
        return [self.preprocess(t) for t in texts]
 
 
# Example usage
preprocessor = TextPreprocessor(
    lowercase=True,
    remove_punctuation=True,
    remove_stopwords=True,
    lemmatization=True
)
 
sample_texts = [
    "I absolutely LOVE this product! It's amazing and works great!!!",
    "The item didn't match the description. Very disappointed :(",
    "Fast shipping, good price. Would recommend to others.",
    "Don't buy this! It broke after 2 days. Total waste of $50."
]
 
for original in sample_texts:
    processed = preprocessor.preprocess(original)
    print(f"Original: {original}")
    print(f"Processed: {processed}
")

Preprocessing Choices Depend on Task

Aggressive preprocessing (stemming, stop word removal) works well for bag-of-words and TF-IDF. For modern embeddings (BERT, sentence transformers), minimal preprocessing is often better—these models handle casing, punctuation, and context internally. Always test preprocessing choices empirically.

Bag of Words and TF-IDF

Classic text representations convert documents into numerical vectors based on word occurrences.

Bag of Words (BoW):

Each document becomes a vector of word counts (or binary presence). Order is lost—hence 'bag.'

TF-IDF (Term Frequency–Inverse Document Frequency):

Weights words by:

TF: How often word appears in this document (common = important here)
IDF: How rare word is across all documents (rare = more distinctive)

$$TFIDF(t, d) = TF(t, d) \times \log\frac{N}{DF(t)}$$

Where N = total documents, DF(t) = documents containing term t.

bow_tfidf.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
 
# Sample documents
documents = [
    "machine learning is a branch of artificial intelligence",
    "deep learning uses neural networks for learning",
    "natural language processing enables machines to understand text",
    "computer vision allows machines to interpret images",
    "reinforcement learning trains agents to make decisions"
]
 
# 1. Bag of Words (CountVectorizer)
print("=== Bag of Words ===")
count_vec = CountVectorizer(
    max_features=20,           # Limit vocabulary size
    min_df=1,                  # Minimum document frequency
    max_df=0.9,                # Maximum document frequency (remove too common)
    ngram_range=(1, 2),        # Include unigrams and bigrams
    stop_words='english'
)
 
bow_matrix = count_vec.fit_transform(documents)
bow_df = pd.DataFrame(
    bow_matrix.toarray(),
    columns=count_vec.get_feature_names_out()
)
print(f"Vocabulary size: {len(count_vec.vocabulary_)}")
print(bow_df)
 
# 2. TF-IDF
print("
=== TF-IDF ===")
tfidf_vec = TfidfVectorizer(
    max_features=20,
    min_df=1,
    max_df=0.9,
    ngram_range=(1, 2),
    stop_words='english',
    sublinear_tf=True,        # Use 1 + log(tf) instead of raw tf
    norm='l2'                  # L2 normalize each document vector
)
 
tfidf_matrix = tfidf_vec.fit_transform(documents)
tfidf_df = pd.DataFrame(
    tfidf_matrix.toarray(),
    columns=tfidf_vec.get_feature_names_out()
)
print(f"Vocabulary size: {len(tfidf_vec.vocabulary_)}")
print(tfidf_df.round(3))
 
# 3. Compare word importance
print("
=== TF-IDF Feature Importance ===")
feature_importance = pd.DataFrame({
    'term': tfidf_vec.get_feature_names_out(),
    'idf': tfidf_vec.idf_
}).sort_values('idf', ascending=False)
print("Most distinctive terms (highest IDF):")
print(feature_importance.head(10))
 
# 4. Advanced TF-IDF with custom preprocessing
class CustomTfidfVectorizer(TfidfVectorizer):
    """
    TF-IDF with domain-specific enhancements.
    """
    def __init__(self, domain_terms: list = None, **kwargs):
        super().__init__(**kwargs)
        self.domain_terms = domain_terms or []
    
    def build_analyzer(self):
        base_analyzer = super().build_analyzer()
        
        def custom_analyzer(doc):
            tokens = base_analyzer(doc)
            
            # Add domain-specific features
            for term in self.domain_terms:
                if term.lower() in doc.lower():
                    tokens.append(f'HAS_{term.upper()}')
            
            return tokens
        
        return custom_analyzer
 
# Usage with domain terms
domain_vec = CustomTfidfVectorizer(
    domain_terms=['neural', 'learning', 'AI'],
    max_features=30
)
domain_features = domain_vec.fit_transform(documents)
print(f"
With domain terms: {domain_vec.get_feature_names_out()[:10]}...")

N-grams Capture Phrases

Unigrams treat 'not good' as two separate tokens, losing the negation. Bigrams capture 'not_good' as a single feature. Use ngram_range=(1, 2) or (1, 3) to include phrases. This often improves sentiment analysis significantly, at the cost of larger feature spaces.

Text Embeddings: Dense Representations

TF-IDF produces sparse, high-dimensional vectors where most entries are zero and semantically similar words are unrelated. Embeddings map text to dense, low-dimensional vectors where semantics are encoded—similar meanings have similar vectors.

Embedding Types:

Text Embedding Methods
Method	Level	Dimension	Pros/Cons
Word2Vec	Word	100-300	Fast, captures analogies; ignores context
GloVe	Word	50-300	Captures global statistics; pre-trained available
FastText	Word (subword)	100-300	Handles OOV via character n-grams
Doc2Vec	Document	100-500	Learns document vectors; training required
Sentence-BERT	Sentence	384-768	State-of-the-art similarity; requires GPU
OpenAI Embeddings	Text	1536-3072	High quality; API cost; no local training

text_embeddings.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
import numpy as np
import pandas as pd
 
# Using sentence-transformers for modern embeddings
from sentence_transformers import SentenceTransformer
 
# Sample texts
texts = [
    "The product quality is excellent and exceeded my expectations",
    "Great item, works perfectly and arrived on time",
    "Terrible experience, the item was broken and customer service was unhelpful",
    "Waste of money, completely disappointed with this purchase",
    "Average product, nothing special but works as described"
]
 
# 1. Using Sentence-BERT (all-MiniLM-L6-v2 is fast and good)
print("=== Sentence-BERT Embeddings ===")
model = SentenceTransformer('all-MiniLM-L6-v2')
 
embeddings = model.encode(texts)
print(f"Embedding shape: {embeddings.shape}")  # (5, 384)
 
# 2. Compute similarity between texts
from sklearn.metrics.pairwise import cosine_similarity
 
similarity_matrix = cosine_similarity(embeddings)
sim_df = pd.DataFrame(
    similarity_matrix,
    index=[f'Text {i}' for i in range(len(texts))],
    columns=[f'Text {i}' for i in range(len(texts))]
)
print("
Similarity Matrix:")
print(sim_df.round(3))
 
# 3. Using embeddings as features
def create_embedding_features(
    texts: list,
    model_name: str = 'all-MiniLM-L6-v2'
) -> pd.DataFrame:
    """
    Convert texts to embedding features.
    """
    model = SentenceTransformer(model_name)
    embeddings = model.encode(texts, show_progress_bar=True)
    
    # Create feature column names
    columns = [f'emb_{i}' for i in range(embeddings.shape[1])]
    
    return pd.DataFrame(embeddings, columns=columns)
 
# 4. Aggregating word embeddings (alternative to sentence embeddings)
def average_word_embeddings(
    text: str,
    word_vectors: dict,  # word -> vector mapping
    vector_dim: int = 300
) -> np.ndarray:
    """
    Create document vector by averaging word vectors.
    Can weight by TF-IDF for TF-IDF weighted embeddings.
    """
    words = text.lower().split()
    vectors = [word_vectors[w] for w in words if w in word_vectors]
    
    if not vectors:
        return np.zeros(vector_dim)
    
    return np.mean(vectors, axis=0)
 
 
# 5. Clustering texts by embedding similarity
from sklearn.cluster import KMeans
 
def cluster_texts_by_embedding(
    texts: list,
    n_clusters: int = 3,
    model_name: str = 'all-MiniLM-L6-v2'
) -> np.ndarray:
    """
    Cluster texts based on embedding similarity.
    """
    model = SentenceTransformer(model_name)
    embeddings = model.encode(texts)
    
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    clusters = kmeans.fit_predict(embeddings)
    
    return clusters
 
# Example clustering
clusters = cluster_texts_by_embedding(texts, n_clusters=2)
for text, cluster in zip(texts, clusters):
    print(f"Cluster {cluster}: {text[:50]}...")

Choosing an Embedding Model

For most applications, 'all-MiniLM-L6-v2' offers the best speed/quality tradeoff. For maximum quality, use 'all-mpnet-base-v2'. For multilingual text, use 'paraphrase-multilingual-MiniLM-L12-v2'. Check the sentence-transformers documentation for specialized models (legal, medical, code).

Domain-Specific Text Features

Beyond general text representations, domain-specific features often provide more direct signal for prediction tasks.

domain_text_features.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
import pandas as pd
import numpy as np
import re
from textblob import TextBlob
 
def extract_review_features(text: str) -> dict:
    """
    Extract domain-specific features from product reviews.
    """
    if not isinstance(text, str):
        return {}
    
    features = {}
    
    # Basic statistics
    features['char_count'] = len(text)
    features['word_count'] = len(text.split())
    features['sentence_count'] = len(re.findall(r'[.!?]+', text)) or 1
    features['avg_word_length'] = np.mean([len(w) for w in text.split()]) if text.split() else 0
    
    # Punctuation patterns
    features['exclamation_count'] = text.count('!')
    features['question_count'] = text.count('?')
    features['uppercase_ratio'] = sum(1 for c in text if c.isupper()) / (len(text) + 1)
    
    # Sentiment (using TextBlob)
    blob = TextBlob(text)
    features['sentiment_polarity'] = blob.sentiment.polarity  # -1 to 1
    features['sentiment_subjectivity'] = blob.sentiment.subjectivity  # 0 to 1
    
    # Keyword presence
    positive_words = ['excellent', 'amazing', 'love', 'great', 'perfect', 'best', 'fantastic']
    negative_words = ['terrible', 'awful', 'hate', 'worst', 'broken', 'disappointed', 'waste']
    
    text_lower = text.lower()
    features['positive_word_count'] = sum(1 for w in positive_words if w in text_lower)
    features['negative_word_count'] = sum(1 for w in negative_words if w in text_lower)
    
    # Comparison/contrast indicators
    features['mentions_competitor'] = int(any(w in text_lower for w in ['amazon', 'competitor', 'other brand']))
    features['mentions_price'] = int(any(w in text_lower for w in ['price', 'cost', 'expensive', 'cheap', '$']))
    features['mentions_shipping'] = int(any(w in text_lower for w in ['shipping', 'delivery', 'arrived']))
    features['mentions_return'] = int(any(w in text_lower for w in ['return', 'refund', 'exchange']))
    
    # Review structure
    features['has_rating_keyword'] = int(any(w in text_lower for w in ['star', 'rating', 'out of']))
    features['asks_question'] = int(features['question_count'] > 0)
    features['is_shouting'] = int(features['uppercase_ratio'] > 0.3)
    
    return features
 
 
def extract_support_ticket_features(text: str) -> dict:
    """
    Extract features specific to customer support tickets.
    """
    if not isinstance(text, str):
        return {}
    
    features = {}
    text_lower = text.lower()
    
    # Urgency indicators
    urgent_words = ['urgent', 'asap', 'immediately', 'critical', 'emergency', 'now']
    features['urgency_score'] = sum(1 for w in urgent_words if w in text_lower)
    
    # Issue categories (rule-based)
    features['is_billing_issue'] = int(any(w in text_lower for w in ['charge', 'invoice', 'bill', 'payment', 'refund']))
    features['is_technical_issue'] = int(any(w in text_lower for w in ['error', 'bug', 'crash', 'not working', 'broken']))
    features['is_account_issue'] = int(any(w in text_lower for w in ['password', 'login', 'account', 'access']))
    
    # Customer sentiment
    features['frustration_indicators'] = sum(1 for w in ['frustrated', 'angry', 'annoyed', 'unacceptable'] if w in text_lower)
    
    # Escalation likelihood
    features['mentions_cancel'] = int('cancel' in text_lower)
    features['mentions_legal'] = int(any(w in text_lower for w in ['lawyer', 'legal', 'lawsuit', 'sue']))
    features['mentions_social_media'] = int(any(w in text_lower for w in ['twitter', 'facebook', 'social media', 'review']))
    
    return features
 
 
# Apply to DataFrame
def batch_extract_text_features(
    df: pd.DataFrame,
    text_col: str,
    feature_extractor: callable
) -> pd.DataFrame:
    """
    Apply feature extractor to all rows.
    """
    features = df[text_col].apply(feature_extractor)
    return pd.DataFrame(features.tolist())
 
 
# Example usage
reviews = pd.DataFrame({
    'text': [
        "ABSOLUTELY TERRIBLE!! Broke after 2 days. Want my money back!",
        "Great product, fast shipping. Would recommend to friends.",
        "It's okay I guess. Works as described but nothing special.",
        "Love love love this! Best purchase ever, exceeded expectations!"
    ]
})
 
review_features = batch_extract_text_features(reviews, 'text', extract_review_features)
print(review_features[['sentiment_polarity', 'positive_word_count', 'negative_word_count', 'is_shouting']])

Common Domain-Specific Text Features

•Sentiment scores — Polarity (positive/negative) and subjectivity from rule-based or ML models
•Entity extraction — Product names, prices, dates, locations using NER
•Topic indicators — Keyword flags for domain-relevant topics
•Structural features — Length, paragraph count, bullet usage (document type indicators)
•Writing style — Formality, reading level, grammatical correctness
•Intent classification — Question vs. statement, complaint vs. praise, request vs. information

Summary: Mastering Text and Categorical Features

Text and categorical features require the most nuanced feature engineering. The choice between encoding strategies—and between sparse and dense representations—can determine model success. Here are the key insights:

Key Takeaways

•Match encoding to cardinality: One-hot for low (< 50), target/hash encoding for high cardinality categories.
•Target encoding requires regularization via smoothing and cross-validation to prevent leakage and overfitting on rare categories.
•Text preprocessing should match the representation: Aggressive preprocessing for BoW/TF-IDF; minimal for modern embeddings.
•TF-IDF outperforms raw counts by downweighting common words; include n-grams for phrase-level features.
•Sentence embeddings capture semantics that sparse representations miss; use pre-trained models like Sentence-BERT.
•Domain-specific text features (sentiment, keywords, structure) often provide direct signal that generic encodings miss.
•Handle unseen categories gracefully using hash encoding, 'other' buckets, or embedding default vectors.

Module Complete

You've completed the Feature Engineering Mastery module! You now have a comprehensive toolkit for engineering features across all data types: numerical, categorical, temporal, and textual. The next modules in this chapter will cover automated feature engineering, feature selection theory, and production-grade feature stores.