Loading content...
Machine learning algorithms operate on numbers. But much of the world's data is text—product descriptions, customer reviews, support tickets, medical notes—and categories—product types, user segments, geographic regions. Bridging the gap between human language and numerical computation is arguably the most impactful domain in feature engineering.
A model predicting customer churn gains little from raw text: 'The product was not what I expected.' But transform that into sentiment scores, keyword flags, and semantic embeddings, and suddenly the model sees what the customer felt.
Similarly, a 'product_category' column with 10,000 unique values can't be naively one-hot encoded (10,000 sparse columns!) or label-encoded (false ordinal relationships). The engineering choices here determine whether your model learns meaningful patterns or drowns in noise.
This page covers the full spectrum of text and categorical feature engineering: from basic one-hot and label encoding to target encoding, hashing tricks, and learned embeddings. You'll learn text preprocessing pipelines, bag-of-words and TF-IDF representations, and how to leverage pre-trained language model embeddings. By the end, you'll handle any text or categorical feature with confidence.
Categorical features require encoding into numerical form. The choice of encoding significantly impacts model performance and should match both the feature's semantics and the model type.
Encoding Taxonomy:
| Method | Output Dimension | Preserves Meaning | Best For |
|---|---|---|---|
| One-Hot Encoding | k (number of categories) | No ordinal assumption | Low cardinality (< 50); linear models |
| Dummy Encoding | k - 1 | Avoids multicollinearity | Regression models |
| Label Encoding | 1 | Implies false order | Tree models only; never for linear |
| Ordinal Encoding | 1 | Uses real order | Naturally ordered categories |
| Target Encoding | 1 | Captures predictive value | High cardinality; with regularization |
| Frequency Encoding | 1 | Captures popularity | When frequency correlates with target |
| Hash Encoding | Configurable | Handles unseen; some collisions | Very high cardinality; streaming |
| Embedding Layers | Configurable dense | Learned representations | Neural networks; millions of categories |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061
import pandas as pdimport numpy as npfrom sklearn.preprocessing import OneHotEncoder, LabelEncoder, OrdinalEncoderfrom category_encoders import TargetEncoder, HashingEncoder, LeaveOneOutEncoder # Sample datadf = pd.DataFrame({ 'color': ['red', 'blue', 'green', 'red', 'blue', 'green', 'red', 'blue'], 'size': ['S', 'M', 'L', 'XL', 'S', 'M', 'L', 'XL'], 'product_id': [f'PROD_{i:04d}' for i in np.random.randint(1, 10000, 8)], 'target': [0, 1, 1, 1, 0, 0, 1, 1]}) # 1. One-Hot Encodingprint("=== One-Hot Encoding ===")onehot = OneHotEncoder(sparse_output=False, handle_unknown='ignore')color_onehot = onehot.fit_transform(df[['color']])print(f"Shape: {color_onehot.shape}")print(pd.DataFrame(color_onehot, columns=onehot.get_feature_names_out(['color']))) # 2. Ordinal Encoding (for ordered categories)print("=== Ordinal Encoding ===")ordinal = OrdinalEncoder(categories=[['S', 'M', 'L', 'XL']]) # Explicit orderdf['size_ordinal'] = ordinal.fit_transform(df[['size']])print(df[['size', 'size_ordinal']]) # 3. Label Encoding (simple numeric mapping)print("=== Label Encoding ===")label = LabelEncoder()df['color_label'] = label.fit_transform(df['color'])print(f"Mapping: {dict(zip(label.classes_, range(len(label.classes_))))}") # 4. Target Encoding (mean target per category)print("=== Target Encoding ===")target_enc = TargetEncoder(smoothing=1.0) # Smoothing prevents overfittingdf['color_target_enc'] = target_enc.fit_transform(df['color'], df['target'])print(df[['color', 'color_target_enc', 'target']].drop_duplicates()) # 5. Frequency Encodingprint("=== Frequency Encoding ===")freq_map = df['color'].value_counts(normalize=True).to_dict()df['color_freq'] = df['color'].map(freq_map)print(df[['color', 'color_freq']].drop_duplicates()) # 6. Hash Encodingprint("=== Hash Encoding ===")hasher = HashingEncoder(n_components=5) # Fixed output dimensionproduct_hashed = hasher.fit_transform(df[['product_id']])print(f"High cardinality ({df['product_id'].nunique()} unique) -> {product_hashed.shape[1]} columns") # 7. Leave-One-Out Encoding (target encoding variant)print("=== Leave-One-Out Encoding ===")loo_enc = LeaveOneOutEncoder()df['color_loo'] = loo_enc.fit_transform(df['color'], df['target'])print("LOO encoding excludes current row from mean calculation to reduce overfitting")Target encoding uses the target variable to create features—a potential source of data leakage. Always use cross-validation or leave-one-out approaches: compute the encoding for each row using only data from OTHER rows. The category_encoders library handles this correctly, but manual implementations often don't.
High cardinality categories—user IDs, product SKUs, zip codes, IP addresses—pose unique challenges:
Strategies for High Cardinality:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146
import pandas as pdimport numpy as npfrom collections import defaultdict def frequency_threshold_encoding( df: pd.DataFrame, cat_col: str, min_frequency: int = 10, other_label: str = '_OTHER_') -> pd.Series: """ Keep only categories with >= min_frequency occurrences. Rare categories become 'other'. """ freq = df[cat_col].value_counts() common = freq[freq >= min_frequency].index return df[cat_col].where(df[cat_col].isin(common), other_label) def smoothed_target_encoding( df: pd.DataFrame, cat_col: str, target_col: str, smoothing: float = 10.0, noise: float = 0.01) -> pd.Series: """ Target encoding with Bayesian smoothing. Smoothing blends category mean toward global mean based on sample size. Formula: (n * category_mean + smoothing * global_mean) / (n + smoothing) """ global_mean = df[target_col].mean() # Compute category statistics agg = df.groupby(cat_col)[target_col].agg(['mean', 'count']) agg.columns = ['cat_mean', 'cat_count'] # Apply smoothing agg['smoothed'] = ( agg['cat_count'] * agg['cat_mean'] + smoothing * global_mean ) / (agg['cat_count'] + smoothing) # Map to original rows encoded = df[cat_col].map(agg['smoothed']) # Add small noise to prevent overfitting if noise > 0: encoded += np.random.normal(0, noise, len(encoded)) return encoded def hierarchical_encoding( df: pd.DataFrame, detail_col: str, parent_col: str, target_col: str, min_detail_count: int = 5) -> pd.DataFrame: """ Use parent encoding when detail category is too rare. Example: Use 'product_category' encoding for rare 'product_id' values. """ # Count occurrences at detail level detail_counts = df[detail_col].value_counts() # Compute target mean at both levels detail_means = df.groupby(detail_col)[target_col].mean() parent_means = df.groupby(parent_col)[target_col].mean() def encode_row(row): if detail_counts.get(row[detail_col], 0) >= min_detail_count: return detail_means[row[detail_col]] else: return parent_means[row[parent_col]] return df.apply(encode_row, axis=1) class EmbeddingLookup: """ Simple entity embedding using pre-computed statistics. For neural network embedding layers, see deep learning frameworks. """ def __init__(self, embedding_dim: int = 8): self.embedding_dim = embedding_dim self.embeddings = {} def fit(self, categories: pd.Series, features: pd.DataFrame): """ Create embedding from average features per category. """ from sklearn.decomposition import PCA combined = pd.concat([categories.reset_index(drop=True), features.reset_index(drop=True)], axis=1) combined.columns = ['category'] + list(features.columns) # Average features per category cat_features = combined.groupby('category').mean() # Reduce to embedding dimension via PCA if len(cat_features.columns) > self.embedding_dim: pca = PCA(n_components=self.embedding_dim) embeddings = pca.fit_transform(cat_features) else: embeddings = cat_features.values self.embeddings = dict(zip(cat_features.index, embeddings)) self.default_embedding = np.zeros(self.embedding_dim) return self def transform(self, categories: pd.Series) -> np.ndarray: return np.array([ self.embeddings.get(cat, self.default_embedding) for cat in categories ]) # Example usage with product datanp.random.seed(42)n_products = 10000n_samples = 50000 product_ids = [f'PROD_{i:05d}' for i in range(n_products)]df = pd.DataFrame({ 'product_id': np.random.choice(product_ids, n_samples, p=np.random.dirichlet(np.ones(n_products))), # Power-law-ish 'product_category': np.random.choice(['Electronics', 'Clothing', 'Home', 'Food'], n_samples), 'price': np.random.lognormal(3, 1, n_samples), 'rating': np.random.uniform(1, 5, n_samples), 'purchased': np.random.binomial(1, 0.1, n_samples)}) # Apply frequency thresholdingdf['product_id_thresholded'] = frequency_threshold_encoding(df, 'product_id', min_frequency=10)print(f"Original cardinality: {df['product_id'].nunique()}")print(f"After thresholding: {df['product_id_thresholded'].nunique()}") # Apply smoothed target encodingdf['product_id_target_enc'] = smoothed_target_encoding( df, 'product_id', 'purchased', smoothing=20.0)Most high-cardinality features follow power-law distributions: a few categories are very common, most are very rare. This means aggressive thresholding (keeping top-100 of 10,000) often retains 90%+ of samples while dramatically reducing complexity. Always check the frequency distribution before choosing an encoding strategy.
Raw text requires substantial preprocessing before feature extraction. A well-designed pipeline handles:
Standard Preprocessing Steps:
| Step | Purpose | Example |
|---|---|---|
| Lowercasing | Normalize case variations | 'iPhone' → 'iphone' |
| Punctuation removal | Remove non-informative characters | 'Hello!' → 'Hello' |
| Tokenization | Split text into words/tokens | 'good product' → ['good', 'product'] |
| Stop word removal | Remove common, low-info words | Remove 'the', 'is', 'and' |
| Stemming | Reduce to word stems (crude) | 'running' → 'run' |
| Lemmatization | Reduce to dictionary forms | 'better' → 'good' |
| Number handling | Normalize or remove numbers | '$50' → '<PRICE>' or removal |
| Spell correction | Fix common misspellings | 'recieve' → 'receive' |
| Contraction expansion | Expand contracted forms | "don't" → 'do not' |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
import reimport stringfrom typing import List, Callable # Using NLTK for NLP preprocessingimport nltkfrom nltk.corpus import stopwordsfrom nltk.stem import PorterStemmer, WordNetLemmatizerfrom nltk.tokenize import word_tokenize # Download required resources (run once)# nltk.download('punkt')# nltk.download('stopwords')# nltk.download('wordnet') class TextPreprocessor: """ Configurable text preprocessing pipeline. """ def __init__( self, lowercase: bool = True, remove_punctuation: bool = True, remove_numbers: bool = False, remove_stopwords: bool = True, stemming: bool = False, lemmatization: bool = True, min_word_length: int = 2, custom_stopwords: List[str] = None ): self.lowercase = lowercase self.remove_punctuation = remove_punctuation self.remove_numbers = remove_numbers self.remove_stopwords = remove_stopwords self.stemming = stemming self.lemmatization = lemmatization self.min_word_length = min_word_length # Initialize tools self.stemmer = PorterStemmer() if stemming else None self.lemmatizer = WordNetLemmatizer() if lemmatization else None self.stopwords = set(stopwords.words('english')) if custom_stopwords: self.stopwords.update(custom_stopwords) def preprocess(self, text: str) -> str: """ Apply full preprocessing pipeline to text. """ if not isinstance(text, str): return "" # Lowercase if self.lowercase: text = text.lower() # Remove URLs text = re.sub(r'https?://\S+|www\.\S+', '', text) # Remove emails text = re.sub(r'\S+@\S+', '', text) # Handle contractions contractions = { "don't": "do not", "won't": "will not", "can't": "cannot", "it's": "it is", "i'm": "i am", "you're": "you are", "they're": "they are", "we're": "we are", "i've": "i have", "you've": "you have", "we've": "we have", "isn't": "is not", "aren't": "are not", "wasn't": "was not", "weren't": "were not" } for contraction, expansion in contractions.items(): text = text.replace(contraction, expansion) # Remove punctuation if self.remove_punctuation: text = text.translate(str.maketrans('', '', string.punctuation)) # Remove numbers if self.remove_numbers: text = re.sub(r'\d+', '', text) # Tokenize tokens = word_tokenize(text) # Remove stopwords if self.remove_stopwords: tokens = [t for t in tokens if t not in self.stopwords] # Apply stemming or lemmatization if self.stemmer: tokens = [self.stemmer.stem(t) for t in tokens] elif self.lemmatizer: tokens = [self.lemmatizer.lemmatize(t) for t in tokens] # Filter by length tokens = [t for t in tokens if len(t) >= self.min_word_length] return ' '.join(tokens) def preprocess_batch(self, texts: List[str]) -> List[str]: """ Preprocess a list of texts. """ return [self.preprocess(t) for t in texts] # Example usagepreprocessor = TextPreprocessor( lowercase=True, remove_punctuation=True, remove_stopwords=True, lemmatization=True) sample_texts = [ "I absolutely LOVE this product! It's amazing and works great!!!", "The item didn't match the description. Very disappointed :(", "Fast shipping, good price. Would recommend to others.", "Don't buy this! It broke after 2 days. Total waste of $50."] for original in sample_texts: processed = preprocessor.preprocess(original) print(f"Original: {original}") print(f"Processed: {processed}")Aggressive preprocessing (stemming, stop word removal) works well for bag-of-words and TF-IDF. For modern embeddings (BERT, sentence transformers), minimal preprocessing is often better—these models handle casing, punctuation, and context internally. Always test preprocessing choices empirically.
Classic text representations convert documents into numerical vectors based on word occurrences.
Bag of Words (BoW):
Each document becomes a vector of word counts (or binary presence). Order is lost—hence 'bag.'
TF-IDF (Term Frequency–Inverse Document Frequency):
Weights words by:
$$TFIDF(t, d) = TF(t, d) \times \log\frac{N}{DF(t)}$$
Where N = total documents, DF(t) = documents containing term t.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import pandas as pdfrom sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer # Sample documentsdocuments = [ "machine learning is a branch of artificial intelligence", "deep learning uses neural networks for learning", "natural language processing enables machines to understand text", "computer vision allows machines to interpret images", "reinforcement learning trains agents to make decisions"] # 1. Bag of Words (CountVectorizer)print("=== Bag of Words ===")count_vec = CountVectorizer( max_features=20, # Limit vocabulary size min_df=1, # Minimum document frequency max_df=0.9, # Maximum document frequency (remove too common) ngram_range=(1, 2), # Include unigrams and bigrams stop_words='english') bow_matrix = count_vec.fit_transform(documents)bow_df = pd.DataFrame( bow_matrix.toarray(), columns=count_vec.get_feature_names_out())print(f"Vocabulary size: {len(count_vec.vocabulary_)}")print(bow_df) # 2. TF-IDFprint("=== TF-IDF ===")tfidf_vec = TfidfVectorizer( max_features=20, min_df=1, max_df=0.9, ngram_range=(1, 2), stop_words='english', sublinear_tf=True, # Use 1 + log(tf) instead of raw tf norm='l2' # L2 normalize each document vector) tfidf_matrix = tfidf_vec.fit_transform(documents)tfidf_df = pd.DataFrame( tfidf_matrix.toarray(), columns=tfidf_vec.get_feature_names_out())print(f"Vocabulary size: {len(tfidf_vec.vocabulary_)}")print(tfidf_df.round(3)) # 3. Compare word importanceprint("=== TF-IDF Feature Importance ===")feature_importance = pd.DataFrame({ 'term': tfidf_vec.get_feature_names_out(), 'idf': tfidf_vec.idf_}).sort_values('idf', ascending=False)print("Most distinctive terms (highest IDF):")print(feature_importance.head(10)) # 4. Advanced TF-IDF with custom preprocessingclass CustomTfidfVectorizer(TfidfVectorizer): """ TF-IDF with domain-specific enhancements. """ def __init__(self, domain_terms: list = None, **kwargs): super().__init__(**kwargs) self.domain_terms = domain_terms or [] def build_analyzer(self): base_analyzer = super().build_analyzer() def custom_analyzer(doc): tokens = base_analyzer(doc) # Add domain-specific features for term in self.domain_terms: if term.lower() in doc.lower(): tokens.append(f'HAS_{term.upper()}') return tokens return custom_analyzer # Usage with domain termsdomain_vec = CustomTfidfVectorizer( domain_terms=['neural', 'learning', 'AI'], max_features=30)domain_features = domain_vec.fit_transform(documents)print(f"With domain terms: {domain_vec.get_feature_names_out()[:10]}...")Unigrams treat 'not good' as two separate tokens, losing the negation. Bigrams capture 'not_good' as a single feature. Use ngram_range=(1, 2) or (1, 3) to include phrases. This often improves sentiment analysis significantly, at the cost of larger feature spaces.
TF-IDF produces sparse, high-dimensional vectors where most entries are zero and semantically similar words are unrelated. Embeddings map text to dense, low-dimensional vectors where semantics are encoded—similar meanings have similar vectors.
Embedding Types:
| Method | Level | Dimension | Pros/Cons |
|---|---|---|---|
| Word2Vec | Word | 100-300 | Fast, captures analogies; ignores context |
| GloVe | Word | 50-300 | Captures global statistics; pre-trained available |
| FastText | Word (subword) | 100-300 | Handles OOV via character n-grams |
| Doc2Vec | Document | 100-500 | Learns document vectors; training required |
| Sentence-BERT | Sentence | 384-768 | State-of-the-art similarity; requires GPU |
| OpenAI Embeddings | Text | 1536-3072 | High quality; API cost; no local training |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293
import numpy as npimport pandas as pd # Using sentence-transformers for modern embeddingsfrom sentence_transformers import SentenceTransformer # Sample textstexts = [ "The product quality is excellent and exceeded my expectations", "Great item, works perfectly and arrived on time", "Terrible experience, the item was broken and customer service was unhelpful", "Waste of money, completely disappointed with this purchase", "Average product, nothing special but works as described"] # 1. Using Sentence-BERT (all-MiniLM-L6-v2 is fast and good)print("=== Sentence-BERT Embeddings ===")model = SentenceTransformer('all-MiniLM-L6-v2') embeddings = model.encode(texts)print(f"Embedding shape: {embeddings.shape}") # (5, 384) # 2. Compute similarity between textsfrom sklearn.metrics.pairwise import cosine_similarity similarity_matrix = cosine_similarity(embeddings)sim_df = pd.DataFrame( similarity_matrix, index=[f'Text {i}' for i in range(len(texts))], columns=[f'Text {i}' for i in range(len(texts))])print("Similarity Matrix:")print(sim_df.round(3)) # 3. Using embeddings as featuresdef create_embedding_features( texts: list, model_name: str = 'all-MiniLM-L6-v2') -> pd.DataFrame: """ Convert texts to embedding features. """ model = SentenceTransformer(model_name) embeddings = model.encode(texts, show_progress_bar=True) # Create feature column names columns = [f'emb_{i}' for i in range(embeddings.shape[1])] return pd.DataFrame(embeddings, columns=columns) # 4. Aggregating word embeddings (alternative to sentence embeddings)def average_word_embeddings( text: str, word_vectors: dict, # word -> vector mapping vector_dim: int = 300) -> np.ndarray: """ Create document vector by averaging word vectors. Can weight by TF-IDF for TF-IDF weighted embeddings. """ words = text.lower().split() vectors = [word_vectors[w] for w in words if w in word_vectors] if not vectors: return np.zeros(vector_dim) return np.mean(vectors, axis=0) # 5. Clustering texts by embedding similarityfrom sklearn.cluster import KMeans def cluster_texts_by_embedding( texts: list, n_clusters: int = 3, model_name: str = 'all-MiniLM-L6-v2') -> np.ndarray: """ Cluster texts based on embedding similarity. """ model = SentenceTransformer(model_name) embeddings = model.encode(texts) kmeans = KMeans(n_clusters=n_clusters, random_state=42) clusters = kmeans.fit_predict(embeddings) return clusters # Example clusteringclusters = cluster_texts_by_embedding(texts, n_clusters=2)for text, cluster in zip(texts, clusters): print(f"Cluster {cluster}: {text[:50]}...")For most applications, 'all-MiniLM-L6-v2' offers the best speed/quality tradeoff. For maximum quality, use 'all-mpnet-base-v2'. For multilingual text, use 'paraphrase-multilingual-MiniLM-L12-v2'. Check the sentence-transformers documentation for specialized models (legal, medical, code).
Beyond general text representations, domain-specific features often provide more direct signal for prediction tasks.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
import pandas as pdimport numpy as npimport refrom textblob import TextBlob def extract_review_features(text: str) -> dict: """ Extract domain-specific features from product reviews. """ if not isinstance(text, str): return {} features = {} # Basic statistics features['char_count'] = len(text) features['word_count'] = len(text.split()) features['sentence_count'] = len(re.findall(r'[.!?]+', text)) or 1 features['avg_word_length'] = np.mean([len(w) for w in text.split()]) if text.split() else 0 # Punctuation patterns features['exclamation_count'] = text.count('!') features['question_count'] = text.count('?') features['uppercase_ratio'] = sum(1 for c in text if c.isupper()) / (len(text) + 1) # Sentiment (using TextBlob) blob = TextBlob(text) features['sentiment_polarity'] = blob.sentiment.polarity # -1 to 1 features['sentiment_subjectivity'] = blob.sentiment.subjectivity # 0 to 1 # Keyword presence positive_words = ['excellent', 'amazing', 'love', 'great', 'perfect', 'best', 'fantastic'] negative_words = ['terrible', 'awful', 'hate', 'worst', 'broken', 'disappointed', 'waste'] text_lower = text.lower() features['positive_word_count'] = sum(1 for w in positive_words if w in text_lower) features['negative_word_count'] = sum(1 for w in negative_words if w in text_lower) # Comparison/contrast indicators features['mentions_competitor'] = int(any(w in text_lower for w in ['amazon', 'competitor', 'other brand'])) features['mentions_price'] = int(any(w in text_lower for w in ['price', 'cost', 'expensive', 'cheap', '$'])) features['mentions_shipping'] = int(any(w in text_lower for w in ['shipping', 'delivery', 'arrived'])) features['mentions_return'] = int(any(w in text_lower for w in ['return', 'refund', 'exchange'])) # Review structure features['has_rating_keyword'] = int(any(w in text_lower for w in ['star', 'rating', 'out of'])) features['asks_question'] = int(features['question_count'] > 0) features['is_shouting'] = int(features['uppercase_ratio'] > 0.3) return features def extract_support_ticket_features(text: str) -> dict: """ Extract features specific to customer support tickets. """ if not isinstance(text, str): return {} features = {} text_lower = text.lower() # Urgency indicators urgent_words = ['urgent', 'asap', 'immediately', 'critical', 'emergency', 'now'] features['urgency_score'] = sum(1 for w in urgent_words if w in text_lower) # Issue categories (rule-based) features['is_billing_issue'] = int(any(w in text_lower for w in ['charge', 'invoice', 'bill', 'payment', 'refund'])) features['is_technical_issue'] = int(any(w in text_lower for w in ['error', 'bug', 'crash', 'not working', 'broken'])) features['is_account_issue'] = int(any(w in text_lower for w in ['password', 'login', 'account', 'access'])) # Customer sentiment features['frustration_indicators'] = sum(1 for w in ['frustrated', 'angry', 'annoyed', 'unacceptable'] if w in text_lower) # Escalation likelihood features['mentions_cancel'] = int('cancel' in text_lower) features['mentions_legal'] = int(any(w in text_lower for w in ['lawyer', 'legal', 'lawsuit', 'sue'])) features['mentions_social_media'] = int(any(w in text_lower for w in ['twitter', 'facebook', 'social media', 'review'])) return features # Apply to DataFramedef batch_extract_text_features( df: pd.DataFrame, text_col: str, feature_extractor: callable) -> pd.DataFrame: """ Apply feature extractor to all rows. """ features = df[text_col].apply(feature_extractor) return pd.DataFrame(features.tolist()) # Example usagereviews = pd.DataFrame({ 'text': [ "ABSOLUTELY TERRIBLE!! Broke after 2 days. Want my money back!", "Great product, fast shipping. Would recommend to friends.", "It's okay I guess. Works as described but nothing special.", "Love love love this! Best purchase ever, exceeded expectations!" ]}) review_features = batch_extract_text_features(reviews, 'text', extract_review_features)print(review_features[['sentiment_polarity', 'positive_word_count', 'negative_word_count', 'is_shouting']])Text and categorical features require the most nuanced feature engineering. The choice between encoding strategies—and between sparse and dense representations—can determine model success. Here are the key insights:
You've completed the Feature Engineering Mastery module! You now have a comprehensive toolkit for engineering features across all data types: numerical, categorical, temporal, and textual. The next modules in this chapter will cover automated feature engineering, feature selection theory, and production-grade feature stores.