Loading content...
Word2Vec and GloVe share a fundamental limitation: they treat words as atomic units. If a word isn't in the vocabulary, it simply cannot be represented. This creates problems in practice:
FastText, developed by Facebook AI Research in 2016, elegantly solves these problems by representing words as bags of character n-grams. The word "where" is represented not just as "where" but as the n-grams: "<wh", "whe", "her", "ere", "re>", plus the word itself. This enables:
By the end of this page, you will understand FastText's subword representation scheme, the modified Skip-gram objective incorporating n-grams, how to compute embeddings for OOV words, practical training and usage considerations, and when FastText outperforms traditional word embeddings. This knowledge is essential for handling real-world text with its inherent messiness.
FastText's key innovation is representing each word as a bag of character n-grams plus the word itself. This is accomplished through a simple yet powerful scheme:
The n-gram extraction process:
Example for "where" with n=3:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
from typing import Set, List def extract_ngrams( word: str, minn: int = 3, maxn: int = 6, include_word: bool = True) -> Set[str]: """ Extract character n-grams for FastText representation. Args: word: Input word minn: Minimum n-gram length maxn: Maximum n-gram length include_word: Whether to include the full word with markers Returns: Set of character n-grams """ # Add boundary markers word_with_markers = f"<{word}>" ngrams = set() # Extract n-grams for each length for n in range(minn, maxn + 1): for i in range(len(word_with_markers) - n + 1): ngram = word_with_markers[i:i+n] ngrams.add(ngram) # Optionally add full word if include_word: ngrams.add(word_with_markers) return ngrams def show_ngram_decomposition(word: str, minn: int = 3, maxn: int = 6): """Visualize n-gram decomposition of a word.""" print(f"Word: '{word}'") print(f"With markers: '<{word}>'") print(f"N-grams (n={minn} to {maxn}):") for n in range(minn, maxn + 1): ngrams = extract_ngrams(word, minn=n, maxn=n, include_word=False) print(f" n={n}: {sorted(ngrams)}") all_ngrams = extract_ngrams(word, minn, maxn) print(f"Total unique n-grams: {len(all_ngrams)}") return all_ngrams # Demonstrate for various wordsfor word in ["where", "happy", "unhappiness", "run", "running"]: show_ngram_decomposition(word) print() # Show n-gram overlap between related wordsdef compute_ngram_overlap(word1: str, word2: str, minn: int = 3, maxn: int = 6): """Compute Jaccard overlap between n-gram sets.""" ng1 = extract_ngrams(word1, minn, maxn) ng2 = extract_ngrams(word2, minn, maxn) intersection = ng1 & ng2 union = ng1 | ng2 jaccard = len(intersection) / len(union) if union else 0 print(f"N-gram overlap between '{word1}' and '{word2}':") print(f" Shared n-grams: {sorted(intersection)}") print(f" Jaccard similarity: {jaccard:.3f}") return jaccard # Related words have high overlapcompute_ngram_overlap("happy", "unhappy")compute_ngram_overlap("happy", "happiness")compute_ngram_overlap("running", "runner") # Typos share many n-gramscompute_ngram_overlap("receive", "recieve") # Unrelated words have low overlapcompute_ngram_overlap("happy", "algorithm")The < and > markers distinguish prefixes, suffixes, and infixes. Without markers, 'her' from 'where' would be indistinguishable from 'her' in 'there' (at the start). With markers, '<her' (prefix) vs 'her>' (suffix) vs 'her' (infix) are all distinct n-grams, preserving positional information.
FastText modifies Word2Vec's Skip-gram objective to incorporate subword information. The key change: a word's embedding is the sum of its n-gram embeddings plus a whole-word embedding.
Modified scoring function:
For Word2Vec Skip-gram, the score for a (center, context) pair is:
$$s(w, c) = u_c^T v_w$$
For FastText, this becomes:
$$s(w, c) = u_c^T \left( v_w + \sum_{g \in G(w)} z_g \right)$$
where:
The complete objective (with negative sampling):
$$\sum_{t=1}^{T} \sum_{c \in C_t} \left[ \log \sigma(s(w_t, c)) + \sum_{n \in N_{t,c}} \log \sigma(-s(w_t, n)) \right]$$
where C_t is the context of word t and N_{t,c} are negative samples.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122
import numpy as npfrom typing import Dict, Set class FastTextEmbedding: """ Demonstrates how FastText computes word embeddings. """ def __init__( self, ngram_vectors: Dict[str, np.ndarray], # n-gram embeddings word_vectors: Dict[str, np.ndarray], # whole-word embeddings embedding_dim: int, minn: int = 3, maxn: int = 6 ): self.ngram_vectors = ngram_vectors self.word_vectors = word_vectors self.embedding_dim = embedding_dim self.minn = minn self.maxn = maxn def extract_ngrams(self, word: str) -> Set[str]: """Extract n-grams for a word.""" word_with_markers = f"<{word}>" ngrams = set() for n in range(self.minn, self.maxn + 1): for i in range(len(word_with_markers) - n + 1): ngrams.add(word_with_markers[i:i+n]) return ngrams def get_word_vector(self, word: str) -> np.ndarray: """ Compute word embedding as sum of n-gram vectors + word vector. This is the key FastText operation. """ vector = np.zeros(self.embedding_dim) count = 0 # Add whole-word embedding if available if word in self.word_vectors: vector += self.word_vectors[word] count += 1 # Add n-gram embeddings ngrams = self.extract_ngrams(word) for ngram in ngrams: if ngram in self.ngram_vectors: vector += self.ngram_vectors[ngram] count += 1 # Normalize by count (optional, some implementations do this) if count > 0: vector /= count return vector def is_in_vocabulary(self, word: str) -> bool: """Check if word has a direct embedding.""" return word in self.word_vectors def get_oov_coverage(self, word: str) -> float: """How many of the word's n-grams are covered?""" ngrams = self.extract_ngrams(word) covered = sum(1 for ng in ngrams if ng in self.ngram_vectors) return covered / len(ngrams) if ngrams else 0.0 # Demonstrate OOV vector computationdef demonstrate_oov_handling(): """Show how FastText handles OOV words.""" # Simulated trained model (in practice, use fasttext library) vocab = ['where', 'there', 'here', 'somewhere', 'anywhere', 'everywhere'] # Pretend we have trained embeddings np.random.seed(42) dim = 100 # Build mock embeddings word_vectors = {w: np.random.randn(dim) for w in vocab} # Build n-gram vocabulary from training words all_ngrams = set() for word in vocab: word_with_markers = f"<{word}>" for n in range(3, 7): for i in range(len(word_with_markers) - n + 1): all_ngrams.add(word_with_markers[i:i+n]) ngram_vectors = {ng: np.random.randn(dim) for ng in all_ngrams} print(f"Vocabulary size: {len(vocab)}") print(f"N-gram vocabulary size: {len(ngram_vectors)}") # Create embedder embedder = FastTextEmbedding( ngram_vectors, word_vectors, dim, minn=3, maxn=6 ) # Test on in-vocabulary word print("In-vocabulary word 'where':") vec = embedder.get_word_vector('where') print(f" Vector norm: {np.linalg.norm(vec):.4f}") # Test on OOV word that shares n-grams print("OOV word 'nowhere' (shares n-grams with vocab):") print(f" Is in vocabulary: {embedder.is_in_vocabulary('nowhere')}") print(f" N-gram coverage: {embedder.get_oov_coverage('nowhere'):.1%}") vec = embedder.get_word_vector('nowhere') print(f" Vector norm: {np.linalg.norm(vec):.4f}") # Test on completely foreign word print("OOV word 'xyzzy' (few shared n-grams):") print(f" N-gram coverage: {embedder.get_oov_coverage('xyzzy'):.1%}") vec = embedder.get_word_vector('xyzzy') print(f" Vector norm: {np.linalg.norm(vec):.4f}") demonstrate_oov_handling()You might worry that storing embeddings for every n-gram is expensive. FastText addresses this with hashing: n-grams are hashed to a fixed number of buckets (default 2 million), and bucket embeddings are learned. This means multiple n-grams may share an embedding, but in practice the collision rate is low enough not to hurt quality significantly.
Training FastText is similar to Word2Vec, with additional parameters controlling the subword representation:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130
import fasttextfrom typing import Listimport tempfileimport os def train_fasttext_skipgram( sentences: List[str], model_path: str = 'fasttext_model', dim: int = 100, epoch: int = 5, lr: float = 0.05, window: int = 5, minn: int = 3, maxn: int = 6, min_count: int = 5, neg: int = 5, bucket: int = 2000000) -> fasttext.FastText._FastText: """ Train a FastText Skip-gram model. Args: sentences: List of sentences (strings) model_path: Where to save the model dim: Embedding dimension epoch: Number of training epochs lr: Learning rate window: Context window size minn: Minimum n-gram length maxn: Maximum n-gram length min_count: Ignore words with fewer occurrences neg: Number of negative samples bucket: Number of hash buckets for n-grams Returns: Trained FastText model """ # FastText requires a file input with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt') as f: for sentence in sentences: f.write(sentence + '') temp_path = f.name try: # Train the model model = fasttext.train_unsupervised( temp_path, model='skipgram', # or 'cbow' dim=dim, epoch=epoch, lr=lr, ws=window, minn=minn, maxn=maxn, minCount=min_count, neg=neg, bucket=bucket, thread=4 ) # Save model model.save_model(f'{model_path}.bin') print(f"Model saved to {model_path}.bin") return model finally: os.unlink(temp_path) def train_fasttext_cbow(sentences: List[str], **kwargs): """Train FastText with CBOW architecture.""" # FastText CBOW doesn't support subword by default in all versions # but the library handles this with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt') as f: for sentence in sentences: f.write(sentence + '') temp_path = f.name try: model = fasttext.train_unsupervised( temp_path, model='cbow', dim=kwargs.get('dim', 100), epoch=kwargs.get('epoch', 5), minn=kwargs.get('minn', 3), maxn=kwargs.get('maxn', 6) ) return model finally: os.unlink(temp_path) # Example trainingcorpus = [ "the quick brown fox jumps over the lazy dog", "machine learning algorithms process data efficiently", "neural networks learn patterns from large datasets", "natural language processing enables machines to understand text", # ... more sentences for real training] model = train_fasttext_skipgram( corpus, dim=100, epoch=10, minn=3, maxn=6) # Test the modelprint("Word vector for 'machine':")vec = model.get_word_vector('machine')print(f" Shape: {vec.shape}, Norm: {np.linalg.norm(vec):.4f}") # OOV wordprint("Word vector for 'machinelearning' (OOV):")vec_oov = model.get_word_vector('machinelearning')print(f" Shape: {vec_oov.shape}, Norm: {np.linalg.norm(vec_oov):.4f}") # Nearest neighborsprint("Nearest neighbors of 'learning':")neighbors = model.get_nearest_neighbors('learning', k=5)for score, word in neighbors: print(f" {word}: {score:.4f}")| Parameter | Default | Description | Tuning Guidance |
|---|---|---|---|
| dim | 100 | Embedding dimension | 100-300; higher for larger corpora |
| minn | 3 | Minimum n-gram length | 2-3 for morphologically rich languages |
| maxn | 6 | Maximum n-gram length | 5-6 typically; lower reduces model size |
| bucket | 2000000 | Hash buckets for n-grams | Reduce for smaller models; increase for large vocab |
| epoch | 5 | Training epochs | 5-10 for large corpora; more for small |
| minCount | 5 | Minimum word frequency | 1-5; lower includes more rare words |
FastText models are larger than Word2Vec because they store n-gram embeddings. A typical 300-dim FastText model can be 2-7GB depending on bucket size. For deployment, consider: (1) reducing bucket count, (2) using quantization (model.quantize()), or (3) extracting only word vectors for known vocabulary.
FastText provides several APIs for accessing and using embeddings:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125
import fasttextimport numpy as npfrom typing import List # Load pre-trained or custom modelmodel = fasttext.load_model('path/to/model.bin') # ============================================# Word Vectors# ============================================ def get_word_embedding(word: str) -> np.ndarray: """Get embedding for any word (in-vocab or OOV).""" return model.get_word_vector(word) # Works for in-vocabulary wordsvec_king = get_word_embedding('king') # Works for OOV words too!vec_coronavirus = get_word_embedding('coronavirus') # OOV in older modelsvec_misspelled = get_word_embedding('recieve') # Typo still gets embedding print(f"'king' vector norm: {np.linalg.norm(vec_king):.4f}")print(f"'coronavirus' (OOV) vector norm: {np.linalg.norm(vec_coronavirus):.4f}") # ============================================# Document Embeddings via Averaging# ============================================ def get_document_embedding( text: str, model: fasttext.FastText._FastText, normalize: bool = True) -> np.ndarray: """ Compute document embedding by averaging word embeddings. FastText handles OOV words, so we can include all tokens. """ words = text.lower().split() if not words: return np.zeros(model.get_dimension()) # Get embeddings for all words (including OOV) word_vectors = [model.get_word_vector(word) for word in words] # Average doc_vector = np.mean(word_vectors, axis=0) if normalize: norm = np.linalg.norm(doc_vector) if norm > 0: doc_vector = doc_vector / norm return doc_vector # Exampledoc1 = "Machine learning algorithms process large datasets"doc2 = "Deep neural networks learn patterns from data"doc3 = "The cat sat on the mat" vec1 = get_document_embedding(doc1, model)vec2 = get_document_embedding(doc2, model)vec3 = get_document_embedding(doc3, model) # Compute similaritiesfrom sklearn.metrics.pairwise import cosine_similarity sim_12 = cosine_similarity([vec1], [vec2])[0, 0]sim_13 = cosine_similarity([vec1], [vec3])[0, 0]print(f"Similarity(ML, DL): {sim_12:.4f}")print(f"Similarity(ML, cat): {sim_13:.4f}") # ============================================# Sentence Vectors (Built-in)# ============================================ def get_sentence_embedding(sentence: str) -> np.ndarray: """ FastText's built-in sentence embedding. Note: Primarily designed for supervised models but works for unsupervised too. """ return model.get_sentence_vector(sentence) # ============================================# Similarity Queries# ============================================ def find_similar_words(word: str, k: int = 10) -> List[tuple]: """Find k most similar words using cosine similarity.""" return model.get_nearest_neighbors(word, k=k) # Find similar wordsprint("Words similar to 'machine':")for score, word in find_similar_words('machine', 5): print(f" {word}: {score:.4f}") # Even works for OOV queriesprint("Words similar to 'machinelearning' (OOV):")for score, word in find_similar_words('machinelearning', 5): print(f" {word}: {score:.4f}") # ============================================# Analogy Queries# ============================================ def analogy(a: str, b: str, c: str, k: int = 5) -> List[tuple]: """ Solve analogy: a is to b as c is to ? Returns (score, word) tuples. """ return model.get_analogies(a, b, c, k=k) # Classic analogyprint("king - man + woman = ?")for score, word in analogy('king', 'man', 'woman'): print(f" {word}: {score:.4f}")FastText also supports very fast text classification. The supervised model (fasttext.train_supervised) can train on millions of examples in seconds and achieves competitive accuracy for many classification tasks. It's an excellent baseline for text classification with minimal preprocessing.
Facebook provides pre-trained FastText vectors for 157 languages—a major advantage over Word2Vec and GloVe which are primarily available for English.
Available pre-trained models:
| Model | Languages | Corpus | Dimensions |
|---|---|---|---|
| Wiki word vectors | 157 | Wikipedia | 300 |
| Common Crawl | 157 | Web crawl | 300 |
| Wiki + CC (aligned) | 44 | Combined | 300 |
Aligned word vectors are special: different languages are aligned to a common vector space, enabling cross-lingual applications.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120
import fasttextimport fasttext.utilimport numpy as np # ============================================# Download and load pre-trained vectors# ============================================ def load_pretrained_fasttext(lang: str = 'en') -> fasttext.FastText._FastText: """ Download and load pre-trained FastText model for a language. Language codes: 'en', 'es', 'fr', 'de', 'zh', 'ja', 'ko', 'ru', etc. Full list at https://fasttext.cc/docs/en/crawl-vectors.html """ # Download (around 7GB for full model) fasttext.util.download_model(lang, if_exists='ignore') # Load model = fasttext.load_model(f'cc.{lang}.300.bin') print(f"Loaded {lang} model:") print(f" Vocabulary size: {len(model.words)}") print(f" Embedding dimension: {model.get_dimension()}") return model # ============================================# Reduce dimensions for faster inference# ============================================ def reduce_dimensions(model, target_dim: int = 100): """ Reduce embedding dimension using PCA. Useful for reducing memory and speeding up computation. """ fasttext.util.reduce_model(model, target_dim) print(f"Reduced to {model.get_dimension()} dimensions") return model # ============================================# Multilingual usage# ============================================ def multilingual_demo(): """Demonstrate multilingual capabilities.""" # Load English and Spanish models en_model = load_pretrained_fasttext('en') es_model = load_pretrained_fasttext('es') # Word vectors in each language vec_cat_en = en_model.get_word_vector('cat') vec_gato_es = es_model.get_word_vector('gato') # cat in Spanish print("Within-language similarities work normally:") print(f" English: cat ↔ dog = {en_model.similarity('cat', 'dog'):.4f}") print(f" Spanish: gato ↔ perro = {es_model.similarity('gato', 'perro'):.4f}") # Note: Cross-lingual comparison requires aligned vectors print("For cross-lingual similarity, use aligned word vectors") # ============================================# Via gensim (alternative)# ============================================ def load_via_gensim(): """Load FastText vectors via gensim (more memory-efficient).""" from gensim.models import FastText from gensim.models.fasttext import load_facebook_model # Load from Facebook .bin file model = load_facebook_model('cc.en.300.bin') # Or load just the word vectors (smaller, no OOV support) from gensim.models import KeyedVectors word_vectors = KeyedVectors.load_word2vec_format( 'cc.en.300.vec', binary=False ) return model, word_vectors # ============================================# Saving word vectors for deployment# ============================================ def export_word_vectors( model, output_path: str, vocabulary: List[str] = None): """ Export word vectors to text format. Optionally filter to specific vocabulary. """ if vocabulary is None: vocabulary = model.words with open(output_path, 'w', encoding='utf-8') as f: # Write header: vocab_size dimension f.write(f"{len(vocabulary)} {model.get_dimension()}") for word in vocabulary: vec = model.get_word_vector(word) vec_str = ' '.join(f'{x:.6f}' for x in vec) f.write(f"{word} {vec_str}") print(f"Exported {len(vocabulary)} vectors to {output_path}") # Example: export only words we needtask_vocabulary = ['machine', 'learning', 'algorithm', 'neural', 'network']export_word_vectors(model, 'task_vectors.vec', task_vocabulary)FastText provides two formats: .bin (full model with n-grams, ~7GB, supports OOV) and .vec (just word vectors, ~2GB, no OOV support). For applications needing OOV handling, use .bin. For simpler applications with fixed vocabulary, .vec is more memory-efficient.
FastText represents an evolution of Word2Vec with specific strengths. Understanding when to use each is crucial for practical applications.
| Aspect | Word2Vec | GloVe | FastText |
|---|---|---|---|
| Core representation | Word → Vector | Word → Vector | Word → Sum of n-gram vectors |
| OOV handling | ❌ No vector | ❌ No vector | ✅ Computed from n-grams |
| Morphological awareness | ❌ Limited | ❌ Limited | ✅ Via shared n-grams |
| Typo robustness | ❌ None | ❌ None | ✅ Shared n-grams |
| Model size | ~100MB-1GB | ~200MB-1GB | ~2-7GB (with n-grams) |
| Training speed | Fast | Fast | Moderate (more parameters) |
| Inference speed | Fast | Fast | Slower (n-gram lookup) |
| Pre-trained languages | 1 (English) | 1 (English) | 157 languages |
For most new NLP projects, FastText is the recommended default for word embeddings. The ability to handle OOV words is invaluable in practice, and the pre-trained models cover 157 languages. Only choose Word2Vec/GloVe when model size or inference speed is a binding constraint.
One of FastText's most valuable properties is its implicit understanding of word morphology. Because morphologically related words share n-grams, their embeddings are naturally similar.
Example: Morphological relationships captured through n-grams:
| Word Pair | Shared N-grams | Relationship |
|---|---|---|
| run, running | run, unn | Base + inflection |
| happy, unhappy | hap, app, ppy | Root + prefix |
| teach, teacher | tea, eac, ach | Root + suffix |
| good, goodness | goo, ood | Root + suffix |
| write, writing | wri, rit, ite | Base + inflection |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
import numpy as npfrom sklearn.metrics.pairwise import cosine_similarity def morphological_analysis(model, word_groups: list): """ Analyze how well FastText captures morphological relationships. """ for group_name, words in word_groups: print(f"{group_name}:") # Get embeddings embeddings = {w: model.get_word_vector(w) for w in words} # Show pairwise similarities for i, w1 in enumerate(words): for w2 in words[i+1:]: sim = cosine_similarity( [embeddings[w1]], [embeddings[w2]] )[0, 0] print(f" {w1} ↔ {w2}: {sim:.4f}") # Analyze various morphological patternsword_groups = [ ("Verb conjugation", ["run", "runs", "running", "ran"]), ("Noun pluralization", ["cat", "cats", "dog", "dogs"]), ("Adjective gradation", ["happy", "happier", "happiest"]), ("Negation prefix", ["happy", "unhappy", "visible", "invisible"]), ("Nominalization", ["teach", "teacher", "teaching"]), ("Related concepts", ["king", "queen", "prince", "princess"]),] morphological_analysis(model, word_groups) # Compare to Word2Vec/GloVe (which don't share subword info)def compare_morphological_generalization(): """ Show how FastText generalizes better for morphological variants. """ # FastText: OOV morphological variants still work base_word = "algorithm" variants = ["algorithms", "algorithmic", "algorithmically"] if base_word in model.words: base_vec = model.get_word_vector(base_word) print(f"Base word: {base_word}") for variant in variants: variant_vec = model.get_word_vector(variant) sim = cosine_similarity([base_vec], [variant_vec])[0, 0] in_vocab = variant in model.words print(f" {variant}: sim={sim:.4f}, in_vocab={in_vocab}") # Even completely novel morphological forms work novel_forms = ["algorithmized", "algorithmwise", "prealgorithm"] print("Novel morphological forms (likely OOV):") base_vec = model.get_word_vector(base_word) for form in novel_forms: form_vec = model.get_word_vector(form) sim = cosine_similarity([base_vec], [form_vec])[0, 0] print(f" {form}: sim={sim:.4f}") compare_morphological_generalization()Languages like Finnish, Turkish, and Hungarian have extensive morphology where a single word root can generate dozens of forms. Word2Vec would need to see all forms in training data. FastText captures the relationship implicitly: 'talot' (houses) shares n-grams with 'talo' (house) even if 'talot' wasn't in training data.
FastText embeddings are particularly valuable for applications dealing with messy, real-world text:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
import numpy as npfrom sklearn.metrics.pairwise import cosine_similarityfrom typing import List, Tuple # ============================================# Spelling Correction via Similarity# ============================================ def find_closest_words( query: str, vocabulary: List[str], model, top_k: int = 5) -> List[Tuple[str, float]]: """ Find vocabulary words most similar to (possibly misspelled) query. """ query_vec = model.get_word_vector(query) similarities = [] for word in vocabulary: word_vec = model.get_word_vector(word) sim = cosine_similarity([query_vec], [word_vec])[0, 0] similarities.append((word, sim)) return sorted(similarities, key=lambda x: -x[1])[:top_k] # Example: typo correctioncorrect_words = ['receive', 'algorithm', 'machine', 'learning', 'neural'] print("Typo correction via FastText similarity:")for typo in ['recieve', 'algoritm', 'machien', 'learing', 'nueral']: suggestions = find_closest_words(typo, correct_words, model) print(f" {typo} → {suggestions[0][0]} ({suggestions[0][1]:.3f})") # ============================================# Hashtag Expansion# ============================================ def expand_hashtag(hashtag: str, model, top_k: int = 5): """ Find related words to a hashtag. FastText handles concatenated words via shared n-grams. """ # Remove # if present tag = hashtag.lstrip('#').lower() return model.get_nearest_neighbors(tag, k=top_k) print("Hashtag expansion:")for tag in ['#MachineLearning', '#COVID19', '#GameOfThrones']: print(f" {tag}:") for score, word in expand_hashtag(tag, model): print(f" {word}: {score:.3f}") # ============================================# Cross-lingual Application (with aligned vectors)# ============================================ def multilingual_search( query: str, documents: List[Tuple[str, str]], # (text, language) models: dict, # language -> model mapping query_lang: str = 'en') -> List[Tuple[int, float]]: """ Search documents in multiple languages using aligned embeddings. Note: Requires aligned/compatible embeddings for this to work well. """ # Get query embedding query_model = models[query_lang] query_vec = query_model.get_sentence_vector(query) results = [] for idx, (text, lang) in enumerate(documents): doc_model = models.get(lang, models['en']) # Fallback to English doc_vec = doc_model.get_sentence_vector(text) sim = cosine_similarity([query_vec], [doc_vec])[0, 0] results.append((idx, sim)) return sorted(results, key=lambda x: -x[1]) # ============================================# Robust Text Classification# ============================================ def robust_text_embedding(text: str, model, normalize: bool = True): """ Create robust text embedding that handles noise. FastText's OOV handling makes this more robust than Word2Vec. """ # Minimal preprocessing - FastText handles messiness words = text.lower().split() # Get all word vectors (FastText always returns something) vectors = [model.get_word_vector(w) for w in words] if not vectors: return np.zeros(model.get_dimension()) # Average embedding = np.mean(vectors, axis=0) if normalize: norm = np.linalg.norm(embedding) if norm > 0: embedding = embedding / norm return embedding # Example: noisy text still worksclean_text = "The machine learning algorithm processes data"noisy_text = "Teh machien lerning algortihm proceses dta" vec_clean = robust_text_embedding(clean_text, model)vec_noisy = robust_text_embedding(noisy_text, model) sim = cosine_similarity([vec_clean], [vec_noisy])[0, 0]print(f"Clean vs noisy text similarity: {sim:.4f}")print("(High similarity shows robustness to typos)")FastText represents an important evolution in word embeddings, addressing fundamental limitations of Word2Vec and GloVe. Let's consolidate the key insights:
What's next:
The final page covers Pre-trained Embeddings—practical strategies for using, selecting, and fine-tuning word embeddings from published sources. We'll compare available resources, discuss when to train custom embeddings versus using pre-trained ones, and explore techniques for domain adaptation.
You now have a comprehensive understanding of FastText—from the subword representation through training and usage to practical applications. FastText's ability to handle OOV words makes it particularly valuable for real-world NLP where messy, evolving text is the norm rather than the exception.