Word Embeddings - Learning Module

Loading content...

0/245

FastText: Subword Embeddings for Morphological Richness and OOV Handling

Breaking the Vocabulary Barrier

Word2Vec and GloVe share a fundamental limitation: they treat words as atomic units. If a word isn't in the vocabulary, it simply cannot be represented. This creates problems in practice:

Rare words: Domain-specific terms, names, technical jargon often fall outside the vocabulary
Misspellings: "recieve" has no embedding even when "receive" does
Morphological variants: "unhappiness" might be OOV while "happy" exists
Neologisms: New words coined after training ("YOLO", "bitcoin" in older models)

FastText, developed by Facebook AI Research in 2016, elegantly solves these problems by representing words as bags of character n-grams. The word "where" is represented not just as "where" but as the n-grams: "<wh", "whe", "her", "ere", "re>", plus the word itself. This enables:

Embeddings for any word (even never-seen words)
Morphological awareness ("unhappiness" shares n-grams with "happy")
Spelling error robustness ("recieve" shares n-grams with "receive")

What You Will Master

By the end of this page, you will understand FastText's subword representation scheme, the modified Skip-gram objective incorporating n-grams, how to compute embeddings for OOV words, practical training and usage considerations, and when FastText outperforms traditional word embeddings. This knowledge is essential for handling real-world text with its inherent messiness.

The Subword Representation

FastText's key innovation is representing each word as a bag of character n-grams plus the word itself. This is accomplished through a simple yet powerful scheme:

The n-gram extraction process:

Add boundary markers: "where" → "<where>"
Extract all n-grams for n in [minn, maxn]: typically n ∈ [3, 6]
Include the full word with markers: "<where>"

Example for "where" with n=3:

"<wh", "whe", "her", "ere", "re>"
Plus the full word: "<where>"

fasttext_ngrams.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
from typing import Set, List
 
def extract_ngrams(
    word: str,
    minn: int = 3,
    maxn: int = 6,
    include_word: bool = True
) -> Set[str]:
    """
    Extract character n-grams for FastText representation.
    
    Args:
        word: Input word
        minn: Minimum n-gram length
        maxn: Maximum n-gram length
        include_word: Whether to include the full word with markers
    
    Returns:
        Set of character n-grams
    """
    # Add boundary markers
    word_with_markers = f"<{word}>"
    ngrams = set()
    
    # Extract n-grams for each length
    for n in range(minn, maxn + 1):
        for i in range(len(word_with_markers) - n + 1):
            ngram = word_with_markers[i:i+n]
            ngrams.add(ngram)
    
    # Optionally add full word
    if include_word:
        ngrams.add(word_with_markers)
    
    return ngrams
 
 
def show_ngram_decomposition(word: str, minn: int = 3, maxn: int = 6):
    """Visualize n-gram decomposition of a word."""
    print(f"Word: '{word}'")
    print(f"With markers: '<{word}>'")
    print(f"
N-grams (n={minn} to {maxn}):")
    
    for n in range(minn, maxn + 1):
        ngrams = extract_ngrams(word, minn=n, maxn=n, include_word=False)
        print(f"  n={n}: {sorted(ngrams)}")
    
    all_ngrams = extract_ngrams(word, minn, maxn)
    print(f"
Total unique n-grams: {len(all_ngrams)}")
    return all_ngrams
 
 
# Demonstrate for various words
for word in ["where", "happy", "unhappiness", "run", "running"]:
    show_ngram_decomposition(word)
    print()
 
 
# Show n-gram overlap between related words
def compute_ngram_overlap(word1: str, word2: str, minn: int = 3, maxn: int = 6):
    """Compute Jaccard overlap between n-gram sets."""
    ng1 = extract_ngrams(word1, minn, maxn)
    ng2 = extract_ngrams(word2, minn, maxn)
    
    intersection = ng1 & ng2
    union = ng1 | ng2
    
    jaccard = len(intersection) / len(union) if union else 0
    
    print(f"N-gram overlap between '{word1}' and '{word2}':")
    print(f"  Shared n-grams: {sorted(intersection)}")
    print(f"  Jaccard similarity: {jaccard:.3f}")
    return jaccard
 
 
# Related words have high overlap
compute_ngram_overlap("happy", "unhappy")
compute_ngram_overlap("happy", "happiness")
compute_ngram_overlap("running", "runner")
 
# Typos share many n-grams
compute_ngram_overlap("receive", "recieve")
 
# Unrelated words have low overlap
compute_ngram_overlap("happy", "algorithm")

Why Boundary Markers?

The < and > markers distinguish prefixes, suffixes, and infixes. Without markers, 'her' from 'where' would be indistinguishable from 'her' in 'there' (at the start). With markers, '<her' (prefix) vs 'her>' (suffix) vs 'her' (infix) are all distinct n-grams, preserving positional information.

The FastText Training Objective

FastText modifies Word2Vec's Skip-gram objective to incorporate subword information. The key change: a word's embedding is the sum of its n-gram embeddings plus a whole-word embedding.

Modified scoring function:

For Word2Vec Skip-gram, the score for a (center, context) pair is:

$$s(w, c) = u_c^T v_w$$

For FastText, this becomes:

$$s(w, c) = u_c^T \left( v_w + \sum_{g \in G(w)} z_g \right)$$

where:

v_w is the word's own embedding
G(w) is the set of n-grams for word w
z_g is the embedding of n-gram g
u_c is the context word's embedding

The complete objective (with negative sampling):

$$\sum_{t=1}^{T} \sum_{c \in C_t} \left[ \log \sigma(s(w_t, c)) + \sum_{n \in N_{t,c}} \log \sigma(-s(w_t, n)) \right]$$

where C_t is the context of word t and N_{t,c} are negative samples.

fasttext_word_vector.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
import numpy as np
from typing import Dict, Set
 
class FastTextEmbedding:
    """
    Demonstrates how FastText computes word embeddings.
    """
    
    def __init__(
        self,
        ngram_vectors: Dict[str, np.ndarray],  # n-gram embeddings
        word_vectors: Dict[str, np.ndarray],   # whole-word embeddings
        embedding_dim: int,
        minn: int = 3,
        maxn: int = 6
    ):
        self.ngram_vectors = ngram_vectors
        self.word_vectors = word_vectors
        self.embedding_dim = embedding_dim
        self.minn = minn
        self.maxn = maxn
    
    def extract_ngrams(self, word: str) -> Set[str]:
        """Extract n-grams for a word."""
        word_with_markers = f"<{word}>"
        ngrams = set()
        
        for n in range(self.minn, self.maxn + 1):
            for i in range(len(word_with_markers) - n + 1):
                ngrams.add(word_with_markers[i:i+n])
        
        return ngrams
    
    def get_word_vector(self, word: str) -> np.ndarray:
        """
        Compute word embedding as sum of n-gram vectors + word vector.
        This is the key FastText operation.
        """
        vector = np.zeros(self.embedding_dim)
        count = 0
        
        # Add whole-word embedding if available
        if word in self.word_vectors:
            vector += self.word_vectors[word]
            count += 1
        
        # Add n-gram embeddings
        ngrams = self.extract_ngrams(word)
        for ngram in ngrams:
            if ngram in self.ngram_vectors:
                vector += self.ngram_vectors[ngram]
                count += 1
        
        # Normalize by count (optional, some implementations do this)
        if count > 0:
            vector /= count
        
        return vector
    
    def is_in_vocabulary(self, word: str) -> bool:
        """Check if word has a direct embedding."""
        return word in self.word_vectors
    
    def get_oov_coverage(self, word: str) -> float:
        """How many of the word's n-grams are covered?"""
        ngrams = self.extract_ngrams(word)
        covered = sum(1 for ng in ngrams if ng in self.ngram_vectors)
        return covered / len(ngrams) if ngrams else 0.0
 
 
# Demonstrate OOV vector computation
def demonstrate_oov_handling():
    """Show how FastText handles OOV words."""
    # Simulated trained model (in practice, use fasttext library)
    vocab = ['where', 'there', 'here', 'somewhere', 'anywhere', 'everywhere']
    
    # Pretend we have trained embeddings
    np.random.seed(42)
    dim = 100
    
    # Build mock embeddings
    word_vectors = {w: np.random.randn(dim) for w in vocab}
    
    # Build n-gram vocabulary from training words
    all_ngrams = set()
    for word in vocab:
        word_with_markers = f"<{word}>"
        for n in range(3, 7):
            for i in range(len(word_with_markers) - n + 1):
                all_ngrams.add(word_with_markers[i:i+n])
    
    ngram_vectors = {ng: np.random.randn(dim) for ng in all_ngrams}
    print(f"Vocabulary size: {len(vocab)}")
    print(f"N-gram vocabulary size: {len(ngram_vectors)}")
    
    # Create embedder
    embedder = FastTextEmbedding(
        ngram_vectors, word_vectors, dim, minn=3, maxn=6
    )
    
    # Test on in-vocabulary word
    print("
In-vocabulary word 'where':")
    vec = embedder.get_word_vector('where')
    print(f"  Vector norm: {np.linalg.norm(vec):.4f}")
    
    # Test on OOV word that shares n-grams
    print("
OOV word 'nowhere' (shares n-grams with vocab):")
    print(f"  Is in vocabulary: {embedder.is_in_vocabulary('nowhere')}")
    print(f"  N-gram coverage: {embedder.get_oov_coverage('nowhere'):.1%}")
    vec = embedder.get_word_vector('nowhere')
    print(f"  Vector norm: {np.linalg.norm(vec):.4f}")
    
    # Test on completely foreign word
    print("
OOV word 'xyzzy' (few shared n-grams):")
    print(f"  N-gram coverage: {embedder.get_oov_coverage('xyzzy'):.1%}")
    vec = embedder.get_word_vector('xyzzy')
    print(f"  Vector norm: {np.linalg.norm(vec):.4f}")
 
demonstrate_oov_handling()

Parameter Explosion?

You might worry that storing embeddings for every n-gram is expensive. FastText addresses this with hashing: n-grams are hashed to a fixed number of buckets (default 2 million), and bucket embeddings are learned. This means multiple n-grams may share an embedding, but in practice the collision rate is low enough not to hurt quality significantly.

Training FastText Models

Training FastText is similar to Word2Vec, with additional parameters controlling the subword representation:

fasttext_training.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
import fasttext
from typing import List
import tempfile
import os
 
def train_fasttext_skipgram(
    sentences: List[str],
    model_path: str = 'fasttext_model',
    dim: int = 100,
    epoch: int = 5,
    lr: float = 0.05,
    window: int = 5,
    minn: int = 3,
    maxn: int = 6,
    min_count: int = 5,
    neg: int = 5,
    bucket: int = 2000000
) -> fasttext.FastText._FastText:
    """
    Train a FastText Skip-gram model.
    
    Args:
        sentences: List of sentences (strings)
        model_path: Where to save the model
        dim: Embedding dimension
        epoch: Number of training epochs
        lr: Learning rate
        window: Context window size
        minn: Minimum n-gram length
        maxn: Maximum n-gram length
        min_count: Ignore words with fewer occurrences
        neg: Number of negative samples
        bucket: Number of hash buckets for n-grams
    
    Returns:
        Trained FastText model
    """
    # FastText requires a file input
    with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt') as f:
        for sentence in sentences:
            f.write(sentence + '
')
        temp_path = f.name
    
    try:
        # Train the model
        model = fasttext.train_unsupervised(
            temp_path,
            model='skipgram',  # or 'cbow'
            dim=dim,
            epoch=epoch,
            lr=lr,
            ws=window,
            minn=minn,
            maxn=maxn,
            minCount=min_count,
            neg=neg,
            bucket=bucket,
            thread=4
        )
        
        # Save model
        model.save_model(f'{model_path}.bin')
        print(f"Model saved to {model_path}.bin")
        
        return model
    
    finally:
        os.unlink(temp_path)
 
 
def train_fasttext_cbow(sentences: List[str], **kwargs):
    """Train FastText with CBOW architecture."""
    # FastText CBOW doesn't support subword by default in all versions
    # but the library handles this
    with tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt') as f:
        for sentence in sentences:
            f.write(sentence + '
')
        temp_path = f.name
    
    try:
        model = fasttext.train_unsupervised(
            temp_path,
            model='cbow',
            dim=kwargs.get('dim', 100),
            epoch=kwargs.get('epoch', 5),
            minn=kwargs.get('minn', 3),
            maxn=kwargs.get('maxn', 6)
        )
        return model
    finally:
        os.unlink(temp_path)
 
 
# Example training
corpus = [
    "the quick brown fox jumps over the lazy dog",
    "machine learning algorithms process data efficiently",
    "neural networks learn patterns from large datasets",
    "natural language processing enables machines to understand text",
    # ... more sentences for real training
]
 
model = train_fasttext_skipgram(
    corpus,
    dim=100,
    epoch=10,
    minn=3,
    maxn=6
)
 
# Test the model
print("
Word vector for 'machine':")
vec = model.get_word_vector('machine')
print(f"  Shape: {vec.shape}, Norm: {np.linalg.norm(vec):.4f}")
 
# OOV word
print("
Word vector for 'machinelearning' (OOV):")
vec_oov = model.get_word_vector('machinelearning')
print(f"  Shape: {vec_oov.shape}, Norm: {np.linalg.norm(vec_oov):.4f}")
 
# Nearest neighbors
print("
Nearest neighbors of 'learning':")
neighbors = model.get_nearest_neighbors('learning', k=5)
for score, word in neighbors:
    print(f"  {word}: {score:.4f}")

Key FastText Hyperparameters
Parameter	Default	Description	Tuning Guidance
dim	100	Embedding dimension	100-300; higher for larger corpora
minn	3	Minimum n-gram length	2-3 for morphologically rich languages
maxn	6	Maximum n-gram length	5-6 typically; lower reduces model size
bucket	2000000	Hash buckets for n-grams	Reduce for smaller models; increase for large vocab
epoch	5	Training epochs	5-10 for large corpora; more for small
minCount	5	Minimum word frequency	1-5; lower includes more rare words

Model Size Consideration

FastText models are larger than Word2Vec because they store n-gram embeddings. A typical 300-dim FastText model can be 2-7GB depending on bucket size. For deployment, consider: (1) reducing bucket count, (2) using quantization (model.quantize()), or (3) extracting only word vectors for known vocabulary.

Using FastText Embeddings

FastText provides several APIs for accessing and using embeddings:

using_fasttext.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
import fasttext
import numpy as np
from typing import List
 
# Load pre-trained or custom model
model = fasttext.load_model('path/to/model.bin')
 
# ============================================
# Word Vectors
# ============================================
 
def get_word_embedding(word: str) -> np.ndarray:
    """Get embedding for any word (in-vocab or OOV)."""
    return model.get_word_vector(word)
 
# Works for in-vocabulary words
vec_king = get_word_embedding('king')
 
# Works for OOV words too!
vec_coronavirus = get_word_embedding('coronavirus')  # OOV in older models
vec_misspelled = get_word_embedding('recieve')  # Typo still gets embedding
 
print(f"'king' vector norm: {np.linalg.norm(vec_king):.4f}")
print(f"'coronavirus' (OOV) vector norm: {np.linalg.norm(vec_coronavirus):.4f}")
 
 
# ============================================
# Document Embeddings via Averaging
# ============================================
 
def get_document_embedding(
    text: str,
    model: fasttext.FastText._FastText,
    normalize: bool = True
) -> np.ndarray:
    """
    Compute document embedding by averaging word embeddings.
    
    FastText handles OOV words, so we can include all tokens.
    """
    words = text.lower().split()
    
    if not words:
        return np.zeros(model.get_dimension())
    
    # Get embeddings for all words (including OOV)
    word_vectors = [model.get_word_vector(word) for word in words]
    
    # Average
    doc_vector = np.mean(word_vectors, axis=0)
    
    if normalize:
        norm = np.linalg.norm(doc_vector)
        if norm > 0:
            doc_vector = doc_vector / norm
    
    return doc_vector
 
# Example
doc1 = "Machine learning algorithms process large datasets"
doc2 = "Deep neural networks learn patterns from data"
doc3 = "The cat sat on the mat"
 
vec1 = get_document_embedding(doc1, model)
vec2 = get_document_embedding(doc2, model)
vec3 = get_document_embedding(doc3, model)
 
# Compute similarities
from sklearn.metrics.pairwise import cosine_similarity
 
sim_12 = cosine_similarity([vec1], [vec2])[0, 0]
sim_13 = cosine_similarity([vec1], [vec3])[0, 0]
print(f"Similarity(ML, DL): {sim_12:.4f}")
print(f"Similarity(ML, cat): {sim_13:.4f}")
 
 
# ============================================
# Sentence Vectors (Built-in)
# ============================================
 
def get_sentence_embedding(sentence: str) -> np.ndarray:
    """
    FastText's built-in sentence embedding.
    Note: Primarily designed for supervised models but works for unsupervised too.
    """
    return model.get_sentence_vector(sentence)
 
 
# ============================================
# Similarity Queries
# ============================================
 
def find_similar_words(word: str, k: int = 10) -> List[tuple]:
    """Find k most similar words using cosine similarity."""
    return model.get_nearest_neighbors(word, k=k)
 
# Find similar words
print("
Words similar to 'machine':")
for score, word in find_similar_words('machine', 5):
    print(f"  {word}: {score:.4f}")
 
# Even works for OOV queries
print("
Words similar to 'machinelearning' (OOV):")
for score, word in find_similar_words('machinelearning', 5):
    print(f"  {word}: {score:.4f}")
 
 
# ============================================
# Analogy Queries
# ============================================
 
def analogy(a: str, b: str, c: str, k: int = 5) -> List[tuple]:
    """
    Solve analogy: a is to b as c is to ?
    Returns (score, word) tuples.
    """
    return model.get_analogies(a, b, c, k=k)
 
# Classic analogy
print("
king - man + woman = ?")
for score, word in analogy('king', 'man', 'woman'):
    print(f"  {word}: {score:.4f}")

FastText Supervised Learning

FastText also supports very fast text classification. The supervised model (fasttext.train_supervised) can train on millions of examples in seconds and achieves competitive accuracy for many classification tasks. It's an excellent baseline for text classification with minimal preprocessing.

Pre-trained FastText Vectors

Facebook provides pre-trained FastText vectors for 157 languages—a major advantage over Word2Vec and GloVe which are primarily available for English.

Available pre-trained models:

Model	Languages	Corpus	Dimensions
Wiki word vectors	157	Wikipedia	300
Common Crawl	157	Web crawl	300
Wiki + CC (aligned)	44	Combined	300

Aligned word vectors are special: different languages are aligned to a common vector space, enabling cross-lingual applications.

pretrained_fasttext.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
import fasttext
import fasttext.util
import numpy as np
 
# ============================================
# Download and load pre-trained vectors
# ============================================
 
def load_pretrained_fasttext(lang: str = 'en') -> fasttext.FastText._FastText:
    """
    Download and load pre-trained FastText model for a language.
    
    Language codes: 'en', 'es', 'fr', 'de', 'zh', 'ja', 'ko', 'ru', etc.
    Full list at https://fasttext.cc/docs/en/crawl-vectors.html
    """
    # Download (around 7GB for full model)
    fasttext.util.download_model(lang, if_exists='ignore')
    
    # Load
    model = fasttext.load_model(f'cc.{lang}.300.bin')
    
    print(f"Loaded {lang} model:")
    print(f"  Vocabulary size: {len(model.words)}")
    print(f"  Embedding dimension: {model.get_dimension()}")
    
    return model
 
 
# ============================================
# Reduce dimensions for faster inference
# ============================================
 
def reduce_dimensions(model, target_dim: int = 100):
    """
    Reduce embedding dimension using PCA.
    Useful for reducing memory and speeding up computation.
    """
    fasttext.util.reduce_model(model, target_dim)
    print(f"Reduced to {model.get_dimension()} dimensions")
    return model
 
 
# ============================================
# Multilingual usage
# ============================================
 
def multilingual_demo():
    """Demonstrate multilingual capabilities."""
    # Load English and Spanish models
    en_model = load_pretrained_fasttext('en')
    es_model = load_pretrained_fasttext('es')
    
    # Word vectors in each language
    vec_cat_en = en_model.get_word_vector('cat')
    vec_gato_es = es_model.get_word_vector('gato')  # cat in Spanish
    
    print("Within-language similarities work normally:")
    print(f"  English: cat ↔ dog = {en_model.similarity('cat', 'dog'):.4f}")
    print(f"  Spanish: gato ↔ perro = {es_model.similarity('gato', 'perro'):.4f}")
    
    # Note: Cross-lingual comparison requires aligned vectors
    print("
For cross-lingual similarity, use aligned word vectors")
 
 
# ============================================
# Via gensim (alternative)
# ============================================
 
def load_via_gensim():
    """Load FastText vectors via gensim (more memory-efficient)."""
    from gensim.models import FastText
    from gensim.models.fasttext import load_facebook_model
    
    # Load from Facebook .bin file
    model = load_facebook_model('cc.en.300.bin')
    
    # Or load just the word vectors (smaller, no OOV support)
    from gensim.models import KeyedVectors
    word_vectors = KeyedVectors.load_word2vec_format(
        'cc.en.300.vec', 
        binary=False
    )
    
    return model, word_vectors
 
 
# ============================================
# Saving word vectors for deployment
# ============================================
 
def export_word_vectors(
    model,
    output_path: str,
    vocabulary: List[str] = None
):
    """
    Export word vectors to text format.
    Optionally filter to specific vocabulary.
    """
    if vocabulary is None:
        vocabulary = model.words
    
    with open(output_path, 'w', encoding='utf-8') as f:
        # Write header: vocab_size dimension
        f.write(f"{len(vocabulary)} {model.get_dimension()}
")
        
        for word in vocabulary:
            vec = model.get_word_vector(word)
            vec_str = ' '.join(f'{x:.6f}' for x in vec)
            f.write(f"{word} {vec_str}
")
    
    print(f"Exported {len(vocabulary)} vectors to {output_path}")
 
 
# Example: export only words we need
task_vocabulary = ['machine', 'learning', 'algorithm', 'neural', 'network']
export_word_vectors(model, 'task_vectors.vec', task_vocabulary)

Model Formats

FastText provides two formats: .bin (full model with n-grams, ~7GB, supports OOV) and .vec (just word vectors, ~2GB, no OOV support). For applications needing OOV handling, use .bin. For simpler applications with fixed vocabulary, .vec is more memory-efficient.

FastText vs. Word2Vec and GloVe

FastText represents an evolution of Word2Vec with specific strengths. Understanding when to use each is crucial for practical applications.

Comprehensive Comparison
Aspect	Word2Vec	GloVe	FastText
Core representation	Word → Vector	Word → Vector	Word → Sum of n-gram vectors
OOV handling	❌ No vector	❌ No vector	✅ Computed from n-grams
Morphological awareness	❌ Limited	❌ Limited	✅ Via shared n-grams
Typo robustness	❌ None	❌ None	✅ Shared n-grams
Model size	~100MB-1GB	~200MB-1GB	~2-7GB (with n-grams)
Training speed	Fast	Fast	Moderate (more parameters)
Inference speed	Fast	Fast	Slower (n-gram lookup)
Pre-trained languages	1 (English)	1 (English)	157 languages

Choose FastText When

•Working with morphologically rich languages (German, Finnish, Turkish)
•Handling user-generated content with typos and informal spelling
•Your vocabulary has many rare/domain-specific words not in pre-trained embeddings
•You need embeddings for multiple languages
•Processing text with neologisms or evolving vocabulary

Word2Vec/GloVe May Suffice When

•Working with clean, edited text (news, Wikipedia)
•Model size is critical and OOV is rare
•Inference speed is paramount for production
•You have a closed vocabulary with few OOV words
•Using embeddings as input to downstream model that can learn OOV handling

Practical Default Recommendation

For most new NLP projects, FastText is the recommended default for word embeddings. The ability to handle OOV words is invaluable in practice, and the pre-trained models cover 157 languages. Only choose Word2Vec/GloVe when model size or inference speed is a binding constraint.

Morphological Properties of FastText

One of FastText's most valuable properties is its implicit understanding of word morphology. Because morphologically related words share n-grams, their embeddings are naturally similar.

Example: Morphological relationships captured through n-grams:

Word Pair	Shared N-grams	Relationship
run, running	run, unn	Base + inflection
happy, unhappy	hap, app, ppy	Root + prefix
teach, teacher	tea, eac, ach	Root + suffix
good, goodness	goo, ood	Root + suffix
write, writing	wri, rit, ite	Base + inflection

morphological_demo.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
 
def morphological_analysis(model, word_groups: list):
    """
    Analyze how well FastText captures morphological relationships.
    """
    for group_name, words in word_groups:
        print(f"
{group_name}:")
        
        # Get embeddings
        embeddings = {w: model.get_word_vector(w) for w in words}
        
        # Show pairwise similarities
        for i, w1 in enumerate(words):
            for w2 in words[i+1:]:
                sim = cosine_similarity(
                    [embeddings[w1]], [embeddings[w2]]
                )[0, 0]
                print(f"  {w1} ↔ {w2}: {sim:.4f}")
 
# Analyze various morphological patterns
word_groups = [
    ("Verb conjugation", ["run", "runs", "running", "ran"]),
    ("Noun pluralization", ["cat", "cats", "dog", "dogs"]),
    ("Adjective gradation", ["happy", "happier", "happiest"]),
    ("Negation prefix", ["happy", "unhappy", "visible", "invisible"]),
    ("Nominalization", ["teach", "teacher", "teaching"]),
    ("Related concepts", ["king", "queen", "prince", "princess"]),
]
 
morphological_analysis(model, word_groups)
 
 
# Compare to Word2Vec/GloVe (which don't share subword info)
def compare_morphological_generalization():
    """
    Show how FastText generalizes better for morphological variants.
    """
    # FastText: OOV morphological variants still work
    base_word = "algorithm"
    variants = ["algorithms", "algorithmic", "algorithmically"]
    
    if base_word in model.words:
        base_vec = model.get_word_vector(base_word)
        
        print(f"Base word: {base_word}")
        for variant in variants:
            variant_vec = model.get_word_vector(variant)
            sim = cosine_similarity([base_vec], [variant_vec])[0, 0]
            in_vocab = variant in model.words
            print(f"  {variant}: sim={sim:.4f}, in_vocab={in_vocab}")
    
    # Even completely novel morphological forms work
    novel_forms = ["algorithmized", "algorithmwise", "prealgorithm"]
    print("
Novel morphological forms (likely OOV):")
    base_vec = model.get_word_vector(base_word)
    for form in novel_forms:
        form_vec = model.get_word_vector(form)
        sim = cosine_similarity([base_vec], [form_vec])[0, 0]
        print(f"  {form}: sim={sim:.4f}")
 
compare_morphological_generalization()

Morphologically Rich Languages

Languages like Finnish, Turkish, and Hungarian have extensive morphology where a single word root can generate dozens of forms. Word2Vec would need to see all forms in training data. FastText captures the relationship implicitly: 'talot' (houses) shares n-grams with 'talo' (house) even if 'talot' wasn't in training data.

Practical Applications

FastText embeddings are particularly valuable for applications dealing with messy, real-world text:

Ideal Use Cases

•Social Media Analysis — Handles hashtags (#machinelearning), mentions (@user), and informal spelling (ur, gonna, lol).
•Search & Information Retrieval — Matches queries with typos to correct documents; handles query expansion through morphological similarity.
•Medical/Legal Text — Technical domains with specialized terminology not in general embeddings; compound words common.
•Multilingual Applications — Pre-trained models for 157 languages; no need to train custom embeddings.
•OCR Post-processing — OCR errors often produce OOV words; FastText still provides useful embeddings.
•Speech Recognition Post-processing — ASR errors (homophones, mishearing) often share n-grams with correct words.

fasttext_applications.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
from typing import List, Tuple
 
# ============================================
# Spelling Correction via Similarity
# ============================================
 
def find_closest_words(
    query: str,
    vocabulary: List[str],
    model,
    top_k: int = 5
) -> List[Tuple[str, float]]:
    """
    Find vocabulary words most similar to (possibly misspelled) query.
    """
    query_vec = model.get_word_vector(query)
    
    similarities = []
    for word in vocabulary:
        word_vec = model.get_word_vector(word)
        sim = cosine_similarity([query_vec], [word_vec])[0, 0]
        similarities.append((word, sim))
    
    return sorted(similarities, key=lambda x: -x[1])[:top_k]
 
# Example: typo correction
correct_words = ['receive', 'algorithm', 'machine', 'learning', 'neural']
 
print("Typo correction via FastText similarity:")
for typo in ['recieve', 'algoritm', 'machien', 'learing', 'nueral']:
    suggestions = find_closest_words(typo, correct_words, model)
    print(f"  {typo} → {suggestions[0][0]} ({suggestions[0][1]:.3f})")
 
 
# ============================================
# Hashtag Expansion
# ============================================
 
def expand_hashtag(hashtag: str, model, top_k: int = 5):
    """
    Find related words to a hashtag.
    FastText handles concatenated words via shared n-grams.
    """
    # Remove # if present
    tag = hashtag.lstrip('#').lower()
    
    return model.get_nearest_neighbors(tag, k=top_k)
 
print("
Hashtag expansion:")
for tag in ['#MachineLearning', '#COVID19', '#GameOfThrones']:
    print(f"  {tag}:")
    for score, word in expand_hashtag(tag, model):
        print(f"    {word}: {score:.3f}")
 
 
# ============================================
# Cross-lingual Application (with aligned vectors)
# ============================================
 
def multilingual_search(
    query: str,
    documents: List[Tuple[str, str]],  # (text, language)
    models: dict,  # language -> model mapping
    query_lang: str = 'en'
) -> List[Tuple[int, float]]:
    """
    Search documents in multiple languages using aligned embeddings.
    Note: Requires aligned/compatible embeddings for this to work well.
    """
    # Get query embedding
    query_model = models[query_lang]
    query_vec = query_model.get_sentence_vector(query)
    
    results = []
    for idx, (text, lang) in enumerate(documents):
        doc_model = models.get(lang, models['en'])  # Fallback to English
        doc_vec = doc_model.get_sentence_vector(text)
        sim = cosine_similarity([query_vec], [doc_vec])[0, 0]
        results.append((idx, sim))
    
    return sorted(results, key=lambda x: -x[1])
 
 
# ============================================
# Robust Text Classification
# ============================================
 
def robust_text_embedding(text: str, model, normalize: bool = True):
    """
    Create robust text embedding that handles noise.
    FastText's OOV handling makes this more robust than Word2Vec.
    """
    # Minimal preprocessing - FastText handles messiness
    words = text.lower().split()
    
    # Get all word vectors (FastText always returns something)
    vectors = [model.get_word_vector(w) for w in words]
    
    if not vectors:
        return np.zeros(model.get_dimension())
    
    # Average
    embedding = np.mean(vectors, axis=0)
    
    if normalize:
        norm = np.linalg.norm(embedding)
        if norm > 0:
            embedding = embedding / norm
    
    return embedding
 
 
# Example: noisy text still works
clean_text = "The machine learning algorithm processes data"
noisy_text = "Teh machien lerning algortihm proceses dta"
 
vec_clean = robust_text_embedding(clean_text, model)
vec_noisy = robust_text_embedding(noisy_text, model)
 
sim = cosine_similarity([vec_clean], [vec_noisy])[0, 0]
print(f"
Clean vs noisy text similarity: {sim:.4f}")
print("(High similarity shows robustness to typos)")

Summary: FastText Embeddings

FastText represents an important evolution in word embeddings, addressing fundamental limitations of Word2Vec and GloVe. Let's consolidate the key insights:

Key Takeaways

•Subword representation breaks words into character n-grams, enabling embeddings for any word including OOV terms.
•The word vector is the sum of the whole-word embedding plus all n-gram embeddings, capturing morphological structure.
•Morphological awareness emerges naturally: related words share n-grams and thus have similar embeddings.
•Robustness to noise: Typos and misspellings share n-grams with correct forms, producing reasonable embeddings.
•Pre-trained models are available for 157 languages, far exceeding Word2Vec/GloVe coverage.
•Trade-off is size: Full FastText models are larger due to n-gram storage; consider quantization or vector extraction for deployment.

What's next:

The final page covers Pre-trained Embeddings—practical strategies for using, selecting, and fine-tuning word embeddings from published sources. We'll compare available resources, discuss when to train custom embeddings versus using pre-trained ones, and explore techniques for domain adaptation.

Page Complete

You now have a comprehensive understanding of FastText—from the subword representation through training and usage to practical applications. FastText's ability to handle OOV words makes it particularly valuable for real-world NLP where messy, evolving text is the norm rather than the exception.