Text Preprocessing - Learning Module

Loading content...

0/245

Case Normalization: Unifying Text Representation

The Case for Case Conversion

In written English, the same word can appear in multiple forms based on its position in a sentence:

"The cat sat on the mat." (sentence-initial)
"I saw the cat." (mid-sentence)
"THE CAT" (all caps emphasis)

To a human, "The," "the," and "THE" are identical. To a computer performing literal string comparison, they're three distinct tokens. Without normalization, your vocabulary balloons with redundant entries, and semantically identical documents may appear dissimilar.

Case normalization (also called case folding) converts all text to a uniform case—typically lowercase—to ensure consistent representation. While seemingly trivial, this transformation has significant implications for vocabulary size, matching accuracy, and semantic preservation.

What You Will Learn

By the end of this page, you will understand when case normalization helps and when it hurts, implement case normalization correctly for different scripts, handle language-specific edge cases like Turkish 'I', and make informed decisions about case handling in your NLP pipeline.

Why Case Matters in NLP

Understanding when and why case affects NLP systems helps you make informed normalization decisions.

Information Carried by Case:

In English and many other languages, capitalization serves multiple functions:

Sentence boundaries: First word of a sentence is capitalized
Proper nouns: Names of people, places, organizations ("John," "Paris," "Google")
Emphasis: ALL CAPS for shouting, Title Case for emphasis
Acronyms: "NASA," "FBI," "HTML"
German nouns: All nouns capitalized regardless of position
Stylistic choice: Poetry, marketing copy, social media

The Vocabulary Explosion Problem:

Without case normalization, your vocabulary contains separate entries for:

"apple," "Apple," "APPLE" (3 entries for 1 concept)
"machine," "Machine," "MACHINE" (3 entries)

In a typical corpus, 20-40% of vocabulary entries may be case variants.

case_vocabulary_impact.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
from collections import Counter
from typing import List, Tuple
 
def analyze_case_impact(texts: List[str]) -> dict:
    """
    Analyze the impact of case on vocabulary size.
    """
    # Tokenize simply (split on whitespace and punctuation)
    import re
    all_tokens = []
    for text in texts:
        tokens = re.findall(r'\b\w+\b', text)
        all_tokens.extend(tokens)
    
    # Original vocabulary
    original_vocab = set(all_tokens)
    original_counts = Counter(all_tokens)
    
    # Lowercased vocabulary
    lowered_tokens = [t.lower() for t in all_tokens]
    lowered_vocab = set(lowered_tokens)
    lowered_counts = Counter(lowered_tokens)
    
    # Find case variants
    case_groups = {}
    for token in original_vocab:
        lower = token.lower()
        if lower not in case_groups:
            case_groups[lower] = []
        case_groups[lower].append(token)
    
    # Tokens with multiple case variants
    multi_case = {k: v for k, v in case_groups.items() if len(v) > 1}
    
    return {
        "original_vocab_size": len(original_vocab),
        "lowered_vocab_size": len(lowered_vocab),
        "reduction_percent": (1 - len(lowered_vocab) / len(original_vocab)) * 100,
        "tokens_with_variants": len(multi_case),
        "sample_variants": dict(list(multi_case.items())[:10]),
    }
 
# Sample corpus
corpus = [
    "The Machine Learning course covers machine learning fundamentals.",
    "MACHINE LEARNING is transforming industries.",
    "Apple released a new product. I ate an apple today.",
    "The United States of America is often called the US or USA.",
    "Python and python are both important in programming.",
    "NASA announced a new mission. nasa is trending on Twitter.",
]
 
results = analyze_case_impact(corpus)
 
print("=== Case Impact Analysis ===\n")
print(f"Original vocabulary size: {results['original_vocab_size']}")
print(f"Lowercased vocabulary size: {results['lowered_vocab_size']}")
print(f"Vocabulary reduction: {results['reduction_percent']:.1f}%")
print(f"Tokens with case variants: {results['tokens_with_variants']}")
 
print("\nSample case variants:")
for lower, variants in results['sample_variants'].items():
    if len(variants) > 1:
        print(f"  '{lower}' appears as: {variants}")

Case Normalization Impact by Domain
Domain	Typical Reduction	Special Considerations
News articles	15-25%	Many proper nouns, acronyms
Social media	30-40%	Heavy use of caps for emphasis
Scientific papers	10-15%	Technical terms, gene names
Code/Technical	5-10%	Case-sensitive identifiers
Legal documents	10-15%	Proper nouns, defined terms

When Case Carries Semantic Information

Case normalization isn't always appropriate. In several scenarios, case carries critical semantic information that lowercasing would destroy.

Named Entity Recognition (NER):

Capitalization is a strong signal for proper nouns. "apple" (fruit) vs. "Apple" (company) are disambiguated by case. Lowercasing before NER removes this valuable feature.

Acronyms and Abbreviations:

"IT" (Information Technology) vs. "it" (pronoun) differ only by case. "US" (United States) vs. "us" (pronoun) would be conflated.

Case-Sensitive Languages and Contexts:

German: All nouns are capitalized ("die Arbeit" = the work)
Programming identifiers: myFunction ≠ myfunction
Biological nomenclature: Gene names often case-sensitive (BRCA1)
Chemistry: Case matters in formulas (CO = carbon monoxide, Co = cobalt)

When NOT to Lowercase

•Before Named Entity Recognition — Case is a strong NER feature
•Scientific/Medical text — Gene names, chemical formulas require exact case
•Code analysis — Programming identifiers are case-sensitive
•German text processing — Noun capitalization is grammatical
•Sentiment with emphasis — "HATE" may indicate stronger sentiment than "hate"
•Transformer pre-training — BERT-cased preserves case information

case_semantic_examples.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
"""
Examples where case normalization causes semantic loss.
"""
 
semantic_case_examples = [
    # (lowercase, uppercase, explanation)
    ("apple", "Apple", "Fruit vs. technology company"),
    ("china", "China", "Porcelain vs. country"),
    ("it", "IT", "Pronoun vs. Information Technology"),
    ("us", "US", "Pronoun vs. United States"),
    ("polish", "Polish", "To clean vs. from Poland"),
    ("march", "March", "To walk vs. month"),
    ("may", "May", "Possibility vs. month/name"),
    ("will", "Will", "Future tense vs. name/legal document"),
    ("job", "Job", "Employment vs. biblical figure"),
    ("mercury", "Mercury", "Element vs. planet/god"),
    ("nice", "Nice", "Pleasant vs. French city"),
    ("turkey", "Turkey", "Bird vs. country"),
]
 
print("=== Case-Sensitive Semantic Distinctions ===\n")
print(f"{'Lowercase':<12} {'Uppercase':<12} {'Distinction'}")
print("-" * 60)
for lower, upper, distinction in semantic_case_examples:
    print(f"{lower:<12} {upper:<12} {distinction}")
 
# Demonstrate impact on NER
print("\n=== Impact on Named Entity Recognition ===\n")
 
import spacy
nlp = spacy.load("en_core_web_sm")
 
test_sentences = [
    "I ate an apple at the Apple store.",
    "The march will happen in March.",
    "Nice is a nice city in France.",
    "Turkey exports turkey meat worldwide.",
]
 
for sentence in test_sentences:
    print(f"Original: {sentence}")
    
    # NER on original
    doc_orig = nlp(sentence)
    orig_ents = [(ent.text, ent.label_) for ent in doc_orig.ents]
    
    # NER on lowercased
    doc_lower = nlp(sentence.lower())
    lower_ents = [(ent.text, ent.label_) for ent in doc_lower.ents]
    
    print(f"  Original entities: {orig_ents}")
    print(f"  Lowercased entities: {lower_ents}")
    print(f"  Information lost: {len(orig_ents) - len(lower_ents)} entities")
    print()

The Cased vs. Uncased Model Trade-off

BERT comes in 'cased' and 'uncased' versions. Uncased models lowercase inputs before tokenization, reducing vocabulary and improving generalization for case-insensitive tasks. Cased models preserve case, crucial for NER and tasks where case carries meaning. Choose based on your downstream task.

Implementation Strategies

Case normalization seems simple—just call .lower(). But proper implementation requires attention to Unicode, locale-specific rules, and edge cases.

Basic Case Conversion:

Python provides three main case methods:

str.lower(): Converts to lowercase
str.upper(): Converts to uppercase
str.casefold(): Aggressive lowercase for case-insensitive matching

The difference between lower() and casefold() matters for international text.

case_implementation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
"""
Case normalization implementation strategies.
"""
from typing import Optional
import re
import unicodedata
 
# === Basic Case Conversion ===
 
text = "Hello WORLD München Straße"
 
print("=== Basic Python Case Methods ===\n")
print(f"Original: {text}")
print(f"lower():    {text.lower()}")
print(f"upper():    {text.upper()}")
print(f"casefold(): {text.casefold()}")
 
# The difference matters for German ß
german_text = "Straße"  # Street
print(f"\nGerman 'Straße' (street):")
print(f"  lower():    '{german_text.lower()}'")  # straße
print(f"  casefold(): '{german_text.casefold()}'")  # strasse (double s)
 
# === Unicode Case Folding ===
 
print("\n=== Unicode Considerations ===\n")
 
unicode_examples = [
    ("İstanbul", "Turkish capital I with dot"),
    ("ΣΩΚΡΑΤΗΣ", "Greek uppercase sigma"),
    ("Σ vs ς", "Greek sigma: different lowercase forms"),
    ("ﬁ", "Latin small ligature fi"),
]
 
for text, desc in unicode_examples:
    print(f"{desc}:")
    print(f"  Original:  '{text}'")
    print(f"  lower():   '{text.lower()}'")
    print(f"  casefold(): '{text.casefold()}'")
    print()
 
# === The Turkish I Problem ===
 
print("=== The Turkish I Problem ===\n")
 
# In Turkish:
# - I (dotless I) lowercases to ı (dotless i)
# - İ (dotted I) lowercases to i (dotted i)
# Standard Python lower() doesn't handle this!
 
turkish_examples = [
    ("ISTANBUL", "Standard lowercase"),
    ("İSTANBUL", "Turkish dotted İ"),
]
 
print("In Turkish, 'I' and 'İ' are different letters:")
print("  I (dotless) ↔ ı (dotless)")
print("  İ (dotted)  ↔ i (dotted)")
print()
 
for text, desc in turkish_examples:
    print(f"{desc}: {text}")
    print(f"  Python lower(): {text.lower()}")
    print(f"  Python casefold(): {text.casefold()}")
    print()
 
# Proper Turkish lowercasing requires locale-aware functions
# import locale
# locale.setlocale(locale.LC_ALL, 'tr_TR.UTF-8')
 
# === Comprehensive Case Normalizer ===
 
class CaseNormalizer:
    """
    Production-ready case normalizer with configurable behavior.
    """
    
    def __init__(
        self,
        method: str = "lower",
        preserve_acronyms: bool = False,
        preserve_all_caps: bool = False,
        min_acronym_length: int = 2,
        max_acronym_length: int = 6
    ):
        """
        Initialize case normalizer.
        
        Args:
            method: 'lower', 'upper', or 'casefold'
            preserve_acronyms: Keep likely acronyms (2-6 uppercase letters)
            preserve_all_caps: Keep ALL CAPS words (potential emphasis)
            min_acronym_length: Minimum length to consider as acronym
            max_acronym_length: Maximum length to consider as acronym
        """
        self.method = method
        self.preserve_acronyms = preserve_acronyms
        self.preserve_all_caps = preserve_all_caps
        self.min_len = min_acronym_length
        self.max_len = max_acronym_length
        
        # Regex for detecting acronyms
        self.acronym_pattern = re.compile(
            rf'\b[A-Z]{{{min_acronym_length},{max_acronym_length}}}\b'
        )
    
    def _is_acronym(self, word: str) -> bool:
        """Check if word is likely an acronym."""
        return bool(self.acronym_pattern.fullmatch(word))
    
    def _is_all_caps(self, word: str) -> bool:
        """Check if word is all caps (length > max_acronym_length)."""
        return word.isupper() and len(word) > self.max_len
    
    def normalize_word(self, word: str) -> str:
        """Normalize case of a single word."""
        # Preserve acronyms if requested
        if self.preserve_acronyms and self._is_acronym(word):
            return word
        
        # Preserve all-caps emphasis if requested
        if self.preserve_all_caps and self._is_all_caps(word):
            return word
        
        # Apply normalization
        if self.method == "lower":
            return word.lower()
        elif self.method == "upper":
            return word.upper()
        elif self.method == "casefold":
            return word.casefold()
        else:
            return word
    
    def normalize_text(self, text: str) -> str:
        """Normalize case of entire text, preserving structure."""
        words = text.split()
        normalized = [self.normalize_word(w) for w in words]
        return " ".join(normalized)
 
# Demonstration
print("=== Configurable Case Normalizer ===\n")
 
text = "I work at NASA and IBM. The AMAZING results were published in Nature."
 
# Basic lowercase
basic = CaseNormalizer(method="lower")
print(f"Original: {text}")
print(f"Basic lower: {basic.normalize_text(text)}")
 
# Preserve acronyms
acronym_preserving = CaseNormalizer(method="lower", preserve_acronyms=True)
print(f"Preserve acronyms: {acronym_preserving.normalize_text(text)}")
 
# Preserve all-caps emphasis
emphasis_preserving = CaseNormalizer(
    method="lower", 
    preserve_acronyms=True, 
    preserve_all_caps=True
)
print(f"Preserve emphasis: {emphasis_preserving.normalize_text(text)}")

Use casefold() for Matching

When performing case-insensitive matching (search, deduplication), use casefold() instead of lower(). It handles Unicode edge cases like German ß → ss and various diacritical marks correctly. For simple text normalization where you'll tokenize later, lower() is usually sufficient.

Language-Specific Case Considerations

Different languages have different relationships with character case. A normalization strategy appropriate for English may be wrong for German, Turkish, or Greek.

German:

All nouns are capitalized ("die Arbeit" = the work)
Lowercasing removes noun markers, potentially useful grammatical information
The letter "ß" has no traditional uppercase (historically "SS", now sometimes "ẞ")

Turkish:

Has both dotted (İ/i) and dotless (I/ı) letters
Standard lower() gives wrong results: "ISTANBUL" → "istanbul" (should be "ıstanbul")
Requires locale-aware processing

Greek:

Sigma (Σ) has two lowercase forms: σ (medial) and ς (final)
Accent marks may be affected by case conversion

Chinese/Japanese/Korean:

No concept of letter case in native scripts
Case normalization only affects romanized text or embedded English

Case Handling by Language
Language	Case System	Normalization Notes
English	Standard A-Z / a-z	Straightforward, acronyms are main consideration
German	A-Z / a-z + ß	Noun capitalization is grammatical; ß handling
Turkish	Extended (İ/I, i/ı)	Requires locale-aware conversion
Greek	Α-Ω / α-ω + σ/ς	Final sigma (ς) vs medial sigma (σ)
Russian	А-Я / а-я	Standard case conversion
Arabic	No case distinction	No normalization needed for native script
Chinese	No case concept	Only affects romanization (Pinyin)

multilingual_case.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
"""
Language-specific case normalization examples.
"""
import re
from typing import Optional
 
# === Turkish Case Conversion ===
 
def turkish_lower(text: str) -> str:
    """
    Locale-correct lowercase for Turkish.
    
    Turkish has:
    - I (dotless capital) → ı (dotless small)
    - İ (dotted capital) → i (dotted small)
    
    Standard Python lower() treats I → i, which is incorrect for Turkish.
    """
    # Direct character mapping for Turkish
    tr_map = str.maketrans({
        'I': 'ı',  # Dotless I to dotless i
        'İ': 'i',  # Dotted İ to dotted i
    })
    return text.translate(tr_map).lower()
 
def turkish_upper(text: str) -> str:
    """Locale-correct uppercase for Turkish."""
    tr_map = str.maketrans({
        'i': 'İ',  # Dotted i to dotted İ
        'ı': 'I',  # Dotless i to dotless I
    })
    return text.translate(tr_map).upper()
 
print("=== Turkish Case Handling ===\n")
 
turkish_words = ["ISTANBUL", "İSTANBUL", "SIK", "SİK"]
 
for word in turkish_words:
    print(f"{word}:")
    print(f"  Python lower(): {word.lower()}")
    print(f"  Turkish lower(): {turkish_lower(word)}")
    print()
 
# === German ß Handling ===
 
print("=== German ß Handling ===\n")
 
german_words = ["Straße", "Grüße", "STRASSE", "GRUESSE"]
 
for word in german_words:
    print(f"{word}:")
    print(f"  lower():    {word.lower()}")
    print(f"  casefold(): {word.casefold()}")
    print(f"  upper():    {word.upper()}")
    print()
 
# Note: Python 3.8+ has ẞ (capital ß), but it's not universally used
print("Capital ß (ẞ) introduced in 2017 German orthography:")
print(f"  'ß'.upper() = '{'ß'.upper()}'")
print(f"  'ẞ'.lower() = '{'ẞ'.lower()}'")
 
# === Greek Final Sigma ===
 
print("\n=== Greek Sigma Handling ===\n")
 
greek_text = "ΣΩΚΡΑΤΗΣ"  # Socrates in Greek
 
print(f"Greek 'Socrates': {greek_text}")
print(f"  lower(): {greek_text.lower()}")
# Note: The final sigma Σ becomes ς (not σ) when it's the last letter
 
# Python handles this correctly
print(f"  Last char is ς (final sigma): {greek_text.lower()[-1] == 'ς'}")
 
# === Language-Aware Normalizer ===
 
class MultilingualCaseNormalizer:
    """
    Language-aware case normalization.
    """
    
    def __init__(self, language: str = "en"):
        self.language = language
    
    def lower(self, text: str) -> str:
        """Language-appropriate lowercase."""
        if self.language == "tr":  # Turkish
            return turkish_lower(text)
        else:
            return text.lower()
    
    def upper(self, text: str) -> str:
        """Language-appropriate uppercase."""
        if self.language == "tr":  # Turkish
            return turkish_upper(text)
        else:
            return text.upper()
    
    def normalize(self, text: str) -> str:
        """Normalize to lowercase for the target language."""
        return self.lower(text)
 
# Test multilingual normalizer
print("\n=== Multilingual Normalizer ===\n")
 
en_norm = MultilingualCaseNormalizer("en")
tr_norm = MultilingualCaseNormalizer("tr")
 
test_text = "ISTANBUL"
print(f"'{test_text}':")
print(f"  English normalizer: {en_norm.normalize(test_text)}")
print(f"  Turkish normalizer: {tr_norm.normalize(test_text)}")

Case Normalization in ML Pipelines

Case normalization must be integrated thoughtfully into machine learning pipelines. The position in the preprocessing chain and interaction with other steps matters.

Pipeline Order Considerations:

Before tokenization: Simplest approach, ensures all tokens are normalized
After tokenization: More control, can preserve case for specific tokens
After NER: Preserve case for entity detection, then normalize
Never (for cased models): Some pre-trained models expect cased input

Interaction with Other Preprocessing:

Stemming/Lemmatization: Most stemmers are case-insensitive; lowercase first
Stop-word removal: Stop-word lists are typically lowercase
Tokenization: Some tokenizers are case-sensitive (WordPiece)

case_ml_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
"""
Integrating case normalization into ML preprocessing pipelines.
"""
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from typing import List, Optional
import re
 
class CaseNormalizationTransformer(BaseEstimator, TransformerMixin):
    """
    Scikit-learn compatible case normalization transformer.
    """
    
    def __init__(
        self, 
        method: str = "lower",
        preserve_pattern: Optional[str] = None
    ):
        """
        Args:
            method: 'lower', 'upper', 'casefold', or 'none'
            preserve_pattern: Regex pattern for tokens to preserve (e.g., acronyms)
        """
        self.method = method
        self.preserve_pattern = preserve_pattern
        self._preserve_regex = None
    
    def fit(self, X, y=None):
        if self.preserve_pattern:
            self._preserve_regex = re.compile(self.preserve_pattern)
        return self
    
    def transform(self, X: List[str]) -> List[str]:
        return [self._normalize(text) for text in X]
    
    def _normalize(self, text: str) -> str:
        if self.method == "none":
            return text
        
        if self._preserve_regex:
            # Find tokens to preserve
            preserved = {}
            for match in self._preserve_regex.finditer(text):
                placeholder = f"__PRESERVED_{len(preserved)}__"
                preserved[placeholder] = match.group()
                text = text[:match.start()] + placeholder + text[match.end():]
            
            # Normalize
            if self.method == "lower":
                text = text.lower()
            elif self.method == "upper":
                text = text.upper()
            elif self.method == "casefold":
                text = text.casefold()
            
            # Restore preserved tokens
            for placeholder, original in preserved.items():
                text = text.replace(placeholder.lower(), original)
            
            return text
        
        # Simple normalization
        if self.method == "lower":
            return text.lower()
        elif self.method == "upper":
            return text.upper()
        elif self.method == "casefold":
            return text.casefold()
        
        return text
 
# === Build preprocessing pipeline ===
 
# Simple pipeline with case normalization
simple_pipeline = Pipeline([
    ('case_norm', CaseNormalizationTransformer(method='lower')),
    ('vectorizer', TfidfVectorizer(max_features=1000)),
])
 
# Advanced pipeline preserving acronyms
acronym_pattern = r'\b[A-Z]{2,6}\b'  # Match 2-6 letter acronyms
 
advanced_pipeline = Pipeline([
    ('case_norm', CaseNormalizationTransformer(
        method='lower', 
        preserve_pattern=acronym_pattern
    )),
    ('vectorizer', TfidfVectorizer(max_features=1000)),
])
 
# Test data
texts = [
    "I work at NASA and study machine learning.",
    "The FBI investigates cyber crimes.",
    "IBM and Google are tech giants.",
    "AMAZING breakthrough in AI research!",
]
 
print("=== Pipeline Comparison ===\n")
 
for i, text in enumerate(texts):
    simple_result = simple_pipeline.named_steps['case_norm'].fit_transform([text])[0]
    advanced_result = advanced_pipeline.named_steps['case_norm'].fit_transform([text])[0]
    
    print(f"Original: {text}")
    print(f"Simple:   {simple_result}")
    print(f"Advanced: {advanced_result}")
    print()
 
# === Interaction with BERT models ===
 
print("=== Case Handling for Transformer Models ===\n")
 
from transformers import AutoTokenizer
 
# BERT uncased (lowercases input automatically)
bert_uncased = AutoTokenizer.from_pretrained("bert-base-uncased")
 
# BERT cased (preserves case)
bert_cased = AutoTokenizer.from_pretrained("bert-base-cased")
 
test = "Apple announced iPhone in California"
 
print(f"Input: {test}\n")
 
uncased_tokens = bert_uncased.tokenize(test)
cased_tokens = bert_cased.tokenize(test)
 
print(f"BERT uncased tokens: {uncased_tokens}")
print(f"BERT cased tokens:   {cased_tokens}")
print(f"\nNote: Uncased model lowercases 'Apple' and 'iPhone' automatically")
print("      This may lose useful information for NER tasks")

Pipeline Position Matters

Case normalization typically happens early in the pipeline (before tokenization or immediately after). However, if you're using features like 'starts with uppercase' for NER, compute those features BEFORE normalizing. The preprocessing order should align with your model's expectations.

True Casing: Restoring Original Case

Sometimes you need to reverse case normalization—restoring proper capitalization to lowercased text. This is called true casing or case restoration.

Use Cases:

Restoring case after speech-to-text (ASR outputs lowercase)
Making generated text human-readable
Post-processing pipeline outputs for display
Correcting social media text

Approaches:

Rule-based: Capitalize sentence starts, known proper nouns from gazetteer
Statistical: Train on cased corpora, predict case as a sequence labeling task
Neural: Fine-tune language models for case prediction

true_casing.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
"""
True casing: restoring proper capitalization to lowercased text.
"""
from typing import List, Set, Dict
import re
from collections import Counter
 
class RuleBasedTrueCaser:
    """
    Simple rule-based true caser for common patterns.
    """
    
    def __init__(self, proper_nouns: Set[str] = None, acronyms: Set[str] = None):
        """
        Args:
            proper_nouns: Set of proper nouns to capitalize
            acronyms: Set of acronyms to uppercase
        """
        self.proper_nouns = proper_nouns or {
            'monday', 'tuesday', 'wednesday', 'thursday', 'friday', 
            'saturday', 'sunday', 'january', 'february', 'march', 
            'april', 'may', 'june', 'july', 'august', 'september', 
            'october', 'november', 'december', 'english', 'french',
            'german', 'spanish', 'chinese', 'japanese', 'american',
            'european', 'african', 'asian'
        }
        
        self.acronyms = acronyms or {
            'usa', 'uk', 'eu', 'un', 'nato', 'nasa', 'fbi', 'cia',
            'ceo', 'cto', 'cfo', 'phd', 'mba', 'ai', 'ml', 'nlp',
            'api', 'url', 'html', 'css', 'js', 'sql', 'aws', 'gcp'
        }
        
        # Sentence-ending punctuation
        self.sentence_end = re.compile(r'[.!?]\s*$')
    
    def truecase_word(self, word: str, is_sentence_start: bool = False) -> str:
        """Apply true casing rules to a single word."""
        word_lower = word.lower()
        
        # Check acronyms first
        if word_lower in self.acronyms:
            return word.upper()
        
        # Check proper nouns
        if word_lower in self.proper_nouns:
            return word.capitalize()
        
        # Capitalize sentence starts
        if is_sentence_start:
            return word.capitalize()
        
        # Default to lowercase
        return word_lower
    
    def truecase(self, text: str) -> str:
        """Apply true casing to entire text."""
        sentences = re.split(r'([.!?]\s*)', text)
        result = []
        
        for i, segment in enumerate(sentences):
            if i % 2 == 0:  # Actual sentence content
                words = segment.split()
                if words:
                    # First word is sentence start
                    truecased = [self.truecase_word(words[0], is_sentence_start=True)]
                    # Rest of words
                    truecased.extend([self.truecase_word(w) for w in words[1:]])
                    result.append(' '.join(truecased))
            else:  # Punctuation
                result.append(segment)
        
        return ''.join(result)
 
class StatisticalTrueCaser:
    """
    Statistical true caser trained on a corpus.
    """
    
    def __init__(self):
        self.word_cases: Dict[str, Counter] = {}
        self.sentence_start_probs: Dict[str, float] = {}
    
    def fit(self, texts: List[str]):
        """Learn case patterns from a corpus."""
        # Count case occurrences for each lowercased word
        for text in texts:
            sentences = re.split(r'[.!?]\s*', text)
            
            for sentence in sentences:
                words = sentence.split()
                for i, word in enumerate(words):
                    lower = word.lower()
                    
                    if lower not in self.word_cases:
                        self.word_cases[lower] = Counter()
                    
                    # Record the observed case
                    self.word_cases[lower][word] += 1
        
        return self
    
    def truecase_word(self, word: str) -> str:
        """Get most likely casing for a word."""
        lower = word.lower()
        
        if lower in self.word_cases:
            # Return most common casing
            return self.word_cases[lower].most_common(1)[0][0]
        
        return lower
    
    def truecase(self, text: str, capitalize_starts: bool = True) -> str:
        """Apply learned true casing to text."""
        sentences = re.split(r'([.!?]\s*)', text)
        result = []
        
        for i, segment in enumerate(sentences):
            if i % 2 == 0:
                words = segment.split()
                truecased = [self.truecase_word(w) for w in words]
                
                # Capitalize sentence starts
                if capitalize_starts and truecased:
                    truecased[0] = truecased[0].capitalize()
                
                result.append(' '.join(truecased))
            else:
                result.append(segment)
        
        return ''.join(result)
 
# Demonstration
print("=== Rule-Based True Casing ===\n")
 
truecaser = RuleBasedTrueCaser()
 
test_texts = [
    "i went to paris on monday.",
    "the fbi and cia work together. nasa launches rockets.",
    "he speaks english and japanese fluently.",
    "we visited the usa and the uk last summer.",
]
 
for text in test_texts:
    truecased = truecaser.truecase(text)
    print(f"Input:  {text}")
    print(f"Output: {truecased}")
    print()
 
# Statistical true casing
print("=== Statistical True Casing ===\n")
 
# Training corpus (properly cased)
training_texts = [
    "Apple released a new iPhone at their headquarters in California.",
    "Google and Microsoft compete in the AI space.",
    "The United States and European Union signed a trade deal.",
    "NASA's Mars mission is progressing well.",
    "The CEO announced record profits at Apple.",
]
 
stat_truecaser = StatisticalTrueCaser()
stat_truecaser.fit(training_texts)
 
# Test on lowercased input
test = "apple released iphone and google announced ai updates"
print(f"Input:  {test}")
print(f"Output: {stat_truecaser.truecase(test)}")

Summary: Case Normalization Mastery

Case normalization is a simple transformation with significant implications. Understanding when to apply it—and when to preserve case—is essential for building effective NLP systems.

Key Takeaways

•Case normalization reduces vocabulary size — By 15-40% depending on domain, merging case variants of the same word.
•Case carries semantic information — Distinguishes proper nouns (Apple vs. apple), acronyms (IT vs. it), and provides emphasis.
•Use casefold() for matching — Handles Unicode edge cases like German ß correctly.
•Language-specific rules matter — Turkish requires special handling for İ/I/i/ı distinctions.
•Consider your task carefully — NER needs case; bag-of-words benefits from lowercasing.
•Cased vs. uncased models — Choose based on downstream task requirements.
•True casing can restore case — When case information is needed after processing lowercased text.

What's Next:

With case normalization covered, we turn to the final essential preprocessing step: punctuation handling. We'll explore when to remove punctuation, when to preserve it, and how to handle the many edge cases that arise in real-world text.

Page Complete

You now understand case normalization from basic concepts through production implementation. You can implement language-aware case conversion, make informed decisions about when to preserve case, and integrate case normalization appropriately into your NLP pipelines.