Text Preprocessing - Learning Module

Loading content...

0/245

Punctuation Handling: Managing Text Delimiters

The Punctuation Puzzle

Punctuation marks seem like trivial noise in text—periods, commas, exclamation points. The simplest preprocessing approach strips them all. But punctuation carries more information than first appears:

Sentence boundaries: Periods, question marks, and exclamation points delimit sentences
Clause structure: Commas, semicolons, and colons organize ideas
Emphasis and emotion: "Wow!" vs. "Wow" vs. "Wow!!!" convey different intensities
Special constructs: Emoticons :), hashtags #ML, URLs, code, and abbreviations
Possession and contractions: Apostrophes in "John's" and "can't" carry grammatical meaning

The right approach to punctuation depends entirely on your NLP task. Bag-of-words models typically discard punctuation; sentiment analysis may preserve exclamation marks; sentence segmentation depends on proper punctuation handling.

What You Will Learn

By the end of this page, you will understand when to remove, preserve, or transform punctuation, handle special cases like emoticons, hashtags, and URLs, implement production-ready punctuation processing, and make task-appropriate decisions about punctuation in your NLP pipeline.

Understanding Punctuation Types

Not all punctuation is equal. Different marks serve different functions, and understanding these distinctions guides preprocessing decisions.

Terminal Punctuation:

Marks that end sentences and indicate sentence type:

. (period) — Declarative statement
? (question mark) — Interrogative
! (exclamation mark) — Exclamative, emphasis

Clause-Level Punctuation:

Marks that organize within sentences:

, (comma) — Clause separation, lists, pauses
; (semicolon) — Related independent clauses
: (colon) — Introduction, elaboration
— (em dash) — Interruption, emphasis

Word-Level Punctuation:

Marks that modify individual words:

' (apostrophe) — Possession, contractions
- (hyphen) — Compound words, prefixes
/ (slash) — Alternatives, dates, paths

Quotation and Bracketing:

Marks that delimit quoted material or parenthetical content:

"" '' — Direct quotation
() — Parenthetical content
[] — Editorial insertions, citations
{} — Sets, code blocks

Punctuation Categories and NLP Relevance
Category	Examples	NLP Relevance
Terminal	. ? !	Sentence segmentation, question detection, emphasis
Clause	, ; : —	Clause boundaries, list parsing
Word-level	' - /	Contractions, compounds, dates
Quotation	" ' « »	Direct speech, citations
Brackets	( ) [ ] { }	Parentheticals, code, references
Special	@ # $ % & *	Mentions, hashtags, symbols
Emoticons	:) :( ;-)	Sentiment, emotion

punctuation_analysis.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
"""
Analyze punctuation distribution in text.
"""
import string
from collections import Counter
from typing import Dict, List
import re
 
def analyze_punctuation(text: str) -> Dict:
    """
    Analyze punctuation usage in text.
    """
    # Standard ASCII punctuation
    standard_punct = set(string.punctuation)
    
    # Count punctuation occurrences
    punct_counts = Counter()
    for char in text:
        if char in standard_punct:
            punct_counts[char] += 1
    
    # Categorize punctuation
    categories = {
        'terminal': set('.?!'),
        'clause': set(',;:'),
        'word_level': set("'-/"),
        'quotation': set('\"\''«»""'''),
        'brackets': set('()[]{}'),
        'special': set('@#$%&*'),
    }
    
    category_counts = {}
    for cat, chars in categories.items():
        category_counts[cat] = sum(punct_counts[c] for c in chars)
    
    # Find emoticons
    emoticon_pattern = r'[:;=8][\-~]?[\)\(DPp\]\[/\\]|[\)\(DPp\]\[/\\][\-~]?[:;=8]'
    emoticons = re.findall(emoticon_pattern, text)
    
    # Find hashtags and mentions
    hashtags = re.findall(r'#\w+', text)
    mentions = re.findall(r'@\w+', text)
    
    return {
        'total_chars': len(text),
        'total_punct': sum(punct_counts.values()),
        'punct_ratio': sum(punct_counts.values()) / len(text) if text else 0,
        'punct_counts': dict(punct_counts.most_common()),
        'category_counts': category_counts,
        'emoticons': emoticons,
        'hashtags': hashtags,
        'mentions': mentions,
    }
 
# Analyze different text types
texts = {
    'formal': """The research, conducted by Dr. Smith et al., demonstrates 
               significant findings. Three key results emerged: (1) improved 
               accuracy, (2) reduced latency, and (3) better scalability.""",
    
    'social': """OMG!! This is amazing!!! 😍 Can't believe it worked!!! 
               @JohnDoe check this out #MachineLearning #AI :) :D""",
    
    'code': """def process(data: Dict[str, Any]) -> List[str]:
               return [x['key'] for x in data.items() if x != None]""",
}
 
print("=== Punctuation Analysis by Text Type ===\n")
 
for text_type, text in texts.items():
    analysis = analyze_punctuation(text)
    print(f"{text_type.upper()}:")
    print(f"  Total chars: {analysis['total_chars']}")
    print(f"  Punctuation: {analysis['total_punct']} ({analysis['punct_ratio']*100:.1f}%)")
    print(f"  Top marks: {list(analysis['punct_counts'].items())[:5]}")
    print(f"  Categories: {analysis['category_counts']}")
    if analysis['emoticons']:
        print(f"  Emoticons: {analysis['emoticons']}")
    if analysis['hashtags']:
        print(f"  Hashtags: {analysis['hashtags']}")
    if analysis['mentions']:
        print(f"  Mentions: {analysis['mentions']}")
    print()

When to Remove Punctuation

For many NLP tasks, punctuation is noise that increases vocabulary size without providing useful features. These scenarios typically call for punctuation removal.

Tasks Where Punctuation Removal Helps

•Bag-of-Words / TF-IDF — Punctuation creates sparse, meaningless vocabulary entries
•Topic Modeling (LDA) — Topics should capture content words, not punctuation patterns
•Information Retrieval — Queries like 'machine learning?' should match 'machine learning'
•Text Classification (traditional ML) — Reduces feature dimensionality without losing signal
•Word Embeddings training — Word2Vec/GloVe shouldn't learn embeddings for punctuation
•Fuzzy Matching / Deduplication — 'Hello, World!' should match 'Hello World'

punctuation_removal.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
"""
Various approaches to punctuation removal.
"""
import string
import re
from typing import List
 
# === Method 1: Simple Character Filtering ===
 
def remove_punct_translate(text: str) -> str:
    """
    Remove punctuation using str.translate().
    Fastest method for simple removal.
    """
    translator = str.maketrans('', '', string.punctuation)
    return text.translate(translator)
 
# === Method 2: Regex-Based Removal ===
 
def remove_punct_regex(text: str) -> str:
    """
    Remove punctuation using regex.
    More flexible for complex patterns.
    """
    return re.sub(r'[^\w\s]', '', text)
 
def remove_punct_regex_unicode(text: str) -> str:
    """
    Remove punctuation including Unicode punctuation.
    Handles international text properly.
    """
    return re.sub(r'[\p{P}\p{S}]', '', text)
 
# === Method 3: Token-Based Removal ===
 
def remove_punct_tokens(tokens: List[str]) -> List[str]:
    """
    Remove punctuation tokens from tokenized text.
    Works well with pre-tokenized input.
    """
    punct_chars = set(string.punctuation)
    return [t for t in tokens if t not in punct_chars and not all(c in punct_chars for c in t)]
 
# === Method 4: Selective Removal ===
 
def remove_punct_selective(
    text: str, 
    preserve: set = None,
    remove_extra: set = None
) -> str:
    """
    Selectively remove punctuation, preserving specific marks.
    
    Args:
        text: Input text
        preserve: Punctuation to keep (e.g., {'!', '?'} for sentiment)
        remove_extra: Additional characters to remove (e.g., {'@', '#'})
    """
    if preserve is None:
        preserve = set()
    if remove_extra is None:
        remove_extra = set()
    
    to_remove = (set(string.punctuation) - preserve) | remove_extra
    translator = str.maketrans({c: ' ' for c in to_remove})
    
    result = text.translate(translator)
    # Collapse multiple spaces
    return re.sub(r'\s+', ' ', result).strip()
 
# === Comparison ===
 
test_texts = [
    "Hello, World! How's it going?",
    "I love #MachineLearning!!! :)",
    "Dr. Smith's research (2023) was ground-breaking.",
    "Price: $99.99/month — great deal!",
]
 
print("=== Punctuation Removal Methods ===\n")
 
for text in test_texts:
    print(f"Original: {text}")
    print(f"  translate(): {remove_punct_translate(text)}")
    print(f"  regex():     {remove_punct_regex(text)}")
    print()
 
# Selective removal
print("=== Selective Removal ===\n")
 
sentiment_text = "I LOVE this!!! Amazing!!! But the service? Not great..."
 
print(f"Original: {sentiment_text}")
print(f"Remove all: {remove_punct_translate(sentiment_text)}")
print(f"Keep ! and ?: {remove_punct_selective(sentiment_text, preserve={'!', '?'})}")
 
# Token-based removal
print("\n=== Token-Based Removal ===\n")
 
tokens = ["Hello", ",", "World", "!", "How", "'s", "it", "going", "?"]
print(f"Original tokens: {tokens}")
print(f"After removal: {remove_punct_tokens(tokens)}")

Collapsing Spaces Matters

Simply removing punctuation often leaves multiple consecutive spaces ('Hello, World' → 'Hello World'). Always collapse multiple spaces to single spaces after removal, or tokenization may produce empty tokens.

When to Preserve Punctuation

Several NLP tasks depend on punctuation for accurate processing. In these scenarios, removing punctuation degrades performance.

Sentiment Analysis:

Exclamation marks amplify sentiment. "Great!" is more positive than "Great." Triple exclamation marks "!!!" often indicate strong emotion. Question marks can indicate uncertainty or negativity.

Sentence Segmentation:

Periods, question marks, and exclamation marks are the primary signals for sentence boundaries. Removing them before segmentation makes the task nearly impossible.

Machine Translation:

Punctuation must be preserved and accurately translated. Question marks, quotation styles, and list punctuation vary by language.

Text Generation:

Models must learn to generate appropriate punctuation for fluent, readable output.

Tasks Where Punctuation Removal Hurts

•Sentiment Analysis — Exclamation marks, question marks indicate intensity
•Sentence Segmentation — Terminal punctuation marks sentence boundaries
•Machine Translation — Punctuation must be preserved in output
•Text Generation (LLMs) — Models must generate punctuation
•Dialogue Systems — Questions need question marks for understanding
•Code/Programming — All punctuation is syntactically meaningful
•Named Entity Recognition — Periods in 'U.S.' or 'Dr.' are part of entities

punctuation_sentiment.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
"""
Demonstrate punctuation impact on sentiment analysis.
"""
from typing import List, Tuple
 
def extract_punctuation_features(text: str) -> dict:
    """
    Extract punctuation-based features for sentiment.
    """
    return {
        'exclamation_count': text.count('!'),
        'question_count': text.count('?'),
        'ellipsis_count': text.count('...'),
        'has_all_caps': any(word.isupper() and len(word) > 1 
                          for word in text.split()),
        'multiple_exclamations': '!!' in text,
        'ends_with_question': text.rstrip().endswith('?'),
        'ends_with_exclamation': text.rstrip().endswith('!'),
        'emoticons_positive': len(__import__('re').findall(r'[:;][\-~]?[\)D]', text)),
        'emoticons_negative': len(__import__('re').findall(r'[:;][\-~]?[\(]', text)),
    }
 
# Sentiment examples where punctuation matters
sentiment_pairs = [
    # (text with punctuation, text without)
    ("I love this!", "I love this"),
    ("I LOVE this!!!", "I love this"),
    ("Great product...", "Great product"),
    ("Really?", "Really"),
    ("This is good :)", "This is good"),
    ("Not bad, I guess...", "Not bad I guess"),
]
 
print("=== Punctuation Impact on Sentiment Features ===\n")
 
for with_punct, without_punct in sentiment_pairs:
    features_with = extract_punctuation_features(with_punct)
    features_without = extract_punctuation_features(without_punct)
    
    print(f"With punctuation: '{with_punct}'")
    print(f"  Features: {features_with}")
    print(f"Without punctuation: '{without_punct}'")
    print(f"  Features: {features_without}")
    print()
 
# === Sentence Segmentation Example ===
 
print("=== Punctuation Impact on Sentence Segmentation ===\n")
 
import re
 
def segment_sentences(text: str) -> List[str]:
    """Simple sentence segmentation using terminal punctuation."""
    sentences = re.split(r'(?<=[.!?])\s+', text)
    return [s.strip() for s in sentences if s.strip()]
 
test_text = "I love this product! It works great. Do you have it in blue? I'd buy that too."
 
print(f"Original text:")
print(f"  {test_text}\n")
 
print("Sentences detected:")
for i, sent in enumerate(segment_sentences(test_text)):
    print(f"  {i+1}. {sent}")
 
# Without punctuation
no_punct = re.sub(r'[.!?]', '', test_text)
print(f"\nWithout terminal punctuation:")
print(f"  {no_punct}")
print(f"  Sentences detected: {len(segment_sentences(no_punct))}")
 
# === Abbreviation Handling ===
 
print("\n=== Abbreviation Periods ===\n")
 
abbreviation_examples = [
    "Dr. Smith works at U.S. Bank.",
    "The U.N. met at 3 p.m. today.",
    "I have a Ph.D. from M.I.T.",
]
 
for text in abbreviation_examples:
    naive_sentences = re.split(r'\.\s+', text)
    print(f"Text: {text}")
    print(f"Naive split: {naive_sentences}")
    print(f"Problem: Abbreviation periods cause incorrect splits")
    print()

Punctuation as Features

Rather than simply preserving or removing punctuation, consider extracting punctuation patterns as features: exclamation count, question mark presence, emoticon polarity. This captures information without including punctuation in the vocabulary.

Special Cases: Emoticons, URLs, Hashtags, and More

Modern text, especially from social media, contains constructs that mix punctuation with semantic content. Naive punctuation stripping destroys these meaningful patterns.

Emoticons and Emojis:

Text emoticons like :) or :-D carry sentiment. Stripping punctuation destroys them. Emojis (Unicode characters) are technically not punctuation but require consideration.

URLs and Email Addresses:

URLs contain periods, slashes, and colons syntactically. https://example.com/path becomes meaningless if punctuation is stripped.

Hashtags and Mentions:

Twitter-style #hashtags and @mentions use special characters meaningfully.

Contractions and Possessives:

Apostrophes in "don't", "it's", "John's" are grammatically significant. Removing them creates "dont", "its", "Johns"—which may or may not be acceptable.

special_punctuation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
"""
Handling special punctuation cases.
"""
import re
from typing import Tuple, List, Dict
 
class SmartPunctuationHandler:
    """
    Intelligent punctuation handling that preserves special constructs.
    """
    
    def __init__(self):
        # Patterns for special constructs
        self.patterns = {
            'url': re.compile(
                r'https?://[^\s]+|www\.[^\s]+',
                re.IGNORECASE
            ),
            'email': re.compile(
                r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
            ),
            'hashtag': re.compile(r'#\w+'),
            'mention': re.compile(r'@\w+'),
            'emoticon': re.compile(
                r'[:;=8B][\-~\^]?[\)\(DPp\]\[oO3><]|'  # Western
                r'[\)\(DPp\]\[oO3><][\-~\^]?[:;=8B]|'  # Reversed
                r'<3|</3|<\\3|'  # Hearts
                r'\^_\^|\^\^|-_-|>_<|UwU|OwO'  # Text faces
            ),
            'number': re.compile(r'\$?\d+[,.]?\d*%?'),
            'abbreviation': re.compile(
                r'\b(?:Mr|Mrs|Ms|Dr|Prof|Inc|Ltd|Jr|Sr|vs|etc|i\.e|e\.g)\.?',
                re.IGNORECASE
            ),
            'contraction': re.compile(
                r"\b\w+'\w+\b"
            ),
        }
    
    def find_special_constructs(self, text: str) -> Dict[str, List[str]]:
        """Find all special constructs in text."""
        found = {}
        for name, pattern in self.patterns.items():
            matches = pattern.findall(text)
            if matches:
                found[name] = matches
        return found
    
    def protect_and_process(
        self, 
        text: str, 
        process_func,
        protect: List[str] = None
    ) -> str:
        """
        Protect special constructs, process text, then restore.
        
        Args:
            text: Input text
            process_func: Function to apply (e.g., remove punctuation)
            protect: List of construct names to protect
        """
        if protect is None:
            protect = ['url', 'email', 'hashtag', 'mention', 'emoticon']
        
        # Store protected strings with placeholders
        placeholders = {}
        modified_text = text
        counter = 0
        
        for construct_type in protect:
            if construct_type in self.patterns:
                pattern = self.patterns[construct_type]
                for match in pattern.finditer(text):
                    placeholder = f"__PROTECTED_{counter}__"
                    placeholders[placeholder] = match.group()
                    modified_text = modified_text.replace(
                        match.group(), placeholder, 1
                    )
                    counter += 1
        
        # Apply processing
        processed = process_func(modified_text)
        
        # Restore protected content
        for placeholder, original in placeholders.items():
            processed = processed.replace(placeholder, original)
        
        return processed
 
# Simple punctuation removal function
def remove_punctuation(text: str) -> str:
    import string
    return text.translate(str.maketrans('', '', string.punctuation))
 
# Demonstration
handler = SmartPunctuationHandler()
 
test_texts = [
    "I love this!!! Check out https://example.com :)",
    "Contact support@company.com or @TwitterHandle",
    "Great #MachineLearning talk! Dr. Smith was amazing ^_^",
    "Price: $99.99 (that's 50% off!)",
    "I can't believe it's not butter!",
]
 
print("=== Smart Punctuation Handling ===\n")
 
for text in test_texts:
    print(f"Original: {text}")
    
    # Find special constructs
    special = handler.find_special_constructs(text)
    if special:
        print(f"Special constructs: {special}")
    
    # Naive removal
    naive = remove_punctuation(text)
    print(f"Naive removal: {naive}")
    
    # Smart removal
    smart = handler.protect_and_process(text, remove_punctuation)
    print(f"Smart removal: {smart}")
    print()
 
# === Emoticon Normalization ===
 
print("=== Emoticon Handling ===\n")
 
def normalize_emoticons(text: str) -> str:
    """Convert emoticons to normalized tokens."""
    # Positive emoticons
    positive = [':)', ':-)', ':D', ':-D', ';)', ';-)', ':P', ':-P', '^_^', '<3']
    # Negative emoticons  
    negative = [':(', ':-(', ":'(", ":'-(", '-_-', '</3']
    
    result = text
    for emot in positive:
        result = result.replace(emot, ' _EMOTICON_POSITIVE_ ')
    for emot in negative:
        result = result.replace(emot, ' _EMOTICON_NEGATIVE_ ')
    
    return re.sub(r'\s+', ' ', result).strip()
 
emoticon_texts = [
    "Great product! Love it :) :D",
    "Terrible experience :( :'(",
    "It's okay I guess... -_-",
]
 
for text in emoticon_texts:
    print(f"Original: {text}")
    print(f"Normalized: {normalize_emoticons(text)}")
    print()

Special Construct Handling Strategies
Construct	Example	Handling Strategy
URLs	https://example.com	Replace with <URL> token or preserve entirely
Emails	user@domain.com	Replace with <EMAIL> token for privacy
Hashtags	#MachineLearning	Preserve or extract as separate feature
Mentions	@username	Preserve, extract, or replace with <USER>
Emoticons	:) :( ;-)	Normalize to polarity tokens or preserve
Contractions	don't, it's	Expand or preserve depending on task
Numbers	$99.99	Preserve, normalize, or replace with <NUM>

Punctuation Normalization

Rather than simply removing or preserving punctuation, normalization transforms punctuation into consistent forms. This is particularly important for noisy text from social media, OCR, or multilingual sources.

Common Normalizations:

Multiple punctuation: "!!!" → "!" or "!REPEAT!"
Unicode variants: Curly quotes "" → straight quotes ""
Elongation: "sooo goood!!!" → "so good!" (normalize repeated chars)
Sentence spacing: Ensure single space after terminal punctuation
Quote styles: Normalize to consistent quote marks
Dashes: Em dash (—), en dash (–), hyphen (-) handling

punctuation_normalization.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
"""
Punctuation normalization for cleaner text.
"""
import re
from typing import Dict
 
class PunctuationNormalizer:
    """
    Normalize punctuation for consistent representation.
    """
    
    def __init__(self):
        # Unicode quote variants to ASCII
        self.quote_map = {
            '"': '"', '"': '"',  # Curly double quotes
            ''': "'", ''': "'",  # Curly single quotes
            '«': '"', '»': '"',  # Guillemets
            '‹': "'", '›': "'",  # Single guillemets
            '„': '"', '‚': "'",  # German quotes
        }
        
        # Dash variants
        self.dash_map = {
            '—': '-',  # Em dash
            '–': '-',  # En dash
            '−': '-',  # Minus sign
            '‐': '-',  # Hyphen (yes, different Unicode)
            '‑': '-',  # Non-breaking hyphen
        }
        
        # Ellipsis
        self.ellipsis_map = {
            '…': '...',  # Unicode ellipsis to ASCII
        }
    
    def normalize_quotes(self, text: str) -> str:
        """Convert all quote variants to ASCII."""
        for fancy, plain in self.quote_map.items():
            text = text.replace(fancy, plain)
        return text
    
    def normalize_dashes(self, text: str) -> str:
        """Convert all dash variants to ASCII hyphen."""
        for fancy, plain in self.dash_map.items():
            text = text.replace(fancy, plain)
        return text
    
    def normalize_ellipsis(self, text: str) -> str:
        """Normalize ellipsis to three periods."""
        text = text.replace('…', '...')
        # Collapse multiple periods to three
        text = re.sub(r'\.{4,}', '...', text)
        return text
    
    def collapse_repeated_punct(self, text: str, max_repeat: int = 1) -> str:
        """
        Collapse repeated punctuation marks.
        '!!!' -> '!' if max_repeat=1
        '!!!' -> '!!' if max_repeat=2
        """
        punct_chars = r'[!?.,;:]'
        pattern = f'({punct_chars})\1{{{max_repeat},}}'
        return re.sub(pattern, r'\1' * max_repeat, text)
    
    def mark_repeated_punct(self, text: str) -> str:
        """
        Mark repeated punctuation with tokens instead of collapsing.
        'Great!!!' -> 'Great! _REPEAT_PUNCT_'
        """
        def replacer(match):
            char = match.group(1)
            count = len(match.group(0))
            if count > 1:
                return char + ' _REPEAT_PUNCT_ '
            return char
        
        return re.sub(r'([!?.])(\1+)', replacer, text)
    
    def normalize_spacing(self, text: str) -> str:
        """Normalize spacing around punctuation."""
        # Space after terminal punctuation
        text = re.sub(r'([.!?])([A-Z])', r'\1 \2', text)
        # No space before punctuation
        text = re.sub(r'\s+([.,;:!?])', r'\1', text)
        # Single space after commas and other marks
        text = re.sub(r'([.,;:!?])\s*', r'\1 ', text)
        # Collapse multiple spaces
        text = re.sub(r' +', ' ', text)
        return text.strip()
    
    def normalize_all(self, text: str, collapse_repeat: bool = True) -> str:
        """Apply all normalizations."""
        text = self.normalize_quotes(text)
        text = self.normalize_dashes(text)
        text = self.normalize_ellipsis(text)
        if collapse_repeat:
            text = self.collapse_repeated_punct(text)
        text = self.normalize_spacing(text)
        return text
 
# Demonstration
normalizer = PunctuationNormalizer()
 
test_cases = [
    # Curly quotes
    '"Hello," she said, 'how are you?'',
    
    # Multiple exclamation
    "This is AMAZING!!! I love it!!!",
    
    # Various dashes
    "Machine learning — a broad field — is fascinating.",
    
    # Ellipsis
    "I don't know… maybe……",
    
    # Spacing issues
    "Hello.How are you?I'm fine,thanks.",
    
    # Mixed issues
    '"Wow!!!" said John—he couldn't believe it…',
]
 
print("=== Punctuation Normalization ===\n")
 
for text in test_cases:
    normalized = normalizer.normalize_all(text)
    print(f"Original:   {text}")
    print(f"Normalized: {normalized}")
    print()
 
# Repeated punctuation as feature
print("=== Repeated Punctuation as Feature ===\n")
 
feature_texts = [
    "Great product!!!",
    "Really???",
    "Okay.",
]
 
for text in feature_texts:
    marked = normalizer.mark_repeated_punct(text)
    print(f"Original: {text}")
    print(f"Marked:   {marked}")
    print()

Integration with Tokenizers

Modern tokenizers handle punctuation differently, and understanding these behaviors helps you make appropriate preprocessing decisions.

Whitespace Tokenizers:

Split on whitespace only, leaving punctuation attached to words ("Hello," includes comma).

Word Tokenizers (NLTK, spaCy):

Separate punctuation as distinct tokens ("Hello" and "," are separate).

Subword Tokenizers (BPE, WordPiece):

May include punctuation in vocabulary, handle it as part of subword units.

tokenizer_punctuation.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
"""
How different tokenizers handle punctuation.
"""
import re
from nltk.tokenize import word_tokenize, RegexpTokenizer
import spacy
 
# Load spaCy
nlp = spacy.load("en_core_web_sm")
 
text = "Hello, World! How's it going? I'd say 99% of this works... mostly."
 
print("=== Tokenizer Punctuation Handling ===\n")
print(f"Input: {text}\n")
 
# Whitespace tokenizer
whitespace_tokens = text.split()
print(f"Whitespace split:")
print(f"  {whitespace_tokens}")
print(f"  Note: Punctuation attached ('Hello,' 'going?')\n")
 
# NLTK word_tokenize
nltk_tokens = word_tokenize(text)
print(f"NLTK word_tokenize:")
print(f"  {nltk_tokens}")
print(f"  Note: Punctuation separated, contractions split\n")
 
# spaCy tokenizer
doc = nlp(text)
spacy_tokens = [token.text for token in doc]
print(f"spaCy tokenizer:")
print(f"  {spacy_tokens}")
print(f"  Note: Similar to NLTK, punctuation as tokens\n")
 
# spaCy with token attributes
print("spaCy token details:")
for token in doc:
    if token.is_punct:
        print(f"  '{token.text}' - is_punct=True")
 
# NLTK RegexpTokenizer (customizable)
print(f"\nCustom regex tokenizers:")
 
# Words only (no punctuation)
word_only = RegexpTokenizer(r'\w+')
print(f"  Words only: {word_only.tokenize(text)}")
 
# Words and punctuation separately
word_punct = RegexpTokenizer(r"\w+|[^\w\s]")
print(f"  Words+punct: {word_punct.tokenize(text)}")
 
# === Handling in sklearn ===
 
print("\n=== Scikit-learn Vectorizers ===\n")
 
from sklearn.feature_extraction.text import CountVectorizer
 
texts = [
    "Hello, World!",
    "Hello World",
    "Machine learning is great!!!",
]
 
# Default behavior (removes punctuation via token pattern)
vec_default = CountVectorizer()
vec_default.fit(texts)
print(f"Default tokenizer vocabulary:")
print(f"  {vec_default.get_feature_names_out()}")
print(f"  Note: Punctuation removed by default\n")
 
# Custom token pattern to include punctuation
vec_with_punct = CountVectorizer(token_pattern=r"\b\w+\b|[^\w\s]")
vec_with_punct.fit(texts)
print(f"Custom pattern (keep punct) vocabulary:")
print(f"  {vec_with_punct.get_feature_names_out()}")
 
# === Transformer Tokenizers ===
 
print("\n=== Transformer Tokenizers ===\n")
 
from transformers import AutoTokenizer
 
bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2")
 
test = "Hello! How's it going???"
 
print(f"Input: {test}\n")
 
bert_tokens = bert_tokenizer.tokenize(test)
gpt2_tokens = gpt2_tokenizer.tokenize(test)
 
print(f"BERT tokens: {bert_tokens}")
print(f"GPT-2 tokens: {gpt2_tokens}")
print("\nNote: Modern transformers handle punctuation natively")
print("      Pre-processing should generally NOT remove punctuation")

Don't Remove Punctuation Before Transformers

BERT, GPT, and other transformer models were pre-trained on text with punctuation. Their tokenizers handle punctuation appropriately. Removing punctuation before passing text to these models creates a distribution mismatch and typically hurts performance.

Production Punctuation Pipeline

A production-ready punctuation handling pipeline combines multiple strategies based on task requirements.

production_pipeline.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
"""
Production-ready punctuation handling pipeline.
"""
import re
import string
from typing import List, Dict, Set, Optional, Callable
from dataclasses import dataclass
from enum import Enum
 
class PunctuationStrategy(Enum):
    REMOVE_ALL = "remove_all"
    PRESERVE_ALL = "preserve_all"
    NORMALIZE = "normalize"
    SELECTIVE = "selective"
    SMART = "smart"
 
@dataclass
class PunctuationConfig:
    """Configuration for punctuation handling."""
    strategy: PunctuationStrategy = PunctuationStrategy.SMART
    preserve_marks: Set[str] = None
    remove_marks: Set[str] = None
    collapse_repeated: bool = True
    max_repeat: int = 1
    normalize_unicode: bool = True
    protect_urls: bool = True
    protect_emails: bool = True
    protect_hashtags: bool = True
    protect_mentions: bool = True
    protect_emoticons: bool = True
    emoticon_tokens: bool = False  # Convert to _POS_/_NEG_ tokens
 
class ProductionPunctuationHandler:
    """
    Production-ready punctuation handler with configurable behavior.
    """
    
    def __init__(self, config: PunctuationConfig = None):
        self.config = config or PunctuationConfig()
        
        # Compile patterns
        self.patterns = {
            'url': re.compile(r'https?://[^\s]+|www\.[^\s]+', re.I),
            'email': re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'),
            'hashtag': re.compile(r'#\w+'),
            'mention': re.compile(r'@\w+'),
            'emoticon_pos': re.compile(r'[:;][\-~]?[\)D\]>]|<3|\^_\^'),
            'emoticon_neg': re.compile(r'[:;][\-~]?[\(\[<]|</3|-_-'),
        }
        
        # Unicode normalization
        self.unicode_map = str.maketrans({
            '"': '"', '"': '"', ''': "'", ''': "'",
            '—': '-', '–': '-', '…': '...',
        })
    
    def _protect_patterns(self, text: str) -> tuple:
        """Protect special patterns with placeholders."""
        placeholders = {}
        counter = 0
        
        protect_list = []
        if self.config.protect_urls:
            protect_list.append('url')
        if self.config.protect_emails:
            protect_list.append('email')
        if self.config.protect_hashtags:
            protect_list.append('hashtag')
        if self.config.protect_mentions:
            protect_list.append('mention')
        if self.config.protect_emoticons:
            protect_list.extend(['emoticon_pos', 'emoticon_neg'])
        
        for pattern_name in protect_list:
            pattern = self.patterns.get(pattern_name)
            if pattern:
                for match in pattern.finditer(text):
                    placeholder = f"__P{counter}__"
                    placeholders[placeholder] = (match.group(), pattern_name)
                    text = text[:match.start()] + placeholder + text[match.end():]
                    counter += 1
        
        return text, placeholders
    
    def _restore_patterns(self, text: str, placeholders: dict) -> str:
        """Restore protected patterns."""
        for placeholder, (original, pattern_name) in placeholders.items():
            if self.config.emoticon_tokens and 'emoticon' in pattern_name:
                if 'pos' in pattern_name:
                    replacement = ' _EMOTICON_POS_ '
                else:
                    replacement = ' _EMOTICON_NEG_ '
                text = text.replace(placeholder, replacement)
            else:
                text = text.replace(placeholder, original)
        return text
    
    def _remove_punctuation(self, text: str) -> str:
        """Remove punctuation based on config."""
        if self.config.preserve_marks:
            to_remove = set(string.punctuation) - self.config.preserve_marks
        elif self.config.remove_marks:
            to_remove = self.config.remove_marks
        else:
            to_remove = set(string.punctuation)
        
        for char in to_remove:
            text = text.replace(char, ' ')
        
        return re.sub(r'\s+', ' ', text).strip()
    
    def _normalize_punctuation(self, text: str) -> str:
        """Normalize punctuation."""
        # Unicode normalization
        if self.config.normalize_unicode:
            text = text.translate(self.unicode_map)
        
        # Collapse repeated
        if self.config.collapse_repeated:
            for char in '!?.':
                pattern = f'\\{char}' if char in '.?' else char
                text = re.sub(f'{pattern}{{2,}}', char * self.config.max_repeat, text)
        
        # Normalize spacing
        text = re.sub(r'\s+', ' ', text)
        
        return text.strip()
    
    def process(self, text: str) -> str:
        """Process text according to configuration."""
        # Protect special patterns
        text, placeholders = self._protect_patterns(text)
        
        # Apply strategy
        if self.config.strategy == PunctuationStrategy.REMOVE_ALL:
            text = self._remove_punctuation(text)
        
        elif self.config.strategy == PunctuationStrategy.NORMALIZE:
            text = self._normalize_punctuation(text)
        
        elif self.config.strategy == PunctuationStrategy.SELECTIVE:
            text = self._remove_punctuation(text)
            text = self._normalize_punctuation(text)
        
        elif self.config.strategy == PunctuationStrategy.SMART:
            text = self._normalize_punctuation(text)
        
        # elif PRESERVE_ALL: do nothing
        
        # Restore protected patterns
        text = self._restore_patterns(text, placeholders)
        
        return text
    
    def batch_process(self, texts: List[str]) -> List[str]:
        """Process multiple texts."""
        return [self.process(text) for text in texts]
 
# === Usage Examples ===
 
print("=== Production Punctuation Handler ===\n")
 
# Different configurations for different tasks
configs = {
    'bag_of_words': PunctuationConfig(
        strategy=PunctuationStrategy.REMOVE_ALL,
        protect_urls=False,
        protect_emails=False,
    ),
    'sentiment_analysis': PunctuationConfig(
        strategy=PunctuationStrategy.SELECTIVE,
        preserve_marks={'!', '?'},
        protect_emoticons=True,
        emoticon_tokens=True,
        collapse_repeated=True,
        max_repeat=1,
    ),
    'transformer_prep': PunctuationConfig(
        strategy=PunctuationStrategy.NORMALIZE,
        normalize_unicode=True,
        collapse_repeated=True,
        max_repeat=2,
    ),
}
 
test_text = "OMG!!! Check https://example.com :) I can't believe it??? @john #amazing"
 
print(f"Input: {test_text}\n")
 
for config_name, config in configs.items():
    handler = ProductionPunctuationHandler(config)
    result = handler.process(test_text)
    print(f"{config_name}:")
    print(f"  {result}\n")

Summary: Punctuation Handling Mastery

Punctuation handling requires thoughtful consideration of your NLP task. The right approach varies from complete removal to careful preservation and normalization.

Key Takeaways

•Punctuation serves multiple functions — Sentence boundaries, emphasis, emotion, and special constructs all use punctuation.
•Remove for bag-of-words, preserve for transformers — Classical ML often benefits from removal; neural models expect punctuation.
•Protect special constructs — URLs, emails, hashtags, emoticons contain meaningful punctuation.
•Normalization is often better than removal — Collapse repeats, standardize quotes, fix spacing.
•Punctuation features can be valuable — Extract exclamation counts, question marks as features rather than tokens.
•Modern tokenizers handle punctuation — Don't remove before BERT/GPT; they were trained with punctuation.
•Task-specific configuration matters — Sentiment needs different handling than topic modeling.

Module Complete:

With this page, you've completed the Text Preprocessing module. You now have a comprehensive understanding of the essential preprocessing steps: tokenization, stop-word removal, stemming, lemmatization, case normalization, and punctuation handling. These techniques form the foundation for all text-based machine learning applications.

Page Complete

You now understand punctuation handling from basic removal through production-ready pipelines. You can make informed decisions about when to remove, preserve, or normalize punctuation, handle special cases appropriately, and integrate punctuation processing into your NLP workflows.