Loading content...
Punctuation marks seem like trivial noise in text—periods, commas, exclamation points. The simplest preprocessing approach strips them all. But punctuation carries more information than first appears:
:), hashtags #ML, URLs, code, and abbreviationsThe right approach to punctuation depends entirely on your NLP task. Bag-of-words models typically discard punctuation; sentiment analysis may preserve exclamation marks; sentence segmentation depends on proper punctuation handling.
By the end of this page, you will understand when to remove, preserve, or transform punctuation, handle special cases like emoticons, hashtags, and URLs, implement production-ready punctuation processing, and make task-appropriate decisions about punctuation in your NLP pipeline.
Not all punctuation is equal. Different marks serve different functions, and understanding these distinctions guides preprocessing decisions.
Terminal Punctuation:
Marks that end sentences and indicate sentence type:
. (period) — Declarative statement? (question mark) — Interrogative! (exclamation mark) — Exclamative, emphasisClause-Level Punctuation:
Marks that organize within sentences:
, (comma) — Clause separation, lists, pauses; (semicolon) — Related independent clauses: (colon) — Introduction, elaboration— (em dash) — Interruption, emphasisWord-Level Punctuation:
Marks that modify individual words:
' (apostrophe) — Possession, contractions- (hyphen) — Compound words, prefixes/ (slash) — Alternatives, dates, pathsQuotation and Bracketing:
Marks that delimit quoted material or parenthetical content:
"" '' — Direct quotation() — Parenthetical content[] — Editorial insertions, citations{} — Sets, code blocks| Category | Examples | NLP Relevance |
|---|---|---|
| Terminal | . ? ! | Sentence segmentation, question detection, emphasis |
| Clause | , ; : — | Clause boundaries, list parsing |
| Word-level | ' - / | Contractions, compounds, dates |
| Quotation | " ' « » | Direct speech, citations |
| Brackets | ( ) [ ] { } | Parentheticals, code, references |
| Special | @ # $ % & * | Mentions, hashtags, symbols |
| Emoticons | :) :( ;-) | Sentiment, emotion |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
"""Analyze punctuation distribution in text."""import stringfrom collections import Counterfrom typing import Dict, Listimport re def analyze_punctuation(text: str) -> Dict: """ Analyze punctuation usage in text. """ # Standard ASCII punctuation standard_punct = set(string.punctuation) # Count punctuation occurrences punct_counts = Counter() for char in text: if char in standard_punct: punct_counts[char] += 1 # Categorize punctuation categories = { 'terminal': set('.?!'), 'clause': set(',;:'), 'word_level': set("'-/"), 'quotation': set('\"\''«»""'''), 'brackets': set('()[]{}'), 'special': set('@#$%&*'), } category_counts = {} for cat, chars in categories.items(): category_counts[cat] = sum(punct_counts[c] for c in chars) # Find emoticons emoticon_pattern = r'[:;=8][\-~]?[\)\(DPp\]\[/\\]|[\)\(DPp\]\[/\\][\-~]?[:;=8]' emoticons = re.findall(emoticon_pattern, text) # Find hashtags and mentions hashtags = re.findall(r'#\w+', text) mentions = re.findall(r'@\w+', text) return { 'total_chars': len(text), 'total_punct': sum(punct_counts.values()), 'punct_ratio': sum(punct_counts.values()) / len(text) if text else 0, 'punct_counts': dict(punct_counts.most_common()), 'category_counts': category_counts, 'emoticons': emoticons, 'hashtags': hashtags, 'mentions': mentions, } # Analyze different text typestexts = { 'formal': """The research, conducted by Dr. Smith et al., demonstrates significant findings. Three key results emerged: (1) improved accuracy, (2) reduced latency, and (3) better scalability.""", 'social': """OMG!! This is amazing!!! 😍 Can't believe it worked!!! @JohnDoe check this out #MachineLearning #AI :) :D""", 'code': """def process(data: Dict[str, Any]) -> List[str]: return [x['key'] for x in data.items() if x != None]""",} print("=== Punctuation Analysis by Text Type ===\n") for text_type, text in texts.items(): analysis = analyze_punctuation(text) print(f"{text_type.upper()}:") print(f" Total chars: {analysis['total_chars']}") print(f" Punctuation: {analysis['total_punct']} ({analysis['punct_ratio']*100:.1f}%)") print(f" Top marks: {list(analysis['punct_counts'].items())[:5]}") print(f" Categories: {analysis['category_counts']}") if analysis['emoticons']: print(f" Emoticons: {analysis['emoticons']}") if analysis['hashtags']: print(f" Hashtags: {analysis['hashtags']}") if analysis['mentions']: print(f" Mentions: {analysis['mentions']}") print()For many NLP tasks, punctuation is noise that increases vocabulary size without providing useful features. These scenarios typically call for punctuation removal.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102
"""Various approaches to punctuation removal."""import stringimport refrom typing import List # === Method 1: Simple Character Filtering === def remove_punct_translate(text: str) -> str: """ Remove punctuation using str.translate(). Fastest method for simple removal. """ translator = str.maketrans('', '', string.punctuation) return text.translate(translator) # === Method 2: Regex-Based Removal === def remove_punct_regex(text: str) -> str: """ Remove punctuation using regex. More flexible for complex patterns. """ return re.sub(r'[^\w\s]', '', text) def remove_punct_regex_unicode(text: str) -> str: """ Remove punctuation including Unicode punctuation. Handles international text properly. """ return re.sub(r'[\p{P}\p{S}]', '', text) # === Method 3: Token-Based Removal === def remove_punct_tokens(tokens: List[str]) -> List[str]: """ Remove punctuation tokens from tokenized text. Works well with pre-tokenized input. """ punct_chars = set(string.punctuation) return [t for t in tokens if t not in punct_chars and not all(c in punct_chars for c in t)] # === Method 4: Selective Removal === def remove_punct_selective( text: str, preserve: set = None, remove_extra: set = None) -> str: """ Selectively remove punctuation, preserving specific marks. Args: text: Input text preserve: Punctuation to keep (e.g., {'!', '?'} for sentiment) remove_extra: Additional characters to remove (e.g., {'@', '#'}) """ if preserve is None: preserve = set() if remove_extra is None: remove_extra = set() to_remove = (set(string.punctuation) - preserve) | remove_extra translator = str.maketrans({c: ' ' for c in to_remove}) result = text.translate(translator) # Collapse multiple spaces return re.sub(r'\s+', ' ', result).strip() # === Comparison === test_texts = [ "Hello, World! How's it going?", "I love #MachineLearning!!! :)", "Dr. Smith's research (2023) was ground-breaking.", "Price: $99.99/month — great deal!",] print("=== Punctuation Removal Methods ===\n") for text in test_texts: print(f"Original: {text}") print(f" translate(): {remove_punct_translate(text)}") print(f" regex(): {remove_punct_regex(text)}") print() # Selective removalprint("=== Selective Removal ===\n") sentiment_text = "I LOVE this!!! Amazing!!! But the service? Not great..." print(f"Original: {sentiment_text}")print(f"Remove all: {remove_punct_translate(sentiment_text)}")print(f"Keep ! and ?: {remove_punct_selective(sentiment_text, preserve={'!', '?'})}") # Token-based removalprint("\n=== Token-Based Removal ===\n") tokens = ["Hello", ",", "World", "!", "How", "'s", "it", "going", "?"]print(f"Original tokens: {tokens}")print(f"After removal: {remove_punct_tokens(tokens)}")Simply removing punctuation often leaves multiple consecutive spaces ('Hello, World' → 'Hello World'). Always collapse multiple spaces to single spaces after removal, or tokenization may produce empty tokens.
Several NLP tasks depend on punctuation for accurate processing. In these scenarios, removing punctuation degrades performance.
Sentiment Analysis:
Exclamation marks amplify sentiment. "Great!" is more positive than "Great." Triple exclamation marks "!!!" often indicate strong emotion. Question marks can indicate uncertainty or negativity.
Sentence Segmentation:
Periods, question marks, and exclamation marks are the primary signals for sentence boundaries. Removing them before segmentation makes the task nearly impossible.
Machine Translation:
Punctuation must be preserved and accurately translated. Question marks, quotation styles, and list punctuation vary by language.
Text Generation:
Models must learn to generate appropriate punctuation for fluent, readable output.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687
"""Demonstrate punctuation impact on sentiment analysis."""from typing import List, Tuple def extract_punctuation_features(text: str) -> dict: """ Extract punctuation-based features for sentiment. """ return { 'exclamation_count': text.count('!'), 'question_count': text.count('?'), 'ellipsis_count': text.count('...'), 'has_all_caps': any(word.isupper() and len(word) > 1 for word in text.split()), 'multiple_exclamations': '!!' in text, 'ends_with_question': text.rstrip().endswith('?'), 'ends_with_exclamation': text.rstrip().endswith('!'), 'emoticons_positive': len(__import__('re').findall(r'[:;][\-~]?[\)D]', text)), 'emoticons_negative': len(__import__('re').findall(r'[:;][\-~]?[\(]', text)), } # Sentiment examples where punctuation matterssentiment_pairs = [ # (text with punctuation, text without) ("I love this!", "I love this"), ("I LOVE this!!!", "I love this"), ("Great product...", "Great product"), ("Really?", "Really"), ("This is good :)", "This is good"), ("Not bad, I guess...", "Not bad I guess"),] print("=== Punctuation Impact on Sentiment Features ===\n") for with_punct, without_punct in sentiment_pairs: features_with = extract_punctuation_features(with_punct) features_without = extract_punctuation_features(without_punct) print(f"With punctuation: '{with_punct}'") print(f" Features: {features_with}") print(f"Without punctuation: '{without_punct}'") print(f" Features: {features_without}") print() # === Sentence Segmentation Example === print("=== Punctuation Impact on Sentence Segmentation ===\n") import re def segment_sentences(text: str) -> List[str]: """Simple sentence segmentation using terminal punctuation.""" sentences = re.split(r'(?<=[.!?])\s+', text) return [s.strip() for s in sentences if s.strip()] test_text = "I love this product! It works great. Do you have it in blue? I'd buy that too." print(f"Original text:")print(f" {test_text}\n") print("Sentences detected:")for i, sent in enumerate(segment_sentences(test_text)): print(f" {i+1}. {sent}") # Without punctuationno_punct = re.sub(r'[.!?]', '', test_text)print(f"\nWithout terminal punctuation:")print(f" {no_punct}")print(f" Sentences detected: {len(segment_sentences(no_punct))}") # === Abbreviation Handling === print("\n=== Abbreviation Periods ===\n") abbreviation_examples = [ "Dr. Smith works at U.S. Bank.", "The U.N. met at 3 p.m. today.", "I have a Ph.D. from M.I.T.",] for text in abbreviation_examples: naive_sentences = re.split(r'\.\s+', text) print(f"Text: {text}") print(f"Naive split: {naive_sentences}") print(f"Problem: Abbreviation periods cause incorrect splits") print()Rather than simply preserving or removing punctuation, consider extracting punctuation patterns as features: exclamation count, question mark presence, emoticon polarity. This captures information without including punctuation in the vocabulary.
Modern text, especially from social media, contains constructs that mix punctuation with semantic content. Naive punctuation stripping destroys these meaningful patterns.
Emoticons and Emojis:
Text emoticons like :) or :-D carry sentiment. Stripping punctuation destroys them. Emojis (Unicode characters) are technically not punctuation but require consideration.
URLs and Email Addresses:
URLs contain periods, slashes, and colons syntactically. https://example.com/path becomes meaningless if punctuation is stripped.
Hashtags and Mentions:
Twitter-style #hashtags and @mentions use special characters meaningfully.
Contractions and Possessives:
Apostrophes in "don't", "it's", "John's" are grammatically significant. Removing them creates "dont", "its", "Johns"—which may or may not be acceptable.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154
"""Handling special punctuation cases."""import refrom typing import Tuple, List, Dict class SmartPunctuationHandler: """ Intelligent punctuation handling that preserves special constructs. """ def __init__(self): # Patterns for special constructs self.patterns = { 'url': re.compile( r'https?://[^\s]+|www\.[^\s]+', re.IGNORECASE ), 'email': re.compile( r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}' ), 'hashtag': re.compile(r'#\w+'), 'mention': re.compile(r'@\w+'), 'emoticon': re.compile( r'[:;=8B][\-~\^]?[\)\(DPp\]\[oO3><]|' # Western r'[\)\(DPp\]\[oO3><][\-~\^]?[:;=8B]|' # Reversed r'<3|</3|<\\3|' # Hearts r'\^_\^|\^\^|-_-|>_<|UwU|OwO' # Text faces ), 'number': re.compile(r'\$?\d+[,.]?\d*%?'), 'abbreviation': re.compile( r'\b(?:Mr|Mrs|Ms|Dr|Prof|Inc|Ltd|Jr|Sr|vs|etc|i\.e|e\.g)\.?', re.IGNORECASE ), 'contraction': re.compile( r"\b\w+'\w+\b" ), } def find_special_constructs(self, text: str) -> Dict[str, List[str]]: """Find all special constructs in text.""" found = {} for name, pattern in self.patterns.items(): matches = pattern.findall(text) if matches: found[name] = matches return found def protect_and_process( self, text: str, process_func, protect: List[str] = None ) -> str: """ Protect special constructs, process text, then restore. Args: text: Input text process_func: Function to apply (e.g., remove punctuation) protect: List of construct names to protect """ if protect is None: protect = ['url', 'email', 'hashtag', 'mention', 'emoticon'] # Store protected strings with placeholders placeholders = {} modified_text = text counter = 0 for construct_type in protect: if construct_type in self.patterns: pattern = self.patterns[construct_type] for match in pattern.finditer(text): placeholder = f"__PROTECTED_{counter}__" placeholders[placeholder] = match.group() modified_text = modified_text.replace( match.group(), placeholder, 1 ) counter += 1 # Apply processing processed = process_func(modified_text) # Restore protected content for placeholder, original in placeholders.items(): processed = processed.replace(placeholder, original) return processed # Simple punctuation removal functiondef remove_punctuation(text: str) -> str: import string return text.translate(str.maketrans('', '', string.punctuation)) # Demonstrationhandler = SmartPunctuationHandler() test_texts = [ "I love this!!! Check out https://example.com :)", "Contact support@company.com or @TwitterHandle", "Great #MachineLearning talk! Dr. Smith was amazing ^_^", "Price: $99.99 (that's 50% off!)", "I can't believe it's not butter!",] print("=== Smart Punctuation Handling ===\n") for text in test_texts: print(f"Original: {text}") # Find special constructs special = handler.find_special_constructs(text) if special: print(f"Special constructs: {special}") # Naive removal naive = remove_punctuation(text) print(f"Naive removal: {naive}") # Smart removal smart = handler.protect_and_process(text, remove_punctuation) print(f"Smart removal: {smart}") print() # === Emoticon Normalization === print("=== Emoticon Handling ===\n") def normalize_emoticons(text: str) -> str: """Convert emoticons to normalized tokens.""" # Positive emoticons positive = [':)', ':-)', ':D', ':-D', ';)', ';-)', ':P', ':-P', '^_^', '<3'] # Negative emoticons negative = [':(', ':-(', ":'(", ":'-(", '-_-', '</3'] result = text for emot in positive: result = result.replace(emot, ' _EMOTICON_POSITIVE_ ') for emot in negative: result = result.replace(emot, ' _EMOTICON_NEGATIVE_ ') return re.sub(r'\s+', ' ', result).strip() emoticon_texts = [ "Great product! Love it :) :D", "Terrible experience :( :'(", "It's okay I guess... -_-",] for text in emoticon_texts: print(f"Original: {text}") print(f"Normalized: {normalize_emoticons(text)}") print()| Construct | Example | Handling Strategy |
|---|---|---|
| URLs | https://example.com | Replace with <URL> token or preserve entirely |
| Emails | user@domain.com | Replace with <EMAIL> token for privacy |
| Hashtags | #MachineLearning | Preserve or extract as separate feature |
| Mentions | @username | Preserve, extract, or replace with <USER> |
| Emoticons | :) :( ;-) | Normalize to polarity tokens or preserve |
| Contractions | don't, it's | Expand or preserve depending on task |
| Numbers | $99.99 | Preserve, normalize, or replace with <NUM> |
Rather than simply removing or preserving punctuation, normalization transforms punctuation into consistent forms. This is particularly important for noisy text from social media, OCR, or multilingual sources.
Common Normalizations:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145
"""Punctuation normalization for cleaner text."""import refrom typing import Dict class PunctuationNormalizer: """ Normalize punctuation for consistent representation. """ def __init__(self): # Unicode quote variants to ASCII self.quote_map = { '"': '"', '"': '"', # Curly double quotes ''': "'", ''': "'", # Curly single quotes '«': '"', '»': '"', # Guillemets '‹': "'", '›': "'", # Single guillemets '„': '"', '‚': "'", # German quotes } # Dash variants self.dash_map = { '—': '-', # Em dash '–': '-', # En dash '−': '-', # Minus sign '‐': '-', # Hyphen (yes, different Unicode) '‑': '-', # Non-breaking hyphen } # Ellipsis self.ellipsis_map = { '…': '...', # Unicode ellipsis to ASCII } def normalize_quotes(self, text: str) -> str: """Convert all quote variants to ASCII.""" for fancy, plain in self.quote_map.items(): text = text.replace(fancy, plain) return text def normalize_dashes(self, text: str) -> str: """Convert all dash variants to ASCII hyphen.""" for fancy, plain in self.dash_map.items(): text = text.replace(fancy, plain) return text def normalize_ellipsis(self, text: str) -> str: """Normalize ellipsis to three periods.""" text = text.replace('…', '...') # Collapse multiple periods to three text = re.sub(r'\.{4,}', '...', text) return text def collapse_repeated_punct(self, text: str, max_repeat: int = 1) -> str: """ Collapse repeated punctuation marks. '!!!' -> '!' if max_repeat=1 '!!!' -> '!!' if max_repeat=2 """ punct_chars = r'[!?.,;:]' pattern = f'({punct_chars})\1{{{max_repeat},}}' return re.sub(pattern, r'\1' * max_repeat, text) def mark_repeated_punct(self, text: str) -> str: """ Mark repeated punctuation with tokens instead of collapsing. 'Great!!!' -> 'Great! _REPEAT_PUNCT_' """ def replacer(match): char = match.group(1) count = len(match.group(0)) if count > 1: return char + ' _REPEAT_PUNCT_ ' return char return re.sub(r'([!?.])(\1+)', replacer, text) def normalize_spacing(self, text: str) -> str: """Normalize spacing around punctuation.""" # Space after terminal punctuation text = re.sub(r'([.!?])([A-Z])', r'\1 \2', text) # No space before punctuation text = re.sub(r'\s+([.,;:!?])', r'\1', text) # Single space after commas and other marks text = re.sub(r'([.,;:!?])\s*', r'\1 ', text) # Collapse multiple spaces text = re.sub(r' +', ' ', text) return text.strip() def normalize_all(self, text: str, collapse_repeat: bool = True) -> str: """Apply all normalizations.""" text = self.normalize_quotes(text) text = self.normalize_dashes(text) text = self.normalize_ellipsis(text) if collapse_repeat: text = self.collapse_repeated_punct(text) text = self.normalize_spacing(text) return text # Demonstrationnormalizer = PunctuationNormalizer() test_cases = [ # Curly quotes '"Hello," she said, 'how are you?'', # Multiple exclamation "This is AMAZING!!! I love it!!!", # Various dashes "Machine learning — a broad field — is fascinating.", # Ellipsis "I don't know… maybe……", # Spacing issues "Hello.How are you?I'm fine,thanks.", # Mixed issues '"Wow!!!" said John—he couldn't believe it…',] print("=== Punctuation Normalization ===\n") for text in test_cases: normalized = normalizer.normalize_all(text) print(f"Original: {text}") print(f"Normalized: {normalized}") print() # Repeated punctuation as featureprint("=== Repeated Punctuation as Feature ===\n") feature_texts = [ "Great product!!!", "Really???", "Okay.",] for text in feature_texts: marked = normalizer.mark_repeated_punct(text) print(f"Original: {text}") print(f"Marked: {marked}") print()Modern tokenizers handle punctuation differently, and understanding these behaviors helps you make appropriate preprocessing decisions.
Whitespace Tokenizers:
Split on whitespace only, leaving punctuation attached to words ("Hello," includes comma).
Word Tokenizers (NLTK, spaCy):
Separate punctuation as distinct tokens ("Hello" and "," are separate).
Subword Tokenizers (BPE, WordPiece):
May include punctuation in vocabulary, handle it as part of subword units.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596
"""How different tokenizers handle punctuation."""import refrom nltk.tokenize import word_tokenize, RegexpTokenizerimport spacy # Load spaCynlp = spacy.load("en_core_web_sm") text = "Hello, World! How's it going? I'd say 99% of this works... mostly." print("=== Tokenizer Punctuation Handling ===\n")print(f"Input: {text}\n") # Whitespace tokenizerwhitespace_tokens = text.split()print(f"Whitespace split:")print(f" {whitespace_tokens}")print(f" Note: Punctuation attached ('Hello,' 'going?')\n") # NLTK word_tokenizenltk_tokens = word_tokenize(text)print(f"NLTK word_tokenize:")print(f" {nltk_tokens}")print(f" Note: Punctuation separated, contractions split\n") # spaCy tokenizerdoc = nlp(text)spacy_tokens = [token.text for token in doc]print(f"spaCy tokenizer:")print(f" {spacy_tokens}")print(f" Note: Similar to NLTK, punctuation as tokens\n") # spaCy with token attributesprint("spaCy token details:")for token in doc: if token.is_punct: print(f" '{token.text}' - is_punct=True") # NLTK RegexpTokenizer (customizable)print(f"\nCustom regex tokenizers:") # Words only (no punctuation)word_only = RegexpTokenizer(r'\w+')print(f" Words only: {word_only.tokenize(text)}") # Words and punctuation separatelyword_punct = RegexpTokenizer(r"\w+|[^\w\s]")print(f" Words+punct: {word_punct.tokenize(text)}") # === Handling in sklearn === print("\n=== Scikit-learn Vectorizers ===\n") from sklearn.feature_extraction.text import CountVectorizer texts = [ "Hello, World!", "Hello World", "Machine learning is great!!!",] # Default behavior (removes punctuation via token pattern)vec_default = CountVectorizer()vec_default.fit(texts)print(f"Default tokenizer vocabulary:")print(f" {vec_default.get_feature_names_out()}")print(f" Note: Punctuation removed by default\n") # Custom token pattern to include punctuationvec_with_punct = CountVectorizer(token_pattern=r"\b\w+\b|[^\w\s]")vec_with_punct.fit(texts)print(f"Custom pattern (keep punct) vocabulary:")print(f" {vec_with_punct.get_feature_names_out()}") # === Transformer Tokenizers === print("\n=== Transformer Tokenizers ===\n") from transformers import AutoTokenizer bert_tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")gpt2_tokenizer = AutoTokenizer.from_pretrained("gpt2") test = "Hello! How's it going???" print(f"Input: {test}\n") bert_tokens = bert_tokenizer.tokenize(test)gpt2_tokens = gpt2_tokenizer.tokenize(test) print(f"BERT tokens: {bert_tokens}")print(f"GPT-2 tokens: {gpt2_tokens}")print("\nNote: Modern transformers handle punctuation natively")print(" Pre-processing should generally NOT remove punctuation")BERT, GPT, and other transformer models were pre-trained on text with punctuation. Their tokenizers handle punctuation appropriately. Removing punctuation before passing text to these models creates a distribution mismatch and typically hurts performance.
A production-ready punctuation handling pipeline combines multiple strategies based on task requirements.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194
"""Production-ready punctuation handling pipeline."""import reimport stringfrom typing import List, Dict, Set, Optional, Callablefrom dataclasses import dataclassfrom enum import Enum class PunctuationStrategy(Enum): REMOVE_ALL = "remove_all" PRESERVE_ALL = "preserve_all" NORMALIZE = "normalize" SELECTIVE = "selective" SMART = "smart" @dataclassclass PunctuationConfig: """Configuration for punctuation handling.""" strategy: PunctuationStrategy = PunctuationStrategy.SMART preserve_marks: Set[str] = None remove_marks: Set[str] = None collapse_repeated: bool = True max_repeat: int = 1 normalize_unicode: bool = True protect_urls: bool = True protect_emails: bool = True protect_hashtags: bool = True protect_mentions: bool = True protect_emoticons: bool = True emoticon_tokens: bool = False # Convert to _POS_/_NEG_ tokens class ProductionPunctuationHandler: """ Production-ready punctuation handler with configurable behavior. """ def __init__(self, config: PunctuationConfig = None): self.config = config or PunctuationConfig() # Compile patterns self.patterns = { 'url': re.compile(r'https?://[^\s]+|www\.[^\s]+', re.I), 'email': re.compile(r'[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'), 'hashtag': re.compile(r'#\w+'), 'mention': re.compile(r'@\w+'), 'emoticon_pos': re.compile(r'[:;][\-~]?[\)D\]>]|<3|\^_\^'), 'emoticon_neg': re.compile(r'[:;][\-~]?[\(\[<]|</3|-_-'), } # Unicode normalization self.unicode_map = str.maketrans({ '"': '"', '"': '"', ''': "'", ''': "'", '—': '-', '–': '-', '…': '...', }) def _protect_patterns(self, text: str) -> tuple: """Protect special patterns with placeholders.""" placeholders = {} counter = 0 protect_list = [] if self.config.protect_urls: protect_list.append('url') if self.config.protect_emails: protect_list.append('email') if self.config.protect_hashtags: protect_list.append('hashtag') if self.config.protect_mentions: protect_list.append('mention') if self.config.protect_emoticons: protect_list.extend(['emoticon_pos', 'emoticon_neg']) for pattern_name in protect_list: pattern = self.patterns.get(pattern_name) if pattern: for match in pattern.finditer(text): placeholder = f"__P{counter}__" placeholders[placeholder] = (match.group(), pattern_name) text = text[:match.start()] + placeholder + text[match.end():] counter += 1 return text, placeholders def _restore_patterns(self, text: str, placeholders: dict) -> str: """Restore protected patterns.""" for placeholder, (original, pattern_name) in placeholders.items(): if self.config.emoticon_tokens and 'emoticon' in pattern_name: if 'pos' in pattern_name: replacement = ' _EMOTICON_POS_ ' else: replacement = ' _EMOTICON_NEG_ ' text = text.replace(placeholder, replacement) else: text = text.replace(placeholder, original) return text def _remove_punctuation(self, text: str) -> str: """Remove punctuation based on config.""" if self.config.preserve_marks: to_remove = set(string.punctuation) - self.config.preserve_marks elif self.config.remove_marks: to_remove = self.config.remove_marks else: to_remove = set(string.punctuation) for char in to_remove: text = text.replace(char, ' ') return re.sub(r'\s+', ' ', text).strip() def _normalize_punctuation(self, text: str) -> str: """Normalize punctuation.""" # Unicode normalization if self.config.normalize_unicode: text = text.translate(self.unicode_map) # Collapse repeated if self.config.collapse_repeated: for char in '!?.': pattern = f'\\{char}' if char in '.?' else char text = re.sub(f'{pattern}{{2,}}', char * self.config.max_repeat, text) # Normalize spacing text = re.sub(r'\s+', ' ', text) return text.strip() def process(self, text: str) -> str: """Process text according to configuration.""" # Protect special patterns text, placeholders = self._protect_patterns(text) # Apply strategy if self.config.strategy == PunctuationStrategy.REMOVE_ALL: text = self._remove_punctuation(text) elif self.config.strategy == PunctuationStrategy.NORMALIZE: text = self._normalize_punctuation(text) elif self.config.strategy == PunctuationStrategy.SELECTIVE: text = self._remove_punctuation(text) text = self._normalize_punctuation(text) elif self.config.strategy == PunctuationStrategy.SMART: text = self._normalize_punctuation(text) # elif PRESERVE_ALL: do nothing # Restore protected patterns text = self._restore_patterns(text, placeholders) return text def batch_process(self, texts: List[str]) -> List[str]: """Process multiple texts.""" return [self.process(text) for text in texts] # === Usage Examples === print("=== Production Punctuation Handler ===\n") # Different configurations for different tasksconfigs = { 'bag_of_words': PunctuationConfig( strategy=PunctuationStrategy.REMOVE_ALL, protect_urls=False, protect_emails=False, ), 'sentiment_analysis': PunctuationConfig( strategy=PunctuationStrategy.SELECTIVE, preserve_marks={'!', '?'}, protect_emoticons=True, emoticon_tokens=True, collapse_repeated=True, max_repeat=1, ), 'transformer_prep': PunctuationConfig( strategy=PunctuationStrategy.NORMALIZE, normalize_unicode=True, collapse_repeated=True, max_repeat=2, ),} test_text = "OMG!!! Check https://example.com :) I can't believe it??? @john #amazing" print(f"Input: {test_text}\n") for config_name, config in configs.items(): handler = ProductionPunctuationHandler(config) result = handler.process(test_text) print(f"{config_name}:") print(f" {result}\n")Punctuation handling requires thoughtful consideration of your NLP task. The right approach varies from complete removal to careful preservation and normalization.
Module Complete:
With this page, you've completed the Text Preprocessing module. You now have a comprehensive understanding of the essential preprocessing steps: tokenization, stop-word removal, stemming, lemmatization, case normalization, and punctuation handling. These techniques form the foundation for all text-based machine learning applications.
You now understand punctuation handling from basic removal through production-ready pipelines. You can make informed decisions about when to remove, preserve, or normalize punctuation, handle special cases appropriately, and integrate punctuation processing into your NLP workflows.