Full Text Search - Learning Module

Loading content...

0/273

Language Handling: Building Multilingual Search Systems

The Tower of Babel Problem in Search

Consider an e-commerce platform serving customers in Japan, Germany, and Brazil. A Japanese customer searches for "パソコン" (personal computer), a German customer searches for "Notebook-Computer," and a Brazilian customer searches for "computador portátil." All three are looking for essentially the same product, but the search system must understand:

Japanese uses no spaces between words and mixes three scripts (hiragana, katakana, kanji)
German forms compound words that can be dozens of characters long
Portuguese uses accented characters that might be typed with or without diacritics

This is the language handling challenge.

A search system that works beautifully for English may completely fail for other languages. Tokenization rules break down. Stemming algorithms produce garbage. Character matching fails on accents. Building truly multilingual search requires understanding the linguistic characteristics of each language and applying appropriate processing.

What You Will Learn

By the end of this page, you will understand how to build search systems that work across multiple languages. You'll learn about character encoding, script-specific processing, language detection, CJK (Chinese/Japanese/Korean) handling, and strategies for mixed-language content.

Character Encoding: The Foundation of Text Processing

Before any linguistic processing, search systems must correctly interpret the bytes that represent text. Character encoding maps byte sequences to characters, and getting this wrong corrupts everything downstream.

The evolution of character encoding:

Character Encoding Systems
Encoding	Year	Characters	Bytes per Char	Usage
ASCII	1963	128 (English only)	1	Legacy systems, basic protocols
ISO-8859-1 (Latin-1)	1987	256 (Western European)	1	Legacy web pages, some databases
Windows-1252	1985	256 (Western European)	1	Windows systems, MS Office
UTF-8	1993	1,112,064 (Unicode)	1-4	Modern web, APIs, databases
UTF-16	1996	1,112,064 (Unicode)	2-4	Windows internals, Java, JavaScript
UTF-32	2003	1,112,064 (Unicode)	4	Processing where fixed width helps

UTF-8 is the standard for search systems. It's backward-compatible with ASCII, compact for Latin-based text, and can represent every Unicode character. All modern search engines (Elasticsearch, Solr, Algolia) expect UTF-8 encoded input.

encoding_issues.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
// Common encoding issues in search systems
 
// Issue 1: Mixed encodings in source data
// A document from a legacy system might have:
const mixedEncodingDocument = {
  title: "Café Résumé",  // UTF-8
  // But the actual bytes might be Latin-1 interpreted as UTF-8:
  // "CafÃ© RÃ©sumÃ©" - the classic "mojibake" corruption
};
 
// Detection and normalization:
function normalizeEncoding(rawBytes: Buffer): string {
  // Libraries like 'chardet' can detect encoding
  // const detected = chardet.detect(rawBytes);
  
  // Always convert to UTF-8 for indexing
  // return iconv.decode(rawBytes, detected.encoding);
  return rawBytes.toString('utf-8');
}
 
// Issue 2: Unicode normalization forms
// The same visual character can have multiple byte representations:
 
const sameCharacter = {
  // "é" as a single character (composed form, NFC)
  composed: "é",           // U+00E9 (1 codepoint)
  
  // "é" as base + combining accent (decomposed form, NFD)
  decomposed: "e\u0301",  // U+0065 + U+0301 (2 codepoints)
};
 
// These look identical but won't match in a byte comparison!
console.log(sameCharacter.composed === sameCharacter.decomposed); // false
 
// Solution: Apply Unicode Normalization Form C (NFC) before indexing
const normalizedText = text.normalize('NFC');
 
// Elasticsearch ICU normalization filter:
const icuNormalizerConfig = {
  "filter": {
    "icu_normalizer": {
      "type": "icu_normalizer",
      "name": "nfc"  // Canonical composition
    }
  }
};
 
// Issue 3: Zero-width characters and invisible content
const invisibleCharacters = {
  zeroWidthSpace: "\u200B",      // Often copied from web pages
  zeroWidthNonJoiner: "\u200C",
  zeroWidthJoiner: "\u200D",
  bom: "\uFEFF",                 // Byte Order Mark
  softHyphen: "\u00AD"
};
 
// These can break exact matches without visible difference
// "hello" !== "hello\u200B" even though they look the same
 
// Solution: Strip invisible characters during analysis
function cleanInvisible(text: string): string {
  return text.replace(/[\u200B-\u200D\uFEFF\u00AD]/g, '');
}

Always Normalize Early

Apply Unicode normalization (NFC) and invisible character stripping at the earliest stage of your pipeline—before tokenization. Encoding issues that reach the index are extremely difficult to debug later. When in doubt, use the ICU (International Components for Unicode) plugins available for most search engines.

Language Detection: Identifying What You're Processing

To apply language-specific processing, you first need to know what language you're dealing with. Language detection is surprisingly challenging, especially for short text, mixed-language content, or closely related languages.

Language detection approaches:

Detection Methods

•N-gram profiling: Compare character trigram frequencies against known language profiles. Works well for longer text. Examples: CLD2 (Chrome's detector), langdetect.
•Script identification: Some languages have unique scripts (Japanese, Korean, Arabic, Hebrew). Script alone can narrow down options significantly.
•Word-list matching: Check for presence of known language-specific words. Fast but susceptible to false positives from cognates.
•Machine learning models: Neural networks trained on multilingual data. Most accurate but slower. Examples: fastText, langid.py.
•User/metadata hints: Explicit language tags, user locale, Accept-Language headers. Most reliable when available.

language_detection.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
// Language detection strategies and challenges
 
interface DetectionResult {
  language: string;
  confidence: number;
  isReliable: boolean;
}
 
// Library-based detection (using fastText or langdetect)
async function detectLanguage(text: string): Promise<DetectionResult> {
  // Pseudocode for fastText-based detection
  // const result = await fasttext.predict(text, 1);
  // return {
  //   language: result[0].label.replace('__label__', ''),
  //   confidence: result[0].probability,
  //   isReliable: result[0].probability > 0.7
  // };
  
  return { language: "en", confidence: 0.95, isReliable: true };
}
 
// Challenge 1: Short text
const shortTextChallenges = [
  { text: "OK", possibleLanguages: ["en", "de", "fr", "es", "..."] },  // Impossible to determine
  { text: "Paris", possibleLanguages: ["en", "fr", "de"] },  // City name in many languages
  { text: "Hello", mostLikely: "en", confidence: 0.6 },  // Low confidence
];
 
// Challenge 2: Mixed language content
const mixedContent = {
  text: "The sushi was très délicieux!",  // English + Japanese + French
  detectedLanguages: [
    { language: "en", segments: ["The", "was"] },
    { language: "fr", segments: ["très délicieux"] },
    { language: "ja", segments: ["sushi"] }  // Loan word
  ]
};
 
// Challenge 3: Similar languages
const confusingPairs = [
  ["Norwegian", "Danish"],   // Extremely similar written forms
  ["Serbian", "Croatian"],   // Same language, different scripts
  ["Malay", "Indonesian"],   // ~80% lexical similarity
  ["Spanish", "Portuguese"], // Many shared words
];
 
// Best practice: Fallback chain
interface LanguageDetectionConfig {
  primaryMethod: string;
  fallbackLanguage: string;
  minConfidence: number;
  useMetadataHints: boolean;
}
 
async function detectWithFallback(
  text: string,
  metadata: { htmlLang?: string; acceptLanguage?: string; userLocale?: string }
): Promise<string> {
  
  // 1. Check explicit language tag
  if (metadata.htmlLang) {
    return metadata.htmlLang;
  }
  
  // 2. Try automatic detection (requires sufficient text)
  if (text.length > 50) {
    const detection = await detectLanguage(text);
    if (detection.isReliable) {
      return detection.language;
    }
  }
  
  // 3. Fall back to user locale
  if (metadata.userLocale) {
    return extractLanguage(metadata.userLocale);  // "en-US" → "en"
  }
  
  // 4. Final fallback
  return "en";
}
 
// Strategy for search: Index multiple language-analyzed versions
// Query strategy: Detect language and route to appropriate index/field

When Detection Fails, Multi-Analyze

For short queries where detection is unreliable, consider querying multiple language-specific fields simultaneously and combining results. A search for "café" can match against English, French, and Spanish fields, with the best matches rising to the top regardless of actual language.

CJK Languages: Chinese, Japanese, and Korean

Chinese, Japanese, and Korean (CJK) present unique challenges because they don't use spaces between words. A sentence like "東京都港区" (Tokyo, Minato Ward) appears as one continuous string of characters. Without proper segmentation, a standard tokenizer produces useless single-character tokens.

CJK-specific challenges:

CJK Language Characteristics
Language	Scripts	Word Segmentation	Special Challenges
Chinese (Simplified/Traditional)	Han characters	Dictionary-based; no spaces	Simplified/Traditional variants; no verb conjugation
Japanese	Hiragana, Katakana, Kanji, Latin	Complex; mixed scripts	Multiple readings for Kanji; script mixing
Korean	Hangul (Jamo), occasional Hanja	Spaces exist but optional	Complex morphology; particle handling

cjk_tokenization.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
// CJK tokenization strategies
 
// Problem: "東京都港区赤坂" (Tokyo, Minato Ward, Akasaka)
// Standard tokenizer: ["東", "京", "都", "港", "区", "赤", "坂"] - Useless!
// Dictionary segmenter: ["東京都", "港区", "赤坂"] - Meaningful words!
 
// =============== JAPANESE ===============
// Recommended: Kuromoji tokenizer (included in Elasticsearch)
 
const japaneseAnalyzerConfig = {
  "settings": {
    "analysis": {
      "tokenizer": {
        "kuromoji_tokenizer": {
          "type": "kuromoji_tokenizer",
          "mode": "search",  // "normal", "search", or "extended"
          // search mode: over-segments to improve recall
          // For: 東京都 → [東京都, 東京, 都] (compound and parts)
          "discard_punctuation": true
        }
      },
      
      "filter": {
        // Convert to reading for matching variations
        "kuromoji_readingform": {
          "type": "kuromoji_readingform",
          "use_romaji": false  // Use katakana reading
        },
        
        // Normalize full-width/half-width
        "cjk_width": {
          "type": "cjk_width"
          // Folds full-width ASCII to half-width
          // Folds half-width Katakana to full-width
        },
        
        // Stem Japanese verbs
        "ja_stop": {
          "type": "ja_stop"
        }
      },
      
      "analyzer": {
        "japanese_analyzer": {
          "type": "custom",
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "cjk_width",
            "lowercase",
            "kuromoji_readingform",
            "ja_stop"
          ]
        }
      }
    }
  }
};
 
// =============== CHINESE ===============
// Recommended: SmartCN or ICU tokenizer
 
const chineseAnalyzerConfig = {
  "settings": {
    "analysis": {
      "analyzer": {
        "smartcn_analyzer": {
          "type": "smartcn"  // Built-in Chinese analyzer
          // Handles segmentation with hidden Markov model
        },
        
        // Alternative: ICU tokenizer for multiple languages
        "icu_chinese": {
          "type": "custom",
          "tokenizer": "icu_tokenizer",
          "filter": ["lowercase"]
        }
      }
    }
  }
};
 
// =============== KOREAN ===============
// Recommended: Nori analyzer (official Korean plugin)
 
const koreanAnalyzerConfig = {
  "settings": {
    "analysis": {
      "tokenizer": {
        "nori_tokenizer": {
          "type": "nori_tokenizer",
          "decompound_mode": "mixed"  // "none", "discard", "mixed"
          // mixed: index both compound and parts
        }
      },
      
      "analyzer": {
        "korean_analyzer": {
          "type": "custom",
          "tokenizer": "nori_tokenizer",
          "filter": [
            "nori_readingform",  // Hanja → Hangul
            "lowercase",
            "nori_part_of_speech"  // Remove particles
          ]
        }
      }
    }
  }
};
 
// =============== CROSS-CJK BIGRAM FALLBACK ===============
// When dictionary-based segmentation isn't available:
// CJK bigrams provide reasonable matching
 
const cjkBigramConfig = {
  "analyzer": {
    "cjk_bigram_analyzer": {
      "type": "custom",
      "tokenizer": "standard",
      "filter": ["cjk_bigram", "lowercase"]
    }
  }
};
 
// "東京都" with bigrams: ["東京", "京都", "都"]
// Less accurate but works for any CJK text without dictionaries

CJK Bigrams as Fallback

CJK bigrams (overlapping pairs of characters) provide a simple fallback when proper segmentation isn't available. They generate more tokens but capture word boundaries probabilistically. This is why "京都" (Kyoto) matches even when searching "東京都" (Tokyo)—a trade-off between recall and precision.

European Languages: Scripts, Compounds, and Diacritics

European languages share the Latin alphabet but have significant variations that affect search. German compounds, French accents, Nordic characters, and Eastern European diacritics all require specific handling.

european_language_handling.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
// European language-specific processing
 
// =============== GERMAN ===============
// Challenge: Compound words can be arbitrarily long
// "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz"
// (beef labeling supervision duties delegation law)
 
const germanConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // Decompose compounds for better matching
        "german_decompounder": {
          "type": "hyphenation_decompounder",
          "word_list_path": "analysis/german_dictionary.txt",
          "hyphenation_patterns_path": "analysis/de_DR.xml",
          "only_longest_match": true,
          "min_subword_size": 4
        },
        
        "german_stemmer": {
          "type": "stemmer",
          "language": "light_german"
        },
        
        // Handle umlauts: ä→ae, ö→oe, ü→ue
        "german_normalization": {
          "type": "german_normalization"
        }
      },
      
      "analyzer": {
        "german": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "german_decompounder",
            "german_normalization",
            "german_stemmer"
          ]
        }
      }
    }
  }
};
 
// "Handschuhe" (gloves) decomposes to ["hand", "schuhe"]
// Searchers for "hand" will find glove products
 
// =============== FRENCH/SPANISH/PORTUGUESE ===============
// Challenge: Diacritics (accents) should match with or without
 
const romanLanguageConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // Elision: l'école → école
        "french_elision": {
          "type": "elision",
          "articles": ["l", "m", "t", "qu", "n", "s", "j", "d", "c"]
        },
        
        // ASCII folding: café → cafe (but preserve original too)
        "asciifolding_preserve": {
          "type": "asciifolding",
          "preserve_original": true
          // Indexes both "café" AND "cafe"
        },
        
        "french_stemmer": { "type": "stemmer", "language": "light_french" },
        "spanish_stemmer": { "type": "stemmer", "language": "light_spanish" },
        "portuguese_stemmer": { "type": "stemmer", "language": "light_portuguese" }
      },
      
      "analyzer": {
        "french": {
          "tokenizer": "standard",
          "filter": ["french_elision", "lowercase", "asciifolding_preserve", "french_stemmer"]
        }
      }
    }
  }
};
 
// =============== NORDIC LANGUAGES ===============
// Challenge: Special characters (å, ä, ö, ø, æ)
 
const nordicConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // ICU folding handles Nordic characters properly
        "icu_folding_nordic": {
          "type": "icu_folding",
          // Optionally exclude specific characters from folding
          "unicodeSetFilter": "[^åäöøæ]"  // Keep Nordic chars distinct
        },
        
        "swedish_stemmer": { "type": "stemmer", "language": "swedish" },
        "norwegian_stemmer": { "type": "stemmer", "language": "norwegian" }
      }
    }
  }
};
 
// =============== EASTERN EUROPEAN ===============
// Challenge: Rich diacritic systems (ą, ę, ć, ś, ź, ż, etc.)
 
const eastEuropeanConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // Polish-specific normalization
        "polish_fold": {
          "type": "asciifolding",
          "preserve_original": true
        },
        
        // Stempel algorithm for Polish stemming
        "stempel_stemmer": {
          "type": "stemmer",
          "language": "polish"
        }
      }
    }
  }
};

Preserve Original with Folding

When using ASCII folding for diacritics, enable 'preserve_original': true. This indexes both "café" and "cafe", allowing users who type either form to find matches. The slight index size increase is worth the improved user experience.

Arabic, Hebrew, and Right-to-Left Languages

Right-to-left (RTL) languages like Arabic and Hebrew present unique challenges beyond text direction. Arabic has complex morphology, diacritics that may or may not be present, and letter forms that change based on position in a word.

RTL Language Challenges
Feature	Arabic	Hebrew	Impact on Search
Root-based morphology	Extensive: k-t-b yields 100+ forms	Moderate: similar pattern	Stemming to root is essential
Diacritics (vowel marks)	Optional; formal text includes them	Usually omitted	Must match with or without
Prefix/suffix clitics	Common: وال = و + ال	Common: הוא = ה + וא	Decomposition needed
Letter forms	Initial, medial, final, isolated	Final forms (5 letters)	Unicode handles; normalize
Numerals	Eastern Arabic or Western	Hebrew numerals or Western	Consider normalizing

arabic_hebrew_config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
// Arabic and Hebrew search configuration
 
// =============== ARABIC ===============
const arabicConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // Normalize Arabic text
        "arabic_normalization": {
          "type": "arabic_normalization"
          // Normalizes variations of Alef (أ، إ، آ → ا)
          // Removes tatweel (ـ)
          // Normalizes Yeh (ى → ي)
        },
        
        // Arabic stemmer (root extraction)
        "arabic_stemmer": {
          "type": "stemmer",
          "language": "arabic"
          // Uses ISRI algorithm for light stemming
        },
        
        // Remove diacritics (tashkeel/harakat)
        "arabic_diacritics": {
          "type": "pattern_replace",
          "pattern": "[\u064B-\u0652]",  // Fatha, Damma, Kasra, etc.
          "replacement": ""
        }
      },
      
      "analyzer": {
        "arabic_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "arabic_diacritics",      // Remove vowel marks
            "arabic_normalization",
            "arabic_stemmer"
          ]
        }
      }
    }
  }
};
 
// Example: كتب، كاتب، مكتبة، كتاب all stem to root ك-ت-ب
// "Writing", "writer", "library", "book" - all from same root
 
// =============== HEBREW ===============
const hebrewConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // Hebrew-specific handling
        "hebrew_normalization": {
          "type": "icu_normalizer",
          "name": "nfkc",  // Normalize Hebrew presentation forms
          "mode": "compose"
        }
      },
      
      "analyzer": {
        "hebrew_analyzer": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "hebrew_normalization"
            // Note: Hebrew stemming is complex; consider HebMorph plugin
          ]
        }
      }
    }
  }
};
 
// For production Hebrew search, consider:
// - HebMorph (https://github.com/synhershko/HebMorph)
// - Commercial Hebrew NLP solutions
// Standard stemmers work poorly for Hebrew morphology
 
// =============== MIXED RTL/LTR CONTENT ===============
// When documents contain both Arabic and English:
 
const mixedRtlConfig = {
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "arabic_analyzer",
        "fields": {
          "english": {
            "type": "text",
            "analyzer": "english"
          }
        }
      }
    }
  }
};
 
// Query both fields with appropriate boosting
const mixedQuery = {
  "query": {
    "multi_match": {
      "query": "محرك بحث search engine",
      "fields": ["content^2", "content.english"]
    }
  }
};

Arabic Morphology is Complex

Arabic words derive from 3-4 letter roots through complex patterns. Light stemming (removing prefixes/suffixes) works for basic matching, but true root extraction requires morphological analysis. For high-quality Arabic search, consider specialized tools like CAMeL or MadaMira.

Handling Mixed-Language Content

Real-world content often mixes languages: product names in English on French websites, technical terms in English within Chinese documents, or user reviews that switch languages mid-sentence. Handling this requires strategic decisions about analysis and querying.

mixed_language_strategies.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
// Strategies for mixed-language content
 
// =============== STRATEGY 1: Multi-field indexing ===============
// Index the same content with multiple language analyzers
 
const multiFieldConfig = {
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "analyzer": "standard",  // Generic
        "fields": {
          "en": { "type": "text", "analyzer": "english" },
          "fr": { "type": "text", "analyzer": "french" },
          "de": { "type": "text", "analyzer": "german" },
          "ja": { "type": "text", "analyzer": "japanese_analyzer" }
        }
      }
    }
  }
};
 
// Query all language fields:
const multiLangQuery = {
  "query": {
    "multi_match": {
      "query": "ordinateur portable",
      "fields": ["content", "content.en", "content.fr", "content.de"],
      "type": "best_fields"
    }
  }
};
 
// Pros: Broad coverage, no language detection needed at query time
// Cons: Index size multiplied by number of languages
 
// =============== STRATEGY 2: Language-per-field ===============
// Detect language at indexing time, store in appropriate field
 
interface MultiLangDocument {
  id: string;
  content_en?: string;
  content_fr?: string;
  content_de?: string;
  content_ja?: string;
  detected_language: string;
}
 
async function indexMultiLang(doc: { id: string; content: string }) {
  const lang = await detectLanguage(doc.content);
  
  return {
    id: doc.id,
    [`content_${lang}`]: doc.content,
    detected_language: lang
  };
}
 
// Pros: More storage efficient, precise routing
// Cons: Detection errors propagate; language must be detected at query time
 
// =============== STRATEGY 3: Paragraph-level detection ===============
// For long documents, detect and index each paragraph separately
 
async function indexParagraphs(docId: string, content: string) {
  const paragraphs = content.split('
 
');
  
  const langSegments: Array<{ text: string; lang: string; position: number }> = [];
  
  for (let i = 0; i < paragraphs.length; i++) {
    const lang = await detectLanguage(paragraphs[i]);
    langSegments.push({
      text: paragraphs[i],
      lang,
      position: i
    });
  }
  
  return {
    id: docId,
    segments: langSegments
  };
}
 
// =============== STRATEGY 4: Universal analyzer ===============
// Use a single analyzer that works "well enough" for multiple languages
 
const universalAnalyzerConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // ICU tokenization handles most scripts
        "icu_fold": {
          "type": "icu_folding"  // Handles diacritics across languages
        }
      },
      
      "tokenizer": {
        "icu_tokenizer": {
          "type": "icu_tokenizer"  // Smart word segmentation
        }
      },
      
      "analyzer": {
        "universal": {
          "tokenizer": "icu_tokenizer",
          "filter": [
            "lowercase",
            "icu_fold"
          ]
          // Note: No stemmer - stemmers are language-specific
        }
      }
    }
  }
};
 
// Pros: Simple, single index, works across languages
// Cons: No stemming, less precise than language-specific analyzers
 
// =============== STRATEGY 5: English core, others supplemental ===============
// Common pattern: English is primary, detect/handle only when non-English
 
async function intelligentIndex(content: string, hints: { locale?: string }) {
  // Default to English
  let primaryLang = 'en';
  
  // Check if content is non-English
  if (hints.locale && hints.locale !== 'en') {
    primaryLang = hints.locale;
  } else if (content.length > 100) {
    const detected = await detectLanguage(content);
    if (detected.confidence > 0.8 && detected.language !== 'en') {
      primaryLang = detected.language;
    }
  }
  
  return { content, language: primaryLang };
}

The 80/20 Rule for Languages

Most applications have a primary language representing 80%+ of content. Optimize heavily for that language, then ensure other languages are at least searchable. Don't over-engineer for edge cases before validating that multilingual search quality matters to your users.

Transliteration and Romanization

Users often search for non-Latin content using Latin characters. A Japanese user might search "Tokyo" instead of "東京", or "sushi" instead of "寿司". Transliteration converts one script to another, typically to Latin/Roman characters. Supporting this dramatically improves search accessibility.

transliteration_config.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
// Transliteration strategies for search
 
// =============== ICU TRANSFORM (General Purpose) ===============
// ICU provides script transformation across many languages
 
const icuTransliterationConfig = {
  "settings": {
    "analysis": {
      "filter": {
        // Any script to Latin transliteration
        "any_to_latin": {
          "type": "icu_transform",
          "id": "Any-Latin"
          // 東京 → Dōngjīng (Mandarin pronunciation)
          // Note: May not match user expectations
        },
        
        // Specific transformation chains
        "japanese_to_romaji": {
          "type": "icu_transform",
          "id": "Katakana-Hiragana; Hiragana-Latin"
          // コンピュータ → konpyuuta
        },
        
        // Cyrillic to Latin
        "cyrillic_to_latin": {
          "type": "icu_transform",
          "id": "Cyrillic-Latin"
          // Москва → Moskva
        },
        
        // Greek to Latin
        "greek_to_latin": {
          "type": "icu_transform",
          "id": "Greek-Latin"
          // Αθήνα → Athína
        }
      },
      
      "analyzer": {
        "transliterated_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "any_to_latin",
            "asciifolding"  // Remove diacritics from result
          ]
        }
      }
    }
  }
};
 
// =============== JAPANESE-SPECIFIC ROMANIZATION ===============
// Japanese has multiple romanization systems (Hepburn, Kunrei, etc.)
 
const japaneseRomajiConfig = {
  "settings": {
    "analysis": {
      "filter": {
        "ja_romaji": {
          "type": "kuromoji_readingform",
          "use_romaji": true
          // 東京 → tokyo (using Kanji reading)
          // コンピュータ → konpyuuta
        }
      },
      
      "analyzer": {
        "japanese_romaji": {
          "tokenizer": "kuromoji_tokenizer",
          "filter": [
            "kuromoji_baseform",
            "ja_romaji",
            "lowercase"
          ]
        }
      }
    }
  },
  
  // Index both original and romanized
  "mappings": {
    "properties": {
      "title": {
        "type": "text",
        "analyzer": "japanese_analyzer",  // Original
        "fields": {
          "romaji": {
            "type": "text",
            "analyzer": "japanese_romaji"  // Romanized
          }
        }
      }
    }
  }
};
 
// Query example: User types "tokyo"
// Matches documents containing "東京" through romaji field
const romajiQuery = {
  "query": {
    "multi_match": {
      "query": "tokyo",
      "fields": ["title", "title.romaji"]
    }
  }
};
 
// =============== PHONETIC MATCHING ===============
// For name search where spelling varies
 
const phoneticConfig = {
  "settings": {
    "analysis": {
      "filter": {
        "phonetic": {
          "type": "phonetic",
          "encoder": "double_metaphone",  // or "soundex", "caverphone2"
          "replace": false  // Keep original token too
        }
      },
      
      "analyzer": {
        "phonetic_analyzer": {
          "tokenizer": "standard",
          "filter": ["lowercase", "phonetic"]
        }
      }
    }
  }
};
 
// "Smith", "Smyth", "Schmidt" all encode to similar phonetic codes
// Helps with international name variations

Index Both Scripts

For the best user experience, index both the original script and the romanized form in separate fields. This allows native speakers to search in their script while giving international users the ability to search using Latin characters. Query both fields with appropriate boosting.

Implementation Architecture for Multilingual Search

Building a production multilingual search system requires decisions at multiple levels: index structure, analyzer configuration, query routing, and result ranking. Here's a recommended architecture for a globally deployed search system.

multilingual_architecture.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
// Production multilingual search architecture
 
// =============== OPTION A: Single Index, Multiple Fields ===============
// Best for: Limited languages, mixed-language content
 
const singleIndexApproach = {
  indexSettings: {
    "settings": {
      "analysis": {
        "analyzer": {
          "en": { /* English config */ },
          "de": { /* German config */ },
          "fr": { /* French config */ },
          "ja": { /* Japanese config */ },
          "universal": { /* ICU-based fallback */ }
        }
      }
    },
    
    "mappings": {
      "properties": {
        "content": {
          "type": "text",
          "analyzer": "universal",
          "fields": {
            "en": { "type": "text", "analyzer": "en" },
            "de": { "type": "text", "analyzer": "de" },
            "fr": { "type": "text", "analyzer": "fr" },
            "ja": { "type": "text", "analyzer": "ja" }
          }
        },
        "language": { "type": "keyword" }
      }
    }
  },
  
  pros: [
    "Single index to manage",
    "Cross-language search easy",
    "Schema changes apply everywhere"
  ],
  
  cons: [
    "Index size grows with each language",
    "All analyzers must be defined upfront",
    "Hard to optimize per-language"
  ]
};
 
// =============== OPTION B: Index Per Language/Region ===============
// Best for: Many languages, language-specific ranking, geo-distribution
 
const multiIndexApproach = {
  indices: {
    "products-en": { /* English analyzer */ },
    "products-de": { /* German analyzer */ },
    "products-fr": { /* French analyzer */ },
    "products-ja": { /* Japanese analyzer */ },
    "products-default": { /* Universal fallback */ }
  },
  
  // Alias for unified querying
  alias: {
    name: "products-all",
    indices: ["products-en", "products-de", "products-fr", "products-ja"]
  },
  
  // Routing logic
  searchRouter: async (query: string, userLocale?: string) => {
    // Determine target indices
    let targetIndices: string[];
    
    if (userLocale) {
      // User has explicit preference
      const primaryIndex = `products-${userLocale}`;
      targetIndices = [primaryIndex, "products-en"];  // Fallback to English
    } else {
      // Detect from query
      const detected = await detectLanguage(query);
      if (detected.isReliable) {
        targetIndices = [`products-${detected.language}`, "products-en"];
      } else {
        // Query all
        targetIndices = ["products-all"];
      }
    }
    
    return targetIndices;
  },
  
  pros: [
    "Optimized index per language",
    "Can geo-replicate specific languages",
    "Independent scaling and management",
    "Easy to add new languages"
  ],
  
  cons: [
    "More indices to manage",
    "Cross-language search requires alias",
    "Schema changes must sync across indices"
  ]
};
 
// =============== QUERY ARCHITECTURE ===============
 
interface SearchRequest {
  query: string;
  language?: string;   // Explicit language preference
  locale?: string;     // User locale
  fallback?: boolean;  // Whether to fall back to other languages
}
 
async function multilingualSearch(request: SearchRequest) {
  // Step 1: Resolve target language
  const queryLang = request.language 
    || await detectFromQuery(request.query)
    || extractLocaleLanguage(request.locale)
    || 'en';
  
  // Step 2: Build language-appropriate query
  const esQuery = buildLangQuery(request.query, queryLang);
  
  // Step 3: Execute search
  const primaryResults = await client.search({
    index: getIndexForLang(queryLang),
    body: esQuery
  });
  
  // Step 4: Optional fallback for low results
  if (request.fallback && primaryResults.hits.total.value < 5) {
    const fallbackResults = await client.search({
      index: "products-all",
      body: esQuery
    });
    return mergeResults(primaryResults, fallbackResults);
  }
  
  return primaryResults;
}
 
function buildLangQuery(query: string, lang: string) {
  // Language-specific query construction
  const contentField = `content.${lang}`;
  
  return {
    query: {
      bool: {
        should: [
          { match: { [contentField]: { query, boost: 2.0 } } },
          { match: { "content": query } }  // Fallback to universal
        ]
      }
    }
  };
}

Summary: Building Truly Multilingual Search

Language handling transforms search from a simple string matching problem into a complex linguistic challenge. Getting it right means understanding the unique characteristics of each language you support and building appropriate processing pipelines.

Key Takeaways

•Start with encoding: UTF-8 everywhere. Apply Unicode normalization (NFC) and strip invisible characters before any processing.
•Language detection is imperfect: Use metadata hints when available. For short text, consider querying multiple languages simultaneously rather than detecting.
•CJK requires specialized tokenizers: Standard whitespace tokenization fails completely. Use Kuromoji (Japanese), SmartCN (Chinese), or Nori (Korean) for proper word segmentation.
•European languages need compound handling: German compounds must be decomposed. French elision, Spanish/Portuguese diacritics require specific filters.
•Arabic/Hebrew are morphologically complex: Light stemming works for basic matching; true root extraction needs morphological analysis.
•Mixed-language content needs strategy: Multi-field indexing, language-per-field, or universal analyzers each have trade-offs. Choose based on your content mix.
•Transliteration improves accessibility: Index both native script and romanized forms to support international users searching in Latin characters.
•Architecture matters: Single index with multiple fields for simplicity, or index-per-language for optimization and geo-distribution.

What's next:

With language handling covered, we now turn to the heart of search quality: relevance scoring. The next page explores TF-IDF, BM25, and modern approaches to ranking search results so that the most relevant documents appear at the top.

Page Complete

You now understand how to build search systems that work across multiple languages. From character encoding to script-specific tokenization to transliteration, you have the tools to serve users globally with appropriate linguistic processing.