Loading content...
Consider an e-commerce platform serving customers in Japan, Germany, and Brazil. A Japanese customer searches for "パソコン" (personal computer), a German customer searches for "Notebook-Computer," and a Brazilian customer searches for "computador portátil." All three are looking for essentially the same product, but the search system must understand:
This is the language handling challenge.
A search system that works beautifully for English may completely fail for other languages. Tokenization rules break down. Stemming algorithms produce garbage. Character matching fails on accents. Building truly multilingual search requires understanding the linguistic characteristics of each language and applying appropriate processing.
By the end of this page, you will understand how to build search systems that work across multiple languages. You'll learn about character encoding, script-specific processing, language detection, CJK (Chinese/Japanese/Korean) handling, and strategies for mixed-language content.
Before any linguistic processing, search systems must correctly interpret the bytes that represent text. Character encoding maps byte sequences to characters, and getting this wrong corrupts everything downstream.
The evolution of character encoding:
| Encoding | Year | Characters | Bytes per Char | Usage |
|---|---|---|---|---|
| ASCII | 1963 | 128 (English only) | 1 | Legacy systems, basic protocols |
| ISO-8859-1 (Latin-1) | 1987 | 256 (Western European) | 1 | Legacy web pages, some databases |
| Windows-1252 | 1985 | 256 (Western European) | 1 | Windows systems, MS Office |
| UTF-8 | 1993 | 1,112,064 (Unicode) | 1-4 | Modern web, APIs, databases |
| UTF-16 | 1996 | 1,112,064 (Unicode) | 2-4 | Windows internals, Java, JavaScript |
| UTF-32 | 2003 | 1,112,064 (Unicode) | 4 | Processing where fixed width helps |
UTF-8 is the standard for search systems. It's backward-compatible with ASCII, compact for Latin-based text, and can represent every Unicode character. All modern search engines (Elasticsearch, Solr, Algolia) expect UTF-8 encoded input.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263
// Common encoding issues in search systems // Issue 1: Mixed encodings in source data// A document from a legacy system might have:const mixedEncodingDocument = { title: "Café Résumé", // UTF-8 // But the actual bytes might be Latin-1 interpreted as UTF-8: // "Café Résumé" - the classic "mojibake" corruption}; // Detection and normalization:function normalizeEncoding(rawBytes: Buffer): string { // Libraries like 'chardet' can detect encoding // const detected = chardet.detect(rawBytes); // Always convert to UTF-8 for indexing // return iconv.decode(rawBytes, detected.encoding); return rawBytes.toString('utf-8');} // Issue 2: Unicode normalization forms// The same visual character can have multiple byte representations: const sameCharacter = { // "é" as a single character (composed form, NFC) composed: "é", // U+00E9 (1 codepoint) // "é" as base + combining accent (decomposed form, NFD) decomposed: "e\u0301", // U+0065 + U+0301 (2 codepoints)}; // These look identical but won't match in a byte comparison!console.log(sameCharacter.composed === sameCharacter.decomposed); // false // Solution: Apply Unicode Normalization Form C (NFC) before indexingconst normalizedText = text.normalize('NFC'); // Elasticsearch ICU normalization filter:const icuNormalizerConfig = { "filter": { "icu_normalizer": { "type": "icu_normalizer", "name": "nfc" // Canonical composition } }}; // Issue 3: Zero-width characters and invisible contentconst invisibleCharacters = { zeroWidthSpace: "\u200B", // Often copied from web pages zeroWidthNonJoiner: "\u200C", zeroWidthJoiner: "\u200D", bom: "\uFEFF", // Byte Order Mark softHyphen: "\u00AD"}; // These can break exact matches without visible difference// "hello" !== "hello\u200B" even though they look the same // Solution: Strip invisible characters during analysisfunction cleanInvisible(text: string): string { return text.replace(/[\u200B-\u200D\uFEFF\u00AD]/g, '');}Apply Unicode normalization (NFC) and invisible character stripping at the earliest stage of your pipeline—before tokenization. Encoding issues that reach the index are extremely difficult to debug later. When in doubt, use the ICU (International Components for Unicode) plugins available for most search engines.
To apply language-specific processing, you first need to know what language you're dealing with. Language detection is surprisingly challenging, especially for short text, mixed-language content, or closely related languages.
Language detection approaches:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
// Language detection strategies and challenges interface DetectionResult { language: string; confidence: number; isReliable: boolean;} // Library-based detection (using fastText or langdetect)async function detectLanguage(text: string): Promise<DetectionResult> { // Pseudocode for fastText-based detection // const result = await fasttext.predict(text, 1); // return { // language: result[0].label.replace('__label__', ''), // confidence: result[0].probability, // isReliable: result[0].probability > 0.7 // }; return { language: "en", confidence: 0.95, isReliable: true };} // Challenge 1: Short textconst shortTextChallenges = [ { text: "OK", possibleLanguages: ["en", "de", "fr", "es", "..."] }, // Impossible to determine { text: "Paris", possibleLanguages: ["en", "fr", "de"] }, // City name in many languages { text: "Hello", mostLikely: "en", confidence: 0.6 }, // Low confidence]; // Challenge 2: Mixed language contentconst mixedContent = { text: "The sushi was très délicieux!", // English + Japanese + French detectedLanguages: [ { language: "en", segments: ["The", "was"] }, { language: "fr", segments: ["très délicieux"] }, { language: "ja", segments: ["sushi"] } // Loan word ]}; // Challenge 3: Similar languagesconst confusingPairs = [ ["Norwegian", "Danish"], // Extremely similar written forms ["Serbian", "Croatian"], // Same language, different scripts ["Malay", "Indonesian"], // ~80% lexical similarity ["Spanish", "Portuguese"], // Many shared words]; // Best practice: Fallback chaininterface LanguageDetectionConfig { primaryMethod: string; fallbackLanguage: string; minConfidence: number; useMetadataHints: boolean;} async function detectWithFallback( text: string, metadata: { htmlLang?: string; acceptLanguage?: string; userLocale?: string }): Promise<string> { // 1. Check explicit language tag if (metadata.htmlLang) { return metadata.htmlLang; } // 2. Try automatic detection (requires sufficient text) if (text.length > 50) { const detection = await detectLanguage(text); if (detection.isReliable) { return detection.language; } } // 3. Fall back to user locale if (metadata.userLocale) { return extractLanguage(metadata.userLocale); // "en-US" → "en" } // 4. Final fallback return "en";} // Strategy for search: Index multiple language-analyzed versions// Query strategy: Detect language and route to appropriate index/fieldFor short queries where detection is unreliable, consider querying multiple language-specific fields simultaneously and combining results. A search for "café" can match against English, French, and Spanish fields, with the best matches rising to the top regardless of actual language.
Chinese, Japanese, and Korean (CJK) present unique challenges because they don't use spaces between words. A sentence like "東京都港区" (Tokyo, Minato Ward) appears as one continuous string of characters. Without proper segmentation, a standard tokenizer produces useless single-character tokens.
CJK-specific challenges:
| Language | Scripts | Word Segmentation | Special Challenges |
|---|---|---|---|
| Chinese (Simplified/Traditional) | Han characters | Dictionary-based; no spaces | Simplified/Traditional variants; no verb conjugation |
| Japanese | Hiragana, Katakana, Kanji, Latin | Complex; mixed scripts | Multiple readings for Kanji; script mixing |
| Korean | Hangul (Jamo), occasional Hanja | Spaces exist but optional | Complex morphology; particle handling |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126
// CJK tokenization strategies // Problem: "東京都港区赤坂" (Tokyo, Minato Ward, Akasaka)// Standard tokenizer: ["東", "京", "都", "港", "区", "赤", "坂"] - Useless!// Dictionary segmenter: ["東京都", "港区", "赤坂"] - Meaningful words! // =============== JAPANESE ===============// Recommended: Kuromoji tokenizer (included in Elasticsearch) const japaneseAnalyzerConfig = { "settings": { "analysis": { "tokenizer": { "kuromoji_tokenizer": { "type": "kuromoji_tokenizer", "mode": "search", // "normal", "search", or "extended" // search mode: over-segments to improve recall // For: 東京都 → [東京都, 東京, 都] (compound and parts) "discard_punctuation": true } }, "filter": { // Convert to reading for matching variations "kuromoji_readingform": { "type": "kuromoji_readingform", "use_romaji": false // Use katakana reading }, // Normalize full-width/half-width "cjk_width": { "type": "cjk_width" // Folds full-width ASCII to half-width // Folds half-width Katakana to full-width }, // Stem Japanese verbs "ja_stop": { "type": "ja_stop" } }, "analyzer": { "japanese_analyzer": { "type": "custom", "tokenizer": "kuromoji_tokenizer", "filter": [ "cjk_width", "lowercase", "kuromoji_readingform", "ja_stop" ] } } } }}; // =============== CHINESE ===============// Recommended: SmartCN or ICU tokenizer const chineseAnalyzerConfig = { "settings": { "analysis": { "analyzer": { "smartcn_analyzer": { "type": "smartcn" // Built-in Chinese analyzer // Handles segmentation with hidden Markov model }, // Alternative: ICU tokenizer for multiple languages "icu_chinese": { "type": "custom", "tokenizer": "icu_tokenizer", "filter": ["lowercase"] } } } }}; // =============== KOREAN ===============// Recommended: Nori analyzer (official Korean plugin) const koreanAnalyzerConfig = { "settings": { "analysis": { "tokenizer": { "nori_tokenizer": { "type": "nori_tokenizer", "decompound_mode": "mixed" // "none", "discard", "mixed" // mixed: index both compound and parts } }, "analyzer": { "korean_analyzer": { "type": "custom", "tokenizer": "nori_tokenizer", "filter": [ "nori_readingform", // Hanja → Hangul "lowercase", "nori_part_of_speech" // Remove particles ] } } } }}; // =============== CROSS-CJK BIGRAM FALLBACK ===============// When dictionary-based segmentation isn't available:// CJK bigrams provide reasonable matching const cjkBigramConfig = { "analyzer": { "cjk_bigram_analyzer": { "type": "custom", "tokenizer": "standard", "filter": ["cjk_bigram", "lowercase"] } }}; // "東京都" with bigrams: ["東京", "京都", "都"]// Less accurate but works for any CJK text without dictionariesCJK bigrams (overlapping pairs of characters) provide a simple fallback when proper segmentation isn't available. They generate more tokens but capture word boundaries probabilistically. This is why "京都" (Kyoto) matches even when searching "東京都" (Tokyo)—a trade-off between recall and precision.
European languages share the Latin alphabet but have significant variations that affect search. German compounds, French accents, Nordic characters, and Eastern European diacritics all require specific handling.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127
// European language-specific processing // =============== GERMAN ===============// Challenge: Compound words can be arbitrarily long// "Rindfleischetikettierungsüberwachungsaufgabenübertragungsgesetz"// (beef labeling supervision duties delegation law) const germanConfig = { "settings": { "analysis": { "filter": { // Decompose compounds for better matching "german_decompounder": { "type": "hyphenation_decompounder", "word_list_path": "analysis/german_dictionary.txt", "hyphenation_patterns_path": "analysis/de_DR.xml", "only_longest_match": true, "min_subword_size": 4 }, "german_stemmer": { "type": "stemmer", "language": "light_german" }, // Handle umlauts: ä→ae, ö→oe, ü→ue "german_normalization": { "type": "german_normalization" } }, "analyzer": { "german": { "tokenizer": "standard", "filter": [ "lowercase", "german_decompounder", "german_normalization", "german_stemmer" ] } } } }}; // "Handschuhe" (gloves) decomposes to ["hand", "schuhe"]// Searchers for "hand" will find glove products // =============== FRENCH/SPANISH/PORTUGUESE ===============// Challenge: Diacritics (accents) should match with or without const romanLanguageConfig = { "settings": { "analysis": { "filter": { // Elision: l'école → école "french_elision": { "type": "elision", "articles": ["l", "m", "t", "qu", "n", "s", "j", "d", "c"] }, // ASCII folding: café → cafe (but preserve original too) "asciifolding_preserve": { "type": "asciifolding", "preserve_original": true // Indexes both "café" AND "cafe" }, "french_stemmer": { "type": "stemmer", "language": "light_french" }, "spanish_stemmer": { "type": "stemmer", "language": "light_spanish" }, "portuguese_stemmer": { "type": "stemmer", "language": "light_portuguese" } }, "analyzer": { "french": { "tokenizer": "standard", "filter": ["french_elision", "lowercase", "asciifolding_preserve", "french_stemmer"] } } } }}; // =============== NORDIC LANGUAGES ===============// Challenge: Special characters (å, ä, ö, ø, æ) const nordicConfig = { "settings": { "analysis": { "filter": { // ICU folding handles Nordic characters properly "icu_folding_nordic": { "type": "icu_folding", // Optionally exclude specific characters from folding "unicodeSetFilter": "[^åäöøæ]" // Keep Nordic chars distinct }, "swedish_stemmer": { "type": "stemmer", "language": "swedish" }, "norwegian_stemmer": { "type": "stemmer", "language": "norwegian" } } } }}; // =============== EASTERN EUROPEAN ===============// Challenge: Rich diacritic systems (ą, ę, ć, ś, ź, ż, etc.) const eastEuropeanConfig = { "settings": { "analysis": { "filter": { // Polish-specific normalization "polish_fold": { "type": "asciifolding", "preserve_original": true }, // Stempel algorithm for Polish stemming "stempel_stemmer": { "type": "stemmer", "language": "polish" } } } }};When using ASCII folding for diacritics, enable 'preserve_original': true. This indexes both "café" and "cafe", allowing users who type either form to find matches. The slight index size increase is worth the improved user experience.
Right-to-left (RTL) languages like Arabic and Hebrew present unique challenges beyond text direction. Arabic has complex morphology, diacritics that may or may not be present, and letter forms that change based on position in a word.
| Feature | Arabic | Hebrew | Impact on Search |
|---|---|---|---|
| Root-based morphology | Extensive: k-t-b yields 100+ forms | Moderate: similar pattern | Stemming to root is essential |
| Diacritics (vowel marks) | Optional; formal text includes them | Usually omitted | Must match with or without |
| Prefix/suffix clitics | Common: وال = و + ال | Common: הוא = ה + וא | Decomposition needed |
| Letter forms | Initial, medial, final, isolated | Final forms (5 letters) | Unicode handles; normalize |
| Numerals | Eastern Arabic or Western | Hebrew numerals or Western | Consider normalizing |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110
// Arabic and Hebrew search configuration // =============== ARABIC ===============const arabicConfig = { "settings": { "analysis": { "filter": { // Normalize Arabic text "arabic_normalization": { "type": "arabic_normalization" // Normalizes variations of Alef (أ، إ، آ → ا) // Removes tatweel (ـ) // Normalizes Yeh (ى → ي) }, // Arabic stemmer (root extraction) "arabic_stemmer": { "type": "stemmer", "language": "arabic" // Uses ISRI algorithm for light stemming }, // Remove diacritics (tashkeel/harakat) "arabic_diacritics": { "type": "pattern_replace", "pattern": "[\u064B-\u0652]", // Fatha, Damma, Kasra, etc. "replacement": "" } }, "analyzer": { "arabic_analyzer": { "tokenizer": "standard", "filter": [ "lowercase", "arabic_diacritics", // Remove vowel marks "arabic_normalization", "arabic_stemmer" ] } } } }}; // Example: كتب، كاتب، مكتبة، كتاب all stem to root ك-ت-ب// "Writing", "writer", "library", "book" - all from same root // =============== HEBREW ===============const hebrewConfig = { "settings": { "analysis": { "filter": { // Hebrew-specific handling "hebrew_normalization": { "type": "icu_normalizer", "name": "nfkc", // Normalize Hebrew presentation forms "mode": "compose" } }, "analyzer": { "hebrew_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "lowercase", "hebrew_normalization" // Note: Hebrew stemming is complex; consider HebMorph plugin ] } } } }}; // For production Hebrew search, consider:// - HebMorph (https://github.com/synhershko/HebMorph)// - Commercial Hebrew NLP solutions// Standard stemmers work poorly for Hebrew morphology // =============== MIXED RTL/LTR CONTENT ===============// When documents contain both Arabic and English: const mixedRtlConfig = { "mappings": { "properties": { "content": { "type": "text", "analyzer": "arabic_analyzer", "fields": { "english": { "type": "text", "analyzer": "english" } } } } }}; // Query both fields with appropriate boostingconst mixedQuery = { "query": { "multi_match": { "query": "محرك بحث search engine", "fields": ["content^2", "content.english"] } }};Arabic words derive from 3-4 letter roots through complex patterns. Light stemming (removing prefixes/suffixes) works for basic matching, but true root extraction requires morphological analysis. For high-quality Arabic search, consider specialized tools like CAMeL or MadaMira.
Real-world content often mixes languages: product names in English on French websites, technical terms in English within Chinese documents, or user reviews that switch languages mid-sentence. Handling this requires strategic decisions about analysis and querying.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141
// Strategies for mixed-language content // =============== STRATEGY 1: Multi-field indexing ===============// Index the same content with multiple language analyzers const multiFieldConfig = { "mappings": { "properties": { "content": { "type": "text", "analyzer": "standard", // Generic "fields": { "en": { "type": "text", "analyzer": "english" }, "fr": { "type": "text", "analyzer": "french" }, "de": { "type": "text", "analyzer": "german" }, "ja": { "type": "text", "analyzer": "japanese_analyzer" } } } } }}; // Query all language fields:const multiLangQuery = { "query": { "multi_match": { "query": "ordinateur portable", "fields": ["content", "content.en", "content.fr", "content.de"], "type": "best_fields" } }}; // Pros: Broad coverage, no language detection needed at query time// Cons: Index size multiplied by number of languages // =============== STRATEGY 2: Language-per-field ===============// Detect language at indexing time, store in appropriate field interface MultiLangDocument { id: string; content_en?: string; content_fr?: string; content_de?: string; content_ja?: string; detected_language: string;} async function indexMultiLang(doc: { id: string; content: string }) { const lang = await detectLanguage(doc.content); return { id: doc.id, [`content_${lang}`]: doc.content, detected_language: lang };} // Pros: More storage efficient, precise routing// Cons: Detection errors propagate; language must be detected at query time // =============== STRATEGY 3: Paragraph-level detection ===============// For long documents, detect and index each paragraph separately async function indexParagraphs(docId: string, content: string) { const paragraphs = content.split(' '); const langSegments: Array<{ text: string; lang: string; position: number }> = []; for (let i = 0; i < paragraphs.length; i++) { const lang = await detectLanguage(paragraphs[i]); langSegments.push({ text: paragraphs[i], lang, position: i }); } return { id: docId, segments: langSegments };} // =============== STRATEGY 4: Universal analyzer ===============// Use a single analyzer that works "well enough" for multiple languages const universalAnalyzerConfig = { "settings": { "analysis": { "filter": { // ICU tokenization handles most scripts "icu_fold": { "type": "icu_folding" // Handles diacritics across languages } }, "tokenizer": { "icu_tokenizer": { "type": "icu_tokenizer" // Smart word segmentation } }, "analyzer": { "universal": { "tokenizer": "icu_tokenizer", "filter": [ "lowercase", "icu_fold" ] // Note: No stemmer - stemmers are language-specific } } } }}; // Pros: Simple, single index, works across languages// Cons: No stemming, less precise than language-specific analyzers // =============== STRATEGY 5: English core, others supplemental ===============// Common pattern: English is primary, detect/handle only when non-English async function intelligentIndex(content: string, hints: { locale?: string }) { // Default to English let primaryLang = 'en'; // Check if content is non-English if (hints.locale && hints.locale !== 'en') { primaryLang = hints.locale; } else if (content.length > 100) { const detected = await detectLanguage(content); if (detected.confidence > 0.8 && detected.language !== 'en') { primaryLang = detected.language; } } return { content, language: primaryLang };}Most applications have a primary language representing 80%+ of content. Optimize heavily for that language, then ensure other languages are at least searchable. Don't over-engineer for edge cases before validating that multilingual search quality matters to your users.
Users often search for non-Latin content using Latin characters. A Japanese user might search "Tokyo" instead of "東京", or "sushi" instead of "寿司". Transliteration converts one script to another, typically to Latin/Roman characters. Supporting this dramatically improves search accessibility.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135
// Transliteration strategies for search // =============== ICU TRANSFORM (General Purpose) ===============// ICU provides script transformation across many languages const icuTransliterationConfig = { "settings": { "analysis": { "filter": { // Any script to Latin transliteration "any_to_latin": { "type": "icu_transform", "id": "Any-Latin" // 東京 → Dōngjīng (Mandarin pronunciation) // Note: May not match user expectations }, // Specific transformation chains "japanese_to_romaji": { "type": "icu_transform", "id": "Katakana-Hiragana; Hiragana-Latin" // コンピュータ → konpyuuta }, // Cyrillic to Latin "cyrillic_to_latin": { "type": "icu_transform", "id": "Cyrillic-Latin" // Москва → Moskva }, // Greek to Latin "greek_to_latin": { "type": "icu_transform", "id": "Greek-Latin" // Αθήνα → Athína } }, "analyzer": { "transliterated_analyzer": { "tokenizer": "standard", "filter": [ "lowercase", "any_to_latin", "asciifolding" // Remove diacritics from result ] } } } }}; // =============== JAPANESE-SPECIFIC ROMANIZATION ===============// Japanese has multiple romanization systems (Hepburn, Kunrei, etc.) const japaneseRomajiConfig = { "settings": { "analysis": { "filter": { "ja_romaji": { "type": "kuromoji_readingform", "use_romaji": true // 東京 → tokyo (using Kanji reading) // コンピュータ → konpyuuta } }, "analyzer": { "japanese_romaji": { "tokenizer": "kuromoji_tokenizer", "filter": [ "kuromoji_baseform", "ja_romaji", "lowercase" ] } } } }, // Index both original and romanized "mappings": { "properties": { "title": { "type": "text", "analyzer": "japanese_analyzer", // Original "fields": { "romaji": { "type": "text", "analyzer": "japanese_romaji" // Romanized } } } } }}; // Query example: User types "tokyo"// Matches documents containing "東京" through romaji fieldconst romajiQuery = { "query": { "multi_match": { "query": "tokyo", "fields": ["title", "title.romaji"] } }}; // =============== PHONETIC MATCHING ===============// For name search where spelling varies const phoneticConfig = { "settings": { "analysis": { "filter": { "phonetic": { "type": "phonetic", "encoder": "double_metaphone", // or "soundex", "caverphone2" "replace": false // Keep original token too } }, "analyzer": { "phonetic_analyzer": { "tokenizer": "standard", "filter": ["lowercase", "phonetic"] } } } }}; // "Smith", "Smyth", "Schmidt" all encode to similar phonetic codes// Helps with international name variationsFor the best user experience, index both the original script and the romanized form in separate fields. This allows native speakers to search in their script while giving international users the ability to search using Latin characters. Query both fields with appropriate boosting.
Building a production multilingual search system requires decisions at multiple levels: index structure, analyzer configuration, query routing, and result ranking. Here's a recommended architecture for a globally deployed search system.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156
// Production multilingual search architecture // =============== OPTION A: Single Index, Multiple Fields ===============// Best for: Limited languages, mixed-language content const singleIndexApproach = { indexSettings: { "settings": { "analysis": { "analyzer": { "en": { /* English config */ }, "de": { /* German config */ }, "fr": { /* French config */ }, "ja": { /* Japanese config */ }, "universal": { /* ICU-based fallback */ } } } }, "mappings": { "properties": { "content": { "type": "text", "analyzer": "universal", "fields": { "en": { "type": "text", "analyzer": "en" }, "de": { "type": "text", "analyzer": "de" }, "fr": { "type": "text", "analyzer": "fr" }, "ja": { "type": "text", "analyzer": "ja" } } }, "language": { "type": "keyword" } } } }, pros: [ "Single index to manage", "Cross-language search easy", "Schema changes apply everywhere" ], cons: [ "Index size grows with each language", "All analyzers must be defined upfront", "Hard to optimize per-language" ]}; // =============== OPTION B: Index Per Language/Region ===============// Best for: Many languages, language-specific ranking, geo-distribution const multiIndexApproach = { indices: { "products-en": { /* English analyzer */ }, "products-de": { /* German analyzer */ }, "products-fr": { /* French analyzer */ }, "products-ja": { /* Japanese analyzer */ }, "products-default": { /* Universal fallback */ } }, // Alias for unified querying alias: { name: "products-all", indices: ["products-en", "products-de", "products-fr", "products-ja"] }, // Routing logic searchRouter: async (query: string, userLocale?: string) => { // Determine target indices let targetIndices: string[]; if (userLocale) { // User has explicit preference const primaryIndex = `products-${userLocale}`; targetIndices = [primaryIndex, "products-en"]; // Fallback to English } else { // Detect from query const detected = await detectLanguage(query); if (detected.isReliable) { targetIndices = [`products-${detected.language}`, "products-en"]; } else { // Query all targetIndices = ["products-all"]; } } return targetIndices; }, pros: [ "Optimized index per language", "Can geo-replicate specific languages", "Independent scaling and management", "Easy to add new languages" ], cons: [ "More indices to manage", "Cross-language search requires alias", "Schema changes must sync across indices" ]}; // =============== QUERY ARCHITECTURE =============== interface SearchRequest { query: string; language?: string; // Explicit language preference locale?: string; // User locale fallback?: boolean; // Whether to fall back to other languages} async function multilingualSearch(request: SearchRequest) { // Step 1: Resolve target language const queryLang = request.language || await detectFromQuery(request.query) || extractLocaleLanguage(request.locale) || 'en'; // Step 2: Build language-appropriate query const esQuery = buildLangQuery(request.query, queryLang); // Step 3: Execute search const primaryResults = await client.search({ index: getIndexForLang(queryLang), body: esQuery }); // Step 4: Optional fallback for low results if (request.fallback && primaryResults.hits.total.value < 5) { const fallbackResults = await client.search({ index: "products-all", body: esQuery }); return mergeResults(primaryResults, fallbackResults); } return primaryResults;} function buildLangQuery(query: string, lang: string) { // Language-specific query construction const contentField = `content.${lang}`; return { query: { bool: { should: [ { match: { [contentField]: { query, boost: 2.0 } } }, { match: { "content": query } } // Fallback to universal ] } } };}Language handling transforms search from a simple string matching problem into a complex linguistic challenge. Getting it right means understanding the unique characteristics of each language you support and building appropriate processing pipelines.
What's next:
With language handling covered, we now turn to the heart of search quality: relevance scoring. The next page explores TF-IDF, BM25, and modern approaches to ranking search results so that the most relevant documents appear at the top.
You now understand how to build search systems that work across multiple languages. From character encoding to script-specific tokenization to transliteration, you have the tools to serve users globally with appropriate linguistic processing.