Loading content...
A search for "python programming tutorial" in a well-indexed documentation system might match 50,000 documents. All of them contain those words. But which documents should appear first?
This is the ranking problem—arguably the most important challenge in search. Finding matching documents is necessary but insufficient. Users expect the most relevant results at the top, with relevance declining smoothly as they scroll. A search system that returns results in random order is essentially useless, no matter how complete its coverage.
Relevance scoring assigns a numerical score to each matching document, enabling ranking. The algorithms that compute these scores—primarily TF-IDF and BM25—represent decades of information retrieval research distilled into production-ready formulas. Understanding them is essential for any engineer building or tuning search systems.
By the end of this page, you will understand the mathematical foundations of TF-IDF and BM25, how modern search engines compute relevance scores, techniques for tuning and customizing scoring, and when to go beyond traditional text-based ranking with learning-to-rank and semantic search.
Before diving into algorithms, we need to understand what makes a document "relevant" and how we can quantify relevance computationally.
Two fundamental insights drive relevance scoring:
These two insights combine into TF-IDF and its successor BM25. But there are additional factors that influence perceived relevance:
Classic TF-IDF and BM25 handle the first three factors. The others require additional signals and custom scoring.
12345678910111213141516171819202122232425262728293031323334353637
// Intuition behind relevance scoring interface Document { id: string; content: string; wordCount: number;} // Query: "python asyncio tutorial" const candidates: Document[] = [ { id: "doc1", content: "Python asyncio tutorial: Learn Python asyncio from scratch...", wordCount: 500, // "python" appears 15 times, "asyncio" 12 times, "tutorial" 5 times }, { id: "doc2", content: "Programming tutorials for beginners. Python is popular...", wordCount: 2000, // "python" appears 3 times, "tutorial" 8 times, no "asyncio" }, { id: "doc3", content: "The complete guide to asyncio patterns in Python applications...", wordCount: 800, // "python" appears 6 times, "asyncio" 20 times, no "tutorial" }]; // Intuitive ranking:// 1. doc1 - All query terms, high frequency, focused document// 2. doc3 - Most asyncio mentions (rare term), good Python coverage// 3. doc2 - Common terms only, no asyncio, less focused // TF-IDF and BM25 formalize this intuition mathematicallyTF-IDF (Term Frequency-Inverse Document Frequency) is the foundational relevance scoring algorithm. Invented in the 1970s, it remains conceptually important even though most modern search engines use BM25. Understanding TF-IDF is essential for understanding BM25.
The formula:
$$TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)$$
Where:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879
// TF-IDF Implementation and Examples // =============== TERM FREQUENCY (TF) ===============// Several variants exist: // Raw TF: Simply count occurrencesconst rawTF = (term: string, document: string[]): number => { return document.filter(word => word === term).length;}; // Logarithmic TF: Dampens the effect of high frequencies// Prevents a document with "python" 100 times from scoring 100x // higher than one with it 1 timeconst logTF = (term: string, document: string[]): number => { const raw = rawTF(term, document); return raw > 0 ? 1 + Math.log(raw) : 0;}; // Augmented TF: Normalizes by maximum term frequency in document// Prevents bias toward longer documentsconst augmentedTF = (term: string, document: string[]): number => { const raw = rawTF(term, document); const maxTF = Math.max(...getTermFrequencies(document)); return 0.5 + (0.5 * raw / maxTF);}; // =============== INVERSE DOCUMENT FREQUENCY (IDF) ===============// Measures how rare/informative a term is across the corpus // Standard IDFconst idf = (term: string, corpus: string[][]): number => { const N = corpus.length; // Total documents const df = corpus.filter(doc => doc.includes(term)).length; // Docs containing term // Add 1 to prevent division by zero for unseen terms return Math.log(N / (df + 1));}; // Smooth IDF (used by Elasticsearch)const smoothIDF = (totalDocs: number, docFreq: number): number => { // log(1 + (N - df + 0.5) / (df + 0.5)) return Math.log(1 + (totalDocs - docFreq + 0.5) / (docFreq + 0.5));}; // =============== COMPLETE TF-IDF ===============const tfIdf = (term: string, document: string[], corpus: string[][]): number => { const tf = logTF(term, document); const idfScore = idf(term, corpus); return tf * idfScore;}; // =============== EXAMPLE CALCULATION ===============// Corpus of 10,000 documents// Query: "python asyncio"// Document: 500 words, "python" appears 10 times, "asyncio" appears 5 times const exampleScoring = { // Term: "python" python: { tf: Math.log(1 + 10), // = 3.4 (log TF) docsContaining: 5000, // 50% of corpus idf: Math.log(10000 / 5000), // = 0.69 (common term, low IDF) tfIdf: 3.4 * 0.69 // = 2.35 }, // Term: "asyncio" asyncio: { tf: Math.log(1 + 5), // = 2.79 (log TF) docsContaining: 200, // 2% of corpus idf: Math.log(10000 / 200), // = 3.91 (rare term, high IDF) tfIdf: 2.79 * 3.91 // = 10.91 }, // Total document score for query totalScore: 2.35 + 10.91 // = 13.26}; // Note: "asyncio" contributes more despite lower TF// because its rarity (high IDF) makes it more informativeThe power of TF-IDF comes primarily from IDF. Without it, common words like "the" and "is" would dominate scoring. IDF ensures that distinctive terms—the ones that actually distinguish relevant documents from irrelevant ones—receive appropriate weight.
BM25 (Best Match 25) is the default ranking algorithm in Elasticsearch, Solr, and most modern search engines. Developed in the 1990s based on probabilistic information retrieval theory, BM25 addresses several limitations of classic TF-IDF.
Key improvements over TF-IDF:
The BM25 formula:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114
// BM25 Algorithm Implementation /** * BM25 Scoring Formula: * * For a query Q containing terms q1, q2, ... qn: * * score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl)) * * Where: * f(qi, D) = frequency of term qi in document D * |D| = document length (in words) * avgdl = average document length in the corpus * k1 = term frequency saturation parameter (typically 1.2-2.0) * b = document length normalization parameter (typically 0.75) */ interface BM25Config { k1: number; // Term frequency saturation (1.2-2.0) b: number; // Length normalization (0-1, typically 0.75)} interface CorpusStats { totalDocs: number; avgDocLength: number; docFrequencies: Map<string, number>; // term -> docs containing term} // IDF component (same concept as TF-IDF, slightly different formula)function bm25IDF(term: string, stats: CorpusStats): number { const N = stats.totalDocs; const df = stats.docFrequencies.get(term) || 0; // Elasticsearch formula: log(1 + (N - df + 0.5) / (df + 0.5)) return Math.log(1 + (N - df + 0.5) / (df + 0.5));} // Score for a single termfunction bm25TermScore( termFreq: number, docLength: number, idf: number, config: BM25Config, avgDocLength: number): number { const { k1, b } = config; // Length normalization factor const lengthNorm = 1 - b + b * (docLength / avgDocLength); // TF saturation formula const tfComponent = (termFreq * (k1 + 1)) / (termFreq + k1 * lengthNorm); return idf * tfComponent;} // Complete BM25 score for a documentfunction bm25Score( queryTerms: string[], document: { terms: Map<string, number>; length: number }, stats: CorpusStats, config: BM25Config = { k1: 1.2, b: 0.75 }): number { let totalScore = 0; for (const term of queryTerms) { const termFreq = document.terms.get(term) || 0; if (termFreq > 0) { const idf = bm25IDF(term, stats); const termScore = bm25TermScore( termFreq, document.length, idf, config, stats.avgDocLength ); totalScore += termScore; } } return totalScore;} // =============== WORKED EXAMPLE =============== const exampleStats: CorpusStats = { totalDocs: 10000, avgDocLength: 500, docFrequencies: new Map([ ["python", 5000], // 50% of docs ["asyncio", 200], // 2% of docs ["tutorial", 3000], // 30% of docs ])}; const exampleDoc = { terms: new Map([ ["python", 10], ["asyncio", 5], ["tutorial", 2], ]), length: 400 // Shorter than average}; // Score breakdown:// "python": IDF = 0.69, TF=10, score ≈ 2.1// "asyncio": IDF = 3.91, TF=5, score ≈ 11.2// "tutorial": IDF = 1.20, TF=2, score ≈ 2.4// // Total ≈ 15.7 // Compare to a longer document (1000 words) with same term frequencies:// Scores would be lower due to length normalizationk1 controls term frequency saturation. Higher k1 (e.g., 2.0) means term frequency continues to matter more; lower k1 (e.g., 0.5) saturates quickly. b controls length normalization. b=1 fully normalizes by length; b=0 ignores length. The defaults (k1=1.2, b=0.75) work well for most content, but you may need to tune for short/long documents.
The default BM25 parameters work well for general content, but specific domains often benefit from tuning. Understanding what each parameter does enables informed optimization.
Parameter effects:
| Parameter | Default | Low Value Effect | High Value Effect | When to Adjust |
|---|---|---|---|---|
| k1 (saturation) | 1.2 | Quick saturation; TF matters less beyond first few occurrences | Slow saturation; high TF documents score much higher | Lower for short docs (tweets); higher for long technical content |
| b (length norm) | 0.75 | Minimal length penalty; long docs may dominate | Strong length normalization; short docs favored | Lower when doc length correlates with quality; higher when length is noise |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485
// BM25 parameter tuning strategies // =============== Elasticsearch BM25 Configuration =============== const customBM25Settings = { "settings": { "index": { "similarity": { "custom_bm25": { "type": "BM25", "k1": 1.2, // Default "b": 0.75 // Default } } } }, "mappings": { "properties": { "content": { "type": "text", "similarity": "custom_bm25" } } }}; // =============== Use Case-Specific Tuning =============== const tuningGuide = { // Tweet/short text search shortText: { k1: 0.5, // Saturate quickly - in 280 chars, term appearing once is enough signal b: 0.3, // Minimal length normalization - all docs are short anyway rationale: "Short documents have limited term frequency range; length differences are meaningful" }, // Legal/academic documents longDocuments: { k1: 2.0, // Allow TF to keep mattering in 50-page documents b: 0.75, // Standard length normalization rationale: "Long documents benefit from high TF distinguishing comprehensive coverage" }, // Product titles productTitles: { k1: 1.5, // Moderate saturation b: 0.0, // NO length normalization - "iPhone 15 Pro" shouldn't beat "iPhone" just for being longer rationale: "Title length is intentional and meaningful" }, // Q&A / Stack Overflow style questionsAndAnswers: { k1: 1.2, b: 0.9, // Strong length normalization - concise answers often better rationale: "Longer answers aren't necessarily better; reward focused content" }, // Log/event search logSearch: { k1: 0.7, // Logs are often sparse; term appearing once is significant b: 0.1, // Log length varies but doesn't indicate quality rationale: "Structure matters more than term frequency in logs" }}; // =============== A/B Testing for Parameters =============== interface SearchExperiment { name: string; k1: number; b: number; ndcg: number; // Normalized Discounted Cumulative Gain mrr: number; // Mean Reciprocal Rank} const experimentResults: SearchExperiment[] = [ { name: "baseline", k1: 1.2, b: 0.75, ndcg: 0.72, mrr: 0.68 }, { name: "low_k1", k1: 0.8, b: 0.75, ndcg: 0.69, mrr: 0.71 }, { name: "high_k1", k1: 1.8, b: 0.75, ndcg: 0.74, mrr: 0.67 }, { name: "low_b", k1: 1.2, b: 0.4, ndcg: 0.73, mrr: 0.69 }, { name: "high_b", k1: 1.2, b: 0.9, ndcg: 0.70, mrr: 0.72 },]; // Best configuration depends on your evaluation metric priorities// NDCG values overall result quality; MRR values top-1 precisionNever tune BM25 parameters based on intuition alone. Use relevance judgments (query + relevant document pairs) and measure impact with metrics like NDCG, MRR, or precision@k. Small parameter changes can significantly affect ranking quality—both positively and negatively.
In real documents, not all content is equally important. A match in the title is usually more relevant than a match in the footer. Field-level boosting allows you to weight matches by where they occur.
Common boosting patterns:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112
// Field boosting strategies in Elasticsearch // =============== QUERY-TIME BOOSTING =============== // Multi-match with field boostsconst multiMatchQuery = { "query": { "multi_match": { "query": "python asyncio tutorial", "fields": [ "title^5", // Title matches worth 5x "headings^3", // Section headings worth 3x "content^1", // Body content baseline "tags^4", // Tags are highly relevant "metadata^0.5" // Metadata less important ], "type": "best_fields", // Score based on best matching field "tie_breaker": 0.3 // Also consider other field matches } }}; // Understanding tie_breaker:// - 0: Only best field counts (default for best_fields)// - 0.3: Best field + 30% of other matching fields// - 1.0: Sum of all fields (like most_fields) // =============== MULTI-MATCH TYPES =============== const multiMatchTypes = { // best_fields: Use score from best matching field // Good for: Finding documents where query appears in ONE key field bestFields: { "type": "best_fields", "fields": ["title^2", "content"] }, // most_fields: Sum scores from all matching fields // Good for: Documents where query might appear across multiple fields mostFields: { "type": "most_fields", "fields": ["title", "title.stemmed", "title.exact"] // Multiple analyzers }, // cross_fields: Treat all fields as one big field // Good for: Queries that span fields (person's first + last name) crossFields: { "type": "cross_fields", "fields": ["first_name", "last_name"], "operator": "and" // All terms must be found across fields }, // phrase: Match as phrase in each field // Good for: Exact phrase matching with field preferences phrase: { "type": "phrase", "fields": ["title^2", "content"], "slop": 2 // Allow 2 words between query terms }}; // =============== FUNCTION SCORE FOR COMPLEX BOOSTING =============== const functionScoreQuery = { "query": { "function_score": { "query": { "multi_match": { "query": "python tutorial", "fields": ["title^2", "content"] } }, "functions": [ // Boost newer documents { "filter": { "range": { "date": { "gte": "now-1M" } } }, "weight": 1.5 }, // Boost popular documents (log scale) { "field_value_factor": { "field": "view_count", "factor": 1.2, "modifier": "log1p", // log(1 + view_count * 1.2) "missing": 1 } }, // Boost verified authors { "filter": { "term": { "author.verified": true } }, "weight": 1.2 }, // Decay by age (documents get less relevant over time) { "gauss": { "date": { "origin": "now", "scale": "30d", "offset": "7d", "decay": 0.5 } } } ], "score_mode": "multiply", // How to combine function scores "boost_mode": "multiply" // How to combine with query score } }};Begin with basic field boosting (title^2, content^1). Only add function_score complexity when basic boosting proves insufficient. Each boost factor you add is a parameter to tune and maintain. Simpler ranking functions are easier to understand and debug.
When search results don't rank as expected, you need to understand how scores were computed. Search engines provide explain APIs that break down exactly how each score was calculated.
Debugging relevance issues:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
// Debugging relevance with Elasticsearch explain API // =============== GET SCORE EXPLANATION =============== // Add "explain": true to your searchconst explainRequest = { "query": { "match": { "content": "python tutorial" } }, "explain": true // Returns score breakdown}; // Response includes explanation for each hit:const explainResponse = { "hits": { "hits": [{ "_id": "doc1", "_score": 12.5, "_explanation": { "value": 12.5, "description": "sum of:", "details": [ { "value": 8.3, "description": "weight(content:python) [BM25]", "details": [ { "value": 2.1, "description": "idf, log(1 + (N - n + 0.5) / (n + 0.5))" }, { "value": 3.95, "description": "tf, freq=10.0, with k1=1.2, b=0.75" } ] }, { "value": 4.2, "description": "weight(content:tutorial) [BM25]", "details": [/* ... */] } ] } }] }}; // =============== EXPLAIN A SPECIFIC DOCUMENT =============== // Why did this document score what it did?const explainDocRequest = { // POST /my_index/_explain/doc_id "query": { "match": { "content": "python tutorial" } }}; // =============== COMMON DEBUGGING SCENARIOS =============== const debuggingGuide = { // Issue: Document with more term occurrences ranks lower unexpectedlyLowScore: { possibleCauses: [ "Document is very long (length normalization)", "Terms are common (low IDF)", "Field boost is lower than expected", "Terms match a stemmed version (different IDF)" ], debugSteps: [ "Check explain output for IDF values", "Compare document lengths", "Verify field boosts in query", "Look at tf normalization factor" ] }, // Issue: Irrelevant document ranks high unexpectedlyHighScore: { possibleCauses: [ "Document is very short (benefits from length norm)", "Term appears in boosted field (title)", "Query terms are accidentally common in this doc", "Synonym expansion matched unintended terms" ], debugSteps: [ "Check which field matched (explain shows field)", "Verify analyzer behavior with _analyze", "Check for synonym expansion issues", "Review field boosting configuration" ] }, // Issue: Perfect document doesn't appear at all missingDocument: { possibleCauses: [ "Analyzer processes query differently than document", "Stopword removal eliminated query", "Stemmer created mismatch", "Document was filtered (not a scoring issue)" ], debugSteps: [ "Use _analyze API on query terms", "Use _termvectors to see how doc was indexed", "Check if filters are excluding the document", "Verify index contains the document" ] }}; // =============== ANALYZER DEBUGGING =============== // Check how a term is analyzedconst analyzeRequest = { // POST /my_index/_analyze "analyzer": "my_analyzer", "text": "running pythons"}; // Response: ["run", "python"] (after stemming)// If document was indexed with different analyzer, mismatch occurs! // Check how a document was actually indexedconst termVectorsRequest = { // GET /my_index/_termvectors/doc_id "fields": ["content"], "term_statistics": true, "field_statistics": true};The explain parameter significantly slows down queries—it's for debugging only. Never enable it in production search requests. Use it on a specific document ID or a limited result set during development and troubleshooting.
While BM25 remains the backbone of most search systems, modern approaches add layers on top for improved relevance. These techniques address BM25's limitations: it only considers lexical matching, ignores user behavior, and can't learn from feedback.
Modern relevance enhancements:
| Technique | How It Works | Best For | Complexity |
|---|---|---|---|
| Learning to Rank (LTR) | ML model trained on click data/judgments combines multiple signals | E-commerce, content sites with click data | High: requires training data and infrastructure |
| Semantic Search (Vector/Dense Retrieval) | Encode queries and docs as vectors; find nearest neighbors | When lexical mismatch is common; Q&A systems | High: requires embedding models, vector DB |
| Hybrid (BM25 + Vectors) | Combine BM25 lexical scores with semantic similarity | Best of both worlds; recommended starting point | Medium: add vector search to existing |
| Query Understanding | Rewrite/expand queries before BM25 scoring | Handling synonyms, spell correction, intent | Medium: NLP pipeline before search |
| Personalization | Adjust scores based on user history/preferences | Content platforms, e-commerce | Medium: requires user data |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124
// Modern relevance approaches // =============== LEARNING TO RANK (LTR) ===============// Use ML to combine signals: BM25, popularity, freshness, etc. interface LTRFeatures { bm25_score: number; title_match: boolean; doc_length: number; page_views: number; days_since_publish: number; author_reputation: number; has_images: boolean;} // Training data: [features, relevance_label]// Model learns optimal combination of signals const ltrPluginQuery = { "query": { "sltr": { "_name": "logged_featureset", "featureset": "my_features", "params": { "keywords": "python tutorial" }, "model": "my_trained_model" } }}; // =============== SEMANTIC / VECTOR SEARCH ===============// Encode text as dense vectors; find similar vectors // First, encode documents and queries using an embedding model// (e.g., Sentence-BERT, OpenAI embeddings) const vectorSearchQuery = { // Elasticsearch 8.x+ with dense_vector field "knn": { "field": "content_vector", "query_vector": [0.12, -0.34, 0.56, /* ... 768 dims */], "k": 10, "num_candidates": 100 }}; // =============== HYBRID SEARCH (BM25 + VECTOR) ===============// Best practice: combine lexical and semantic signals const hybridSearchQuery = { "query": { "bool": { "should": [ // Lexical matching (BM25) { "multi_match": { "query": "python async programming", "fields": ["title^2", "content"], "boost": 1.0 } } ] } }, // Semantic matching (KNN) "knn": { "field": "content_vector", "query_vector": embedQuery("python async programming"), "k": 10, "boost": 0.5 // Weight relative to BM25 }}; // Reciprocal Rank Fusion (RRF) for combining rankingsfunction reciprocalRankFusion( bm25Ranking: string[], vectorRanking: string[], k: number = 60 // Tunable parameter): Map<string, number> { const scores = new Map<string, number>(); // Score = Σ 1 / (k + rank) bm25Ranking.forEach((docId, rank) => { const current = scores.get(docId) || 0; scores.set(docId, current + 1 / (k + rank + 1)); }); vectorRanking.forEach((docId, rank) => { const current = scores.get(docId) || 0; scores.set(docId, current + 1 / (k + rank + 1)); }); return scores;} // =============== QUERY EXPANSION ===============// Improve BM25 input by expanding the query interface ExpandedQuery { original: string; synonyms: string[]; spellingCorrections: string[]; relatedTerms: string[];} async function expandQuery(query: string): Promise<ExpandedQuery> { return { original: "python asyncio", synonyms: ["python async io", "python asynchronous"], spellingCorrections: [], relatedTerms: ["coroutines", "event loop", "await"] };} // Build expanded queryconst expandedQueryDSL = { "query": { "bool": { "should": [ { "match": { "content": { "query": "python asyncio", "boost": 2.0 } } }, { "match": { "content": { "query": "coroutines event loop", "boost": 0.5 } } } ] } }};BM25 with proper field boosting handles 80% of relevance needs. Add semantic search when you have vocabulary mismatch problems. Add LTR when you have enough click/judgment data to train on. Premature complexity adds maintenance burden without proportional benefit.
How do you know if your relevance tuning actually improved search quality? You need quantitative metrics computed against labeled test data. These metrics are essential for comparing configurations and tracking quality over time.
| Metric | What It Measures | Range | Best For |
|---|---|---|---|
| Precision@k | Proportion of top-k results that are relevant | 0-1 (higher is better) | When users only look at first page |
| Recall@k | Proportion of all relevant docs that appear in top-k | 0-1 (higher is better) | When finding ALL relevant docs matters |
| MRR (Mean Reciprocal Rank) | Average of 1/rank of first relevant result | 0-1 (higher is better) | When users want ONE good result (navigation) |
| MAP (Mean Average Precision) | Average precision at each relevant result position | 0-1 (higher is better) | Balanced precision/recall across result set |
| NDCG (Normalized DCG) | Weighted relevance with position discount | 0-1 (higher is better) | When you have graded relevance (not just binary) |
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113
// Implementing search quality metrics interface RankedResult { docId: string; rank: number; isRelevant: boolean; relevanceGrade?: number; // 0-3 for graded relevance} interface RelevanceJudgments { queryId: string; relevantDocs: Set<string>; gradedRelevance?: Map<string, number>; // docId -> grade} // =============== PRECISION@K ===============function precisionAtK(results: RankedResult[], k: number): number { const topK = results.slice(0, k); const relevantCount = topK.filter(r => r.isRelevant).length; return relevantCount / k;} // Example: [R, R, N, R, N, N, N, R, N, N]// P@5 = 3/5 = 0.60// P@10 = 4/10 = 0.40 // =============== RECALL@K ===============function recallAtK( results: RankedResult[], allRelevant: Set<string>, k: number): number { const topK = results.slice(0, k); const relevantFound = topK.filter(r => r.isRelevant).length; return relevantFound / allRelevant.size;} // =============== MRR (Mean Reciprocal Rank) ===============function reciprocalRank(results: RankedResult[]): number { const firstRelevant = results.find(r => r.isRelevant); if (!firstRelevant) return 0; return 1 / firstRelevant.rank;} function meanReciprocalRank(queryResults: RankedResult[][]): number { const rrs = queryResults.map(reciprocalRank); return rrs.reduce((a, b) => a + b, 0) / rrs.length;} // =============== NDCG (Normalized Discounted Cumulative Gain) ===============function dcg(results: RankedResult[], k: number): number { return results.slice(0, k).reduce((sum, result, i) => { const relevance = result.relevanceGrade || (result.isRelevant ? 1 : 0); const position = i + 1; // DCG = Σ (2^rel - 1) / log2(position + 1) return sum + (Math.pow(2, relevance) - 1) / Math.log2(position + 1); }, 0);} function idealDCG(relevanceGrades: number[], k: number): number { // Sort by relevance descending for ideal ranking const sorted = [...relevanceGrades].sort((a, b) => b - a); return sorted.slice(0, k).reduce((sum, rel, i) => { return sum + (Math.pow(2, rel) - 1) / Math.log2(i + 2); }, 0);} function ndcg(results: RankedResult[], idealGrades: number[], k: number): number { const actual = dcg(results, k); const ideal = idealDCG(idealGrades, k); return ideal > 0 ? actual / ideal : 0;} // =============== RUNNING EVALUATION =============== interface EvaluationResult { queryId: string; precisionAt5: number; precisionAt10: number; mrr: number; ndcgAt10: number;} async function evaluateSearch( queries: { id: string; text: string }[], judgments: Map<string, RelevanceJudgments>, searchFn: (query: string) => Promise<RankedResult[]>): Promise<EvaluationResult[]> { const results: EvaluationResult[] = []; for (const query of queries) { const searchResults = await searchFn(query.text); const judged = judgments.get(query.id); if (judged) { // Mark relevance based on judgments searchResults.forEach(r => { r.isRelevant = judged.relevantDocs.has(r.docId); r.relevanceGrade = judged.gradedRelevance?.get(r.docId) || 0; }); results.push({ queryId: query.id, precisionAt5: precisionAtK(searchResults, 5), precisionAt10: precisionAtK(searchResults, 10), mrr: reciprocalRank(searchResults), ndcgAt10: ndcg(searchResults, Array.from(judged.gradedRelevance?.values() || []), 10) }); } } return results;}Creating relevance judgments (human-labeled query-document relevance pairs) is time-consuming but essential for meaningful evaluation. Start with a small set of important queries. Use click data as a proxy when human judgments aren't available, but be aware of biases (users can only click what they see).
Relevance scoring determines whether your search system delights users or frustrates them. BM25 provides a robust foundation, but effective search requires understanding the algorithm, tuning its parameters, and knowing when to go beyond lexical matching.
Module Complete:
With this final page, you've completed the Full-Text Search module. You now understand the complete journey from raw text to ranked results: tokenization breaks text into searchable units, stemming and lemmatization normalize word forms, stopwords are strategically handled, language-specific processing addresses linguistic diversity, and relevance scoring ranks results by importance.
This foundation prepares you for the next module, where we'll dive into Elasticsearch architecture—how these algorithms scale across distributed clusters to handle billions of documents and thousands of queries per second.
You've mastered the complete full-text search pipeline: from tokenization through relevance scoring. You understand how text becomes searchable, how linguistic variations are handled, and how BM25 ranks results. This knowledge enables you to build, tune, and debug search systems that deliver relevant results to users.