Full-Text Search - Learning Module

Loading content...

0/273

Relevance Scoring: TF-IDF, BM25, and Ranking Search Results

The Ranking Problem

A search for "python programming tutorial" in a well-indexed documentation system might match 50,000 documents. All of them contain those words. But which documents should appear first?

This is the ranking problem—arguably the most important challenge in search. Finding matching documents is necessary but insufficient. Users expect the most relevant results at the top, with relevance declining smoothly as they scroll. A search system that returns results in random order is essentially useless, no matter how complete its coverage.

Relevance scoring assigns a numerical score to each matching document, enabling ranking. The algorithms that compute these scores—primarily TF-IDF and BM25—represent decades of information retrieval research distilled into production-ready formulas. Understanding them is essential for any engineer building or tuning search systems.

What You Will Learn

By the end of this page, you will understand the mathematical foundations of TF-IDF and BM25, how modern search engines compute relevance scores, techniques for tuning and customizing scoring, and when to go beyond traditional text-based ranking with learning-to-rank and semantic search.

Foundations of Relevance

Before diving into algorithms, we need to understand what makes a document "relevant" and how we can quantify relevance computationally.

Two fundamental insights drive relevance scoring:

Core Relevance Principles

•Term Frequency (TF): If a document mentions a search term many times, it's probably about that topic. A document mentioning "python" 50 times is more likely to be Python-focused than one mentioning it once.
•Document Frequency (DF) / Inverse Document Frequency (IDF): Rare terms are more informative than common ones. "Asyncio" appearing in a document signals Python relevance more strongly than "programming" because fewer documents contain "asyncio."

These two insights combine into TF-IDF and its successor BM25. But there are additional factors that influence perceived relevance:

Document length: A term appearing 5 times in a 100-word document is more significant than 5 times in a 10,000-word document.
Field importance: A term in the title likely matters more than in the body text.
Term proximity: Query terms appearing close together suggest a stronger match than terms scattered across the document.
Freshness: For some queries, recent documents are more relevant than older ones.
Popularity: Documents that many users have viewed or clicked may be more relevant.

Classic TF-IDF and BM25 handle the first three factors. The others require additional signals and custom scoring.

relevance_intuition.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Intuition behind relevance scoring
 
interface Document {
  id: string;
  content: string;
  wordCount: number;
}
 
// Query: "python asyncio tutorial"
 
const candidates: Document[] = [
  {
    id: "doc1",
    content: "Python asyncio tutorial: Learn Python asyncio from scratch...",
    wordCount: 500,
    // "python" appears 15 times, "asyncio" 12 times, "tutorial" 5 times
  },
  {
    id: "doc2", 
    content: "Programming tutorials for beginners. Python is popular...",
    wordCount: 2000,
    // "python" appears 3 times, "tutorial" 8 times, no "asyncio"
  },
  {
    id: "doc3",
    content: "The complete guide to asyncio patterns in Python applications...",
    wordCount: 800,
    // "python" appears 6 times, "asyncio" 20 times, no "tutorial"
  }
];
 
// Intuitive ranking:
// 1. doc1 - All query terms, high frequency, focused document
// 2. doc3 - Most asyncio mentions (rare term), good Python coverage
// 3. doc2 - Common terms only, no asyncio, less focused
 
// TF-IDF and BM25 formalize this intuition mathematically

TF-IDF: The Classic Approach

TF-IDF (Term Frequency-Inverse Document Frequency) is the foundational relevance scoring algorithm. Invented in the 1970s, it remains conceptually important even though most modern search engines use BM25. Understanding TF-IDF is essential for understanding BM25.

The formula:

$$TF\text{-}IDF(t, d, D) = TF(t, d) \times IDF(t, D)$$

Where:

t = the search term
d = the document being scored
D = the corpus (collection of all documents)

tf_idf_calculation.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
// TF-IDF Implementation and Examples
 
// =============== TERM FREQUENCY (TF) ===============
// Several variants exist:
 
// Raw TF: Simply count occurrences
const rawTF = (term: string, document: string[]): number => {
  return document.filter(word => word === term).length;
};
 
// Logarithmic TF: Dampens the effect of high frequencies
// Prevents a document with "python" 100 times from scoring 100x 
// higher than one with it 1 time
const logTF = (term: string, document: string[]): number => {
  const raw = rawTF(term, document);
  return raw > 0 ? 1 + Math.log(raw) : 0;
};
 
// Augmented TF: Normalizes by maximum term frequency in document
// Prevents bias toward longer documents
const augmentedTF = (term: string, document: string[]): number => {
  const raw = rawTF(term, document);
  const maxTF = Math.max(...getTermFrequencies(document));
  return 0.5 + (0.5 * raw / maxTF);
};
 
// =============== INVERSE DOCUMENT FREQUENCY (IDF) ===============
// Measures how rare/informative a term is across the corpus
 
// Standard IDF
const idf = (term: string, corpus: string[][]): number => {
  const N = corpus.length;  // Total documents
  const df = corpus.filter(doc => doc.includes(term)).length;  // Docs containing term
  
  // Add 1 to prevent division by zero for unseen terms
  return Math.log(N / (df + 1));
};
 
// Smooth IDF (used by Elasticsearch)
const smoothIDF = (totalDocs: number, docFreq: number): number => {
  // log(1 + (N - df + 0.5) / (df + 0.5))
  return Math.log(1 + (totalDocs - docFreq + 0.5) / (docFreq + 0.5));
};
 
// =============== COMPLETE TF-IDF ===============
const tfIdf = (term: string, document: string[], corpus: string[][]): number => {
  const tf = logTF(term, document);
  const idfScore = idf(term, corpus);
  return tf * idfScore;
};
 
// =============== EXAMPLE CALCULATION ===============
// Corpus of 10,000 documents
// Query: "python asyncio"
// Document: 500 words, "python" appears 10 times, "asyncio" appears 5 times
 
const exampleScoring = {
  // Term: "python"
  python: {
    tf: Math.log(1 + 10),           // = 3.4 (log TF)
    docsContaining: 5000,            // 50% of corpus
    idf: Math.log(10000 / 5000),     // = 0.69 (common term, low IDF)
    tfIdf: 3.4 * 0.69                // = 2.35
  },
  
  // Term: "asyncio"
  asyncio: {
    tf: Math.log(1 + 5),             // = 2.79 (log TF)
    docsContaining: 200,             // 2% of corpus
    idf: Math.log(10000 / 200),      // = 3.91 (rare term, high IDF)
    tfIdf: 2.79 * 3.91               // = 10.91
  },
  
  // Total document score for query
  totalScore: 2.35 + 10.91           // = 13.26
};
 
// Note: "asyncio" contributes more despite lower TF
// because its rarity (high IDF) makes it more informative

IDF is the Key Insight

The power of TF-IDF comes primarily from IDF. Without it, common words like "the" and "is" would dominate scoring. IDF ensures that distinctive terms—the ones that actually distinguish relevant documents from irrelevant ones—receive appropriate weight.

BM25: The Industry Standard

BM25 (Best Match 25) is the default ranking algorithm in Elasticsearch, Solr, and most modern search engines. Developed in the 1990s based on probabilistic information retrieval theory, BM25 addresses several limitations of classic TF-IDF.

Key improvements over TF-IDF:

Saturation of term frequency: After a term appears several times, additional occurrences provide diminishing returns. BM25 implements this naturally.
Document length normalization: Short documents aren't penalized unfairly; long documents aren't artificially boosted.
Tunable parameters: k1 and b allow adjusting behavior for different content types.

The BM25 formula:

bm25_algorithm.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
// BM25 Algorithm Implementation
 
/**
 * BM25 Scoring Formula:
 * 
 * For a query Q containing terms q1, q2, ... qn:
 * 
 * score(D, Q) = Σ IDF(qi) × (f(qi, D) × (k1 + 1)) / (f(qi, D) + k1 × (1 - b + b × |D|/avgdl))
 * 
 * Where:
 *   f(qi, D) = frequency of term qi in document D
 *   |D| = document length (in words)
 *   avgdl = average document length in the corpus
 *   k1 = term frequency saturation parameter (typically 1.2-2.0)
 *   b = document length normalization parameter (typically 0.75)
 */
 
interface BM25Config {
  k1: number;    // Term frequency saturation (1.2-2.0)
  b: number;     // Length normalization (0-1, typically 0.75)
}
 
interface CorpusStats {
  totalDocs: number;
  avgDocLength: number;
  docFrequencies: Map<string, number>;  // term -> docs containing term
}
 
// IDF component (same concept as TF-IDF, slightly different formula)
function bm25IDF(term: string, stats: CorpusStats): number {
  const N = stats.totalDocs;
  const df = stats.docFrequencies.get(term) || 0;
  
  // Elasticsearch formula: log(1 + (N - df + 0.5) / (df + 0.5))
  return Math.log(1 + (N - df + 0.5) / (df + 0.5));
}
 
// Score for a single term
function bm25TermScore(
  termFreq: number,
  docLength: number,
  idf: number,
  config: BM25Config,
  avgDocLength: number
): number {
  const { k1, b } = config;
  
  // Length normalization factor
  const lengthNorm = 1 - b + b * (docLength / avgDocLength);
  
  // TF saturation formula
  const tfComponent = (termFreq * (k1 + 1)) / (termFreq + k1 * lengthNorm);
  
  return idf * tfComponent;
}
 
// Complete BM25 score for a document
function bm25Score(
  queryTerms: string[],
  document: { terms: Map<string, number>; length: number },
  stats: CorpusStats,
  config: BM25Config = { k1: 1.2, b: 0.75 }
): number {
  let totalScore = 0;
  
  for (const term of queryTerms) {
    const termFreq = document.terms.get(term) || 0;
    
    if (termFreq > 0) {
      const idf = bm25IDF(term, stats);
      const termScore = bm25TermScore(
        termFreq,
        document.length,
        idf,
        config,
        stats.avgDocLength
      );
      totalScore += termScore;
    }
  }
  
  return totalScore;
}
 
// =============== WORKED EXAMPLE ===============
 
const exampleStats: CorpusStats = {
  totalDocs: 10000,
  avgDocLength: 500,
  docFrequencies: new Map([
    ["python", 5000],    // 50% of docs
    ["asyncio", 200],    // 2% of docs
    ["tutorial", 3000],  // 30% of docs
  ])
};
 
const exampleDoc = {
  terms: new Map([
    ["python", 10],
    ["asyncio", 5],
    ["tutorial", 2],
  ]),
  length: 400  // Shorter than average
};
 
// Score breakdown:
// "python":  IDF = 0.69, TF=10, score ≈ 2.1
// "asyncio": IDF = 3.91, TF=5, score ≈ 11.2
// "tutorial": IDF = 1.20, TF=2, score ≈ 2.4
// 
// Total ≈ 15.7
 
// Compare to a longer document (1000 words) with same term frequencies:
// Scores would be lower due to length normalization

Understanding k1 and b Parameters

k1 controls term frequency saturation. Higher k1 (e.g., 2.0) means term frequency continues to matter more; lower k1 (e.g., 0.5) saturates quickly. b controls length normalization. b=1 fully normalizes by length; b=0 ignores length. The defaults (k1=1.2, b=0.75) work well for most content, but you may need to tune for short/long documents.

BM25 Parameter Tuning

The default BM25 parameters work well for general content, but specific domains often benefit from tuning. Understanding what each parameter does enables informed optimization.

Parameter effects:

BM25 Parameter Effects
Parameter	Default	Low Value Effect	High Value Effect	When to Adjust
k1 (saturation)	1.2	Quick saturation; TF matters less beyond first few occurrences	Slow saturation; high TF documents score much higher	Lower for short docs (tweets); higher for long technical content
b (length norm)	0.75	Minimal length penalty; long docs may dominate	Strong length normalization; short docs favored	Lower when doc length correlates with quality; higher when length is noise

bm25_tuning.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
// BM25 parameter tuning strategies
 
// =============== Elasticsearch BM25 Configuration ===============
 
const customBM25Settings = {
  "settings": {
    "index": {
      "similarity": {
        "custom_bm25": {
          "type": "BM25",
          "k1": 1.2,   // Default
          "b": 0.75    // Default
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "content": {
        "type": "text",
        "similarity": "custom_bm25"
      }
    }
  }
};
 
// =============== Use Case-Specific Tuning ===============
 
const tuningGuide = {
  // Tweet/short text search
  shortText: {
    k1: 0.5,  // Saturate quickly - in 280 chars, term appearing once is enough signal
    b: 0.3,   // Minimal length normalization - all docs are short anyway
    rationale: "Short documents have limited term frequency range; length differences are meaningful"
  },
  
  // Legal/academic documents
  longDocuments: {
    k1: 2.0,  // Allow TF to keep mattering in 50-page documents
    b: 0.75,  // Standard length normalization
    rationale: "Long documents benefit from high TF distinguishing comprehensive coverage"
  },
  
  // Product titles
  productTitles: {
    k1: 1.5,  // Moderate saturation
    b: 0.0,   // NO length normalization - "iPhone 15 Pro" shouldn't beat "iPhone" just for being longer
    rationale: "Title length is intentional and meaningful"
  },
  
  // Q&A / Stack Overflow style
  questionsAndAnswers: {
    k1: 1.2,
    b: 0.9,   // Strong length normalization - concise answers often better
    rationale: "Longer answers aren't necessarily better; reward focused content"
  },
  
  // Log/event search
  logSearch: {
    k1: 0.7,  // Logs are often sparse; term appearing once is significant
    b: 0.1,   // Log length varies but doesn't indicate quality
    rationale: "Structure matters more than term frequency in logs"
  }
};
 
// =============== A/B Testing for Parameters ===============
 
interface SearchExperiment {
  name: string;
  k1: number;
  b: number;
  ndcg: number;  // Normalized Discounted Cumulative Gain
  mrr: number;   // Mean Reciprocal Rank
}
 
const experimentResults: SearchExperiment[] = [
  { name: "baseline", k1: 1.2, b: 0.75, ndcg: 0.72, mrr: 0.68 },
  { name: "low_k1", k1: 0.8, b: 0.75, ndcg: 0.69, mrr: 0.71 },
  { name: "high_k1", k1: 1.8, b: 0.75, ndcg: 0.74, mrr: 0.67 },
  { name: "low_b", k1: 1.2, b: 0.4, ndcg: 0.73, mrr: 0.69 },
  { name: "high_b", k1: 1.2, b: 0.9, ndcg: 0.70, mrr: 0.72 },
];
 
// Best configuration depends on your evaluation metric priorities
// NDCG values overall result quality; MRR values top-1 precision

Tune with Evaluation Data

Never tune BM25 parameters based on intuition alone. Use relevance judgments (query + relevant document pairs) and measure impact with metrics like NDCG, MRR, or precision@k. Small parameter changes can significantly affect ranking quality—both positively and negatively.

Field-Level Boosting and Multi-Field Scoring

In real documents, not all content is equally important. A match in the title is usually more relevant than a match in the footer. Field-level boosting allows you to weight matches by where they occur.

Common boosting patterns:

field_boosting.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
// Field boosting strategies in Elasticsearch
 
// =============== QUERY-TIME BOOSTING ===============
 
// Multi-match with field boosts
const multiMatchQuery = {
  "query": {
    "multi_match": {
      "query": "python asyncio tutorial",
      "fields": [
        "title^5",       // Title matches worth 5x
        "headings^3",    // Section headings worth 3x
        "content^1",     // Body content baseline
        "tags^4",        // Tags are highly relevant
        "metadata^0.5"   // Metadata less important
      ],
      "type": "best_fields",  // Score based on best matching field
      "tie_breaker": 0.3      // Also consider other field matches
    }
  }
};
 
// Understanding tie_breaker:
// - 0: Only best field counts (default for best_fields)
// - 0.3: Best field + 30% of other matching fields
// - 1.0: Sum of all fields (like most_fields)
 
// =============== MULTI-MATCH TYPES ===============
 
const multiMatchTypes = {
  // best_fields: Use score from best matching field
  // Good for: Finding documents where query appears in ONE key field
  bestFields: {
    "type": "best_fields",
    "fields": ["title^2", "content"]
  },
  
  // most_fields: Sum scores from all matching fields
  // Good for: Documents where query might appear across multiple fields
  mostFields: {
    "type": "most_fields", 
    "fields": ["title", "title.stemmed", "title.exact"]  // Multiple analyzers
  },
  
  // cross_fields: Treat all fields as one big field
  // Good for: Queries that span fields (person's first + last name)
  crossFields: {
    "type": "cross_fields",
    "fields": ["first_name", "last_name"],
    "operator": "and"  // All terms must be found across fields
  },
  
  // phrase: Match as phrase in each field
  // Good for: Exact phrase matching with field preferences
  phrase: {
    "type": "phrase",
    "fields": ["title^2", "content"],
    "slop": 2  // Allow 2 words between query terms
  }
};
 
// =============== FUNCTION SCORE FOR COMPLEX BOOSTING ===============
 
const functionScoreQuery = {
  "query": {
    "function_score": {
      "query": {
        "multi_match": {
          "query": "python tutorial",
          "fields": ["title^2", "content"]
        }
      },
      "functions": [
        // Boost newer documents
        {
          "filter": { "range": { "date": { "gte": "now-1M" } } },
          "weight": 1.5
        },
        
        // Boost popular documents (log scale)
        {
          "field_value_factor": {
            "field": "view_count",
            "factor": 1.2,
            "modifier": "log1p",  // log(1 + view_count * 1.2)
            "missing": 1
          }
        },
        
        // Boost verified authors
        {
          "filter": { "term": { "author.verified": true } },
          "weight": 1.2
        },
        
        // Decay by age (documents get less relevant over time)
        {
          "gauss": {
            "date": {
              "origin": "now",
              "scale": "30d",
              "offset": "7d",
              "decay": 0.5
            }
          }
        }
      ],
      "score_mode": "multiply",  // How to combine function scores
      "boost_mode": "multiply"   // How to combine with query score
    }
  }
};

Start Simple, Then Add Complexity

Begin with basic field boosting (title^2, content^1). Only add function_score complexity when basic boosting proves insufficient. Each boost factor you add is a parameter to tune and maintain. Simpler ranking functions are easier to understand and debug.

Understanding and Debugging Relevance Scores

When search results don't rank as expected, you need to understand how scores were computed. Search engines provide explain APIs that break down exactly how each score was calculated.

Debugging relevance issues:

score_debugging.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
// Debugging relevance with Elasticsearch explain API
 
// =============== GET SCORE EXPLANATION ===============
 
// Add "explain": true to your search
const explainRequest = {
  "query": {
    "match": {
      "content": "python tutorial"
    }
  },
  "explain": true  // Returns score breakdown
};
 
// Response includes explanation for each hit:
const explainResponse = {
  "hits": {
    "hits": [{
      "_id": "doc1",
      "_score": 12.5,
      "_explanation": {
        "value": 12.5,
        "description": "sum of:",
        "details": [
          {
            "value": 8.3,
            "description": "weight(content:python) [BM25]",
            "details": [
              { "value": 2.1, "description": "idf, log(1 + (N - n + 0.5) / (n + 0.5))" },
              { "value": 3.95, "description": "tf, freq=10.0, with k1=1.2, b=0.75" }
            ]
          },
          {
            "value": 4.2,
            "description": "weight(content:tutorial) [BM25]",
            "details": [/* ... */]
          }
        ]
      }
    }]
  }
};
 
// =============== EXPLAIN A SPECIFIC DOCUMENT ===============
 
// Why did this document score what it did?
const explainDocRequest = {
  // POST /my_index/_explain/doc_id
  "query": {
    "match": { "content": "python tutorial" }
  }
};
 
// =============== COMMON DEBUGGING SCENARIOS ===============
 
const debuggingGuide = {
  // Issue: Document with more term occurrences ranks lower
  unexpectedlyLowScore: {
    possibleCauses: [
      "Document is very long (length normalization)",
      "Terms are common (low IDF)",
      "Field boost is lower than expected",
      "Terms match a stemmed version (different IDF)"
    ],
    debugSteps: [
      "Check explain output for IDF values",
      "Compare document lengths",
      "Verify field boosts in query",
      "Look at tf normalization factor"
    ]
  },
  
  // Issue: Irrelevant document ranks high
  unexpectedlyHighScore: {
    possibleCauses: [
      "Document is very short (benefits from length norm)",
      "Term appears in boosted field (title)",
      "Query terms are accidentally common in this doc",
      "Synonym expansion matched unintended terms"
    ],
    debugSteps: [
      "Check which field matched (explain shows field)",
      "Verify analyzer behavior with _analyze",
      "Check for synonym expansion issues",
      "Review field boosting configuration"
    ]
  },
  
  // Issue: Perfect document doesn't appear at all
  missingDocument: {
    possibleCauses: [
      "Analyzer processes query differently than document",
      "Stopword removal eliminated query",
      "Stemmer created mismatch",
      "Document was filtered (not a scoring issue)"
    ],
    debugSteps: [
      "Use _analyze API on query terms",
      "Use _termvectors to see how doc was indexed",
      "Check if filters are excluding the document",
      "Verify index contains the document"
    ]
  }
};
 
// =============== ANALYZER DEBUGGING ===============
 
// Check how a term is analyzed
const analyzeRequest = {
  // POST /my_index/_analyze
  "analyzer": "my_analyzer",
  "text": "running pythons"
};
 
// Response: ["run", "python"] (after stemming)
// If document was indexed with different analyzer, mismatch occurs!
 
// Check how a document was actually indexed
const termVectorsRequest = {
  // GET /my_index/_termvectors/doc_id
  "fields": ["content"],
  "term_statistics": true,
  "field_statistics": true
};

Explain is Expensive

The explain parameter significantly slows down queries—it's for debugging only. Never enable it in production search requests. Use it on a specific document ID or a limited result set during development and troubleshooting.

Beyond BM25: Modern Relevance Approaches

While BM25 remains the backbone of most search systems, modern approaches add layers on top for improved relevance. These techniques address BM25's limitations: it only considers lexical matching, ignores user behavior, and can't learn from feedback.

Modern relevance enhancements:

Beyond BM25: Advanced Relevance Techniques
Technique	How It Works	Best For	Complexity
Learning to Rank (LTR)	ML model trained on click data/judgments combines multiple signals	E-commerce, content sites with click data	High: requires training data and infrastructure
Semantic Search (Vector/Dense Retrieval)	Encode queries and docs as vectors; find nearest neighbors	When lexical mismatch is common; Q&A systems	High: requires embedding models, vector DB
Hybrid (BM25 + Vectors)	Combine BM25 lexical scores with semantic similarity	Best of both worlds; recommended starting point	Medium: add vector search to existing
Query Understanding	Rewrite/expand queries before BM25 scoring	Handling synonyms, spell correction, intent	Medium: NLP pipeline before search
Personalization	Adjust scores based on user history/preferences	Content platforms, e-commerce	Medium: requires user data

modern_relevance.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
// Modern relevance approaches
 
// =============== LEARNING TO RANK (LTR) ===============
// Use ML to combine signals: BM25, popularity, freshness, etc.
 
interface LTRFeatures {
  bm25_score: number;
  title_match: boolean;
  doc_length: number;
  page_views: number;
  days_since_publish: number;
  author_reputation: number;
  has_images: boolean;
}
 
// Training data: [features, relevance_label]
// Model learns optimal combination of signals
 
const ltrPluginQuery = {
  "query": {
    "sltr": {
      "_name": "logged_featureset",
      "featureset": "my_features",
      "params": { "keywords": "python tutorial" },
      "model": "my_trained_model"
    }
  }
};
 
// =============== SEMANTIC / VECTOR SEARCH ===============
// Encode text as dense vectors; find similar vectors
 
// First, encode documents and queries using an embedding model
// (e.g., Sentence-BERT, OpenAI embeddings)
 
const vectorSearchQuery = {
  // Elasticsearch 8.x+ with dense_vector field
  "knn": {
    "field": "content_vector",
    "query_vector": [0.12, -0.34, 0.56, /* ... 768 dims */],
    "k": 10,
    "num_candidates": 100
  }
};
 
// =============== HYBRID SEARCH (BM25 + VECTOR) ===============
// Best practice: combine lexical and semantic signals
 
const hybridSearchQuery = {
  "query": {
    "bool": {
      "should": [
        // Lexical matching (BM25)
        {
          "multi_match": {
            "query": "python async programming",
            "fields": ["title^2", "content"],
            "boost": 1.0
          }
        }
      ]
    }
  },
  // Semantic matching (KNN)
  "knn": {
    "field": "content_vector",
    "query_vector": embedQuery("python async programming"),
    "k": 10,
    "boost": 0.5  // Weight relative to BM25
  }
};
 
// Reciprocal Rank Fusion (RRF) for combining rankings
function reciprocalRankFusion(
  bm25Ranking: string[],
  vectorRanking: string[],
  k: number = 60  // Tunable parameter
): Map<string, number> {
  const scores = new Map<string, number>();
  
  // Score = Σ 1 / (k + rank)
  bm25Ranking.forEach((docId, rank) => {
    const current = scores.get(docId) || 0;
    scores.set(docId, current + 1 / (k + rank + 1));
  });
  
  vectorRanking.forEach((docId, rank) => {
    const current = scores.get(docId) || 0;
    scores.set(docId, current + 1 / (k + rank + 1));
  });
  
  return scores;
}
 
// =============== QUERY EXPANSION ===============
// Improve BM25 input by expanding the query
 
interface ExpandedQuery {
  original: string;
  synonyms: string[];
  spellingCorrections: string[];
  relatedTerms: string[];
}
 
async function expandQuery(query: string): Promise<ExpandedQuery> {
  return {
    original: "python asyncio",
    synonyms: ["python async io", "python asynchronous"],
    spellingCorrections: [],
    relatedTerms: ["coroutines", "event loop", "await"]
  };
}
 
// Build expanded query
const expandedQueryDSL = {
  "query": {
    "bool": {
      "should": [
        { "match": { "content": { "query": "python asyncio", "boost": 2.0 } } },
        { "match": { "content": { "query": "coroutines event loop", "boost": 0.5 } } }
      ]
    }
  }
};

Start with BM25, Add Complexity Gradually

BM25 with proper field boosting handles 80% of relevance needs. Add semantic search when you have vocabulary mismatch problems. Add LTR when you have enough click/judgment data to train on. Premature complexity adds maintenance burden without proportional benefit.

Measuring Search Relevance Quality

How do you know if your relevance tuning actually improved search quality? You need quantitative metrics computed against labeled test data. These metrics are essential for comparing configurations and tracking quality over time.

Standard Search Quality Metrics
Metric	What It Measures	Range	Best For
Precision@k	Proportion of top-k results that are relevant	0-1 (higher is better)	When users only look at first page
Recall@k	Proportion of all relevant docs that appear in top-k	0-1 (higher is better)	When finding ALL relevant docs matters
MRR (Mean Reciprocal Rank)	Average of 1/rank of first relevant result	0-1 (higher is better)	When users want ONE good result (navigation)
MAP (Mean Average Precision)	Average precision at each relevant result position	0-1 (higher is better)	Balanced precision/recall across result set
NDCG (Normalized DCG)	Weighted relevance with position discount	0-1 (higher is better)	When you have graded relevance (not just binary)

evaluation_metrics.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
// Implementing search quality metrics
 
interface RankedResult {
  docId: string;
  rank: number;
  isRelevant: boolean;
  relevanceGrade?: number;  // 0-3 for graded relevance
}
 
interface RelevanceJudgments {
  queryId: string;
  relevantDocs: Set<string>;
  gradedRelevance?: Map<string, number>;  // docId -> grade
}
 
// =============== PRECISION@K ===============
function precisionAtK(results: RankedResult[], k: number): number {
  const topK = results.slice(0, k);
  const relevantCount = topK.filter(r => r.isRelevant).length;
  return relevantCount / k;
}
 
// Example: [R, R, N, R, N, N, N, R, N, N]
// P@5 = 3/5 = 0.60
// P@10 = 4/10 = 0.40
 
// =============== RECALL@K ===============
function recallAtK(
  results: RankedResult[],
  allRelevant: Set<string>,
  k: number
): number {
  const topK = results.slice(0, k);
  const relevantFound = topK.filter(r => r.isRelevant).length;
  return relevantFound / allRelevant.size;
}
 
// =============== MRR (Mean Reciprocal Rank) ===============
function reciprocalRank(results: RankedResult[]): number {
  const firstRelevant = results.find(r => r.isRelevant);
  if (!firstRelevant) return 0;
  return 1 / firstRelevant.rank;
}
 
function meanReciprocalRank(queryResults: RankedResult[][]): number {
  const rrs = queryResults.map(reciprocalRank);
  return rrs.reduce((a, b) => a + b, 0) / rrs.length;
}
 
// =============== NDCG (Normalized Discounted Cumulative Gain) ===============
function dcg(results: RankedResult[], k: number): number {
  return results.slice(0, k).reduce((sum, result, i) => {
    const relevance = result.relevanceGrade || (result.isRelevant ? 1 : 0);
    const position = i + 1;
    // DCG = Σ (2^rel - 1) / log2(position + 1)
    return sum + (Math.pow(2, relevance) - 1) / Math.log2(position + 1);
  }, 0);
}
 
function idealDCG(relevanceGrades: number[], k: number): number {
  // Sort by relevance descending for ideal ranking
  const sorted = [...relevanceGrades].sort((a, b) => b - a);
  return sorted.slice(0, k).reduce((sum, rel, i) => {
    return sum + (Math.pow(2, rel) - 1) / Math.log2(i + 2);
  }, 0);
}
 
function ndcg(results: RankedResult[], idealGrades: number[], k: number): number {
  const actual = dcg(results, k);
  const ideal = idealDCG(idealGrades, k);
  return ideal > 0 ? actual / ideal : 0;
}
 
// =============== RUNNING EVALUATION ===============
 
interface EvaluationResult {
  queryId: string;
  precisionAt5: number;
  precisionAt10: number;
  mrr: number;
  ndcgAt10: number;
}
 
async function evaluateSearch(
  queries: { id: string; text: string }[],
  judgments: Map<string, RelevanceJudgments>,
  searchFn: (query: string) => Promise<RankedResult[]>
): Promise<EvaluationResult[]> {
  const results: EvaluationResult[] = [];
  
  for (const query of queries) {
    const searchResults = await searchFn(query.text);
    const judged = judgments.get(query.id);
    
    if (judged) {
      // Mark relevance based on judgments
      searchResults.forEach(r => {
        r.isRelevant = judged.relevantDocs.has(r.docId);
        r.relevanceGrade = judged.gradedRelevance?.get(r.docId) || 0;
      });
      
      results.push({
        queryId: query.id,
        precisionAt5: precisionAtK(searchResults, 5),
        precisionAt10: precisionAtK(searchResults, 10),
        mrr: reciprocalRank(searchResults),
        ndcgAt10: ndcg(searchResults, Array.from(judged.gradedRelevance?.values() || []), 10)
      });
    }
  }
  
  return results;
}

Judgments are Expensive but Essential

Creating relevance judgments (human-labeled query-document relevance pairs) is time-consuming but essential for meaningful evaluation. Start with a small set of important queries. Use click data as a proxy when human judgments aren't available, but be aware of biases (users can only click what they see).

Summary: Mastering Relevance Scoring

Relevance scoring determines whether your search system delights users or frustrates them. BM25 provides a robust foundation, but effective search requires understanding the algorithm, tuning its parameters, and knowing when to go beyond lexical matching.

Key Takeaways

•TF-IDF combines two insights: Term frequency (more mentions = more relevant) and inverse document frequency (rare terms are more informative). These fundamentals underpin all modern relevance scoring.
•BM25 improves on TF-IDF: It adds term frequency saturation (diminishing returns) and document length normalization. Parameters k1 and b control these behaviors.
•Default BM25 parameters work well: k1=1.2, b=0.75 are sensible defaults. Tune only when you have evaluation data showing improvement potential.
•Field boosting is essential: Matches in titles, headings, and tags are usually more relevant than body text. Use multi_match with field-level boosts.
•Debugging requires explain API: When results don't rank as expected, use explain to understand exactly how scores were computed.
•BM25 has limits: It only does lexical matching. Semantic search (vectors), learning-to-rank, and query expansion address vocabulary mismatch and can learn from user behavior.
•Measure quality with metrics: NDCG, MRR, and Precision@k quantify search quality. Without metrics, you're tuning blind.
•Evaluation requires judgments: Human relevance labels or click data are essential for meaningful metrics. Start small but start now.

Module Complete:

With this final page, you've completed the Full-Text Search module. You now understand the complete journey from raw text to ranked results: tokenization breaks text into searchable units, stemming and lemmatization normalize word forms, stopwords are strategically handled, language-specific processing addresses linguistic diversity, and relevance scoring ranks results by importance.

This foundation prepares you for the next module, where we'll dive into Elasticsearch architecture—how these algorithms scale across distributed clusters to handle billions of documents and thousands of queries per second.

Module Complete

You've mastered the complete full-text search pipeline: from tokenization through relevance scoring. You understand how text becomes searchable, how linguistic variations are handled, and how BM25 ranks results. This knowledge enables you to build, tune, and debug search systems that deliver relevant results to users.