Database Management SystemsFull-Text Indexes

Full-Text Indexing and Search

LevelIntermediate

Duration75 mins

TopicFull-Text Indexes

5 / 5

Relevance Ranking

The Art and Science of Ranking

A search engine that returns 10,000 results is useless if the best matches aren't on the first page. Relevance ranking—the science of ordering results so the most relevant appear first—is what separates a useful search system from a mere filter.

We've learned about TF-IDF scoring, but production systems go much further. They incorporate dozens of signals: term proximity, field importance, document quality, freshness, and even user behavior. This page brings together everything into complete ranking systems.

What You Will Learn

By the end of this page, you will understand BM25 in depth, field weighting strategies, proximity and phrase scoring, learning to rank approaches, and how to evaluate search quality using precision, recall, and nDCG metrics.

BM25: The Modern Standard

BM25 (Best Match 25) is the de facto standard for full-text relevance scoring. It's the default in Elasticsearch, Apache Solr, and most production search systems. Let's understand it deeply.

The Full BM25 Formula:

bm25_formula.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
BM25 Score for document D given query Q:
 
                    n
score(D, Q) = Σ    IDF(qi) × f(qi, D) × (k1 + 1)
                   i=1     ────────────────────────────────────────
                           f(qi, D) + k1 × (1 - b + b × |D| / avgdl)
 
Where:
  qi = query term i
  f(qi, D) = frequency of qi in document D
  |D| = length of document D (in words)
  avgdl = average document length in collection
  k1 = term frequency saturation parameter (typical: 1.2)
  b = length normalization parameter (typical: 0.75)
  
IDF formula (BM25 variant):
                    N - n(qi) + 0.5
  IDF(qi) = log ( ─────────────────── )
                     n(qi) + 0.5
 
Where:
  N = total documents in collection
  n(qi) = documents containing qi

Understanding the Parameters:

BM25 Parameter Effects
Parameter	Range	Low Value Effect	High Value Effect
k1	0.0 - 3.0+	Frequency matters less (saturation early)	Frequency matters more (linear-ish)
b	0.0 - 1.0	No length penalty (long docs favored)	Full length normalization (short docs favored)

bm25_saturation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
TF Saturation in BM25 (k1 = 1.2):
 
Raw TF | BM25 TF Component
-------|------------------
   1   |      1.29
   2   |      1.71
   3   |      1.95
   5   |      2.22
  10   |      2.55
  20   |      2.78
  50   |      2.94
 100   |      3.00  (approaching limit of k1+1 = 2.2)
 
Key insight: 
- First occurrence is most valuable
- Additional occurrences have diminishing returns
- After ~20 occurrences, score barely increases
 
Compare to raw TF where 100 occurrences = 100× weight!
BM25's saturation prevents frequency spam from dominating.

Length Normalization in Action:

length_normalization.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Length Normalization Factor: (1 - b + b × |D| / avgdl)
 
Assume avgdl = 500 words, b = 0.75:
 
Doc Length | Factor | Effect
-----------|--------|-------
   100     | 0.40   | Short doc: Boost (divide by smaller number)
   250     | 0.625  | Below avg: Moderate boost
   500     | 1.0    | Average: Neutral
  1000     | 1.75   | Long: Penalty
  2500     | 4.0    | Very long: Strong penalty
 
Why? Long documents naturally contain more term occurrences
just by having more words. Length normalization levels the field.
 
With b = 0: Factor = 1 always (no normalization)
With b = 1: Factor = |D| / avgdl (full normalization)
With b = 0.75: Balanced middle ground

Tuning BM25

The default k1=1.2 and b=0.75 work well for general text search. For short documents (tweets, titles), reduce b toward 0.5. For technical documents where repetition matters more, increase k1 toward 2.0. Always validate with your actual queries and relevance judgments.

Field Weighting and Boosting

Documents have structure. Matching a term in the title is more significant than matching it in a footnote. Field weighting assigns different importance to matches in different document fields.

Common Field Importance Hierarchy:

field_weights.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Typical Document Fields by Importance:
 
Field               | Suggested Weight | Rationale
--------------------|------------------|----------------------------------
Title               | 4.0 - 10.0       | Highest: Directly describes content
H1/H2 Headings      | 2.0 - 4.0        | Structure indicates importance
Abstract/Summary    | 2.0 - 3.0        | Concentrated information
Keywords/Tags       | 2.0 - 4.0        | Explicit topic markers
First paragraph     | 1.5 - 2.0        | Often contains key points
Body text           | 1.0 (baseline)   | Default weight
Footnotes/Citations | 0.5 - 0.8        | Less relevant to main topic
Comments/Metadata   | 0.3 - 0.5        | User-generated, less reliable
 
Example scoring:
Query: "machine learning"
 
Doc A: "machine learning" in title → score × 5.0
Doc B: "machine learning" in body (5 times) → score × 1.0
 
Doc A likely ranks higher despite fewer occurrences!

Implementing Field Weights in Elasticsearch:

elasticsearch_field_weights.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Multi-field query with boosting
 
GET /articles/_search
{
  "query": {
    "multi_match": {
      "query": "database optimization",
      "fields": [
        "title^5",          // Title matches worth 5x
        "headings^3",       // Heading matches worth 3x
        "abstract^2",       // Abstract matches worth 2x
        "body",             // Body is baseline (1x)
        "footnotes^0.5"     // Footnotes discounted
      ],
      "type": "best_fields", // Score = max of field scores
      "tie_breaker": 0.3     // Plus 0.3 × other field scores
    }
  }
}
 
// Alternative: Combine with bool query
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": { "query": "database", "boost": 5 }}},
        { "match": { "body": { "query": "database", "boost": 1 }}}
      ]
    }
  }
}

PostgreSQL Field Weighting:

postgresql_field_weights.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- PostgreSQL uses weight labels A, B, C, D
-- By default: A=1.0, B=0.4, C=0.2, D=0.1
 
-- Create weighted tsvector
SELECT 
    setweight(to_tsvector('english', COALESCE(title, '')), 'A') ||
    setweight(to_tsvector('english', COALESCE(abstract, '')), 'B') ||
    setweight(to_tsvector('english', COALESCE(body, '')), 'C') ||
    setweight(to_tsvector('english', COALESCE(footnotes, '')), 'D')
    AS document_vector
FROM articles;
 
-- Custom weights in ts_rank
SELECT 
    title,
    ts_rank(
        document_vector,
        to_tsquery('database & optimization'),
        ARRAY[0.1, 0.2, 0.4, 1.0]  -- Custom weights for D,C,B,A
    ) AS relevance
FROM articles
WHERE document_vector @@ to_tsquery('database & optimization')
ORDER BY relevance DESC;

Field Weight Tuning

Field weights should be tuned based on your specific content. E-commerce product search might heavily favor product name, while academic search might favor abstract equally with title. Use real user queries and click data to validate weights.

Proximity and Phrase Scoring

When a user searches 'database optimization', documents where these words appear close together are likely more relevant than those where they're in different paragraphs. Proximity scoring rewards terms appearing near each other.

Proximity Concepts:

Proximity Scoring Factors

•Exact Phrase Match — Query terms in exact order. Highest boost. 'database optimization' as written.
•Unordered Proximity — All terms within N words. 'optimization of database systems' matches within 4 words.
•Minimum Span — The smallest window containing all query terms. Smaller span = higher relevance.
•Same Sentence — Terms in same sentence likely more related than across paragraphs.
•Same Paragraph — Still valuable proximity signal, weaker than sentence.

proximity_examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Query: "database optimization"
 
Document A: "This section covers database optimization techniques."
  Span: 2 words (adjacent)
  Score boost: Maximum (exact phrase)
 
Document B: "Database systems require optimization for performance."  
  Span: 4 words (database ... optimization)
  Score boost: High (very close)
 
Document C: "We will optimize the database connection pool."
  Span: 3 words (optimize ... database)
  Order: Reversed
  Score boost: Moderate (close but reordered)
 
Document D: "The database stores millions of records. Performance 
            optimization is critical for the application."
  Span: ~12 words across sentences
  Score boost: Low (same paragraph but far)
 
Document E: "Chapter 1: Database Design... Chapter 5: Optimization."
  Span: Hundreds of words
  Score boost: Minimal (different sections entirely)
 
Ranking: A > B > C > D > E

Minimum Span Scoring:

minimum_span_scoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def compute_minimum_span_score(query_terms, term_positions):
    """
    Compute proximity score based on minimum span containing all query terms.
    
    term_positions: dict mapping term -> list of positions in document
    """
    if not all(term in term_positions for term in query_terms):
        return 0  # Not all terms present
    
    # Create merged list of (position, term) sorted by position
    all_positions = []
    for term in query_terms:
        for pos in term_positions[term]:
            all_positions.append((pos, term))
    all_positions.sort()
    
    min_span = float('inf')
    terms_needed = set(query_terms)
    
    # Sliding window to find minimum span
    left = 0
    terms_in_window = {}
    
    for right, (pos, term) in enumerate(all_positions):
        terms_in_window[term] = terms_in_window.get(term, 0) + 1
        
        # Shrink window from left while all terms covered
        while len(terms_in_window) == len(terms_needed):
            left_pos, left_term = all_positions[left]
            span = pos - left_pos + 1
            min_span = min(min_span, span)
            
            terms_in_window[left_term] -= 1
            if terms_in_window[left_term] == 0:
                del terms_in_window[left_term]
            left += 1
    
    # Convert span to score (smaller span = higher score)
    if min_span == float('inf'):
        return 0
    
    # Score formula: Inverse of span, with minimum cap
    return len(query_terms) / max(min_span, len(query_terms))
 
# Example: query = ["database", "optimization"]
# positions = {"database": [5, 42, 100], "optimization": [7, 105]}
# Best span: positions 5-7 → span = 3
# Score: 2 / 3 = 0.67

Phrase vs. Proximity

Exact phrase matching requires positional indexes. Proximity scoring also requires positions but is more forgiving—it still rewards nearness for non-exact matches. Many systems boost exact phrases separately from general proximity.

Additional Ranking Signals

Beyond text matching, production systems incorporate many additional signals to determine ranking.

Document-Level Signals:

Non-Text Ranking Signals
Signal	Description	Use Case
PageRank/Authority	Link-based importance	Web search, citation networks
Freshness/Recency	Document publication date	News, time-sensitive content
Document Length	Word/character count	Prefer substantial content
Popularity/Views	Historical access count	E-commerce, social content
User Ratings	Explicit quality signals	Product reviews, Q&A sites
Click-Through Rate	Historical user engagement	Search result refinement
Geographic Relevance	Distance from user	Local search, maps

Query-Document Interaction Signals:

Query-Specific Signals

•Query Term Coverage — What fraction of query terms does the document contain?
•Rare Term Matches — Does the document match the most specific query terms?
•Query Type Match — Is this a navigational query (looking for specific page) or informational?
•Previous Interactions — Has this user clicked similar documents before?

combined_scoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def compute_final_score(doc, query, user_context):
    """
    Combine multiple signals into final ranking score.
    Production systems may have 100+ features!
    """
    
    # Text relevance (BM25)
    text_score = compute_bm25(doc, query)
    
    # Field-weighted matching
    field_score = (
        5.0 * match_in_title(doc, query) +
        2.0 * match_in_headings(doc, query) +
        1.0 * match_in_body(doc, query)
    )
    
    # Proximity bonus
    proximity_score = compute_proximity(doc, query)
    
    # Document quality signals
    quality_score = (
        0.3 * doc.pagerank +
        0.2 * doc.avg_rating +
        0.1 * log(doc.view_count + 1)
    )
    
    # Freshness (decay older content)
    days_old = (today - doc.published_date).days
    freshness_score = 1.0 / (1 + 0.01 * days_old)  # Exponential decay
    
    # Personalization
    personal_score = user_affinity(user_context, doc.topics)
    
    # Combine with learned weights
    final_score = (
        0.40 * text_score +
        0.20 * field_score +
        0.15 * proximity_score +
        0.10 * quality_score +
        0.10 * freshness_score +
        0.05 * personal_score
    )
    
    return final_score

Signal Weighting is Hard

Manually tuning signal weights is error-prone and doesn't scale. Modern systems learn these weights from user behavior data (Learning to Rank). The weights above are illustrative—real values come from training on relevance judgments.

Learning to Rank (LTR)

Learning to Rank (LTR) uses machine learning to combine ranking signals optimally. Rather than manually tuning weights, we train a model on labeled relevance data.

The LTR Pipeline:

Converting Mermaid diagram...

LTR Approaches:

Learning to Rank Paradigms
Approach	Loss Function	Algorithm Examples
Pointwise	Predict absolute relevance score	Linear Regression, Neural Net
Pairwise	Correctly order document pairs	RankNet, RankSVM, LambdaRank
Listwise	Optimize entire ranking list	ListNet, LambdaMART, ListMLE

ltr_feature_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Learning to Rank Feature Vector:
 
For each (query, document) pair, extract features:
 
Feature Category      | Example Features
----------------------|------------------------------------------
Text Matching         | BM25 score, TF-IDF score, term overlap
Field Matching        | Title match, heading match, URL match
Proximity             | Min span, phrase match count, same sentence
Query Features        | Query length, term count, rare term count
Document Features     | PageRank, length, freshness, ratings
Match Statistics      | Terms matched, % coverage, IDF sum
 
Example feature vector:
query: "machine learning tutorial"
document: "Introduction to Machine Learning - Stanford"
 
features = [
    4.23,   # BM25 score
    0.85,   # Title match (85% terms match)
    0.67,   # Heading match score  
    2,      # Query length (words)
    3,      # Document matches 3 of 3 terms
    0.92,   # Phrase proximity score
    7.2,    # PageRank score
    30,     # Days since publication
    4.5,    # Average user rating
    12500,  # View count
    ...
]
 
Label: 4 (highly relevant) on 0-4 scale
 
Training data: Thousands of such labeled examples

LambdaMART: The Industry Workhorse

LambdaMART combines gradient boosted trees with a ranking-aware loss function. It's used in Bing, Yahoo, and many commercial search systems.

lambdamart_overview.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
LambdaMART Algorithm Overview:
 
1. Initialize: Simple ranking (e.g., BM25 only)
 
2. For each boosting iteration:
   a. Compute gradients based on ranking quality
      - Consider all pairs of documents
      - Gradient focuses on pairs where swap improves ranking
      - Weight by how much nDCG would improve from swap
      
   b. Train regression tree to predict gradient
   
   c. Update model: Add tree with learning rate
 
3. Final model: Ensemble of trees
 
Why it works:
- Trees handle non-linear feature interactions
- Lambda gradients directly optimize ranking metrics (nDCG)
- Focuses training on important ranking decisions
 
Typical model:
- 500-2000 trees
- Max depth 6-10
- 50-200 features

When to Use LTR

LTR shines when you have relevance judgments (human labels or click data) and multiple ranking signals to combine. For simple cases with just text matching, BM25 may suffice. LTR adds complexity—use it when the quality improvement justifies the engineering cost.

Search Quality Metrics

How do we measure whether a ranking is good? Evaluation metrics provide objective measures of search quality.

Precision and Recall:

precision_recall.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Precision and Recall (Binary Relevance):
 
Consider a query with results and relevance judgments:
R = Set of truly relevant documents (from human annotation)
A = Set of returned documents (search results)
 
                 |R ∩ A|          Relevant docs returned
Precision = ──────────── = ─────────────────────────────────
                  |A|            Total docs returned
 
                 |R ∩ A|          Relevant docs returned
Recall    = ──────────── = ─────────────────────────────────
                  |R|            Total relevant docs
 
Example:
True relevant: {doc1, doc2, doc3, doc5, doc8}  (5 relevant)
Returned:      {doc1, doc2, doc4, doc6, doc3}  (5 returned)
Intersection:  {doc1, doc2, doc3}              (3 correct)
 
Precision = 3/5 = 0.6  (60% of results are relevant)
Recall    = 3/5 = 0.6  (60% of relevant docs found)
 
Tradeoff: Returning more increases recall but may decrease precision.

Precision at K (P@K):

Precision@K measures precision considering only the top K results—what users actually see:

precision_at_k.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Precision@K: Consider only top K results
 
Results (ordered): [doc1✓, doc4✗, doc2✓, doc6✗, doc3✓, doc5✓, ...]
 
P@1 = 1/1 = 1.0     (top 1 result is relevant)
P@3 = 2/3 = 0.67    (2 of top 3 are relevant)
P@5 = 3/5 = 0.60    (3 of top 5 are relevant)
P@10 = 4/10 = 0.40  (4 of top 10 are relevant)
 
Why P@K matters:
- Users rarely look past first page (top 10)
- P@10 is often the most critical metric
- Different K values for different use cases:
  - Web search: P@10 (first page)
  - Mobile search: P@3 (small screen)
  - Voice search: P@1 (single answer)

Normalized Discounted Cumulative Gain (nDCG):

Unlike precision (binary: relevant or not), nDCG handles graded relevance (e.g., 0-4 scale) and rewards results appearing higher:

ndcg_calculation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Normalized Discounted Cumulative Gain (nDCG):
 
Graded relevance scale: 0 (irrelevant) to 4 (perfect)
 
Step 1: Compute Discounted Cumulative Gain (DCG)
                n
  DCG@n = Σ    relevance(i) / log₂(i + 1)
               i=1
               
Position discount: log₂(2)=1, log₂(3)=1.58, log₂(4)=2, log₂(5)=2.32
 
Example ranking with relevance scores:
Position:  1    2    3    4    5
Relevance: 4    2    0    3    1
 
DCG@5 = 4/log₂(2) + 2/log₂(3) + 0/log₂(4) + 3/log₂(5) + 1/log₂(6)
      = 4/1 + 2/1.58 + 0/2 + 3/2.32 + 1/2.58
      = 4.0 + 1.26 + 0 + 1.29 + 0.39
      = 6.94
 
Step 2: Compute Ideal DCG (IDCG) - best possible order
Ideal order: 4, 3, 2, 1, 0  (descending by relevance)
 
IDCG@5 = 4/1 + 3/1.58 + 2/2 + 1/2.32 + 0/2.58
       = 4.0 + 1.90 + 1.0 + 0.43 + 0
       = 7.33
 
Step 3: Normalize
nDCG@5 = DCG@5 / IDCG@5 = 6.94 / 7.33 = 0.947
 
Interpretation: 94.7% of ideal ranking quality achieved.
Perfect ranking: nDCG = 1.0

Metric Comparison
Metric	Relevance Model	Position Aware	Best For
Precision	Binary	No	Simple quality checks
Recall	Binary	No	Coverage analysis
P@K	Binary	Partial (cutoff)	User-facing quality
MAP	Binary	Yes (averaged)	Cross-query comparison
nDCG	Graded	Yes (discounted)	Fine-grained ranking quality

Industry Standard

nDCG is the most widely used metric in commercial search because it handles graded relevance and rewards placing the best results at the top. Most Learning to Rank systems directly optimize for nDCG.

Practical Ranking Considerations

Building production ranking systems involves practical considerations beyond the core algorithms.

Query Understanding:

Pre-Ranking Query Processing

•Spell Correction — Fix typos before ranking: 'databse' → 'database'
•Query Expansion — Add synonyms: 'car' → 'car OR automobile'
•Stop Word Strategy — Decide per-query whether to include stops
•Intent Classification — Is this navigational, informational, or transactional?
•Entity Recognition — Identify names, places, products for special handling

Two-Stage Ranking:

Production systems typically use two stages for efficiency:

two_stage_ranking.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Two-Stage Ranking Architecture:
 
Stage 1: Candidate Generation (Fast)
-----------------------------------------
- Use inverted index for recall
- Apply simple scoring (BM25)
- Retrieve top 1000 candidates
- Speed: ~10ms for millions of docs
 
Stage 2: Re-Ranking (Sophisticated)  
-----------------------------------------
- Extract rich features for 1000 candidates
- Apply ML model (LambdaMART, Neural)
- Consider expensive signals (proximity, ML)
- Return top 10-20 results
- Speed: ~50-100ms for 1000 candidates
 
Why two stages?
- Can't run expensive ML model on millions of docs
- Inverted index efficiently filters to candidates
- Re-ranker provides quality polish on small set
 
Total latency: ~60-110ms for world-class results

Diversity and Freshness:

diversity_freshness.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Result Diversity:
 
Problem: Top 10 results all from same website = poor experience
 
Solution: Maximal Marginal Relevance (MMR)
 
MMR(d) = λ × Relevance(d) - (1-λ) × max Similarity(d, selected)
 
At each position:
1. Pick document with highest MMR
2. Add to selected set
3. Repeat
 
Effect: Balance relevance against redundancy
 
Example:
Position 1: Pick most relevant doc
Position 2: Pick relevant doc that's different from #1
Position 3: Pick relevant doc different from #1 and #2
...
 
Result: Diverse top 10 even if similar docs score high
 
Freshness Boost:
 
For time-sensitive queries (news, events):
freshness_boost = 1.0 / (1 + decay_rate × days_old)
 
Score = relevance_score × freshness_boost
 
Setting decay_rate by query type:
- Breaking news: High decay (hours)
- General news: Moderate decay (days)
- Evergreen content: Low decay (months)

Monitoring and Iteration

Ranking is never 'done'. Monitor click-through rates, dwell time, and search satisfaction. Use A/B tests to validate ranking changes. Collect user feedback to continuously improve relevance judgments and retrain models.

Summary: Relevance Ranking

We've comprehensively explored relevance ranking in full-text search. Let's consolidate the key concepts:

Key Takeaways

•BM25 is the modern standard — Term frequency saturation and length normalization improve on raw TF-IDF
•Field weighting matters — Title matches worth more than body matches; tune weights per domain
•Proximity signals relevance — Query terms appearing near each other indicate stronger matches
•Multiple signals combine — Text matching, document quality, freshness, and user behavior all contribute
•Learning to Rank scales — ML models (LambdaMART) learn optimal signal combinations from data
•nDCG measures quality — Graded relevance with position discounting is the industry standard metric

Module Complete:

Congratulations! You've completed the comprehensive module on Full-Text Indexes. You now understand:

Text search fundamentals and why specialized indexes are needed
Inverted index architecture including vocabulary, posting lists, and compression
TF-IDF and term frequency for relevance scoring
Stop word handling strategies and tradeoffs
Relevance ranking algorithms including BM25 and Learning to Rank

This knowledge enables you to design, optimize, and troubleshoot full-text search systems at scale.

Module 3: Full-Text Indexes Complete

You've mastered full-text indexing—from the inverted index data structure through sophisticated ranking algorithms. Whether you're using PostgreSQL's tsvector, Elasticsearch, or building a custom search system, you now have the theoretical foundation and practical knowledge to build world-class text search experiences.

5 / 5

Loading learning content...

Database Management SystemsFull-Text Indexes

Full-Text Indexing and Search

LevelIntermediate

Duration75 mins

TopicFull-Text Indexes

5 / 5

Relevance Ranking

The Art and Science of Ranking

What You Will Learn

BM25: The Modern Standard

BM25 (Best Match 25) is the de facto standard for full-text relevance scoring. It's the default in Elasticsearch, Apache Solr, and most production search systems. Let's understand it deeply.

The Full BM25 Formula:

bm25_formula.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
BM25 Score for document D given query Q:
 
                    n
score(D, Q) = Σ    IDF(qi) × f(qi, D) × (k1 + 1)
                   i=1     ────────────────────────────────────────
                           f(qi, D) + k1 × (1 - b + b × |D| / avgdl)
 
Where:
  qi = query term i
  f(qi, D) = frequency of qi in document D
  |D| = length of document D (in words)
  avgdl = average document length in collection
  k1 = term frequency saturation parameter (typical: 1.2)
  b = length normalization parameter (typical: 0.75)
  
IDF formula (BM25 variant):
                    N - n(qi) + 0.5
  IDF(qi) = log ( ─────────────────── )
                     n(qi) + 0.5
 
Where:
  N = total documents in collection
  n(qi) = documents containing qi

Understanding the Parameters:

BM25 Parameter Effects
Parameter	Range	Low Value Effect	High Value Effect
k1	0.0 - 3.0+	Frequency matters less (saturation early)	Frequency matters more (linear-ish)
b	0.0 - 1.0	No length penalty (long docs favored)	Full length normalization (short docs favored)

bm25_saturation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
TF Saturation in BM25 (k1 = 1.2):
 
Raw TF | BM25 TF Component
-------|------------------
   1   |      1.29
   2   |      1.71
   3   |      1.95
   5   |      2.22
  10   |      2.55
  20   |      2.78
  50   |      2.94
 100   |      3.00  (approaching limit of k1+1 = 2.2)
 
Key insight: 
- First occurrence is most valuable
- Additional occurrences have diminishing returns
- After ~20 occurrences, score barely increases
 
Compare to raw TF where 100 occurrences = 100× weight!
BM25's saturation prevents frequency spam from dominating.

Length Normalization in Action:

length_normalization.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
Length Normalization Factor: (1 - b + b × |D| / avgdl)
 
Assume avgdl = 500 words, b = 0.75:
 
Doc Length | Factor | Effect
-----------|--------|-------
   100     | 0.40   | Short doc: Boost (divide by smaller number)
   250     | 0.625  | Below avg: Moderate boost
   500     | 1.0    | Average: Neutral
  1000     | 1.75   | Long: Penalty
  2500     | 4.0    | Very long: Strong penalty
 
Why? Long documents naturally contain more term occurrences
just by having more words. Length normalization levels the field.
 
With b = 0: Factor = 1 always (no normalization)
With b = 1: Factor = |D| / avgdl (full normalization)
With b = 0.75: Balanced middle ground

Tuning BM25

Field Weighting and Boosting

Documents have structure. Matching a term in the title is more significant than matching it in a footnote. Field weighting assigns different importance to matches in different document fields.

Common Field Importance Hierarchy:

field_weights.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Typical Document Fields by Importance:
 
Field               | Suggested Weight | Rationale
--------------------|------------------|----------------------------------
Title               | 4.0 - 10.0       | Highest: Directly describes content
H1/H2 Headings      | 2.0 - 4.0        | Structure indicates importance
Abstract/Summary    | 2.0 - 3.0        | Concentrated information
Keywords/Tags       | 2.0 - 4.0        | Explicit topic markers
First paragraph     | 1.5 - 2.0        | Often contains key points
Body text           | 1.0 (baseline)   | Default weight
Footnotes/Citations | 0.5 - 0.8        | Less relevant to main topic
Comments/Metadata   | 0.3 - 0.5        | User-generated, less reliable
 
Example scoring:
Query: "machine learning"
 
Doc A: "machine learning" in title → score × 5.0
Doc B: "machine learning" in body (5 times) → score × 1.0
 
Doc A likely ranks higher despite fewer occurrences!

Implementing Field Weights in Elasticsearch:

elasticsearch_field_weights.json
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
// Multi-field query with boosting
 
GET /articles/_search
{
  "query": {
    "multi_match": {
      "query": "database optimization",
      "fields": [
        "title^5",          // Title matches worth 5x
        "headings^3",       // Heading matches worth 3x
        "abstract^2",       // Abstract matches worth 2x
        "body",             // Body is baseline (1x)
        "footnotes^0.5"     // Footnotes discounted
      ],
      "type": "best_fields", // Score = max of field scores
      "tie_breaker": 0.3     // Plus 0.3 × other field scores
    }
  }
}
 
// Alternative: Combine with bool query
{
  "query": {
    "bool": {
      "should": [
        { "match": { "title": { "query": "database", "boost": 5 }}},
        { "match": { "body": { "query": "database", "boost": 1 }}}
      ]
    }
  }
}

PostgreSQL Field Weighting:

postgresql_field_weights.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
-- PostgreSQL uses weight labels A, B, C, D
-- By default: A=1.0, B=0.4, C=0.2, D=0.1
 
-- Create weighted tsvector
SELECT 
    setweight(to_tsvector('english', COALESCE(title, '')), 'A') ||
    setweight(to_tsvector('english', COALESCE(abstract, '')), 'B') ||
    setweight(to_tsvector('english', COALESCE(body, '')), 'C') ||
    setweight(to_tsvector('english', COALESCE(footnotes, '')), 'D')
    AS document_vector
FROM articles;
 
-- Custom weights in ts_rank
SELECT 
    title,
    ts_rank(
        document_vector,
        to_tsquery('database & optimization'),
        ARRAY[0.1, 0.2, 0.4, 1.0]  -- Custom weights for D,C,B,A
    ) AS relevance
FROM articles
WHERE document_vector @@ to_tsquery('database & optimization')
ORDER BY relevance DESC;

Field Weight Tuning

Proximity and Phrase Scoring

Proximity Concepts:

Proximity Scoring Factors

•Exact Phrase Match — Query terms in exact order. Highest boost. 'database optimization' as written.
•Unordered Proximity — All terms within N words. 'optimization of database systems' matches within 4 words.
•Minimum Span — The smallest window containing all query terms. Smaller span = higher relevance.
•Same Sentence — Terms in same sentence likely more related than across paragraphs.
•Same Paragraph — Still valuable proximity signal, weaker than sentence.

proximity_examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
Query: "database optimization"
 
Document A: "This section covers database optimization techniques."
  Span: 2 words (adjacent)
  Score boost: Maximum (exact phrase)
 
Document B: "Database systems require optimization for performance."  
  Span: 4 words (database ... optimization)
  Score boost: High (very close)
 
Document C: "We will optimize the database connection pool."
  Span: 3 words (optimize ... database)
  Order: Reversed
  Score boost: Moderate (close but reordered)
 
Document D: "The database stores millions of records. Performance 
            optimization is critical for the application."
  Span: ~12 words across sentences
  Score boost: Low (same paragraph but far)
 
Document E: "Chapter 1: Database Design... Chapter 5: Optimization."
  Span: Hundreds of words
  Score boost: Minimal (different sections entirely)
 
Ranking: A > B > C > D > E

Minimum Span Scoring:

minimum_span_scoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
def compute_minimum_span_score(query_terms, term_positions):
    """
    Compute proximity score based on minimum span containing all query terms.
    
    term_positions: dict mapping term -> list of positions in document
    """
    if not all(term in term_positions for term in query_terms):
        return 0  # Not all terms present
    
    # Create merged list of (position, term) sorted by position
    all_positions = []
    for term in query_terms:
        for pos in term_positions[term]:
            all_positions.append((pos, term))
    all_positions.sort()
    
    min_span = float('inf')
    terms_needed = set(query_terms)
    
    # Sliding window to find minimum span
    left = 0
    terms_in_window = {}
    
    for right, (pos, term) in enumerate(all_positions):
        terms_in_window[term] = terms_in_window.get(term, 0) + 1
        
        # Shrink window from left while all terms covered
        while len(terms_in_window) == len(terms_needed):
            left_pos, left_term = all_positions[left]
            span = pos - left_pos + 1
            min_span = min(min_span, span)
            
            terms_in_window[left_term] -= 1
            if terms_in_window[left_term] == 0:
                del terms_in_window[left_term]
            left += 1
    
    # Convert span to score (smaller span = higher score)
    if min_span == float('inf'):
        return 0
    
    # Score formula: Inverse of span, with minimum cap
    return len(query_terms) / max(min_span, len(query_terms))
 
# Example: query = ["database", "optimization"]
# positions = {"database": [5, 42, 100], "optimization": [7, 105]}
# Best span: positions 5-7 → span = 3
# Score: 2 / 3 = 0.67

Phrase vs. Proximity

Additional Ranking Signals

Beyond text matching, production systems incorporate many additional signals to determine ranking.

Document-Level Signals:

Non-Text Ranking Signals
Signal	Description	Use Case
PageRank/Authority	Link-based importance	Web search, citation networks
Freshness/Recency	Document publication date	News, time-sensitive content
Document Length	Word/character count	Prefer substantial content
Popularity/Views	Historical access count	E-commerce, social content
User Ratings	Explicit quality signals	Product reviews, Q&A sites
Click-Through Rate	Historical user engagement	Search result refinement
Geographic Relevance	Distance from user	Local search, maps

Query-Document Interaction Signals:

Query-Specific Signals

•Query Term Coverage — What fraction of query terms does the document contain?
•Rare Term Matches — Does the document match the most specific query terms?
•Query Type Match — Is this a navigational query (looking for specific page) or informational?
•Previous Interactions — Has this user clicked similar documents before?

combined_scoring.py
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
def compute_final_score(doc, query, user_context):
    """
    Combine multiple signals into final ranking score.
    Production systems may have 100+ features!
    """
    
    # Text relevance (BM25)
    text_score = compute_bm25(doc, query)
    
    # Field-weighted matching
    field_score = (
        5.0 * match_in_title(doc, query) +
        2.0 * match_in_headings(doc, query) +
        1.0 * match_in_body(doc, query)
    )
    
    # Proximity bonus
    proximity_score = compute_proximity(doc, query)
    
    # Document quality signals
    quality_score = (
        0.3 * doc.pagerank +
        0.2 * doc.avg_rating +
        0.1 * log(doc.view_count + 1)
    )
    
    # Freshness (decay older content)
    days_old = (today - doc.published_date).days
    freshness_score = 1.0 / (1 + 0.01 * days_old)  # Exponential decay
    
    # Personalization
    personal_score = user_affinity(user_context, doc.topics)
    
    # Combine with learned weights
    final_score = (
        0.40 * text_score +
        0.20 * field_score +
        0.15 * proximity_score +
        0.10 * quality_score +
        0.10 * freshness_score +
        0.05 * personal_score
    )
    
    return final_score

Signal Weighting is Hard

Learning to Rank (LTR)

Learning to Rank (LTR) uses machine learning to combine ranking signals optimally. Rather than manually tuning weights, we train a model on labeled relevance data.

The LTR Pipeline:

Converting Mermaid diagram...

LTR Approaches:

Learning to Rank Paradigms
Approach	Loss Function	Algorithm Examples
Pointwise	Predict absolute relevance score	Linear Regression, Neural Net
Pairwise	Correctly order document pairs	RankNet, RankSVM, LambdaRank
Listwise	Optimize entire ranking list	ListNet, LambdaMART, ListMLE

ltr_feature_example.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Learning to Rank Feature Vector:
 
For each (query, document) pair, extract features:
 
Feature Category      | Example Features
----------------------|------------------------------------------
Text Matching         | BM25 score, TF-IDF score, term overlap
Field Matching        | Title match, heading match, URL match
Proximity             | Min span, phrase match count, same sentence
Query Features        | Query length, term count, rare term count
Document Features     | PageRank, length, freshness, ratings
Match Statistics      | Terms matched, % coverage, IDF sum
 
Example feature vector:
query: "machine learning tutorial"
document: "Introduction to Machine Learning - Stanford"
 
features = [
    4.23,   # BM25 score
    0.85,   # Title match (85% terms match)
    0.67,   # Heading match score  
    2,      # Query length (words)
    3,      # Document matches 3 of 3 terms
    0.92,   # Phrase proximity score
    7.2,    # PageRank score
    30,     # Days since publication
    4.5,    # Average user rating
    12500,  # View count
    ...
]
 
Label: 4 (highly relevant) on 0-4 scale
 
Training data: Thousands of such labeled examples

LambdaMART: The Industry Workhorse

LambdaMART combines gradient boosted trees with a ranking-aware loss function. It's used in Bing, Yahoo, and many commercial search systems.

lambdamart_overview.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
LambdaMART Algorithm Overview:
 
1. Initialize: Simple ranking (e.g., BM25 only)
 
2. For each boosting iteration:
   a. Compute gradients based on ranking quality
      - Consider all pairs of documents
      - Gradient focuses on pairs where swap improves ranking
      - Weight by how much nDCG would improve from swap
      
   b. Train regression tree to predict gradient
   
   c. Update model: Add tree with learning rate
 
3. Final model: Ensemble of trees
 
Why it works:
- Trees handle non-linear feature interactions
- Lambda gradients directly optimize ranking metrics (nDCG)
- Focuses training on important ranking decisions
 
Typical model:
- 500-2000 trees
- Max depth 6-10
- 50-200 features

When to Use LTR

Search Quality Metrics

How do we measure whether a ranking is good? Evaluation metrics provide objective measures of search quality.

Precision and Recall:

precision_recall.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Precision and Recall (Binary Relevance):
 
Consider a query with results and relevance judgments:
R = Set of truly relevant documents (from human annotation)
A = Set of returned documents (search results)
 
                 |R ∩ A|          Relevant docs returned
Precision = ──────────── = ─────────────────────────────────
                  |A|            Total docs returned
 
                 |R ∩ A|          Relevant docs returned
Recall    = ──────────── = ─────────────────────────────────
                  |R|            Total relevant docs
 
Example:
True relevant: {doc1, doc2, doc3, doc5, doc8}  (5 relevant)
Returned:      {doc1, doc2, doc4, doc6, doc3}  (5 returned)
Intersection:  {doc1, doc2, doc3}              (3 correct)
 
Precision = 3/5 = 0.6  (60% of results are relevant)
Recall    = 3/5 = 0.6  (60% of relevant docs found)
 
Tradeoff: Returning more increases recall but may decrease precision.

Precision at K (P@K):

Precision@K measures precision considering only the top K results—what users actually see:

precision_at_k.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Precision@K: Consider only top K results
 
Results (ordered): [doc1✓, doc4✗, doc2✓, doc6✗, doc3✓, doc5✓, ...]
 
P@1 = 1/1 = 1.0     (top 1 result is relevant)
P@3 = 2/3 = 0.67    (2 of top 3 are relevant)
P@5 = 3/5 = 0.60    (3 of top 5 are relevant)
P@10 = 4/10 = 0.40  (4 of top 10 are relevant)
 
Why P@K matters:
- Users rarely look past first page (top 10)
- P@10 is often the most critical metric
- Different K values for different use cases:
  - Web search: P@10 (first page)
  - Mobile search: P@3 (small screen)
  - Voice search: P@1 (single answer)

Normalized Discounted Cumulative Gain (nDCG):

Unlike precision (binary: relevant or not), nDCG handles graded relevance (e.g., 0-4 scale) and rewards results appearing higher:

ndcg_calculation.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
Normalized Discounted Cumulative Gain (nDCG):
 
Graded relevance scale: 0 (irrelevant) to 4 (perfect)
 
Step 1: Compute Discounted Cumulative Gain (DCG)
                n
  DCG@n = Σ    relevance(i) / log₂(i + 1)
               i=1
               
Position discount: log₂(2)=1, log₂(3)=1.58, log₂(4)=2, log₂(5)=2.32
 
Example ranking with relevance scores:
Position:  1    2    3    4    5
Relevance: 4    2    0    3    1
 
DCG@5 = 4/log₂(2) + 2/log₂(3) + 0/log₂(4) + 3/log₂(5) + 1/log₂(6)
      = 4/1 + 2/1.58 + 0/2 + 3/2.32 + 1/2.58
      = 4.0 + 1.26 + 0 + 1.29 + 0.39
      = 6.94
 
Step 2: Compute Ideal DCG (IDCG) - best possible order
Ideal order: 4, 3, 2, 1, 0  (descending by relevance)
 
IDCG@5 = 4/1 + 3/1.58 + 2/2 + 1/2.32 + 0/2.58
       = 4.0 + 1.90 + 1.0 + 0.43 + 0
       = 7.33
 
Step 3: Normalize
nDCG@5 = DCG@5 / IDCG@5 = 6.94 / 7.33 = 0.947
 
Interpretation: 94.7% of ideal ranking quality achieved.
Perfect ranking: nDCG = 1.0

Metric Comparison
Metric	Relevance Model	Position Aware	Best For
Precision	Binary	No	Simple quality checks
Recall	Binary	No	Coverage analysis
P@K	Binary	Partial (cutoff)	User-facing quality
MAP	Binary	Yes (averaged)	Cross-query comparison
nDCG	Graded	Yes (discounted)	Fine-grained ranking quality

Industry Standard

nDCG is the most widely used metric in commercial search because it handles graded relevance and rewards placing the best results at the top. Most Learning to Rank systems directly optimize for nDCG.

Practical Ranking Considerations

Building production ranking systems involves practical considerations beyond the core algorithms.

Query Understanding:

Pre-Ranking Query Processing

•Spell Correction — Fix typos before ranking: 'databse' → 'database'
•Query Expansion — Add synonyms: 'car' → 'car OR automobile'
•Stop Word Strategy — Decide per-query whether to include stops
•Intent Classification — Is this navigational, informational, or transactional?
•Entity Recognition — Identify names, places, products for special handling

Two-Stage Ranking:

Production systems typically use two stages for efficiency:

two_stage_ranking.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
Two-Stage Ranking Architecture:
 
Stage 1: Candidate Generation (Fast)
-----------------------------------------
- Use inverted index for recall
- Apply simple scoring (BM25)
- Retrieve top 1000 candidates
- Speed: ~10ms for millions of docs
 
Stage 2: Re-Ranking (Sophisticated)  
-----------------------------------------
- Extract rich features for 1000 candidates
- Apply ML model (LambdaMART, Neural)
- Consider expensive signals (proximity, ML)
- Return top 10-20 results
- Speed: ~50-100ms for 1000 candidates
 
Why two stages?
- Can't run expensive ML model on millions of docs
- Inverted index efficiently filters to candidates
- Re-ranker provides quality polish on small set
 
Total latency: ~60-110ms for world-class results

Diversity and Freshness:

diversity_freshness.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
Result Diversity:
 
Problem: Top 10 results all from same website = poor experience
 
Solution: Maximal Marginal Relevance (MMR)
 
MMR(d) = λ × Relevance(d) - (1-λ) × max Similarity(d, selected)
 
At each position:
1. Pick document with highest MMR
2. Add to selected set
3. Repeat
 
Effect: Balance relevance against redundancy
 
Example:
Position 1: Pick most relevant doc
Position 2: Pick relevant doc that's different from #1
Position 3: Pick relevant doc different from #1 and #2
...
 
Result: Diverse top 10 even if similar docs score high
 
Freshness Boost:
 
For time-sensitive queries (news, events):
freshness_boost = 1.0 / (1 + decay_rate × days_old)
 
Score = relevance_score × freshness_boost
 
Setting decay_rate by query type:
- Breaking news: High decay (hours)
- General news: Moderate decay (days)
- Evergreen content: Low decay (months)

Monitoring and Iteration

Summary: Relevance Ranking

We've comprehensively explored relevance ranking in full-text search. Let's consolidate the key concepts:

Key Takeaways

•BM25 is the modern standard — Term frequency saturation and length normalization improve on raw TF-IDF
•Field weighting matters — Title matches worth more than body matches; tune weights per domain
•Proximity signals relevance — Query terms appearing near each other indicate stronger matches
•Multiple signals combine — Text matching, document quality, freshness, and user behavior all contribute
•Learning to Rank scales — ML models (LambdaMART) learn optimal signal combinations from data
•nDCG measures quality — Graded relevance with position discounting is the industry standard metric

Module Complete:

Congratulations! You've completed the comprehensive module on Full-Text Indexes. You now understand:

Text search fundamentals and why specialized indexes are needed
Inverted index architecture including vocabulary, posting lists, and compression
TF-IDF and term frequency for relevance scoring
Stop word handling strategies and tradeoffs
Relevance ranking algorithms including BM25 and Learning to Rank

This knowledge enables you to design, optimize, and troubleshoot full-text search systems at scale.

Module 3: Full-Text Indexes Complete

5 / 5