Database Management SystemsFull-Text Indexes

Full-Text Indexing and Search

LevelIntermediate

Duration75 mins

TopicFull-Text Indexes

1 / 5

Text Search Fundamentals

The Text Search Challenge

Consider a simple question: "Find all documents containing the word 'database'." This seemingly trivial query exposes one of the most fundamental challenges in data management—searching unstructured textual content efficiently.

Unlike numeric data where we can leverage comparison operators and range queries, text presents unique challenges. A word might appear at the beginning, middle, or end of a document. It might be capitalized differently. It might appear in plural form, past tense, or as part of a compound word. And with millions of documents, we cannot afford to scan every character of every document for every query.

What You Will Learn

By the end of this page, you will understand why traditional B+-tree and hash indexes are inadequate for text search, how text search differs fundamentally from structured queries, and the key requirements that drive full-text search system design. This foundation prepares you for understanding inverted indexes and ranking algorithms in subsequent pages.

The Nature of Textual Data

To understand why text search requires specialized solutions, we must first understand what makes textual data fundamentally different from other data types in database systems.

Structured vs. Unstructured Data:

Database systems excel at managing structured data—information organized into well-defined fields with predictable formats. An employee's salary is a number, their hire date is a date, and their department ID is a foreign key. Each field has a single, atomic value.

Textual data is inherently unstructured. A product description, an email body, a research paper—these contain free-form text with no predefined structure. The 'meaning' isn't in a single atomic value; it's distributed across words, phrases, and contextual relationships.

Structured vs. Unstructured Data Characteristics
Characteristic	Structured Data	Unstructured Text
Value Format	Fixed, predictable format	Variable-length, free-form
Query Pattern	Exact match, range comparison	Pattern matching, relevance
Atomic Unit	Entire field value	Individual words/tokens
Search Target	Complete value equality	Partial content matching
Result Order	Arbitrary or explicit sort	Relevance-ranked
Index Granularity	Per-row	Per-word-per-row

The Word as Fundamental Unit:

In structured data, the smallest meaningful unit is the field value. In text, the smallest unit is the word (or more precisely, the token). A document containing 1,000 words has 1,000 searchable units, each of which might independently match a query.

This shifts our indexing strategy fundamentally. Instead of indexing complete column values, we must index individual words and their locations within documents.

Key Properties of Text Data

•Variable Length — Documents range from a few words to millions of words, unlike fixed-width numeric or date fields
•High Cardinality — Natural language contains hundreds of thousands of unique words, creating massive index sizes
•Semantic Complexity — Words have meanings, synonyms, inflections, and context-dependent interpretations
•Positional Significance — The order and proximity of words affects meaning ('not good' differs from 'good')
•Language Dependence — Different languages have different tokenization rules, stop words, and grammatical structures

Why B+-Trees and Hash Indexes Fail

Before building new solutions, let's understand precisely why traditional indexing approaches are inadequate for text search. This isn't about minor inefficiencies—these approaches fundamentally cannot solve the text search problem.

Attempt 1: B+-Tree on Text Column

Suppose we create a B+-tree index on a description column. What does this enable?

btree_text_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Create B+-tree index on text column
CREATE INDEX idx_product_desc ON products(description);
 
-- What queries can this index support?
 
-- Exact match: YES (but rarely useful for text)
SELECT * FROM products WHERE description = 'High-performance laptop with 16GB RAM';
 
-- Prefix match: YES (limited utility)
SELECT * FROM products WHERE description LIKE 'High-performance%';
 
-- Contains word: NO - requires full scan!
SELECT * FROM products WHERE description LIKE '%laptop%';
 
-- Contains phrase: NO - requires full scan!
SELECT * FROM products WHERE description LIKE '%with 16GB%';

The B+-tree index is useless for the most common text search operation: finding documents that contain a specific word. The LIKE '%pattern%' query with leading wildcards cannot use a B+-tree because the index is ordered by prefix, not by internal content.

Why Prefix Ordering Fails:

A B+-tree on text stores entries ordered lexicographically by the complete text value:

'A high-performance laptop' < 'High-performance laptop' < 'Laptop with high performance'

This ordering tells us nothing about which documents contain 'laptop'. The word 'laptop' appears at different positions, and the B+-tree cannot jump to all entries containing it.

The Wildcard Problem

A LIKE '%keyword%' query with leading wildcard forces a full table scan regardless of any B+-tree index on the column. This makes B+-tree indexes essentially worthless for text search operations that look for words within documents.

Attempt 2: Hash Index on Text Column

Hash indexes are even more limited:

hash_text_limitations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Hash index on text column (hypothetical)
CREATE INDEX idx_product_desc_hash ON products USING HASH(description);
 
-- Only supports exact match
SELECT * FROM products WHERE description = 'Exact complete text match';
 
-- Cannot support:
-- - Partial matches
-- - Word containment
-- - Prefix matching
-- - Range queries
 
-- The hash of 'laptop' tells us nothing about documents containing 'laptop'

Hash indexes hash the complete text value, producing one hash per document. Searching for a word requires finding documents whose hashes... but we'd need to know the complete text to compute the hash. It's circular and useless.

Attempt 3: Multiple Indexes on Extracted Words?

A clever attempt might be to extract words and create indexes on them:

word_extraction_attempt.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Attempt: Create normalized word table
CREATE TABLE product_words (
    product_id INTEGER REFERENCES products(id),
    word VARCHAR(100),
    position INTEGER  -- Word position in document
);
 
-- Index on word
CREATE INDEX idx_word ON product_words(word);
 
-- Now we can search!
SELECT DISTINCT p.* 
FROM products p
JOIN product_words pw ON p.id = pw.product_id
WHERE pw.word = 'laptop';
 
-- But what about phrases?
-- Find "high performance laptop"
SELECT DISTINCT p.*
FROM products p
JOIN product_words pw1 ON p.id = pw1.product_id AND pw1.word = 'high'
JOIN product_words pw2 ON p.id = pw2.product_id AND pw2.word = 'performance'
JOIN product_words pw3 ON p.id = pw3.product_id AND pw3.word = 'laptop'
WHERE pw2.position = pw1.position + 1
  AND pw3.position = pw2.position + 1;
-- Extremely complex, inefficient, and hard to optimize!

This approach (manual word extraction) is moving in the right direction—it's essentially a primitive inverted index. But it's inefficient, lacks ranking, doesn't handle linguistic variations, and integrates poorly with SQL optimization.

The core insight:

We need a fundamentally different indexing structure—one that:

Indexes by word rather than by document
Supports efficient containment queries (which documents contain word X?)
Handles multiple words and their relationships (proximity, phrase matching)
Provides relevance information (not just yes/no, but how relevant?)

Types of Text Search Queries

Full-text search systems must support a rich variety of query types, far beyond simple 'contains word' searches. Understanding these query types helps us appreciate the sophistication required in full-text index design.

Fundamental Query Types

•Single Word Query — Find all documents containing the word 'database'. The simplest case, requiring lookup of one term.
•Multi-Word Query (AND) — Find documents containing 'database' AND 'index'. Requires intersection of document sets.
•Multi-Word Query (OR) — Find documents containing 'database' OR 'datastore'. Requires union of document sets.
•Phrase Query — Find documents containing the exact phrase 'relational database management'. Words must appear consecutively.
•Proximity Query — Find documents where 'index' appears within 5 words of 'performance'. Requires position information.
•Negation Query — Find documents containing 'database' but NOT 'nosql'. Requires set difference operations.
•Prefix Query — Find documents containing words starting with 'optim*' (optimize, optimization, optimizer). Supports partial matching.
•Fuzzy Query — Find documents containing words similar to 'databse' (typo-tolerant search). Requires edit distance computation.

Boolean Model vs. Relevance Model:

Text search systems typically support two fundamental paradigms:

Boolean Model: Returns documents that match (true) or don't match (false). No ranking—results are an unordered set. Query: (database AND index) OR (search AND engine)

Relevance Model: Returns documents ranked by how well they match. A document containing 'database' 10 times might rank higher than one containing it once. Each document gets a relevance score.

Boolean vs. Relevance Search Models
Aspect	Boolean Model	Relevance Model
Result Type	Set of matching documents	Ranked list of documents with scores
Query Complexity	Boolean operators (AND, OR, NOT)	Free-form natural language
Partial Matches	Not supported	Supported (partial matches score lower)
Result Ordering	Arbitrary or explicit sort	By relevance score (best match first)
User Skill Required	Higher (must know boolean logic)	Lower (natural language queries)
Use Case	Precise legal/patent search	Web search, product search

Modern full-text search systems support both models, often combining them. A user might write:

+database +performance -nosql

This is boolean (+ means required, - means excluded) but results are still ranked by relevance within the matching set.

Field-Specific Search:

In database contexts, text often spans multiple fields:

field_specific_search.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Search across multiple text fields with different weights
-- Find products where:
--   'laptop' appears in title (high importance)
--   OR 'portable' appears in description (medium importance)
--   OR 'computer' appears in category (lower importance)
 
SELECT p.*, 
       ts_rank(to_tsvector(p.title), to_tsquery('laptop')) * 4.0 +
       ts_rank(to_tsvector(p.description), to_tsquery('portable')) * 2.0 +
       ts_rank(to_tsvector(p.category), to_tsquery('computer')) * 1.0 AS relevance
FROM products p
WHERE to_tsvector(p.title || ' ' || p.description || ' ' || p.category) 
      @@ to_tsquery('laptop | portable | computer')
ORDER BY relevance DESC;
 
-- Title matches are weighted 4x higher than category matches

Query Complexity Drives Index Design

The variety of query types significantly impacts index design. Simple 'contains word' queries need only word→document mappings. Phrase queries require position information. Fuzzy queries need edit-distance-friendly structures. A production index must balance these needs.

The Text Processing Pipeline

Before text can be indexed, it must be transformed into a form suitable for matching. This process—the text processing pipeline—is critical to search quality. The pipeline converts raw text into normalized tokens that can be efficiently indexed and matched.

The Core Pipeline Stages:

Converting Mermaid diagram...

Stage 1: Tokenization

Tokenization splits raw text into individual tokens (words). This sounds simple but involves many decisions:

tokenization_examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Input: "The high-performance B+-tree index isn't slow—it's O(log n)!"
 
Different tokenization approaches:
 
Whitespace Only:
["The", "high-performance", "B+-tree", "index", "isn't", "slow—it's", "O(log", "n)!"]
 
Standard Word Boundaries:
["The", "high", "performance", "B", "tree", "index", "isn't", "slow", "it's", "O", "log", "n"]
 
Preserve Hyphens and Technical Terms:
["The", "high-performance", "B+-tree", "index", "isn't", "slow", "it's", "O(log n)"]
 
Questions tokenization must answer:
- Are hyphens word separators? (high-performance → 1 or 2 tokens?)
- Are contractions split? (isn't → isn't, isn or is not?)  
- Are numbers indexed? (What about version numbers like "3.14"?)
- Are special characters preserved? (C++ vs C plus plus?)
- Are punctuation marks removed? (What about URLs with dots?)

Stage 2: Normalization

Normalization converts tokens to a standard form for consistent matching:

Common Normalization Operations

•Case Folding — Convert to lowercase: 'Database' → 'database'
•Accent Removal — Normalize diacritics: 'café' → 'cafe'
•Unicode Normalization — Handle equivalent encodings: 'ﬁ' (ligature) → 'fi'
•Number Normalization — Standardize formats: '1,000' → '1000'
•Punctuation Handling — Remove or standardize: 'end.' → 'end'

Stage 3: Stop Word Removal

Stop words are extremely common words that add little search value. Removing them reduces index size and improves precision. We'll cover this in depth in a later page, but the concept is important here:

stop_word_removal.txt
1
2
3
4
5
6
7
8
9
Before stop word removal:
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
 
Common English stop words: the, a, an, is, are, was, were, the, and, or, but, in, on, at, to, for, of, with, as, by, about, into, through, during, before, after, above, below, between, under, again, further, then, once, here, there, when, where, why, how, all, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, can, will, just, should, now...
 
After stop word removal:
["quick", "brown", "fox", "jumps", "lazy", "dog"]
 
Reduction: 9 tokens → 6 tokens (33% reduction)

Stage 4: Stemming/Lemmatization

Stemming reduces words to their root form so that variations match:

running, runs, ran → run
databases, database's → databas (stem)
indexing, indexed, indexes → index

Lemmatization is more sophisticated, using linguistic knowledge to find the true base form (lemma) rather than just cutting suffixes.

Stemming vs. Lemmatization
Word	Stemmed (Porter)	Lemmatized
better	better	good
running	run	run
studies	studi	study
indices	indic	index
were	were	be
meeting	meet	meeting (noun) or meet (verb)

Pipeline Decisions Affect Search Quality

Every pipeline decision involves tradeoffs. Aggressive stemming improves recall (finding more matches) but may hurt precision (returning irrelevant matches). The same pipeline must be applied to both indexed documents and query terms for matching to work correctly.

Requirements for Full-Text Search Systems

Based on our understanding of text data characteristics and query types, we can now articulate the requirements that any production full-text search system must satisfy.

Functional Requirements:

Core Functional Requirements

•Word Containment Queries — Given a word, find all documents containing it in O(k) time where k is the number of matching documents
•Boolean Query Support — Combine terms with AND, OR, NOT operators and evaluate efficiently
•Phrase Query Support — Match exact sequences of words in specific order
•Relevance Ranking — Score and rank results by relevance, returning best matches first
•Linguistic Processing — Handle stemming, synonyms, and language-specific variations
•Incremental Updates — Add/remove/modify documents without rebuilding entire index

Performance Requirements:

Performance Requirements for Full-Text Search
Operation	Typical Requirement	Rationale
Single-term lookup	< 10ms	Interactive user experience
Complex boolean query	< 100ms	Reasonable response time
Index single document	< 100ms	Real-time content ingestion
Full re-index (1M docs)	< 1 hour	Offline maintenance window
Index size overhead	30-100% of text size	Practical storage limits
Memory per query	< 100 MB	Concurrent query support

Scalability Requirements:

A full-text search system must scale across multiple dimensions:

Scalability Dimensions

•Document Count — From thousands to billions of documents
•Document Size — From tweets (280 chars) to entire books (millions of chars)
•Vocabulary Size — From narrow technical domains to broad multilingual content
•Query Throughput — From single-user to millions of concurrent queries
•Update Frequency — From static archives to real-time streams

The Inverted Index Solution

The fundamental data structure that meets these requirements is the inverted index—a structure that inverts the document→words relationship to words→documents. Instead of scanning documents to find words, we look up words to find documents. This simple inversion is the foundation of all modern full-text search systems.

Database Full-Text vs. Dedicated Search Engines

Organizations face a fundamental architectural choice: use the database's built-in full-text capabilities or deploy a dedicated search engine. Understanding this tradeoff is crucial for system design.

Database Built-In Full-Text Search:

Most modern relational databases offer full-text search capabilities:

PostgreSQL: tsvector, tsquery, GIN/GiST indexes
MySQL: FULLTEXT indexes on InnoDB/MyISAM
SQL Server: Full-Text Search service
Oracle: Oracle Text

postgresql_fulltext.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- PostgreSQL Full-Text Search Example
 
-- Create table with text column
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT,
    body TEXT,
    search_vector TSVECTOR  -- Pre-computed search vector
);
 
-- Create GIN index on search vector
CREATE INDEX idx_articles_search ON articles USING GIN(search_vector);
 
-- Populate search vector (combine title and body)
UPDATE articles SET search_vector = 
    setweight(to_tsvector('english', COALESCE(title, '')), 'A') ||
    setweight(to_tsvector('english', COALESCE(body, '')), 'B');
 
-- Create trigger for automatic updates
CREATE TRIGGER articles_search_update
    BEFORE INSERT OR UPDATE ON articles
    FOR EACH ROW EXECUTE FUNCTION
    tsvector_update_trigger(search_vector, 'pg_catalog.english', title, body);
 
-- Query with ranking
SELECT id, title, ts_rank(search_vector, query) AS rank
FROM articles, to_tsquery('english', 'database & index') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 20;

Dedicated Search Engines:

Dedicated search engines like Elasticsearch, Apache Solr, and Meilisearch are purpose-built for search:

Optimized data structures and algorithms
Distributed architecture for massive scale
Rich query language and ranking features
Specialized APIs and integrations

Database Full-Text vs. Dedicated Search Engine
Aspect	Database Full-Text	Dedicated Search Engine
Data Consistency	ACID transactions, joins with other data	Eventually consistent, separate sync
Operational Complexity	Single system to manage	Additional service to deploy/monitor
Feature Richness	Basic to moderate	Advanced (facets, aggregations, ML)
Query Performance	Good for moderate scale	Optimized for extreme scale
Ranking Quality	Good with tuning	Excellent out of box
Real-Time Updates	Native support	Requires sync mechanism
Best For	Light-moderate search load	Search-heavy applications

Starting Point Recommendation

Unless you have search-heavy requirements from day one, start with your database's built-in full-text search. It's simpler, keeps data consistent, and handles moderate scale well. Migrate to a dedicated search engine when you need advanced features, extreme scale, or search becomes a core product differentiator.

Summary: Text Search Foundations

We've established the foundational understanding needed to dive into full-text index implementation. Let's consolidate the key concepts:

Key Takeaways

•Text is fundamentally unstructured — Unlike scalar values, meaning is distributed across words, requiring granular per-word indexing
•Traditional indexes fail for text search — B+-trees and hash indexes cannot efficiently support 'contains word' queries
•Query types drive index design — Boolean, phrase, proximity, and ranked queries each impose requirements on index structure
•Text processing is critical — Tokenization, normalization, stop words, and stemming transform raw text into matchable tokens
•Requirements are demanding — Sub-10ms lookups, incremental updates, and scalability to billions of documents
•Inverted indexes are the solution — Inverting the document→words relationship enables efficient word→documents lookups

What's Next:

With this foundation established, the next page dives deep into inverted indexes—the data structure at the heart of all full-text search systems. We'll explore how they're constructed, stored, and used to answer queries with remarkable efficiency.

Foundation Complete

You now understand why full-text search requires specialized solutions beyond traditional database indexes. The text processing pipeline and query type taxonomy prepare you for understanding the inverted index structure that makes efficient text search possible.

1 / 5

Loading learning content...

Database Management SystemsFull-Text Indexes

Full-Text Indexing and Search

LevelIntermediate

Duration75 mins

TopicFull-Text Indexes

1 / 5

Text Search Fundamentals

The Text Search Challenge

What You Will Learn

The Nature of Textual Data

To understand why text search requires specialized solutions, we must first understand what makes textual data fundamentally different from other data types in database systems.

Structured vs. Unstructured Data:

Structured vs. Unstructured Data Characteristics
Characteristic	Structured Data	Unstructured Text
Value Format	Fixed, predictable format	Variable-length, free-form
Query Pattern	Exact match, range comparison	Pattern matching, relevance
Atomic Unit	Entire field value	Individual words/tokens
Search Target	Complete value equality	Partial content matching
Result Order	Arbitrary or explicit sort	Relevance-ranked
Index Granularity	Per-row	Per-word-per-row

The Word as Fundamental Unit:

This shifts our indexing strategy fundamentally. Instead of indexing complete column values, we must index individual words and their locations within documents.

Key Properties of Text Data

•Variable Length — Documents range from a few words to millions of words, unlike fixed-width numeric or date fields
•High Cardinality — Natural language contains hundreds of thousands of unique words, creating massive index sizes
•Semantic Complexity — Words have meanings, synonyms, inflections, and context-dependent interpretations
•Positional Significance — The order and proximity of words affects meaning ('not good' differs from 'good')
•Language Dependence — Different languages have different tokenization rules, stop words, and grammatical structures

Why B+-Trees and Hash Indexes Fail

Attempt 1: B+-Tree on Text Column

Suppose we create a B+-tree index on a description column. What does this enable?

btree_text_index.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Create B+-tree index on text column
CREATE INDEX idx_product_desc ON products(description);
 
-- What queries can this index support?
 
-- Exact match: YES (but rarely useful for text)
SELECT * FROM products WHERE description = 'High-performance laptop with 16GB RAM';
 
-- Prefix match: YES (limited utility)
SELECT * FROM products WHERE description LIKE 'High-performance%';
 
-- Contains word: NO - requires full scan!
SELECT * FROM products WHERE description LIKE '%laptop%';
 
-- Contains phrase: NO - requires full scan!
SELECT * FROM products WHERE description LIKE '%with 16GB%';

Why Prefix Ordering Fails:

A B+-tree on text stores entries ordered lexicographically by the complete text value:

'A high-performance laptop' < 'High-performance laptop' < 'Laptop with high performance'

This ordering tells us nothing about which documents contain 'laptop'. The word 'laptop' appears at different positions, and the B+-tree cannot jump to all entries containing it.

The Wildcard Problem

Attempt 2: Hash Index on Text Column

Hash indexes are even more limited:

hash_text_limitations.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Hash index on text column (hypothetical)
CREATE INDEX idx_product_desc_hash ON products USING HASH(description);
 
-- Only supports exact match
SELECT * FROM products WHERE description = 'Exact complete text match';
 
-- Cannot support:
-- - Partial matches
-- - Word containment
-- - Prefix matching
-- - Range queries
 
-- The hash of 'laptop' tells us nothing about documents containing 'laptop'

Attempt 3: Multiple Indexes on Extracted Words?

A clever attempt might be to extract words and create indexes on them:

word_extraction_attempt.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
-- Attempt: Create normalized word table
CREATE TABLE product_words (
    product_id INTEGER REFERENCES products(id),
    word VARCHAR(100),
    position INTEGER  -- Word position in document
);
 
-- Index on word
CREATE INDEX idx_word ON product_words(word);
 
-- Now we can search!
SELECT DISTINCT p.* 
FROM products p
JOIN product_words pw ON p.id = pw.product_id
WHERE pw.word = 'laptop';
 
-- But what about phrases?
-- Find "high performance laptop"
SELECT DISTINCT p.*
FROM products p
JOIN product_words pw1 ON p.id = pw1.product_id AND pw1.word = 'high'
JOIN product_words pw2 ON p.id = pw2.product_id AND pw2.word = 'performance'
JOIN product_words pw3 ON p.id = pw3.product_id AND pw3.word = 'laptop'
WHERE pw2.position = pw1.position + 1
  AND pw3.position = pw2.position + 1;
-- Extremely complex, inefficient, and hard to optimize!

The core insight:

We need a fundamentally different indexing structure—one that:

Indexes by word rather than by document
Supports efficient containment queries (which documents contain word X?)
Handles multiple words and their relationships (proximity, phrase matching)
Provides relevance information (not just yes/no, but how relevant?)

Types of Text Search Queries

Fundamental Query Types

•Single Word Query — Find all documents containing the word 'database'. The simplest case, requiring lookup of one term.
•Multi-Word Query (AND) — Find documents containing 'database' AND 'index'. Requires intersection of document sets.
•Multi-Word Query (OR) — Find documents containing 'database' OR 'datastore'. Requires union of document sets.
•Phrase Query — Find documents containing the exact phrase 'relational database management'. Words must appear consecutively.
•Proximity Query — Find documents where 'index' appears within 5 words of 'performance'. Requires position information.
•Negation Query — Find documents containing 'database' but NOT 'nosql'. Requires set difference operations.
•Prefix Query — Find documents containing words starting with 'optim*' (optimize, optimization, optimizer). Supports partial matching.
•Fuzzy Query — Find documents containing words similar to 'databse' (typo-tolerant search). Requires edit distance computation.

Boolean Model vs. Relevance Model:

Text search systems typically support two fundamental paradigms:

Boolean Model: Returns documents that match (true) or don't match (false). No ranking—results are an unordered set. Query: (database AND index) OR (search AND engine)

Boolean vs. Relevance Search Models
Aspect	Boolean Model	Relevance Model
Result Type	Set of matching documents	Ranked list of documents with scores
Query Complexity	Boolean operators (AND, OR, NOT)	Free-form natural language
Partial Matches	Not supported	Supported (partial matches score lower)
Result Ordering	Arbitrary or explicit sort	By relevance score (best match first)
User Skill Required	Higher (must know boolean logic)	Lower (natural language queries)
Use Case	Precise legal/patent search	Web search, product search

Modern full-text search systems support both models, often combining them. A user might write:

+database +performance -nosql

This is boolean (+ means required, - means excluded) but results are still ranked by relevance within the matching set.

Field-Specific Search:

In database contexts, text often spans multiple fields:

field_specific_search.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Search across multiple text fields with different weights
-- Find products where:
--   'laptop' appears in title (high importance)
--   OR 'portable' appears in description (medium importance)
--   OR 'computer' appears in category (lower importance)
 
SELECT p.*, 
       ts_rank(to_tsvector(p.title), to_tsquery('laptop')) * 4.0 +
       ts_rank(to_tsvector(p.description), to_tsquery('portable')) * 2.0 +
       ts_rank(to_tsvector(p.category), to_tsquery('computer')) * 1.0 AS relevance
FROM products p
WHERE to_tsvector(p.title || ' ' || p.description || ' ' || p.category) 
      @@ to_tsquery('laptop | portable | computer')
ORDER BY relevance DESC;
 
-- Title matches are weighted 4x higher than category matches

Query Complexity Drives Index Design

The Text Processing Pipeline

The Core Pipeline Stages:

Converting Mermaid diagram...

Stage 1: Tokenization

Tokenization splits raw text into individual tokens (words). This sounds simple but involves many decisions:

tokenization_examples.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
Input: "The high-performance B+-tree index isn't slow—it's O(log n)!"
 
Different tokenization approaches:
 
Whitespace Only:
["The", "high-performance", "B+-tree", "index", "isn't", "slow—it's", "O(log", "n)!"]
 
Standard Word Boundaries:
["The", "high", "performance", "B", "tree", "index", "isn't", "slow", "it's", "O", "log", "n"]
 
Preserve Hyphens and Technical Terms:
["The", "high-performance", "B+-tree", "index", "isn't", "slow", "it's", "O(log n)"]
 
Questions tokenization must answer:
- Are hyphens word separators? (high-performance → 1 or 2 tokens?)
- Are contractions split? (isn't → isn't, isn or is not?)  
- Are numbers indexed? (What about version numbers like "3.14"?)
- Are special characters preserved? (C++ vs C plus plus?)
- Are punctuation marks removed? (What about URLs with dots?)

Stage 2: Normalization

Normalization converts tokens to a standard form for consistent matching:

Common Normalization Operations

•Case Folding — Convert to lowercase: 'Database' → 'database'
•Accent Removal — Normalize diacritics: 'café' → 'cafe'
•Unicode Normalization — Handle equivalent encodings: 'ﬁ' (ligature) → 'fi'
•Number Normalization — Standardize formats: '1,000' → '1000'
•Punctuation Handling — Remove or standardize: 'end.' → 'end'

Stage 3: Stop Word Removal

stop_word_removal.txt
1
2
3
4
5
6
7
8
9
Before stop word removal:
["the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"]
 
Common English stop words: the, a, an, is, are, was, were, the, and, or, but, in, on, at, to, for, of, with, as, by, about, into, through, during, before, after, above, below, between, under, again, further, then, once, here, there, when, where, why, how, all, each, few, more, most, other, some, such, no, nor, not, only, own, same, so, than, too, very, can, will, just, should, now...
 
After stop word removal:
["quick", "brown", "fox", "jumps", "lazy", "dog"]
 
Reduction: 9 tokens → 6 tokens (33% reduction)

Stage 4: Stemming/Lemmatization

Stemming reduces words to their root form so that variations match:

running, runs, ran → run
databases, database's → databas (stem)
indexing, indexed, indexes → index

Lemmatization is more sophisticated, using linguistic knowledge to find the true base form (lemma) rather than just cutting suffixes.

Stemming vs. Lemmatization
Word	Stemmed (Porter)	Lemmatized
better	better	good
running	run	run
studies	studi	study
indices	indic	index
were	were	be
meeting	meet	meeting (noun) or meet (verb)

Pipeline Decisions Affect Search Quality

Requirements for Full-Text Search Systems

Based on our understanding of text data characteristics and query types, we can now articulate the requirements that any production full-text search system must satisfy.

Functional Requirements:

Core Functional Requirements

•Word Containment Queries — Given a word, find all documents containing it in O(k) time where k is the number of matching documents
•Boolean Query Support — Combine terms with AND, OR, NOT operators and evaluate efficiently
•Phrase Query Support — Match exact sequences of words in specific order
•Relevance Ranking — Score and rank results by relevance, returning best matches first
•Linguistic Processing — Handle stemming, synonyms, and language-specific variations
•Incremental Updates — Add/remove/modify documents without rebuilding entire index

Performance Requirements:

Performance Requirements for Full-Text Search
Operation	Typical Requirement	Rationale
Single-term lookup	< 10ms	Interactive user experience
Complex boolean query	< 100ms	Reasonable response time
Index single document	< 100ms	Real-time content ingestion
Full re-index (1M docs)	< 1 hour	Offline maintenance window
Index size overhead	30-100% of text size	Practical storage limits
Memory per query	< 100 MB	Concurrent query support

Scalability Requirements:

A full-text search system must scale across multiple dimensions:

Scalability Dimensions

•Document Count — From thousands to billions of documents
•Document Size — From tweets (280 chars) to entire books (millions of chars)
•Vocabulary Size — From narrow technical domains to broad multilingual content
•Query Throughput — From single-user to millions of concurrent queries
•Update Frequency — From static archives to real-time streams

The Inverted Index Solution

Database Full-Text vs. Dedicated Search Engines

Database Built-In Full-Text Search:

Most modern relational databases offer full-text search capabilities:

PostgreSQL: tsvector, tsquery, GIN/GiST indexes
MySQL: FULLTEXT indexes on InnoDB/MyISAM
SQL Server: Full-Text Search service
Oracle: Oracle Text

postgresql_fulltext.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
-- PostgreSQL Full-Text Search Example
 
-- Create table with text column
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    title TEXT,
    body TEXT,
    search_vector TSVECTOR  -- Pre-computed search vector
);
 
-- Create GIN index on search vector
CREATE INDEX idx_articles_search ON articles USING GIN(search_vector);
 
-- Populate search vector (combine title and body)
UPDATE articles SET search_vector = 
    setweight(to_tsvector('english', COALESCE(title, '')), 'A') ||
    setweight(to_tsvector('english', COALESCE(body, '')), 'B');
 
-- Create trigger for automatic updates
CREATE TRIGGER articles_search_update
    BEFORE INSERT OR UPDATE ON articles
    FOR EACH ROW EXECUTE FUNCTION
    tsvector_update_trigger(search_vector, 'pg_catalog.english', title, body);
 
-- Query with ranking
SELECT id, title, ts_rank(search_vector, query) AS rank
FROM articles, to_tsquery('english', 'database & index') AS query
WHERE search_vector @@ query
ORDER BY rank DESC
LIMIT 20;

Dedicated Search Engines:

Dedicated search engines like Elasticsearch, Apache Solr, and Meilisearch are purpose-built for search:

Optimized data structures and algorithms
Distributed architecture for massive scale
Rich query language and ranking features
Specialized APIs and integrations

Database Full-Text vs. Dedicated Search Engine
Aspect	Database Full-Text	Dedicated Search Engine
Data Consistency	ACID transactions, joins with other data	Eventually consistent, separate sync
Operational Complexity	Single system to manage	Additional service to deploy/monitor
Feature Richness	Basic to moderate	Advanced (facets, aggregations, ML)
Query Performance	Good for moderate scale	Optimized for extreme scale
Ranking Quality	Good with tuning	Excellent out of box
Real-Time Updates	Native support	Requires sync mechanism
Best For	Light-moderate search load	Search-heavy applications

Starting Point Recommendation

Summary: Text Search Foundations

We've established the foundational understanding needed to dive into full-text index implementation. Let's consolidate the key concepts:

Key Takeaways

•Text is fundamentally unstructured — Unlike scalar values, meaning is distributed across words, requiring granular per-word indexing
•Traditional indexes fail for text search — B+-trees and hash indexes cannot efficiently support 'contains word' queries
•Query types drive index design — Boolean, phrase, proximity, and ranked queries each impose requirements on index structure
•Text processing is critical — Tokenization, normalization, stop words, and stemming transform raw text into matchable tokens
•Requirements are demanding — Sub-10ms lookups, incremental updates, and scalability to billions of documents
•Inverted indexes are the solution — Inverting the document→words relationship enables efficient word→documents lookups

What's Next:

Foundation Complete

1 / 5