Loading content...
In the previous page, we introduced the bitmap index concept—representing data presence as bit vectors. Now we dive deeper into the mechanics: how exactly are these bit vectors constructed?
This seemingly simple question reveals fascinating complexity. Different data types require different encoding strategies. The order of bits matters profoundly. Storage layout affects cache performance. And the mathematical properties of bit vectors enable powerful optimizations.
Understanding bit vector construction is essential for database engineers who tune bitmap indexes, analysts who need to predict query performance, and system architects who evaluate when bitmap indexes are appropriate.
By the end of this page, you will understand: (1) How bit vectors are constructed from table scans, (2) Encoding strategies for different data types (strings, integers, dates), (3) The mathematical properties of bitmap representations, (4) Bit ordering and storage layout considerations, and (5) How NULL values are handled in bitmap indexes.
Building a bitmap index requires a systematic process of scanning the source data and constructing bit vectors for each distinct value. Let's trace through the complete construction algorithm:
Algorithm: Bitmap Index Construction
CREATE_BITMAP_INDEX(table T, column C):
// Phase 1: Discover distinct values
distinct_values = {}
FOR each row R in T:
value = R[C]
IF value NOT IN distinct_values:
distinct_values[value] = new_empty_bitmap(size = row_count(T))
// Phase 2: Populate bit vectors
row_position = 0
FOR each row R in T:
value = R[C]
bitmap = distinct_values[value]
bitmap[row_position] = 1
row_position = row_position + 1
// Phase 3: Store index
FOR each (value, bitmap) in distinct_values:
store_bitmap(value, bitmap)
update_value_lookup_structure(value)
RETURN bitmap_index
Concrete example:
Let's construct a bitmap index on the Status column of an Orders table:
| RowId | OrderId | Status |
|---|---|---|
| 0 | ORD001 | Pending |
| 1 | ORD002 | Shipped |
| 2 | ORD003 | Delivered |
| 3 | ORD004 | Pending |
| 4 | ORD005 | Cancelled |
| 5 | ORD006 | Shipped |
| 6 | ORD007 | Delivered |
| 7 | ORD008 | Pending |
Step 1: Scan and discover distinct values
00000000 eachStep 2: Populate bit vectors
100000000100000000100000Final bitmaps:
10010001 (rows 0, 3, 7)01000100 (rows 1, 5)00100010 (rows 2, 6)00001000 (row 4)The algorithm above shows a two-pass approach (discover values, then populate). In practice, databases often use single-pass construction with dynamic bitmap allocation, or leverage existing statistics to pre-allocate bitmaps. For very large tables, parallel construction across partitions is common.
Different data types require different approaches to value identification and bitmap assignment. The goal is always the same—map each distinct value to a unique bitmap—but the mechanics vary:
String/Character Columns
Strings are the most straightforward case. Each distinct string value gets its own bitmap. The value-to-bitmap lookup typically uses a hash table or B+-tree on the string values.
Considerations:
Example: Status column with strings
'Pending' → Bitmap 0
'Shipped' → Bitmap 1
'Delivered' → Bitmap 2
'Cancelled' → Bitmap 3
Integer Columns
Integer columns can use the integer value directly as an index into an array of bitmaps (for dense ranges) or use a hash/tree lookup (for sparse ranges).
Dense range optimization:
If a column contains values from a small, continuous range (e.g., rating from 1-5), store bitmaps in an array indexed by value:
rating = 1 → bitmaps[0]
rating = 2 → bitmaps[1]
rating = 3 → bitmaps[2]
rating = 4 → bitmaps[3]
rating = 5 → bitmaps[4]
This eliminates lookup overhead entirely—accessing bitmap for value V is O(1) array access.
Sparse range handling: If integers are sparse (e.g., error codes 100, 200, 404, 500), use a hash table or sorted array with binary search.
Date/Time Columns
Dates can be encoded at different granularities depending on query patterns:
Day-level encoding:
Month-level encoding:
Decomposed encoding: Some systems create separate bitmap indexes on year, month, and day:
Query "sales in June 2024" becomes: (year=2024) AND (month=6)
Decomposed date encoding creates fewer bitmaps (12 for months vs. 365 for days) but requires more bitmap operations to filter precise dates. The right choice depends on query patterns—if most queries filter by month, decomposed encoding is efficient; if queries need exact dates, per-day bitmaps are better.
Boolean Columns
Boolean columns are the simplest case—only two distinct values (TRUE/FALSE or 1/0):
TRUE → Bitmap 0: 10110010...
FALSE → Bitmap 1: 01001101...
Important observation: For booleans, the two bitmaps are complements of each other:
Bitmap_TRUE = NOT(Bitmap_FALSE)Some databases optimize this by storing only one bitmap and computing the other on demand. This single-bitmap optimization halves storage for boolean columns.
Bitmap indexes have elegant mathematical properties that enable correctness proofs and optimizations. Understanding these properties helps database engineers reason about bitmap behavior:
popcount(bitmap_V) = COUNT(*) WHERE column = V.A AND A = A, A OR A = A. Bitmap operations follow Boolean algebra laws.Formal notation:
Let B_v denote the bitmap for value v in column C of table T with n rows.
Property 1: Mutual Exclusivity $$\forall i \in [0, n-1]: \sum_{v \in domain(C)} B_v[i] = 1$$
(For every row position, exactly one bitmap has a 1)
Property 2: Completeness $$\bigvee_{v \in domain(C)} B_v = \mathbf{1}$$
(OR of all bitmaps equals the all-ones vector)
Property 3: Disjointness $$\forall v_1, v_2 \in domain(C), v_1 eq v_2: B_{v_1} \wedge B_{v_2} = \mathbf{0}$$
(AND of any two different value bitmaps equals zero vector)
Why these properties matter:
1. Query correctness proofs:
These properties prove that bitmap queries return correct results. For example, if a row matches column = 'A' OR column = 'B', its bit will be set in exactly one of B_A or B_B, so B_A OR B_B will include it.
2. Optimization opportunities:
B_B OR B_C OR B_D... (OR of all other bitmaps). This avoids issues with unindexed values.B_A AND B_B = 0 (disjointness), we know no row matches column = 'A' AND column = 'B' without scanning.popcount(B_A) + popcount(B_B) + ... = n (total rows). If we know some popcounts, we can derive others.3. Error detection: Violations of these properties indicate data corruption or index inconsistency:
The properties above assume no NULL values. With NULLs, completeness breaks—the OR of all value bitmaps has 0-bits where NULLs exist. We'll address NULL handling specifically in a later section.
How bits are ordered within bytes and how bitmaps are stored affects both correctness and performance. These low-level details matter for implementations:
Bit numbering within bytes:
Given a byte (8 bits), two conventions exist:
Little-endian bit order (LSB first):
Bit position: 0 1 2 3 4 5 6 7
Byte value: 1 0 0 1 0 1 0 0 = 0x29 (binary 00101001)
Row 0 → bit 0 (rightmost in binary representation)
Big-endian bit order (MSB first):
Bit position: 7 6 5 4 3 2 1 0
Byte value: 0 0 1 0 1 0 0 1 = 0x29
Row 0 → bit 7 (leftmost in binary representation)
Most database systems use little-endian bit order because it aligns with how CPUs naturally number bits, and the POPCNT instruction counts bits regardless of ordering.
Multi-word bitmap storage:
For tables with more than 64 rows, bitmaps span multiple 64-bit words. Storage options include:
Contiguous storage:
Bitmap for value 'A':
Word 0: bits 0-63
Word 1: bits 64-127
Word 2: bits 128-191
...
This layout optimizes sequential scanning of a single bitmap.
Interleaved storage (Vertical Bitmap):
For bit position 0:
Bitmap A[0], Bitmap B[0], Bitmap C[0], ...
For bit position 1:
Bitmap A[1], Bitmap B[1], Bitmap C[1], ...
This layout optimizes queries that combine multiple bitmaps (AND/OR operations) because related bits are adjacent in memory.
| Layout | Best For | Drawback |
|---|---|---|
| Contiguous (Horizontal) | Single-bitmap scans, popcount, compression | Multiple bitmap combinations require jumping between locations |
| Interleaved (Vertical) | Multi-bitmap AND/OR operations | Single-bitmap operations touch scattered locations |
| Block-interleaved (Hybrid) | Balanced workloads | More complex implementation |
Alignment considerations:
Modern CPUs load data in cache lines (typically 64 bytes). Misaligned bitmap access causes performance penalties:
Aligned access:
Bitmap starts at address divisible by 64
→ Single cache line loads entire first 512 bits
→ Optimal SIMD performance
Misaligned access:
Bitmap starts at odd address
→ First 512 bits span two cache lines
→ Double memory bandwidth, slower operations
Production bitmap implementations always align bitmap storage to cache line boundaries. Padding bytes at the end of bitmaps (to round up to word boundaries) are also common.
For AVX-512 operations (processing 512 bits at once), bitmaps should be 64-byte aligned. For AVX2 (256 bits), 32-byte alignment is optimal. Database systems that leverage SIMD carefully design their memory allocators to guarantee these alignments.
NULL values introduce complexity into bitmap indexing. NULLs are not equal to any value (including other NULLs), which breaks the clean mathematical properties we established earlier.
The NULL problem:
Consider a column with values 'A', 'B', and NULL:
| RowId | Value |
|---|---|
| 0 | A |
| 1 | B |
| 2 | NULL |
| 3 | A |
| 4 | NULL |
If we only create bitmaps for 'A' and 'B':
1001001000The query WHERE Value IS NULL cannot be answered!
The query WHERE Value != 'A' should return rows 1, 2, 4—but NOT(Bitmap_A) = 01101 includes row 4 (correct) but also row 2 (the NULL, which we may or may not want).
Solution: Create a NULL bitmap
Most databases create an explicit bitmap for NULL values:
100100100000101Now:
IS NULL → return Bitmap_NULLIS NOT NULL → return NOT(Bitmap_NULL) = 11010Value = 'A' → return Bitmap_AValue != 'A' → depends on NULL semantics...NULL semantics in comparisons:
SQL's three-valued logic (TRUE, FALSE, UNKNOWN) affects bitmap operations:
WHERE Value != 'A'
Bitmap_B only (rows with non-NULL, non-A values)(NOT Bitmap_A) AND (NOT Bitmap_NULL)WHERE Value = 'A' OR Value IS NULL
Bitmap_A OR Bitmap_NULL = 10111WHERE Value IN ('A', 'B')
Value = 'A' OR Value = 'B'Bitmap_A OR Bitmap_B = 11010Key insight: Most comparison operations need to exclude NULLs. The safe pattern for Value != X is:
(NOT Bitmap_X) AND (NOT Bitmap_NULL)
If a column is defined as NOT NULL, the NULL bitmap can be omitted, and NOT(Bitmap_A) safely returns all other values. This saves one bitmap per column and simplifies NOT operations. Schema design decisions directly impact bitmap index efficiency.
Alternative NULL representations:
1. Implicit complement computation: Instead of storing a NULL bitmap, compute it as:
Bitmap_NULL = NOT(Bitmap_A OR Bitmap_B OR ... OR Bitmap_Z)
This saves storage but requires computing the OR of all bitmaps for NULL checks.
2. Existence bitmap: Store a single 'existence' bitmap where bit i = 1 if row i has a non-NULL value:
Bitmap_EXISTS: 11010
NULL bitmap = NOT(Bitmap_EXISTS) = 00101
3. Encode NULL as a special value: Treat NULL as just another distinct value with its own bitmap. Simplest implementation, but requires careful handling in comparisons.
So far, we've assumed each row has exactly one value per column. But some data models allow multi-valued attributes—a row can have multiple values for a single attribute. Examples include tags, categories, and array columns.
Example: Product tags
| ProductId | Tags |
|---|---|
| 1 | {Electronics, Portable} |
| 2 | {Electronics} |
| 3 | {Home, Kitchen} |
| 4 | {Electronics, Home, Portable} |
| 5 | {Kitchen} |
A product can have multiple tags. Bitmap indexing still works:
11010 (products 1, 2, 4)10010 (products 1, 4)00110 (products 3, 4)00101 (products 3, 5)Key difference: Non-mutual exclusivity
For multi-valued attributes, the mutual exclusivity property no longer holds:
Bitmap_Electronics AND Bitmap_Home ≠ 0 (it's 00010, just product 4)This changes query semantics:
Query: Products with Electronics tag
SELECT * FROM Products WHERE 'Electronics' IN Tags;
Result: Bitmap_Electronics = 11010
Query: Products with BOTH Electronics AND Portable
SELECT * FROM Products WHERE 'Electronics' IN Tags AND 'Portable' IN Tags;
Result: Bitmap_Electronics AND Bitmap_Portable = 10010
Query: Products with ONLY Electronics (no other tags)
SELECT * FROM Products
WHERE 'Electronics' IN Tags
AND 'Portable' NOT IN Tags
AND 'Home' NOT IN Tags
AND 'Kitchen' NOT IN Tags;
Result: Bitmap_Electronics AND NOT(Bitmap_Portable) AND NOT(Bitmap_Home) AND NOT(Bitmap_Kitchen)
= 11010 AND 01101 AND 11001 AND 11010 = 01000 (only product 2)
For multi-valued attributes, interesting queries become possible: 'How many products have exactly 2 tags?' requires counting which rows have exactly 2 bits set across all tag bitmaps. This is more complex but still efficient with bitmap operations and conditional counting.
Standard bitmap encoding (one bitmap per value) is called equality encoding—each bitmap answers "Does this row have this exact value?" Alternative encoding schemes trade different trade-offs:
Range Encoding
Instead of equality bitmaps, create bitmaps that answer range questions directly:
Example: Age column with values 18-65
Equality encoding: 48 bitmaps (one per age)
Range encoding: 48 bitmaps
Query: Age BETWEEN 25 AND 35
With equality encoding:
Bitmap_25 OR Bitmap_26 OR ... OR Bitmap_35 (11 bitmap ORs)
With range encoding:
Bitmap_≤35 AND NOT(Bitmap_≤24) (just 2 bitmap operations!)
Interval Encoding
A hybrid approach where bitmaps represent intervals:
Fewer bitmaps (6 instead of 48), but some queries require ORing multiple bitmaps.
Bit-Sliced Indexing (BSI)
Encode numeric values in binary and create one bitmap per bit position:
Value 5 = binary 101
Value 7 = binary 111
Value 3 = binary 011
Bitmap_bit0 (ones place): 1, 1, 1
Bitmap_bit1 (twos place): 0, 1, 1
Bitmap_bit2 (fours place): 1, 1, 0
BSI enables arithmetic operations on indexed columns using binary arithmetic on bitmaps. Useful for SUM queries without fetching actual values.
| Encoding | Number of Bitmaps | Equality Query | Range Query | Best For |
|---|---|---|---|---|
| Equality | C (cardinality) | 1 lookup | OR many bitmaps | Low-cardinality, equality queries |
| Range | C | Derived (B_≤v AND NOT B_≤v-1) | 2 lookups + AND/NOT | Range-heavy queries |
| Interval | C / interval_size | Lookup + filter | OR intervals + filter | Medium cardinality |
| Bit-Sliced | log2(max_value) | Complex decode | Arithmetic | Numeric aggregations |
Most production systems use equality encoding for categorical data and bit-sliced or range encoding for numeric columns where range queries are common. Oracle, for example, supports multiple encoding strategies and can choose based on column statistics and query patterns.
We've thoroughly explored how bit vectors are constructed and represented in bitmap indexes. Let's consolidate the key concepts:
What's next:
We've seen how to construct bitmaps for individual values. In the next page, we'll focus on low cardinality columns—the sweet spot where bitmap indexes truly shine—and understand why cardinality is the critical factor in bitmap index effectiveness.
You now understand the detailed mechanics of bit vector construction—from data type-specific encoding to NULL handling to alternative encoding strategies. Next, we'll explore why low-cardinality columns are the ideal candidates for bitmap indexing.