Loading content...
Throughout our exploration of bitmap indexes, we've repeatedly mentioned that they excel for low-cardinality columns. But what exactly does "low cardinality" mean? How low is low enough? And what happens when cardinality increases?
This page provides definitive answers. We'll quantify the relationship between cardinality and bitmap effectiveness, establish practical thresholds for real-world decision-making, and understand the mathematical basis for why cardinality matters so profoundly.
Cardinality refers to the number of distinct values in a column. A column with cardinality 3 has three unique values (like Status = {'Active', 'Pending', 'Closed'}). A column with cardinality 1 million has a million unique values (like UserID in a large system).
By the end of this page, you will understand: (1) The mathematical relationship between cardinality and bitmap space consumption, (2) Practical cardinality thresholds for bitmap index decisions, (3) How selectivity interacts with cardinality, (4) The crossover point where B+-trees become more efficient, and (5) Strategies for handling medium-cardinality columns.
The space consumed by a bitmap index is directly proportional to the cardinality of the indexed column. Understanding this relationship is crucial for capacity planning and index selection.
Uncompressed bitmap space formula:
For a table with N rows and a column with C distinct values (cardinality C):
Total bits = N × C
Total bytes = (N × C) / 8
Total megabytes = (N × C) / 8,000,000
Example calculations:
| Table Size (N) | Cardinality (C) | Bitmap Size |
|---|---|---|
| 1 million | 2 (boolean) | 250 KB |
| 1 million | 10 (status codes) | 1.25 MB |
| 1 million | 100 (categories) | 12.5 MB |
| 1 million | 1,000 (products) | 125 MB |
| 1 million | 10,000 (customers) | 1.25 GB |
| 1 million | 1,000,000 (unique IDs) | 125 GB |
The growth is linear in both N and C, but notice how dramatically space increases with cardinality:
When cardinality approaches row count (C ≈ N), uncompressed bitmap space becomes O(N²). For a billion-row table with a billion unique values, you'd need over 100 petabytes—physically impossible. This is why bitmap indexes are strictly limited to low-cardinality scenarios without compression.
Space comparison with B+-tree:
A B+-tree index on the same column requires approximately:
B+-tree size ≈ N × (key_size + pointer_size) / fill_factor
≈ N × (8 + 8) / 0.67
≈ 24 × N bytes
For 1 million rows: ~24 MB (constant regardless of cardinality)
Crossover analysis:
Bitmaps are smaller when:
(N × C) / 8 < 24 × N
C / 8 < 24
C < 192
For uncompressed bitmaps, the break-even point is roughly C < 200. Above this cardinality, B+-trees use less space than uncompressed bitmaps.
With compression, bitmaps remain competitive to higher cardinalities (often C < 10,000 or more), as we'll explore in the compression section.
The term "low cardinality" is context-dependent. A column with 1,000 distinct values might be low-cardinality in a billion-row table but high-cardinality in a thousand-row table. The key metric is the cardinality ratio:
Cardinality Ratio = C / N
Where C = number of distinct values and N = number of rows
| Cardinality Ratio | Classification | Bitmap Suitability |
|---|---|---|
| < 0.01% (C/N < 0.0001) | Very Low | Excellent |
| 0.01% - 1% | Low | Good |
| 1% - 10% | Medium | Consider with compression |
| 10% - 50% | High | Generally not suitable |
50% | Very High | Avoid bitmaps entirely |
Practical examples:
Very Low Cardinality (< 0.01%):
Low Cardinality (0.01% - 1%):
Medium Cardinality (1% - 10%):
High Cardinality (> 10%):
A practical guideline: consider bitmap indexes when cardinality is less than 1% of row count, and strongly favor them when cardinality is less than 0.1%. Above 1%, carefully analyze compression benefits and query patterns before committing to bitmaps.
Low cardinality benefits bitmap indexes in three interconnected ways: space efficiency, query efficiency, and compression effectiveness. Let's examine each:
11111111111111111111 (20 ones) to just 'run of 20 ones' regardless of length.The density insight:
For a column with cardinality C and N rows (assuming uniform distribution):
| Cardinality | Bitmap Density | Bits set per million rows |
|---|---|---|
| 2 | 50% | 500,000 |
| 10 | 10% | 100,000 |
| 100 | 1% | 10,000 |
| 1,000 | 0.1% | 1,000 |
| 10,000 | 0.01% | 100 |
At high cardinality, bitmaps become extremely sparse—mostly zeros with rare ones. Operations become overhead-dominated, and compression becomes essential just to achieve reasonable sizes.
Selectivity measures what fraction of rows a predicate selects. It interacts with cardinality in important ways that affect query planning:
Selectivity formulas:
For equality predicates on uniformly distributed data:
Selectivity(col = value) = 1 / C
Where C is cardinality.
| Cardinality | Selectivity per value | Rows selected (1M table) |
|---|---|---|
| 2 | 50% | 500,000 |
| 10 | 10% | 100,000 |
| 100 | 1% | 10,000 |
| 1,000 | 0.1% | 1,000 |
| 10,000 | 0.01% | 100 |
| 100,000 | 0.001% | 10 |
The selectivity-cardinality paradox:
High-cardinality columns have high selectivity (each value selects few rows)—which is traditionally good for indexing. But for bitmaps, this is a problem:
Low cardinality (good for bitmaps):
High cardinality (bad for bitmaps):
This is the opposite of B+-tree behavior! B+-trees excel when selectivity is high (quickly navigating to specific rows). Bitmaps excel when selectivity is low but multiple predicates combine to create high overall selectivity.
Bitmaps shine when individual predicates have low selectivity (match many rows) but combining multiple predicates via AND produces high selectivity (match few rows). This is the quintessential analytical query pattern: 'Sales in Q4 AND Region=West AND Category=Electronics AND Channel=Online'.
Combined selectivity example:
A sales data warehouse query:
SELECT COUNT(*) FROM Sales
WHERE Region = 'West' -- 20% selectivity (5 regions)
AND Quarter = 'Q4' -- 25% selectivity (4 quarters)
AND Category = 'Electronics' -- 10% selectivity (10 categories)
AND Channel = 'Online' -- 33% selectivity (3 channels)
Combined selectivity (assuming independence):
0.20 × 0.25 × 0.10 × 0.33 = 0.00165 = 0.165%
On a 100 million row table, this matches ~165,000 rows.
With B+-tree indexes:
With bitmap indexes:
Let's analyze typical columns in real database schemas to assess bitmap suitability:
| Column Type | Typical Cardinality | Bitmap Verdict | Reasoning |
|---|---|---|---|
| Boolean flags | 2 | ✅ Excellent | Maximum density, minimal storage |
| Gender | 2-3 | ✅ Excellent | Very low cardinality, high frequency |
| Status/State | 3-20 | ✅ Excellent | Common in workflow systems |
| Priority levels | 3-5 | ✅ Excellent | Low, high, urgent, critical, etc. |
| Day of week | 7 | ✅ Excellent | Fixed, low cardinality |
| Month | 12 | ✅ Excellent | Fixed, low cardinality |
| Quarter | 4 | ✅ Excellent | Very low cardinality |
| Year | 5-50 | ✅ Good | Depends on data range |
| Country code | 200-250 | ✅ Good | Manageable cardinality |
| US State | 50 | ✅ Good | Low cardinality |
| Product category | 10-500 | ⚠️ Consider | Depends on catalog size |
| Age | 0-120 | ⚠️ Consider | Medium cardinality, consider ranges |
| Hour of day | 24 | ✅ Good | Fixed, manageable |
| Minute of hour | 60 | ⚠️ Consider | Borderline, evaluate queries |
| Department | 10-100 | ✅ Good | Typically low cardinality |
| Brand | 100-10,000 | ⚠️ Consider | Varies widely by industry |
| City | 1,000-100,000 | ❌ Avoid | Usually too high |
| ZIP code | ~42,000 (US) | ❌ Avoid | High cardinality |
| Date (exact) | 365-36,500 | ❌ Avoid | Use range encoding or partition |
| Timestamp | Very high | ❌ Avoid | Near-unique values |
| ~unique | ❌ Avoid | Cardinality ≈ row count | |
| User ID | ~unique | ❌ Avoid | Primary key, use B+-tree |
| Phone number | ~unique | ❌ Avoid | Near-unique values |
The analysis above assumes equality queries. If your workload heavily uses range queries on dates or numeric columns, consider range-encoded bitmaps instead of equality bitmaps, or use B+-trees. Always profile with actual query patterns.
Columns with medium cardinality (roughly 100-10,000 distinct values) present a dilemma. They're too high for simple bitmap efficiency but may still benefit from bitmap operations. Several strategies address this challenge:
Strategy 1: Bitmap Compression
Compression dramatically extends bitmap viability. Run-length encoding (RLE) and Word-Aligned Hybrid (WAH) can reduce bitmap sizes by 10-100x for sparse bitmaps.
Example:
ProductID with 5,000 distinct valuesWe'll cover compression in detail in the next page.
Strategy 2: Value Grouping
Group individual values into logical categories and index the groups:
Instead of:
Use:
Category = 'Electronics' (bitmap operation)This hybrid approach uses bitmaps for coarse filtering and secondary mechanisms for fine filtering.
Strategy 3: Range/Interval Encoding
For numeric columns with medium cardinality, use interval bitmaps:
Instead of:
Use:
age BETWEEN 25 AND 45: OR intervals [20-29], [30-39], [40-49]Strategy 4: Hybrid Indexing
Maintain both bitmap and B+-tree indexes on the same column:
The query optimizer chooses the appropriate index based on query structure.
Cost: Double index maintenance overhead Benefit: Optimal performance for both query types
Many modern analytical databases (Vertica, ClickHouse, SingleStore) automatically handle medium-cardinality columns with adaptive encoding—starting with dictionary encoding, switching to bitmaps for frequent values, and using other structures for rare values. Manual optimization is becoming less necessary.
When deciding whether to use bitmap indexes, follow this systematic framework:
Step 1: Calculate cardinality ratio
ratio = COUNT(DISTINCT column) / COUNT(*)
Step 2: Classify the cardinality
IF ratio < 0.0001 (0.01%) → Very Low → STRONG bitmap candidate
IF ratio < 0.01 (1%) → Low → Good bitmap candidate
IF ratio < 0.1 (10%) → Medium → Consider with compression
IF ratio >= 0.1 → High → Avoid bitmaps
Step 3: Evaluate query patterns
IF queries heavily use multi-column AND → +1 for bitmaps
IF queries use range predicates → Consider range encoding
IF queries are point lookups → B+-tree likely better
IF queries are aggregations/counts → +1 for bitmaps
Step 4: Assess update patterns
IF table is append-only or batch-loaded → +1 for bitmaps
IF table has frequent random updates → -1 for bitmaps
IF table is partitioned by time → Bitmaps on historical partitions
Decision matrix:
| Cardinality | Query Pattern | Update Pattern | Recommendation |
|---|---|---|---|
| Very Low | Multi-column AND | Batch load | ✅ Bitmap (ideal case) |
| Very Low | Point lookup | Batch load | ⚠️ Bitmap or B+-tree |
| Very Low | Any | Frequent updates | ⚠️ Consider update cost |
| Low | Multi-column AND | Batch load | ✅ Bitmap |
| Low | Multi-column AND | Frequent updates | ⚠️ Bitmap with caution |
| Low | Range queries | Any | ⚠️ Range-encoded bitmap |
| Medium | Multi-column AND | Batch load | ⚠️ Bitmap with compression |
| Medium | Any | Frequent updates | ❌ B+-tree |
| High | Any | Any | ❌ B+-tree or hash |
Use bitmap indexes when: (1) Cardinality ratio is below 1%, (2) Queries combine multiple low-cardinality predicates, (3) The table is read-heavy or batch-loaded, and (4) Workload is analytical rather than transactional.
We've established why cardinality is the critical factor in bitmap index suitability. Let's consolidate the key insights:
What's next:
With cardinality fundamentals established, we'll explore bitmap operations in depth—the AND, OR, NOT, and XOR operations that make bitmaps powerful, including SIMD optimizations and operation ordering strategies.
You now understand why cardinality is the essential factor in bitmap index decisions—affecting space, performance, and compression effectiveness. Next, we'll explore the bit-level operations that give bitmap indexes their remarkable query performance.