Access Methods - Learning Module

Loading content...

0/252

Table Scan: The Foundation of Data Access

The Universal Access Method

Every database query ultimately needs to examine data stored on disk. The table scan—also known as a full table scan or sequential scan—is the most fundamental access method in relational database systems. It works exactly as its name suggests: the database reads every single row in a table, examines each one, and returns those that satisfy the query conditions.

While table scans are often viewed negatively (associated with poor performance), they are far more nuanced than their reputation suggests. Understanding when table scans are appropriate—and when they are catastrophic—is essential knowledge for any database professional.

What You Will Learn

By the end of this page, you will understand exactly how table scans work at the storage layer, why they can be both the best and worst access method choice, how to recognize when table scans are appropriate, and how modern databases optimize sequential access patterns for maximum throughput.

How Table Scans Work: The Complete Picture

A table scan is conceptually simple but involves sophisticated machinery at the storage engine level. Let's trace exactly what happens when a database executes a table scan:

Step 1: Page-Level Organization

Database tables are not stored as flat files but as collections of fixed-size pages (typically 4KB, 8KB, or 16KB depending on the DBMS). Each page contains multiple rows, along with metadata such as page headers, slot directories, and free space pointers.

When executing a table scan, the database doesn't read individual rows—it reads entire pages. This is crucial because disk I/O operates at the block level, not the byte level. Reading one byte costs the same as reading 8KB if that's your page size.

Table Scan Algorithm (Pseudocode)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
function TableScan(table, predicate):
    result = []
    
    // Iterate through all pages belonging to this table
    for each page_id in table.page_list:
        
        // Read page from disk into buffer pool
        page = buffer_pool.fetch(page_id)
        
        // Iterate through all rows (tuples) in the page
        for each row in page.rows:
            
            // Apply the WHERE clause predicate
            if predicate.evaluate(row) == true:
                result.append(row)
        
        // Page may be unpinned from buffer (LRU eviction)
        buffer_pool.unpin(page_id)
    
    return result

Step 2: Buffer Pool Interaction

The database doesn't directly read from disk for every page. Instead, it uses a buffer pool—an in-memory cache of recently accessed pages. When a table scan requests a page:

Buffer Hit: If the page is already in memory, it's returned immediately (microseconds)
Buffer Miss: If not, the page is read from disk into the buffer pool (milliseconds)

For a fresh table scan on a cold cache, nearly every page will be a buffer miss. For repeated scans of the same table, many pages may already be cached.

Step 3: Sequential Access Optimization

Modern disk subsystems and operating systems are highly optimized for sequential access patterns. When a database detects that it's reading pages sequentially (adjacent page_id values), several optimizations kick in:

Sequential Access Optimizations

•Read-Ahead / Prefetching: The OS and disk controller predict that subsequent pages will be needed and proactively load them before they're requested. This overlaps disk I/O with CPU processing.
•Reduced Seek Time: HDD seek time (moving the read head) dominates random access. Sequential reads eliminate seeks after the initial positioning.
•Elevator Algorithm: Disk schedulers reorder I/O requests to minimize head movement, naturally benefiting sequential patterns.
•SSD Efficiency: While SSDs have no seek time, they still benefit from sequential access due to internal parallelism and reduced firmware overhead.
•Large Block I/O: Databases can coalesce adjacent page requests into larger I/O operations, reducing system call overhead.

The Sequential Scan Advantage

On traditional hard drives, sequential reads can achieve 100-200 MB/s, while random reads may only achieve 1-5 MB/s due to seek overhead. This 20-100x performance difference is why table scans often outperform index scans when a large fraction of the table must be read.

Table Scan Cost Analysis

Understanding the cost of a table scan is essential for query optimization. The cost model must account for both I/O and CPU components.

I/O Cost

The I/O cost of a table scan is straightforward: you must read every page of the table. If a table has B pages, and each page read costs t_I/O time units, the I/O cost is:

I/O Cost = B × t_I/O

However, this formula assumes all pages require disk I/O. In reality, some pages may be in the buffer pool. A more accurate model incorporates the buffer hit rate (h):

I/O Cost = B × (1 - h) × t_disk + B × h × t_memory

Where:

t_disk ≈ 5-10ms for HDD, 0.1-0.5ms for SSD
t_memory ≈ 0.0001ms (100 nanoseconds)
h varies from 0% (cold cache) to approaching 100% (hot table)

Table Scan I/O Cost Analysis Example
Scenario	Pages (B)	Buffer Hit Rate	Disk I/O Time	Total I/O Cost
Small table, cold cache	100	0%	5ms/page	500ms
Small table, warm cache	100	80%	5ms/page	100ms
Large table, cold cache	100,000	0%	5ms/page	500 seconds (8.3 min)
Large table, SSD	100,000	0%	0.2ms/page	20 seconds
Large table, fully cached	100,000	100%	N/A	10ms (memory only)

CPU Cost

Beyond I/O, the CPU must examine every row to evaluate the WHERE clause predicate. If a table has N rows, and evaluating the predicate costs t_cpu per row:

CPU Cost = N × t_cpu

For simple predicates (equality checks, comparisons), t_cpu is negligible. For complex predicates involving string matching, function calls, or subqueries, CPU cost can become significant.

Total Cost Formula

Combining both components:

Total Table Scan Cost = Pages × t_I/O + Rows × t_cpu

Or with buffer considerations:

Total Cost = Pages × [(1-h) × t_disk + h × t_memory] + Rows × t_cpu

I/O Dominates in Most Cases

For typical OLTP workloads on spinning disks, I/O cost dominates by 4-6 orders of magnitude. A 5ms disk read is 50,000 times slower than a 100-nanosecond memory access. This is why reducing I/O is the primary focus of query optimization.

When Table Scans Are the Optimal Choice

Despite their reputation, table scans are often the best choice. Understanding these scenarios is crucial for proper query tuning.

Scenario 1: High Selectivity Inversion

When a query must return a large fraction of the table's rows (typically >10-20%), table scans outperform index scans. Consider a query that returns 30% of rows:

Index scan approach: Read index entries, then perform random I/O to fetch each matching row from different pages. If rows are scattered across all pages, you may read nearly all pages anyway—but with random access overhead.
Table scan approach: Read all pages sequentially. Yes, you read pages containing unwanted rows, but sequential access is so much faster that the total time is less.

High Selectivity Example
1
2
3
4
5
6
7
8
9
-- Returns ~40% of rows (assume status distribution is skewed)
-- Table scan likely optimal here
SELECT * FROM orders WHERE status != 'pending';
 
-- Returns all rows - table scan is the ONLY option
SELECT * FROM orders;
 
-- Aggregation over entire table - no alternative to full scan
SELECT COUNT(*), SUM(total_amount) FROM orders;

Scenario 2: Small Tables

For very small tables (fitting in a few pages), table scans are almost always optimal. The overhead of using an index—traversing B-tree levels, following pointers—exceeds the cost of just reading all rows directly.

Rule of thumb: If a table fits in 10-20 pages (80-320KB at 8KB page size), index overhead likely exceeds index benefit.

Index vs Table Scan Breakeven Analysis
Table Size	Pages (8KB)	Rows (~100B each)	Recommended Access
< 100 KB	< 12	< 1,000	Always table scan
100 KB - 1 MB	12 - 125	1,000 - 10,000	Table scan unless selectivity < 1%
1 MB - 100 MB	125 - 12,500	10K - 1M	Depends on selectivity (~5-15% threshold)
100 MB	12,500	1M	Index scan for selectivity < 10-20%

Scenario 3: No Suitable Index Exists

If no index supports the query's filter conditions, a table scan is the only option. This is the most common reason for table scans in practice.

Scenario 4: Clustered Table Organization

If the table is physically sorted by (clustered on) the column being queried, a table scan can efficiently skip irrelevant sections through early termination.

Scenario 5: Analytical Workloads (OLAP)

Data warehousing and analytics queries typically aggregate, transform, or analyze entire data sets. These workloads are inherently full-scan operations, which is why columnar databases optimize specifically for this pattern.

The 'Missing Index' Fallacy

Not every table scan indicates a missing index. Some queries genuinely need to examine most of the table. Adding an index may actually slow these queries down while consuming storage and slowing writes. Always analyze selectivity before creating indexes.

When Table Scans Become Performance Disasters

While table scans have legitimate uses, they can also cripple system performance. Recognizing these anti-patterns is critical.

Anti-Pattern 1: Needle in a Haystack

Querying for a single row (or tiny fraction of rows) in a large table via table scan is the classic performance disaster:

Catastrophic Table Scan Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Finding one user in 10 million: O(n) vs O(log n)
-- Without index: Reads potentially ALL 10 million rows
-- With index: Reads ~4 tree levels + 1 data page = 5 pages
SELECT * FROM users WHERE user_id = 12345;
 
-- Pagination on large tables without proper indexing
-- This gets progressively worse as offset increases
SELECT * FROM logs ORDER BY created_at LIMIT 20 OFFSET 1000000;
 
-- Join with table scan on large table
-- Forces full scan for EACH row in the driving table
SELECT o.*, c.name 
FROM orders o 
JOIN customers c ON c.customer_id = o.customer_id
-- If customers has no index on customer_id, disaster ensues

Anti-Pattern 2: Table Scan in Loop / Nested Operations

When a table scan occurs inside a nested loop join or application-level iteration, the cost multiplies:

Total Cost = Outer Loop Iterations × Table Scan Cost

If the outer loop runs 10,000 times and each table scan takes 1 second, total time is 10,000 seconds (2.8 hours)!

Warning Signs

•Query plan shows 'Seq Scan' on large table with low 'Rows Removed by Filter'
•Table scan appears under 'Nested Loop' operator
•Execution time grows linearly with table size
•High disk I/O during query execution
•Queries that were fast on test data are slow in production

Solutions

•Add appropriate indexes on filter columns
•Restructure queries to enable index usage
•Partition large tables to reduce scan scope
•Use covering indexes to eliminate row lookups
•Consider materialized views for complex aggregations

Anti-Pattern 3: Buffer Pool Pollution

Large table scans can evict other frequently-accessed pages from the buffer pool, causing collateral damage to unrelated queries. A single analytical query scanning a terabyte table can flush the entire cache, suddenly making all OLTP queries hit disk.

Modern databases address this with scan-resistant buffer pools that use separate buffers for sequential scans, or algorithms like LRU-K that don't promote sequential scan pages as aggressively.

The Production Killer

The most dangerous table scans are those that work fine in development (small data) but fail in production (large data). Always test queries with production-scale data volumes. A 10ms query on 1,000 rows becomes a 10,000ms query on 1,000,000 rows if it's doing a table scan.

Table Scan Variants and Optimizations

Modern databases implement several optimizations and variants of the basic table scan.

Parallel Sequential Scan

For large tables, databases can divide the table into segments and scan multiple segments concurrently using separate worker processes. PostgreSQL, Oracle, and SQL Server all support parallel table scans.

With P parallel workers and B pages:

Parallel Scan Time ≈ (B / P) × t_I/O

However, parallelism is limited by:

Available I/O bandwidth (saturating disk)
CPU cores available
Buffer pool contention
Final result aggregation overhead

Parallel Scan in PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
-- PostgreSQL parallel scan execution plan
EXPLAIN (ANALYZE) SELECT COUNT(*) FROM large_table;
 
/*
Finalize Aggregate  (actual=1 loops=1)
  ->  Gather  (actual=4 loops=1)
        Workers Planned: 3
        Workers Launched: 3
        ->  Partial Aggregate  (actual=1 loops=4)
              ->  Parallel Seq Scan on large_table  (actual=250000 loops=4)
                    -- Each worker scans ~250K rows of the 1M total
*/

Synchronized Scans

When multiple queries scan the same large table concurrently, modern databases can synchronize them. Instead of each query starting from page 1, a new scan joins an already-in-progress scan at its current position, wraps around the table, and finishes where it started.

This provides:

Shared buffer pool pages (data read once, used many times)
Reduced I/O amplification
Better cache utilization

Sample Scans

For approximate analytics, databases can scan a random sample of pages rather than all pages:

-- PostgreSQL tablesample
SELECT AVG(amount) FROM orders TABLESAMPLE SYSTEM (10);  -- 10% of pages

Sample scans trade accuracy for speed—useful for exploratory data analysis.

Modern Table Scan Optimizations

•Vectorized Execution: Process batches of rows (vectors) together to leverage CPU SIMD instructions and reduce per-row overhead
•Late Materialization: In columnar stores, only reconstruct full rows after filtering, avoiding work for rejected rows
•Predicate Pushdown: Apply filter conditions during the scan rather than in a separate filter operator
•Zone Maps / Min-Max Indexes: Track min/max values per page; skip pages that can't possibly contain matching rows
•Compression-Aware Scanning: Apply predicates on compressed data when possible, decompressing only qualifying rows

Table Scans Across Major Database Systems

Each major RDBMS implements table scans with subtle differences. Understanding these helps in cross-platform optimization.

Table Scan Implementation Comparison
DBMS	Operator Name	Parallel Support	Special Features
PostgreSQL	Seq Scan / Parallel Seq Scan	Yes (configurable workers)	Synchronized scans, TABLESAMPLE
MySQL (InnoDB)	Full Table Scan	Limited (8.0+)	Clustered index scan (always uses PK order)
Oracle	TABLE ACCESS FULL	Yes (DOP parameter)	Direct path reads, serial/parallel modes
SQL Server	Table Scan / Clustered Index Scan	Yes (MaxDOP)	Batch mode, columnstore integration
SQLite	SCAN TABLE	No	Very simple implementation

PostgreSQL Details

PostgreSQL's sequential scan reads pages in physical order (using the visibility map to skip pages containing only dead tuples during certain operations). The seq_page_cost parameter (default 1.0) allows tuning the optimizer's cost estimate.

MySQL/InnoDB Details

InnoDB tables are organized as clustered indexes (B+ tree ordered by primary key). A 'full table scan' in InnoDB is actually a full clustered index scan—it reads leaf pages in primary key order, not physical order. This means:

Data is always read in PK order (can help if application needs sorted output)
Space for the PK index is 'free' since it IS the table

Oracle Details

Oracle distinguishes between buffered scans (read through buffer cache) and direct path reads (bypass cache, read directly to process memory). Direct path reads are faster for large scans because they avoid buffer pool contention, but they skip the cache entirely.

Reading Execution Plans

In execution plans, look for 'Seq Scan' (PostgreSQL), 'Full Table Scan' or 'type: ALL' (MySQL), 'TABLE ACCESS FULL' (Oracle), or 'Table Scan' (SQL Server). These all indicate the full table scan access method.

Practical Examples and Demonstrations

Let's examine real execution plans and analyze when table scans are appropriate.

PostgreSQL Table Scan Analysis
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
-- Create a sample table with 1 million rows
CREATE TABLE orders (
    order_id SERIAL PRIMARY KEY,
    customer_id INTEGER NOT NULL,
    status VARCHAR(20) NOT NULL,
    total_amount DECIMAL(10,2) NOT NULL,
    created_at TIMESTAMP DEFAULT NOW()
);
 
-- Insert test data
INSERT INTO orders (customer_id, status, total_amount)
SELECT 
    (random() * 10000)::int,
    (ARRAY['pending','processing','shipped','delivered'])[1 + (random()*3)::int],
    (random() * 1000)::decimal(10,2)
FROM generate_series(1, 1000000);
 
-- Force statistics update
ANALYZE orders;
 
-- Example 1: Full aggregation (table scan is optimal)
EXPLAIN (ANALYZE, BUFFERS) 
SELECT COUNT(*), AVG(total_amount) FROM orders;
 
/*
Aggregate  (cost=20834.00..20834.01 rows=1) (actual time=347.123..347.124 rows=1)
  Buffers: shared hit=8334
  ->  Seq Scan on orders  (cost=0.00..18334.00 rows=1000000) (actual time=0.011..127.456 rows=1000000)
        Buffers: shared hit=8334
Planning Time: 0.052 ms
Execution Time: 347.156 ms
*/
 
-- Example 2: Low selectivity (table scan appropriate)
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders WHERE status = 'delivered';
 
/*
Seq Scan on orders  (cost=0.00..20834.00 rows=250032) (actual time=0.017..312.456 rows=250089)
  Filter: ((status)::text = 'delivered'::text)
  Rows Removed by Filter: 749911
  Buffers: shared hit=8334
*/
-- ~25% selectivity = table scan is faster than index + heap lookups
 
-- Example 3: High selectivity (index would be better)
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders WHERE order_id = 500000;
 
/*
Seq Scan on orders  (cost=0.00..20834.00 rows=1) (actual time=156.234..223.456 rows=1)
  Filter: (order_id = 500000)
  Rows Removed by Filter: 999999
  Buffers: shared hit=8334
*/
-- Finding 1 row scanned ALL 1 million - catastrophically inefficient!
 
-- Create index to fix
CREATE INDEX idx_orders_id ON orders(order_id);
 
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders WHERE order_id = 500000;
 
/*
Index Scan using idx_orders_id on orders (cost=0.42..8.44 rows=1) (actual time=0.019..0.020 rows=1)
  Index Cond: (order_id = 500000)
  Buffers: shared hit=4
*/
-- Now: 4 buffer hits instead of 8334!

Summary: Table Scan Mastery

You now have comprehensive knowledge of table scans—the most fundamental access method in relational databases. Let's consolidate the key takeaways:

Key Takeaways

•Table scans read every page of a table — The cost is proportional to the number of pages (not rows), making page size and row density important factors.
•Sequential access is highly optimized — Prefetching, read-ahead, and reduced seek time make sequential I/O 20-100x faster than random I/O on HDDs.
•Table scans excel for large result sets — When selectivity exceeds ~10-20%, table scans often outperform index scans due to sequential access benefits.
•Table scans are catastrophic for selective queries — Finding one row in millions via table scan is the classic database performance anti-pattern.
•Buffer pool effects are critical — Cold caches magnify table scan costs; hot tables may scan entirely from memory.
•Modern optimizations exist — Parallel scans, synchronized scans, vectorized execution, and zone maps all accelerate table scans in appropriate scenarios.

What's Next:

Now that we understand table scans, we'll explore index scans—the primary alternative access method. Index scans trade sequential access simplicity for targeted, selective data retrieval, enabling efficient lookups in tables of any size.

Page Complete

You now understand table scans at a deep level—their mechanics, cost models, optimal use cases, and dangerous anti-patterns. This knowledge is fundamental to understanding all other access methods and query optimization decisions.