Database Management SystemsQuery Optimization Tips

Query Optimization Tips

LevelIntermediate

Duration75 mins

TopicQuery Optimization Tips

1 / 5

Avoiding SELECT *

The Most Common SQL Anti-Pattern

Few SQL practices are as universally discouraged—yet as widely used—as SELECT *. This seemingly innocent shortcut appears in countless production systems, tutorials, and even documentation examples. It's quick to type and appears to 'just work.' But beneath this convenience lurks a cascade of performance problems that can cripple database systems at scale.

Understanding why SELECT * is problematic—and more importantly, how to avoid it—is one of the most impactful optimizations any developer can learn. The principles explored here touch every aspect of database performance: network bandwidth, memory utilization, query planning, index optimization, and long-term maintainability.

What You Will Learn

By the end of this page, you will understand the full spectrum of problems caused by SELECT *, learn to quantify its impact on your systems, master techniques for column selection optimization, and recognize the few legitimate scenarios where SELECT * is acceptable.

Understanding SELECT * Behavior

Before examining the problems, let's understand exactly what happens when the database engine encounters SELECT *. This knowledge is fundamental to appreciating why explicit column selection matters.

Query Expansion at Parse Time:

When you write SELECT * FROM employees, the database doesn't literally fetch 'everything.' During the parsing phase, the query processor expands the asterisk into an explicit list of every column in the table, in their defined order. If the employees table has columns id, first_name, last_name, email, department_id, salary, hire_date, manager_id, address, and phone, the query becomes:

Query Expansion
SQL
1
2
3
4
5
6
7
-- What you write:
SELECT * FROM employees;
 
-- What the database actually executes:
SELECT id, first_name, last_name, email, department_id, 
       salary, hire_date, manager_id, address, phone
FROM employees;

Metadata Lookup Overhead:

To perform this expansion, the database must:

Query the system catalog to retrieve the table's column definitions
Resolve column order according to the schema definition
Determine data types for each column to plan memory allocation
Check column visibility based on user permissions

This metadata lookup adds latency—small per query, but significant at high volumes. More critically, it creates a dependency on the current schema state that can cause problems we'll explore shortly.

Dynamic vs. Static Column Lists:

With SELECT *, your query's column list is dynamic—it changes whenever the table schema changes. With explicit column selection, your query is static—it returns exactly what you specified, regardless of schema evolution. This distinction has profound implications for application stability.

Schema Coupling

Using SELECT * creates invisible coupling between your application code and the database schema. The query 'looks' simple, but it's actually a function of the table's current column definitions—a dependency that's easy to forget until it causes production failures.

Network Bandwidth and I/O Costs

The most immediate cost of SELECT * is the sheer volume of data transferred. Every unnecessary column consumes network bandwidth, increases disk I/O, and occupies memory buffers. Let's quantify this impact.

Case Study: User Profile Retrieval

Consider a typical web application that needs to display a user's name on a dashboard. The users table contains:

Users Table Column Analysis
Column	Data Type	Typical Size (bytes)	Needed for Display?
id	BIGINT	8	No
username	VARCHAR(50)	25 avg	Yes
email	VARCHAR(255)	30 avg	No
password_hash	CHAR(60)	60	Never
first_name	VARCHAR(50)	15 avg	Yes
last_name	VARCHAR(50)	15 avg	Yes
profile_photo	BLOB	500,000 avg	No
bio	TEXT	2,000 avg	No
created_at	TIMESTAMP	8	No
last_login	TIMESTAMP	8	No
settings_json	JSON	5,000 avg	No

SELECT * Approach

•Data per row: ~507,169 bytes
•For 1,000 users: ~484 MB transferred
•Network time: 48 seconds @ 10 MB/s
•Includes password hashes (security risk!)
•Includes 500KB photos never displayed

Explicit Column Selection

•Data per row: ~55 bytes
•For 1,000 users: ~54 KB transferred
•Network time: 5 milliseconds @ 10 MB/s
•Excludes sensitive password data
•Returns exactly what's needed

The multiplication effect:

This 9,000x difference compounds across your application:

API requests: Each endpoint using SELECT * transfers unnecessary data
Concurrent users: 1,000 users means 1,000x the waste simultaneously
Database connections: Connection buffers fill with unused data
Cloud costs: Data transfer is often metered and billed

For high-traffic applications, the cumulative impact of SELECT * across hundreds of queries can consume terabytes of unnecessary bandwidth monthly.

Hidden BLOB and TEXT Penalty

Large object columns (BLOB, TEXT, JSON) are particularly dangerous with SELECT *. Even if you never use the data, the database must read it from disk (often from separate storage pages), transfer it over the network, and allocate memory for it. One forgotten TEXT column can dwarf the cost of all other columns combined.

Memory and Buffer Pool Impact

Beyond network transfer, SELECT * affects database memory management in ways that ripple throughout system performance.

Buffer Pool Contamination:

Most databases use a buffer pool (or cache) to keep frequently accessed data pages in memory. When you request unnecessary columns, the database must load their data pages into the buffer pool, potentially evicting more useful data.

Consider a query fetching order IDs for processing:

Buffer Pool Impact
SQL
1
2
3
4
5
-- Anti-pattern: Loads entire rows including large description TEXT
SELECT * FROM orders WHERE status = 'pending';
 
-- Optimized: Only loads pages containing order IDs
SELECT order_id FROM orders WHERE status = 'pending';

If the orders table includes a description TEXT column averaging 5KB per order, the first query loads 50,000x more data into the buffer pool than the second. This means:

Cache pollution: Valuable cached data gets evicted
Increased disk I/O: Future queries must re-read evicted data
Higher memory pressure: May trigger memory management overhead
Reduced concurrency: Fewer queries can execute simultaneously

Result Set Memory Allocation:

The database must allocate memory for the entire result set before transferring it. With SELECT *, this allocation includes space for columns you never examine:

Memory Allocation Comparison (10,000 rows)
Query Type	Columns Retrieved	Memory Allocated	Actual Data Used
SELECT *	20 columns (1.5 KB/row)	15 MB	500 KB (3 columns)
SELECT a, b, c	3 columns (50 bytes/row)	500 KB	500 KB (100%)

Client-Side Memory Waste:

The problem extends to your application. ORMs and database drivers allocate objects or arrays to hold query results. With SELECT *, every result object contains fields for unused columns:

Java/Kotlin: Each ResultSet row allocates all column objects
Python: Dictionaries or ORM objects include every column
JavaScript/Node: Each row object contains all properties
Go: Struct fields for all columns consume memory

For long-running services processing millions of rows daily, this waste accumulates into gigabytes of unnecessary memory allocation and garbage collection overhead.

The Compound Effect

Memory waste from SELECT * compounds across the entire request lifecycle: database buffer pool → database result buffer → network buffers → driver buffers → application memory → serialization buffers. Each layer allocates space for columns you never use.

Query Plan and Index Implications

Perhaps the most significant—yet least understood—impact of SELECT * is on query execution plans and index utilization. This is where the difference between SELECT * and explicit column selection moves from linear waste to exponential slowdown.

Destroying Covering Index Potential:

A covering index (also called an index-only scan) is one of the most powerful optimizations available. When all columns needed by a query exist in an index, the database can satisfy the query entirely from the index without touching the table data at all.

Consider a query to find employee email addresses by department:

Covering Index Example
SQL
1
2
3
4
5
6
7
8
9
10
-- Assume this covering index exists:
CREATE INDEX idx_emp_dept_email ON employees(department_id, email);
 
-- Query 1: Cannot use covering index - must access table
SELECT * FROM employees WHERE department_id = 5;
-- Execution: Index seek + Table lookup for EVERY row (expensive)
 
-- Query 2: Uses covering index exclusively - never touches table
SELECT email FROM employees WHERE department_id = 5;
-- Execution: Index-only scan (10-100x faster)

Performance difference explained:

With the covering index approach:

The index contains department_id and email together
The database scans only the index structure
No random I/O to table data pages
Index is typically 10-50x smaller than table
Index is more likely to be fully cached

With SELECT *:

Index identifies matching rows by their row ID
Database must perform a random seek to each row's data page
Each seek may require disk I/O if page isn't cached
Even with SSD, the seek overhead dominates query time

Covering Index vs. Table Access (10,000 matching rows)
Operation	SELECT *	SELECT email
Index scan time	5 ms	5 ms
Table lookups required	10,000 random seeks	0
Disk I/O (worst case)	10,000 × 10ms = 100 sec	None
Disk I/O (cached)	10,000 × 0.1ms = 1 sec	None
Total query time (worst)	~100 seconds	5 ms
Total query time (cached)	~1 second	5 ms

Optimizer Decision Influence:

The query optimizer considers all requested columns when choosing an execution plan. With SELECT *, the optimizer knows it must eventually retrieve all columns, which influences its decisions:

Index selection: May choose a worse index or full table scan
Join order: May change join order when column set affects intermediate result sizes
Parallel execution: May limit parallelism due to memory requirements
Materialization: May force result materialization instead of streaming

By requesting only needed columns, you give the optimizer more freedom to choose efficient plans.

The Hidden Cost

EXPLAIN plans don't always show the full cost difference. A plan might show 'index seek' for both queries, but the SELECT * version incurs thousands of additional table lookups that may not be visible in the plan cost estimates. Always test with actual execution times.

Schema Evolution and Maintainability

Beyond immediate performance, SELECT * creates long-term maintainability challenges that compound as systems evolve.

The Fragile Query Problem:

When a schema changes, queries using SELECT * behave unpredictably:

Schema Evolution Hazard
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Original table
CREATE TABLE products (id INT, name VARCHAR(100), price DECIMAL(10,2));
 
-- Application code expects: [id, name, price]
SELECT * FROM products;
 
-- Six months later, DBA adds columns:
ALTER TABLE products ADD COLUMN description TEXT;  -- 10KB average
ALTER TABLE products ADD COLUMN image_url VARCHAR(500);
ALTER TABLE products ADD COLUMN inventory_count INT;
ALTER TABLE products ADD COLUMN metadata JSON;  -- 50KB average
 
-- Same query now returns 60KB per row instead of 200 bytes
-- Application silently slows down by 300x
-- No code change, no deployment, no alert

Symptoms of this problem:

Gradual performance degradation that's hard to correlate with cause
Intermittent timeouts as row sizes increase
Memory pressure alerts without obvious code changes
Mysterious production incidents after 'routine' schema changes

Column Order Dependencies:

Some application code implicitly depends on column order when processing SELECT * results:

Column Order Danger
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Dangerous: Assumes column positions
cursor.execute("SELECT * FROM users")
for row in cursor.fetchall():
    user_id = row[0]      # Assumes id is first
    username = row[1]     # Assumes username is second
    email = row[2]        # Assumes email is third
 
# If DBA reorders columns or adds a column before 'email',
# this code silently produces wrong results
 
# Safe: Explicit column selection
cursor.execute("SELECT id, username, email FROM users")
for row in cursor.fetchall():
    user_id = row[0]      # Always id
    username = row[1]     # Always username
    email = row[2]        # Always email

Documentation and Readability:

SELECT * tells readers nothing about the query's purpose. Explicit columns document intent:

What data does this query actually need?
What business logic depends on these fields?
What security implications exist (are sensitive fields included)?

Months later, when debugging or optimizing, explicit columns provide essential context that SELECT * obscures.

Security Implications

SELECT * can inadvertently expose sensitive columns. When new sensitive data (SSN, password hashes, financial info) is added to a table, all SELECT * queries immediately start returning it. This has caused real data breaches when developers assumed SELECT * results wouldn't contain sensitive data.

Best Practices for Column Selection

Now that we understand the problems, let's establish concrete practices for effective column selection.

Rule 1: Always Specify Columns Explicitly

Make explicit column selection your default, not an optimization added later:

Explicit Column Selection
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Instead of:
SELECT * FROM orders WHERE customer_id = 123;
 
-- Write:
SELECT order_id, order_date, total_amount, status
FROM orders
WHERE customer_id = 123;
 
-- Benefits:
-- 1. Documents exactly what data is needed
-- 2. Enables covering index optimization
-- 3. Immune to schema changes
-- 4. Uses minimal network and memory

Rule 2: Request Only What You Process

Audit your code to ensure every selected column is actually used:

Minimal Column Selection
JavaScript
1
2
3
4
5
6
7
8
9
// Bad: Selects 20 columns, uses 3
const users = await db.query('SELECT * FROM users WHERE active = true');
const names = users.map(u => `${u.first_name} ${u.last_name}`);
 
// Good: Selects exactly what's needed
const users = await db.query(
  'SELECT first_name, last_name FROM users WHERE active = true'
);
const names = users.map(u => `${u.first_name} ${u.last_name}`);

Rule 3: Use Table Aliases in Joins

When joining tables, always qualify column names to avoid ambiguity and ensure clarity:

Table Aliases in Joins
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Bad: Ambiguous and fetches all columns from both tables
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id;
 
-- Good: Clear, minimal, and optimizable
SELECT 
    o.order_id,
    o.order_date,
    o.total_amount,
    c.customer_name,
    c.email
FROM orders o
JOIN customers c ON o.customer_id = c.id;

Rule 4: Create Views for Common Column Sets

If multiple queries need the same column subset, encapsulate it in a view:

Views for Column Sets
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Create a view for the common display fields
CREATE VIEW customer_display AS
SELECT 
    id,
    customer_name,
    email,
    phone,
    city,
    country
FROM customers;
 
-- Queries can now use the view, knowing it contains minimum necessary fields
SELECT * FROM customer_display WHERE country = 'USA';
 
-- Note: SELECT * from a well-designed view is acceptable because
-- the view itself enforces column discipline

Column Selection Checklist

•List columns explicitly in every production query
•Audit column usage to ensure each selected column is processed
•Consider covering indexes and select columns that enable index-only scans
•Exclude LOB columns (BLOB, TEXT, JSON) unless specifically needed
•Never select password hashes or other sensitive data unnecessarily
•Use views to encapsulate common column subsets
•Document column purpose in complex queries with comments

When SELECT * Is Acceptable

Despite everything discussed, there are legitimate scenarios where SELECT * is appropriate. Understanding these exceptions helps apply the rules pragmatically.

Ad-Hoc Query Exploration:

During development or debugging, SELECT * helps you understand table structure and data:

Legitimate SELECT * Usage
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Acceptable: Quick exploration during development
SELECT * FROM unknown_table LIMIT 10;
 
-- Acceptable: Debugging to see all data related to an issue
SELECT * FROM orders WHERE order_id = 12345;
 
-- Acceptable: Export/backup scenarios where all data is needed
SELECT * INTO backup_table FROM production_table;
 
-- Acceptable: Subquery where outer query filters columns
SELECT id, name FROM (
    SELECT * FROM employees WHERE department = 'Engineering'
) AS eng_employees;

EXISTS Subqueries:

SELECT * in an EXISTS clause is optimized away by smart query engines:

EXISTS Subquery Pattern
SQL
1
2
3
4
5
6
7
8
9
10
-- Acceptable: The SELECT * is never actually executed
SELECT customer_name
FROM customers c
WHERE EXISTS (
    SELECT * FROM orders o WHERE o.customer_id = c.id
);
 
-- The optimizer recognizes this as an existence check
-- No columns are actually retrieved from the orders table
-- Equivalent to SELECT 1 or SELECT NULL

Row Count Queries:

For COUNT(*), the asterisk means 'count rows' rather than 'select all columns':

COUNT(*) Is Different
SQL
1
2
3
4
5
6
7
8
-- COUNT(*) is a special case - counts rows, doesn't retrieve columns
SELECT COUNT(*) FROM orders WHERE status = 'pending';
 
-- This is fully optimized and does NOT suffer from SELECT * problems
-- The database can use any available index or table statistics
-- No actual row data is retrieved
 
-- Note: COUNT(1) and COUNT(*) are equivalent in modern databases

Rule of Thumb

If the query will reach production code, specify columns explicitly. If you're exploring data interactively or writing a one-time script, SELECT * is fine. The distinction is: will this query be executed thousands or millions of times?

Measuring the Impact

To convince stakeholders or measure improvement in your systems, use these techniques to quantify SELECT * impact.

Query Comparison Testing:

Measuring Query Performance
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- PostgreSQL: Compare execution statistics
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 1000;
 
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT order_id, order_date, total FROM orders WHERE customer_id = 1000;
 
-- Key metrics to compare:
-- - 'Buffers: shared hit/read' - I/O operations
-- - 'actual time' - execution duration
-- - 'rows' vs 'width' - data volume
 
-- MySQL: Enable profiling
SET profiling = 1;
SELECT * FROM orders WHERE customer_id = 1000;
SELECT order_id, order_date, total FROM orders WHERE customer_id = 1000;
SHOW PROFILES;  -- Compare query durations

Network Transfer Measurement:

Measuring Data Transfer
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- PostgreSQL: Estimate row sizes
SELECT 
    pg_column_size(t.*) as row_size_bytes,
    pg_column_size((t.order_id, t.order_date, t.total)) as minimal_size_bytes
FROM orders t
LIMIT 100;
 
-- MySQL: Get column sizes
SELECT 
    column_name,
    data_type,
    character_maximum_length,
    numeric_precision
FROM information_schema.columns
WHERE table_name = 'orders';
 
-- Calculate total transfer size difference
-- Row size × Row count = Total bytes transferred

Impact Measurement Checklist
Metric	How to Measure	Target Improvement
Query execution time	EXPLAIN ANALYZE	50% reduction
Buffer/page reads	BUFFERS in EXPLAIN	70% reduction
Network bytes	Row size × count	80% reduction
Memory allocation	Application profiler	60% reduction
Client processing time	Application timing	30% reduction

Quick Win Identification

Search your codebase for 'SELECT *' patterns. Each instance is a potential quick win. Prioritize queries that: (1) execute frequently, (2) touch tables with large columns, and (3) return many rows. These offer the highest return on optimization effort.

Summary: The SELECT * Optimization

We've explored one of the most impactful yet simplest query optimizations available. Let's consolidate the key principles:

Key Takeaways

•SELECT * expands at parse time to fetch every column—creating invisible schema coupling and metadata lookup overhead.
•Network and I/O costs multiply as unnecessary columns consume bandwidth, disk reads, and memory across every layer of the system.
•Buffer pool contamination occurs when unwanted data evicts valuable cached pages, slowing all subsequent queries.
•Covering indexes become unusable when SELECT * forces table lookups instead of efficient index-only scans.
•Schema evolution breaks silently as new columns automatically appear in results, causing gradual performance degradation.
•Explicit column selection is self-documenting and enables better query optimization, security, and maintenance.

What's next:

The next page explores another fundamental optimization: limiting result sets. You'll learn techniques for efficient pagination, the performance implications of unbounded queries, and strategies for processing large data sets without overwhelming your systems.

Page Complete

You now understand why SELECT * is one of the most common SQL anti-patterns and how explicit column selection can dramatically improve query performance. This single practice—applied consistently—can reduce network traffic, memory usage, and query times by orders of magnitude.

1 / 5

Loading learning content...

Database Management SystemsQuery Optimization Tips

Query Optimization Tips

LevelIntermediate

Duration75 mins

TopicQuery Optimization Tips

1 / 5

Avoiding SELECT *

The Most Common SQL Anti-Pattern

What You Will Learn

Understanding SELECT * Behavior

Query Expansion at Parse Time:

Query Expansion
SQL
1
2
3
4
5
6
7
-- What you write:
SELECT * FROM employees;
 
-- What the database actually executes:
SELECT id, first_name, last_name, email, department_id, 
       salary, hire_date, manager_id, address, phone
FROM employees;

Metadata Lookup Overhead:

To perform this expansion, the database must:

Query the system catalog to retrieve the table's column definitions
Resolve column order according to the schema definition
Determine data types for each column to plan memory allocation
Check column visibility based on user permissions

Dynamic vs. Static Column Lists:

Schema Coupling

Network Bandwidth and I/O Costs

Case Study: User Profile Retrieval

Consider a typical web application that needs to display a user's name on a dashboard. The users table contains:

Users Table Column Analysis
Column	Data Type	Typical Size (bytes)	Needed for Display?
id	BIGINT	8	No
username	VARCHAR(50)	25 avg	Yes
email	VARCHAR(255)	30 avg	No
password_hash	CHAR(60)	60	Never
first_name	VARCHAR(50)	15 avg	Yes
last_name	VARCHAR(50)	15 avg	Yes
profile_photo	BLOB	500,000 avg	No
bio	TEXT	2,000 avg	No
created_at	TIMESTAMP	8	No
last_login	TIMESTAMP	8	No
settings_json	JSON	5,000 avg	No

SELECT * Approach

•Data per row: ~507,169 bytes
•For 1,000 users: ~484 MB transferred
•Network time: 48 seconds @ 10 MB/s
•Includes password hashes (security risk!)
•Includes 500KB photos never displayed

Explicit Column Selection

•Data per row: ~55 bytes
•For 1,000 users: ~54 KB transferred
•Network time: 5 milliseconds @ 10 MB/s
•Excludes sensitive password data
•Returns exactly what's needed

The multiplication effect:

This 9,000x difference compounds across your application:

API requests: Each endpoint using SELECT * transfers unnecessary data
Concurrent users: 1,000 users means 1,000x the waste simultaneously
Database connections: Connection buffers fill with unused data
Cloud costs: Data transfer is often metered and billed

For high-traffic applications, the cumulative impact of SELECT * across hundreds of queries can consume terabytes of unnecessary bandwidth monthly.

Hidden BLOB and TEXT Penalty

Memory and Buffer Pool Impact

Beyond network transfer, SELECT * affects database memory management in ways that ripple throughout system performance.

Buffer Pool Contamination:

Consider a query fetching order IDs for processing:

Buffer Pool Impact
SQL
1
2
3
4
5
-- Anti-pattern: Loads entire rows including large description TEXT
SELECT * FROM orders WHERE status = 'pending';
 
-- Optimized: Only loads pages containing order IDs
SELECT order_id FROM orders WHERE status = 'pending';

If the orders table includes a description TEXT column averaging 5KB per order, the first query loads 50,000x more data into the buffer pool than the second. This means:

Cache pollution: Valuable cached data gets evicted
Increased disk I/O: Future queries must re-read evicted data
Higher memory pressure: May trigger memory management overhead
Reduced concurrency: Fewer queries can execute simultaneously

Result Set Memory Allocation:

The database must allocate memory for the entire result set before transferring it. With SELECT *, this allocation includes space for columns you never examine:

Memory Allocation Comparison (10,000 rows)
Query Type	Columns Retrieved	Memory Allocated	Actual Data Used
SELECT *	20 columns (1.5 KB/row)	15 MB	500 KB (3 columns)
SELECT a, b, c	3 columns (50 bytes/row)	500 KB	500 KB (100%)

Client-Side Memory Waste:

The problem extends to your application. ORMs and database drivers allocate objects or arrays to hold query results. With SELECT *, every result object contains fields for unused columns:

Java/Kotlin: Each ResultSet row allocates all column objects
Python: Dictionaries or ORM objects include every column
JavaScript/Node: Each row object contains all properties
Go: Struct fields for all columns consume memory

For long-running services processing millions of rows daily, this waste accumulates into gigabytes of unnecessary memory allocation and garbage collection overhead.

The Compound Effect

Query Plan and Index Implications

Destroying Covering Index Potential:

Consider a query to find employee email addresses by department:

Covering Index Example
SQL
1
2
3
4
5
6
7
8
9
10
-- Assume this covering index exists:
CREATE INDEX idx_emp_dept_email ON employees(department_id, email);
 
-- Query 1: Cannot use covering index - must access table
SELECT * FROM employees WHERE department_id = 5;
-- Execution: Index seek + Table lookup for EVERY row (expensive)
 
-- Query 2: Uses covering index exclusively - never touches table
SELECT email FROM employees WHERE department_id = 5;
-- Execution: Index-only scan (10-100x faster)

Performance difference explained:

With the covering index approach:

The index contains department_id and email together
The database scans only the index structure
No random I/O to table data pages
Index is typically 10-50x smaller than table
Index is more likely to be fully cached

With SELECT *:

Index identifies matching rows by their row ID
Database must perform a random seek to each row's data page
Each seek may require disk I/O if page isn't cached
Even with SSD, the seek overhead dominates query time

Covering Index vs. Table Access (10,000 matching rows)
Operation	SELECT *	SELECT email
Index scan time	5 ms	5 ms
Table lookups required	10,000 random seeks	0
Disk I/O (worst case)	10,000 × 10ms = 100 sec	None
Disk I/O (cached)	10,000 × 0.1ms = 1 sec	None
Total query time (worst)	~100 seconds	5 ms
Total query time (cached)	~1 second	5 ms

Optimizer Decision Influence:

The query optimizer considers all requested columns when choosing an execution plan. With SELECT *, the optimizer knows it must eventually retrieve all columns, which influences its decisions:

Index selection: May choose a worse index or full table scan
Join order: May change join order when column set affects intermediate result sizes
Parallel execution: May limit parallelism due to memory requirements
Materialization: May force result materialization instead of streaming

By requesting only needed columns, you give the optimizer more freedom to choose efficient plans.

The Hidden Cost

Schema Evolution and Maintainability

Beyond immediate performance, SELECT * creates long-term maintainability challenges that compound as systems evolve.

The Fragile Query Problem:

When a schema changes, queries using SELECT * behave unpredictably:

Schema Evolution Hazard
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- Original table
CREATE TABLE products (id INT, name VARCHAR(100), price DECIMAL(10,2));
 
-- Application code expects: [id, name, price]
SELECT * FROM products;
 
-- Six months later, DBA adds columns:
ALTER TABLE products ADD COLUMN description TEXT;  -- 10KB average
ALTER TABLE products ADD COLUMN image_url VARCHAR(500);
ALTER TABLE products ADD COLUMN inventory_count INT;
ALTER TABLE products ADD COLUMN metadata JSON;  -- 50KB average
 
-- Same query now returns 60KB per row instead of 200 bytes
-- Application silently slows down by 300x
-- No code change, no deployment, no alert

Symptoms of this problem:

Gradual performance degradation that's hard to correlate with cause
Intermittent timeouts as row sizes increase
Memory pressure alerts without obvious code changes
Mysterious production incidents after 'routine' schema changes

Column Order Dependencies:

Some application code implicitly depends on column order when processing SELECT * results:

Column Order Danger
Python
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Dangerous: Assumes column positions
cursor.execute("SELECT * FROM users")
for row in cursor.fetchall():
    user_id = row[0]      # Assumes id is first
    username = row[1]     # Assumes username is second
    email = row[2]        # Assumes email is third
 
# If DBA reorders columns or adds a column before 'email',
# this code silently produces wrong results
 
# Safe: Explicit column selection
cursor.execute("SELECT id, username, email FROM users")
for row in cursor.fetchall():
    user_id = row[0]      # Always id
    username = row[1]     # Always username
    email = row[2]        # Always email

Documentation and Readability:

SELECT * tells readers nothing about the query's purpose. Explicit columns document intent:

What data does this query actually need?
What business logic depends on these fields?
What security implications exist (are sensitive fields included)?

Months later, when debugging or optimizing, explicit columns provide essential context that SELECT * obscures.

Security Implications

Best Practices for Column Selection

Now that we understand the problems, let's establish concrete practices for effective column selection.

Rule 1: Always Specify Columns Explicitly

Make explicit column selection your default, not an optimization added later:

Explicit Column Selection
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Instead of:
SELECT * FROM orders WHERE customer_id = 123;
 
-- Write:
SELECT order_id, order_date, total_amount, status
FROM orders
WHERE customer_id = 123;
 
-- Benefits:
-- 1. Documents exactly what data is needed
-- 2. Enables covering index optimization
-- 3. Immune to schema changes
-- 4. Uses minimal network and memory

Rule 2: Request Only What You Process

Audit your code to ensure every selected column is actually used:

Minimal Column Selection
JavaScript
1
2
3
4
5
6
7
8
9
// Bad: Selects 20 columns, uses 3
const users = await db.query('SELECT * FROM users WHERE active = true');
const names = users.map(u => `${u.first_name} ${u.last_name}`);
 
// Good: Selects exactly what's needed
const users = await db.query(
  'SELECT first_name, last_name FROM users WHERE active = true'
);
const names = users.map(u => `${u.first_name} ${u.last_name}`);

Rule 3: Use Table Aliases in Joins

When joining tables, always qualify column names to avoid ambiguity and ensure clarity:

Table Aliases in Joins
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Bad: Ambiguous and fetches all columns from both tables
SELECT * FROM orders o
JOIN customers c ON o.customer_id = c.id;
 
-- Good: Clear, minimal, and optimizable
SELECT 
    o.order_id,
    o.order_date,
    o.total_amount,
    c.customer_name,
    c.email
FROM orders o
JOIN customers c ON o.customer_id = c.id;

Rule 4: Create Views for Common Column Sets

If multiple queries need the same column subset, encapsulate it in a view:

Views for Column Sets
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
-- Create a view for the common display fields
CREATE VIEW customer_display AS
SELECT 
    id,
    customer_name,
    email,
    phone,
    city,
    country
FROM customers;
 
-- Queries can now use the view, knowing it contains minimum necessary fields
SELECT * FROM customer_display WHERE country = 'USA';
 
-- Note: SELECT * from a well-designed view is acceptable because
-- the view itself enforces column discipline

Column Selection Checklist

•List columns explicitly in every production query
•Audit column usage to ensure each selected column is processed
•Consider covering indexes and select columns that enable index-only scans
•Exclude LOB columns (BLOB, TEXT, JSON) unless specifically needed
•Never select password hashes or other sensitive data unnecessarily
•Use views to encapsulate common column subsets
•Document column purpose in complex queries with comments

When SELECT * Is Acceptable

Despite everything discussed, there are legitimate scenarios where SELECT * is appropriate. Understanding these exceptions helps apply the rules pragmatically.

Ad-Hoc Query Exploration:

During development or debugging, SELECT * helps you understand table structure and data:

Legitimate SELECT * Usage
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
-- Acceptable: Quick exploration during development
SELECT * FROM unknown_table LIMIT 10;
 
-- Acceptable: Debugging to see all data related to an issue
SELECT * FROM orders WHERE order_id = 12345;
 
-- Acceptable: Export/backup scenarios where all data is needed
SELECT * INTO backup_table FROM production_table;
 
-- Acceptable: Subquery where outer query filters columns
SELECT id, name FROM (
    SELECT * FROM employees WHERE department = 'Engineering'
) AS eng_employees;

EXISTS Subqueries:

SELECT * in an EXISTS clause is optimized away by smart query engines:

EXISTS Subquery Pattern
SQL
1
2
3
4
5
6
7
8
9
10
-- Acceptable: The SELECT * is never actually executed
SELECT customer_name
FROM customers c
WHERE EXISTS (
    SELECT * FROM orders o WHERE o.customer_id = c.id
);
 
-- The optimizer recognizes this as an existence check
-- No columns are actually retrieved from the orders table
-- Equivalent to SELECT 1 or SELECT NULL

Row Count Queries:

For COUNT(*), the asterisk means 'count rows' rather than 'select all columns':

COUNT(*) Is Different
SQL
1
2
3
4
5
6
7
8
-- COUNT(*) is a special case - counts rows, doesn't retrieve columns
SELECT COUNT(*) FROM orders WHERE status = 'pending';
 
-- This is fully optimized and does NOT suffer from SELECT * problems
-- The database can use any available index or table statistics
-- No actual row data is retrieved
 
-- Note: COUNT(1) and COUNT(*) are equivalent in modern databases

Rule of Thumb

Measuring the Impact

To convince stakeholders or measure improvement in your systems, use these techniques to quantify SELECT * impact.

Query Comparison Testing:

Measuring Query Performance
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
-- PostgreSQL: Compare execution statistics
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 1000;
 
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT order_id, order_date, total FROM orders WHERE customer_id = 1000;
 
-- Key metrics to compare:
-- - 'Buffers: shared hit/read' - I/O operations
-- - 'actual time' - execution duration
-- - 'rows' vs 'width' - data volume
 
-- MySQL: Enable profiling
SET profiling = 1;
SELECT * FROM orders WHERE customer_id = 1000;
SELECT order_id, order_date, total FROM orders WHERE customer_id = 1000;
SHOW PROFILES;  -- Compare query durations

Network Transfer Measurement:

Measuring Data Transfer
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
-- PostgreSQL: Estimate row sizes
SELECT 
    pg_column_size(t.*) as row_size_bytes,
    pg_column_size((t.order_id, t.order_date, t.total)) as minimal_size_bytes
FROM orders t
LIMIT 100;
 
-- MySQL: Get column sizes
SELECT 
    column_name,
    data_type,
    character_maximum_length,
    numeric_precision
FROM information_schema.columns
WHERE table_name = 'orders';
 
-- Calculate total transfer size difference
-- Row size × Row count = Total bytes transferred

Impact Measurement Checklist
Metric	How to Measure	Target Improvement
Query execution time	EXPLAIN ANALYZE	50% reduction
Buffer/page reads	BUFFERS in EXPLAIN	70% reduction
Network bytes	Row size × count	80% reduction
Memory allocation	Application profiler	60% reduction
Client processing time	Application timing	30% reduction

Quick Win Identification

Summary: The SELECT * Optimization

We've explored one of the most impactful yet simplest query optimizations available. Let's consolidate the key principles:

Key Takeaways

•SELECT * expands at parse time to fetch every column—creating invisible schema coupling and metadata lookup overhead.
•Network and I/O costs multiply as unnecessary columns consume bandwidth, disk reads, and memory across every layer of the system.
•Buffer pool contamination occurs when unwanted data evicts valuable cached pages, slowing all subsequent queries.
•Covering indexes become unusable when SELECT * forces table lookups instead of efficient index-only scans.
•Schema evolution breaks silently as new columns automatically appear in results, causing gradual performance degradation.
•Explicit column selection is self-documenting and enables better query optimization, security, and maintenance.

What's next:

Page Complete

1 / 5