Schemas And Instances - Learning Module

Loading content...

0/241

Metadata

The Data That Describes Data

When you query a database, how does the system know that the 'customers' table exists? How does it verify that 'email' is a valid column? How does the query optimizer know that there's an index on 'customer_id'? The answer lies in metadata—the data about data that makes database systems intelligent, self-describing, and manageable.\n\nMetadata is like the nervous system of a database. It's invisible to most users, but without it, nothing would work. The database engine consults metadata thousands of times per second to parse queries, plan executions, enforce security, and manage resources.\n\nUnderstanding metadata transforms you from a database user into a database practitioner. You move from trusting the magic to understanding the machinery.

What You Will Learn

By the end of this page, you will understand what metadata is, why it's essential, how it's organized in the system catalog, how to query and use metadata programmatically, and how metadata powers everything from query optimization to data governance. You'll gain practical skills for introspecting any database system.

What is Metadata?

Metadata is data that provides information about other data. In the context of databases, metadata describes the structure, relationships, constraints, and properties of the data stored in the database.\n\nConsider this analogy: A library card catalog doesn't contain the books—it contains information about the books: titles, authors, subjects, publication dates, shelf locations. Similarly, database metadata doesn't contain your customer records—it contains information about the tables that hold customers, the columns in those tables, the indexes that speed up searches, and the constraints that ensure data quality.\n\nTypes of Metadata in Databases:\n\n1. Structural Metadata:\nDescribes the organization of data:\n- Table and view definitions\n- Column names, types, and constraints\n- Primary and foreign key relationships\n- Index definitions\n\n2. Descriptive Metadata:\nProvides context for understanding data:\n- Table and column comments/descriptions\n- Business glossary definitions\n- Data lineage (where data came from)\n- Ownership and custodian information\n\n3. Administrative Metadata:\nSupports database operations:\n- Storage statistics (row counts, table sizes)\n- Access permissions and security policies\n- Performance statistics (index usage, query patterns)\n- Backup and recovery information\n\n4. Operational Metadata:\nTracks data processing:\n- Last modification times\n- ETL job execution history\n- Data quality metrics\n- Change data capture logs

Metadata Categories and Examples
Category	What It Describes	Example
Structural	Data organization	customers table has columns: id, email, name
Semantic	Data meaning	email column contains RFC 5322 compliant addresses
Administrative	Data governance	PII data requires encryption at rest
Operational	Data processing	ETL job last ran at 2024-01-15 03:00:00 UTC
Statistical	Data characteristics	customers table has 10.5 million rows
Quality	Data health	email column has 0.3% NULL rate

The System Catalog (Data Dictionary)

Every DBMS maintains an internal repository of metadata called the system catalog (also known as the data dictionary, or metadata repository). The system catalog is itself a set of tables and views—metadata is stored as data!\n\nThis creates an elegant self-referential design: The system catalog tables describe all tables in the database, including themselves. You can query the catalog to discover what's in the catalog.\n\nSystem Catalog Contents:\n\nA typical system catalog contains information about:\n- Schemas/Databases: Logical groupings of objects\n- Tables: Relations that store data\n- Views: Virtual tables defined by queries\n- Columns: Attributes of tables and views\n- Constraints: Rules enforced on data\n- Indexes: Structures for fast data access\n- Functions/Procedures: Stored code\n- Triggers: Automatic responses to events\n- Users/Roles: Security principals\n- Privileges: Access permissions\n- Statistics: Cardinality and distribution data

explore-system-catalog.sql
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
-- PostgreSQL system catalog exploration
-- The pg_catalog schema contains core metadata tables
 
-- List all tables in your database
SELECT 
    schemaname,
    tablename,
    tableowner
FROM pg_catalog.pg_tables
WHERE schemaname NOT IN ('pg_catalog', 'information_schema')
ORDER BY schemaname, tablename;
 
-- The pg_catalog tables themselves follow naming conventions:
-- pg_class    → all relations (tables, indexes, sequences, views)
-- pg_attribute → all columns
-- pg_constraint → all constraints
-- pg_index    → index information
-- pg_proc     → functions and procedures
-- pg_type     → data types
-- pg_namespace → schemas
-- pg_roles    → users and roles
 
-- Example: Find all columns and their types for a specific table
SELECT 
    a.attname AS column_name,
    t.typname AS data_type,
    a.attnotnull AS not_null,
    pg_get_expr(d.adbin, d.adrelid) AS default_value
FROM pg_attribute a
JOIN pg_class c ON a.attrelid = c.oid
JOIN pg_type t ON a.atttypid = t.oid
LEFT JOIN pg_attrdef d ON a.attrelid = d.adrelid AND a.attnum = d.adnum
WHERE c.relname = 'customers'
  AND a.attnum > 0  -- Exclude system columns
  AND NOT a.attisdropped
ORDER BY a.attnum;

INFORMATION_SCHEMA vs. Native Catalogs

The INFORMATION_SCHEMA is a SQL standard for metadata access, supported by most databases with consistent naming. Native catalogs (pg_catalog in PostgreSQL, mysql database in MySQL) offer more detailed, DBMS-specific information. Use INFORMATION_SCHEMA for portability; use native catalogs when you need deep system details.

Querying Metadata: Practical Techniques

Being able to query metadata programmatically is a superpower. It enables automation, documentation generation, validation, and introspection that would be tedious or impossible manually.\n\nCommon Metadata Queries:

practical-metadata-queries.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
-- ============================================
-- PRACTICAL METADATA QUERIES
-- ============================================
 
-- 1. DISCOVER TABLE STRUCTURE
-- Useful for dynamic code generation or documentation
SELECT 
    column_name,
    data_type,
    character_maximum_length,
    is_nullable,
    column_default
FROM information_schema.columns
WHERE table_name = 'orders'
ORDER BY ordinal_position;
 
-- 2. FIND ALL FOREIGN KEY RELATIONSHIPS
-- Invaluable for understanding data model
SELECT
    tc.table_name AS source_table,
    kcu.column_name AS source_column,
    ccu.table_name AS target_table,
    ccu.column_name AS target_column
FROM information_schema.table_constraints tc
JOIN information_schema.key_column_usage kcu 
    ON tc.constraint_name = kcu.constraint_name
JOIN information_schema.constraint_column_usage ccu 
    ON tc.constraint_name = ccu.constraint_name
WHERE tc.constraint_type = 'FOREIGN KEY';
 
-- 3. FIND TABLES WITHOUT PRIMARY KEYS
-- Data quality check - every table should have a PK
SELECT t.table_name
FROM information_schema.tables t
LEFT JOIN information_schema.table_constraints tc 
    ON t.table_name = tc.table_name 
    AND tc.constraint_type = 'PRIMARY KEY'
WHERE t.table_schema = 'public'
    AND t.table_type = 'BASE TABLE'
    AND tc.constraint_name IS NULL;
 
-- 4. TABLE SIZES AND ROW COUNTS
-- Essential for capacity planning
SELECT
    relname AS table_name,
    n_live_tup AS estimated_row_count,
    pg_size_pretty(pg_total_relation_size(relid)) AS total_size,
    pg_size_pretty(pg_relation_size(relid)) AS data_size,
    pg_size_pretty(pg_indexes_size(relid)) AS index_size
FROM pg_stat_user_tables
ORDER BY pg_total_relation_size(relid) DESC;
 
-- 5. FIND UNUSED INDEXES
-- Indexes cost write performance; remove unused ones
SELECT
    schemaname,
    relname AS table_name,
    indexrelname AS index_name,
    idx_scan AS times_used,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size
FROM pg_stat_user_indexes
WHERE idx_scan = 0
ORDER BY pg_relation_size(indexrelid) DESC;
 
-- 6. FIND COLUMNS WITH MOST NULLs
-- Data quality indicator
SELECT
    attname AS column_name,
    null_frac AS null_fraction,
    n_distinct AS distinct_values
FROM pg_stats
WHERE tablename = 'customers'
ORDER BY null_frac DESC;
 
-- 7. GENERATE CREATE TABLE STATEMENT
-- Documentation and migration helper
SELECT pg_get_ddl('table', 'customers');  -- In Redshift
-- In PostgreSQL, use pg_dump --schema-only or:
SELECT 
    'CREATE TABLE ' || table_name || ' (' ||
    string_agg(column_name || ' ' || data_type, ', ') ||
    ');'
FROM information_schema.columns
WHERE table_name = 'customers'
GROUP BY table_name;

Use Cases for Metadata Queries

•Automated documentation — Generate data dictionaries and ER diagrams from catalog
•Code generation — Build ORM models, API endpoints, or admin UIs from schema
•Validation — Check naming conventions, constraint coverage, index presence
•Capacity planning — Track table growth, storage usage, index bloat
•Security auditing — Review permissions, identify overprivileged users
•Migration planning — Understand dependencies before making changes
•Performance tuning — Find unused indexes, analyze query patterns

Descriptive Metadata: Comments and Documentation

Structural metadata describes what data looks like; descriptive metadata explains what it means. Most databases support adding comments to objects, storing business-friendly descriptions alongside technical definitions.\n\nAdding Comments to Database Objects:

descriptive-metadata.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- Add descriptive comments to tables and columns
-- These become part of the schema and are queryable
 
-- Table comment
COMMENT ON TABLE customers IS 
'Master table of customer accounts. Contains one row per registered customer. 
Excludes anonymous/guest checkouts. See also: orders, subscriptions.';
 
-- Column comments
COMMENT ON COLUMN customers.id IS 
'Auto-generated unique identifier. Exposed in APIs as customer_id.';
 
COMMENT ON COLUMN customers.email IS 
'Primary contact email. Used for authentication. Must be unique per tenant. 
Verified emails have verified_at set.';
 
COMMENT ON COLUMN customers.status IS 
'Account status: active (normal), suspended (payment failed), banned (TOS violation). 
Only active customers can place orders.';
 
COMMENT ON COLUMN customers.created_at IS 
'Timestamp of account creation in UTC. Immutable after initial insert.';
 
COMMENT ON COLUMN customers.metadata IS 
'JSONB field for flexible additional data. Common keys: referral_source, 
preferred_language, marketing_preferences. Not indexed by default.';
 
-- Query comments back
SELECT 
    c.column_name,
    c.data_type,
    c.is_nullable,
    pgd.description AS comment
FROM information_schema.columns c
LEFT JOIN pg_catalog.pg_statio_user_tables st 
    ON c.table_name = st.relname
LEFT JOIN pg_catalog.pg_description pgd 
    ON pgd.objoid = st.relid 
    AND pgd.objsubid = c.ordinal_position
WHERE c.table_name = 'customers'
ORDER BY c.ordinal_position;
 
-- Get table comment
SELECT obj_description('customers'::regclass, 'pg_class');

Building a Data Dictionary:\n\nA data dictionary is a comprehensive reference document describing every element of your database. While you can create this manually, generating it from metadata + comments ensures it stays synchronized with the actual schema:

Document During Design

Add comments when you create tables, not as an afterthought. Make COMMENT ON part of your migration files. Treat undocumented columns as technical debt. Future team members (and future you) will thank past you for the context.

Extended Metadata Systems:\n\nFor enterprise data governance, built-in comments aren't enough. Organizations use dedicated metadata management systems:\n\n- Apache Atlas — Open-source metadata and governance for Hadoop ecosystems\n- AWS Glue Data Catalog — Managed metadata for AWS data lakes\n- Collibra — Enterprise data governance platform\n- Alation — Data catalog with ML-powered discovery\n- dbt — Data transformation tool with rich documentation support\n\nThese systems track lineage (where did this data come from?), impact analysis (what breaks if I change this?), and business glossaries (what does 'customer' mean in our company?).

Statistical Metadata: The Query Optimizer's Secret

One of the most critical uses of metadata is powering the query optimizer. To choose the best execution plan, the optimizer needs statistical information about data distribution:\n\nKey Statistics:\n\n- Row count (cardinality) — How many rows in each table?\n- Distinct values — How many unique values in each column?\n- NULL fraction — What percentage of values are NULL?\n- Histograms — How are values distributed across the range?\n- Correlation — How much do column values align with physical row order?\n- Most common values — What values appear most frequently?\n\nThese statistics dramatically affect query planning. Without accurate statistics, the optimizer might:\n- Choose a full table scan when an index would be faster\n- Use a nested loop join when a hash join would be better\n- Seriously underestimate result sizes, causing memory issues

statistical-metadata.sql
PostgreSQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- View statistical metadata in PostgreSQL
-- pg_stats contains detailed column statistics
 
SELECT
    tablename,
    attname AS column_name,
    null_frac AS fraction_null,
    avg_width AS avg_bytes_per_value,
    n_distinct AS distinct_values,
    -- Positive: actual count; Negative: -fraction of rows
    most_common_vals AS common_values,
    most_common_freqs AS common_frequencies,
    correlation
FROM pg_stats
WHERE tablename = 'orders'
ORDER BY attname;
 
-- Example output interpretation:
-- n_distinct = -1.0 means all values are unique (like primary key)
-- n_distinct = 5 means exactly 5 distinct values (like an enum)
-- n_distinct = -0.1 means about 10% of rows have unique values
 
-- correlation = 1.0 means perfectly ordered (ascending)
-- correlation = -1.0 means reverse ordered (descending)
-- correlation = 0.0 means random order (index range scans less efficient)
 
-- UPDATE STATISTICS
-- Statistics become stale as data changes
-- Run ANALYZE to refresh:
ANALYZE customers;  -- Update stats for one table
ANALYZE;            -- Update stats for entire database
 
-- In production, autovacuum handles this automatically
-- But after bulk loads, run ANALYZE manually
 
-- View when statistics were last updated
SELECT
    relname AS table_name,
    last_analyze,
    last_autoanalyze,
    n_live_tup AS estimated_rows,
    n_dead_tup AS dead_rows
FROM pg_stat_user_tables
WHERE schemaname = 'public';

Stale Statistics Kill Performance

After large data loads or deletions, statistics can be wildly inaccurate. A table that grew from 1,000 to 10 million rows will have plans optimized for 1,000 rows until ANALYZE runs. Always run ANALYZE after bulk operations. Many mysterious 'the database suddenly got slow' issues trace to stale statistics.

How the Optimizer Uses Statistics:\n\nWhen you run EXPLAIN ANALYZE on a query, you can see the optimizer's estimates:\n\nsql\nEXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 12345;\n\n\nThe output shows:\n- Estimated rows — What the optimizer predicted based on statistics\n- Actual rows — What actually happened during execution\n- Estimated cost — Relative effort (in arbitrary units)\n- Actual time — Real execution time in milliseconds\n\nWhen estimated and actual rows differ significantly, statistics may be stale or the data model has unusual distributions that confuse the optimizer.

Metadata in Data Governance

In enterprise environments, metadata is central to data governance—the policies, processes, and standards that ensure data is secure, compliant, and trustworthy.\n\nKey Governance Use Cases:\n\n1. Data Classification:\n\nMetadata tags identify sensitive data:\n- PII (Personally Identifiable Information)\n- PHI (Protected Health Information)\n- Financial data\n- Public/Internal/Confidential classifications\n\n2. Access Control:\n\nMetadata defines who can access what:\n- Role-based permissions\n- Row-level security policies\n- Column-level encryption requirements\n\n3. Data Lineage:\n\nMetadata tracks data flow:\n- Where did this data originate?\n- What transformations were applied?\n- What reports depend on this table?\n\n4. Data Quality:\n\nMetadata captures quality metrics:\n- Completeness (% non-null)\n- Accuracy (validation pass rate)\n- Timeliness (freshness)\n- Consistency (cross-system agreement)

Governance Metadata Examples

•Column 'ssn' classified as PII
•Table 'payments' requires SOC2 audit
•Schema 'analytics' refreshed daily at 0300 UTC
•Column 'email' retention: 90 days after account deletion
•Table 'users' is source of truth for customer identity

Governance Actions Enabled

•Auto-mask SSN in non-production environments
•Generate compliance reports for auditors
•Alert when data pipeline misses refresh window
•Automate deletion when retention period expires
•Prevent conflicting definitions of 'user'

Implementing Governance Metadata:\n\nMost databases don't have built-in regulatory classification. Organizations implement this through:\n\n1. Naming conventions — Prefixes like pii_, encrypted_ signal sensitivity\n2. Separate metadata tables — Custom tables mapping columns to classifications\n3. External catalogs — Tools like Apache Atlas, Collibra, or AWS Glue\n4. Tags/Labels — Cloud databases often support key-value tags on resources\n5. Comments — Structured comments following a convention (e.g., JSON in comments)

Compliance Drives Metadata Maturity

Regulations like GDPR require knowing where personal data lives (data discovery), who accessed it (audit logs), and proving it's deleted on request (lineage tracking). This compliance pressure has elevated metadata management from 'nice to have' to 'legally required' in many industries.

Programmatic Metadata Access

Beyond SQL queries, applications often access metadata programmatically through database drivers and APIs. This enables building dynamic, schema-aware applications.\n\nCommon Programmatic Approaches:

programmatic-metadata.ts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
// Programmatic metadata access examples
 
// Using Prisma (TypeScript ORM)
import { PrismaClient } from '@prisma/client';
const prisma = new PrismaClient();
 
// Prisma generates types from schema - compile-time metadata!
// The schema IS the metadata, enforced by TypeScript
const customers = await prisma.customer.findMany({
  select: {
    id: true,
    email: true,
    // IDE autocomplete shows available columns
  }
});
 
// Raw introspection query
const tables = await prisma.$queryRaw`
  SELECT table_name, column_name, data_type
  FROM information_schema.columns
  WHERE table_schema = 'public'
`;
 
// Using pg (raw PostgreSQL driver)
import { Client } from 'pg';
const client = new Client();
await client.connect();
 
// DatabaseMetaData equivalent: describe a table
const { rows } = await client.query(`
  SELECT 
    column_name,
    data_type,
    is_nullable,
    column_default
  FROM information_schema.columns
  WHERE table_name = $1
  ORDER BY ordinal_position
`, ['customers']);
 
console.log('Columns:', rows);
// Output: Array of column definitions
 
// Dynamic query building based on metadata
function buildSelectQuery(tableName: string, columns: string[]) {
  // Validate columns exist (prevent SQL injection)
  const validColumns = columns.filter(col => 
    rows.some(r => r.column_name === col)
  );
  
  if (validColumns.length === 0) {
    throw new Error('No valid columns specified');
  }
  
  return `SELECT ${validColumns.join(', ')} FROM ${tableName}`;
}

Use Cases for Programmatic Metadata

•Admin interfaces — Auto-generate CRUD UI from table structure
•API generation — Create REST/GraphQL endpoints dynamically
•Migration tools — Compare schemas between environments
•Testing — Validate database matches expected schema
•Documentation — Generate always-up-to-date data dictionaries
•Monitoring — Track schema drift across environments

Summary: The Power of Metadata

We've explored metadata comprehensively—from its definition through its practical applications. Let's consolidate the key insights:

Key Takeaways

•Metadata is data about data — It describes structure, meaning, quality, and lineage of your actual data.
•The system catalog stores structural metadata — Tables, columns, constraints, and indexes are all recorded in queryable system tables.
•INFORMATION_SCHEMA provides portable access — SQL-standard views work across databases; native catalogs offer DBMS-specific details.
•Descriptive metadata adds business context — Comments and documentation make schemas self-explaining.
•Statistical metadata powers query optimization — Row counts, distinct values, and histograms guide the optimizer's decisions.
•Governance requires rich metadata — Classification, lineage, and quality metrics enable compliance and trust.
•Programmatic access enables automation — Drivers and ORMs expose metadata for dynamic, schema-aware applications.
•Keep metadata current — Run ANALYZE after data loads; update comments with schema changes; treat metadata as first-class artifacts.

Module Complete:\n\nWith this page on metadata, we've completed our exploration of Schemas and Instances. You now understand:\n\n- What schemas are and how they're defined (Page 1)\n- What instances are and how they change (Page 2)\n- The crucial differences between schemas and instances (Page 3)\n- How schemas evolve over time (Page 4)\n- How metadata makes it all manageable (Page 5)\n\nThis knowledge forms the foundation for everything that follows in database systems—from data modeling to query optimization to administration. Every concept builds on the schema-instance-metadata triad.

Module Complete

Congratulations! You've mastered the concepts of database schemas and instances—the fundamental duality at the heart of database systems. You understand what schemas define, how instances change, why they must remain separate, how schemas evolve, and how metadata ties it all together. This foundational knowledge will serve you throughout your database career.