Loading content...
A database administrator who keeps all knowledge in their head is a liability, not an asset. When that person goes on vacation, gets sick, or leaves the organization, their knowledge leaves with them. Critical procedures become guesswork. Simple tasks become archeological expeditions.
Documentation transforms individual expertise into organizational capability.
Good documentation enables consistent operations regardless of who is on call. It reduces errors during stressful incidents. It accelerates onboarding of new team members. It provides the institutional memory that survives personnel changes.
Yet documentation is often neglected—seen as overhead rather than essential infrastructure. This is a mistake that compounds over time, creating technical debt that eventually manifests as extended outages and costly mistakes.
By the end of this page, you will understand the essential categories of database documentation, master runbook creation for operational procedures, develop architecture and configuration documentation, establish change logging and audit trails, and create disaster recovery documentation that works under pressure.
Database documentation serves multiple audiences and purposes. Understanding these helps prioritize what to document and how to structure it.
Primary audiences for database documentation:
| Category | Purpose | Update Frequency | Primary Audience |
|---|---|---|---|
| Architecture Diagrams | Understand system topology and relationships | On significant changes | All |
| Configuration Documentation | Record settings and their rationale | When configs change | Operations, New hires |
| Operational Runbooks | Step-by-step procedures for common tasks | Quarterly review | Operations, On-call |
| Disaster Recovery Plans | Recovery procedures for major failures | Semi-annual review + test | All engineers |
| Change Logs | Record of all changes with context | Every change | All, Auditors |
| Incident Postmortems | Lessons learned from outages | After each incident | All engineers |
| Capacity Planning | Growth projections and resource planning | Quarterly | Management, DBAs |
| Security Documentation | Access controls, encryption, compliance | On policy changes | Security, Auditors |
Documentation becomes outdated the moment it's written. Systems change, but docs don't get updated. Outdated documentation is often worse than no documentation—it misleads during critical moments. Build documentation review into your regular maintenance cycles.
Architecture documentation provides the high-level view of database systems—what exists, how components connect, and why they're structured that way. This is often the first thing new team members need and the reference point during major changes.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172
# Database Architecture Documentation ## 1. Environment Overview | Environment | Purpose | Databases | HA Strategy ||-------------|----------------|-----------|---------------|| Production | Live workloads | OrdersDB, UsersDB, AnalyticsDB | Always On AG || Staging | Pre-prod tests | OrdersDB, UsersDB | Single node || Development | Dev/Testing | OrdersDB, UsersDB | Single node | ## 2. Production Cluster Topology ### Primary Datacenter (DC-EAST)- **db-prod-01** (Primary for OrdersDB, UsersDB) - CPU: 64 cores | RAM: 512GB | Storage: 20TB SAN - IP: 10.0.1.10 | FQDN: db-prod-01.internal - **db-prod-02** (Secondary Replica) - CPU: 64 cores | RAM: 512GB | Storage: 20TB SAN - IP: 10.0.1.11 | FQDN: db-prod-02.internal ### Secondary Datacenter (DC-WEST)- **db-prod-03** (Async Replica for DR) - CPU: 64 cores | RAM: 512GB | Storage: 20TB SAN - IP: 10.1.1.10 | FQDN: db-prod-03.internal ## 3. Connection Information ### Application Connection Strings```Production Listener: prod-db-ag.internal:1433Read-Only Routing: prod-db-ag.internal:1433;ApplicationIntent=ReadOnly``` ### Administrative Access- Jump host: db-jump.internal (requires MFA)- Direct RDP: Restricted to DB team (VPN + certificate) ## 4. Database Inventory | Database | Size | Recovery Model | Backup Schedule ||------------|---------|----------------|----------------------|| OrdersDB | 2.1 TB | Full | Full: Daily 2AM, Log: 15min || UsersDB | 150 GB | Full | Full: Daily 2AM, Log: 15min || AnalyticsDB| 5.4 TB | Simple | Full: Weekly Sun 3AM | ## 5. Replication Topology ```[OrdersDB - Primary (db-prod-01)] │ ├──> [Sync Replica (db-prod-02)] ──> Used for failover │ └──> [Async Replica (db-prod-03)] ──> DR site, read offload``` ## 6. Network Architecture ```Internet -> WAF -> App Servers (10.0.2.0/24) │ v Load Balancer (10.0.1.5) │ v Database Tier (10.0.1.0/24) [Firewall: 1433 only from 10.0.2.0/24]``` ---*Last Updated: 2024-01-15 by J. Smith**Review Schedule: Quarterly*Consider using 'diagram as code' tools like Mermaid, PlantUML, or Graphviz. These allow diagrams to be version-controlled, diffed, and updated alongside other documentation. They're also easier to keep current than manually edited graphics.
Configuration documentation records what settings are in place and why. The 'why' is crucial—it preserves the reasoning behind decisions so future engineers don't unknowingly reverse important changes.
| Element | What to Document | Example |
|---|---|---|
| Server Parameters | All non-default settings with rationale | max_connections = 500 (anticipated peak: 400) |
| Memory Configuration | Buffer pools, caches, work memory | shared_buffers = 128GB (25% of 512GB RAM) |
| Storage Configuration | File locations, sizes, growth settings | Log file pre-sized to 100GB to prevent auto-growth |
| Security Settings | Authentication modes, encryption | SSL required for all connections |
| Replication Settings | Sync mode, timeouts, commit behavior | synchronous_commit = on for durability |
| Maintenance Settings | Autovacuum, checkpoints, jobs | checkpoint_timeout = 15min (balance recovery time vs performance) |
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849
# PostgreSQL Configuration Documentation## Server: db-prod-01 ### Memory Configuration | Parameter | Value | Default | Rationale ||-----------|-------|---------|-----------|| shared_buffers | 128GB | 128MB | 25% of 512GB RAM per PostgreSQL recommendations || effective_cache_size | 384GB | 4GB | 75% of RAM - OS cache + shared_buffers || work_mem | 256MB | 4MB | Higher for complex analytics queries. Caution: multiplied by (max_connections × sort operations) || maintenance_work_mem | 4GB | 64MB | Faster VACUUM, index builds. Only one maintenance op at a time per session | **Note**: work_mem was increased from 64MB to 256MB on 2023-08-15 after analyzing slow sort operations. See ticket DB-1234 for analysis. ### Connection Configuration | Parameter | Value | Default | Rationale ||-----------|-------|---------|-----------|| max_connections | 500 | 100 | Peak connections observed: 380. Headroom for spikes || superuser_reserved_connections | 5 | 3 | Ensure admin access during connection exhaustion | **Warning**: max_connections × work_mem = potential 128GB memory usage for sorts.If connection pool grows, may need to reduce work_mem. ### Write-Ahead Log (WAL) Configuration | Parameter | Value | Default | Rationale ||-----------|-------|---------|-----------|| wal_level | replica | replica | Required for streaming replication || max_wal_senders | 5 | 10 | 2 standby + 3 backup connections || wal_keep_size | 4GB | 0 | Retain WAL for standby catchup during network issues || archive_mode | on | off | Required for PITR || archive_command | pgbackrest --stanza=main archive-push %p | - | Using pgBackRest for WAL archiving | ### Checkpoint Configuration | Parameter | Value | Default | Rationale ||-----------|-------|---------|-----------|| checkpoint_timeout | 15min | 5min | Reduce checkpoint frequency, accept longer recovery || checkpoint_completion_target | 0.9 | 0.9 | Spread checkpoint I/O over 90% of interval || max_wal_size | 8GB | 1GB | Allow more WAL between checkpoints | **Trade-off**: Longer checkpoint intervals improve write performance but extend crash recovery time. Current setting = ~10 min recovery time. ---*Reviewed: 2024-01-01 by DBA Team**Next Review: 2024-04-01*Runbooks are step-by-step procedures for common operational tasks. They enable consistent execution regardless of who performs the task and are invaluable during high-stress incidents when clear thinking is compromised.
Characteristics of effective runbooks:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107
# Runbook: Database Failover to DR Site ## Overview- **Purpose**: Failover production databases from DC-EAST to DC-WEST- **When to Use**: Primary datacenter failure, planned maintenance, DR drill- **Duration**: ~15 minutes (planned), ~30 minutes (unplanned)- **Impact**: Brief connection interruption during DNS update- **Owner**: Database Team- **Last Tested**: 2024-01-10 (quarterly DR drill) ## Prerequisites- [ ] Confirm DR replica is synchronized (lag < 5 seconds)- [ ] Notify Application Team (app-team@company.com)- [ ] Notify NOC (noc@company.com)- [ ] Ensure you have admin access to DR site ## Procedure ### Step 1: Assess Current State```sql-- Run on DR replica to check synchronizationSELECT database_name, synchronization_state_desc, synchronization_health_desc, secondary_lag_secondsFROM sys.dm_hadr_database_replica_statesWHERE is_local = 1;```**Expected**: synchronization_state = SYNCHRONIZED, lag < 5 seconds ### Step 2: Verify DR Replica Readiness```powershell# Check replica is accessibleTest-NetConnection -ComputerName db-prod-03.internal -Port 1433```**Expected**: TcpTestSucceeded = True ### Step 3: Initiate Failover #### If Primary is Available (Graceful):```sql-- Run on PRIMARY (db-prod-01)ALTER AVAILABILITY GROUP [ProdAG] FAILOVER;``` #### If Primary is Unavailable (Forced):```sql-- Run on DR replica (db-prod-03)ALTER AVAILABILITY GROUP [ProdAG] FORCE_FAILOVER_ALLOW_DATA_LOSS;```⚠️ **Warning**: Forced failover may result in data loss for uncommitted transactions. ### Step 4: Verify Failover Success```sql-- Run on new primary (db-prod-03)SELECT replica_server_name, role_descFROM sys.dm_hadr_availability_replica_states arsJOIN sys.availability_replicas ar ON ars.replica_id = ar.replica_id;```**Expected**: db-prod-03 shows role_desc = PRIMARY ### Step 5: Update DNS (if required)```powershell# If AG listener didn't handle automatically:# Update DNS record for prod-db.company.com to 10.1.1.10# Contact: network-team@company.com``` ### Step 6: Verify Application Connectivity- [ ] Application health check: https://app.company.com/health- [ ] Run test transaction through application- [ ] Check application logs for database connection errors ### Step 7: Post-Failover Tasks- [ ] Notify stakeholders of successful failover- [ ] Update status page- [ ] Schedule failback (if this was planned maintenance)- [ ] Document any issues in incident ticket ## Rollback ProcedureIf failover fails or causes issues: 1. If old primary is available: ```sql ALTER AVAILABILITY GROUP [ProdAG] FAILOVER; ```2. If old primary is unavailable: Restore from backup (see DR-RESTORE-001) ## Escalation- Database Team Lead: John Smith (+1-555-0100)- On-call DBA: Via PagerDuty- VP Infrastructure: Jane Doe (+1-555-0101) - for extended outage ## Related Documents- DR-RESTORE-001: Full Database Restore Procedure- DR-NETWORK-001: Network Failover Procedure- BACKUP-RESTORE-001: Backup Verification ---*Last Updated: 2024-01-10 after DR drill**Next Review: 2024-04-10*A good runbook should be executable by someone who has never performed the task before. Test this by having a junior team member (or someone from another team) follow the runbook while you observe. Every question they ask reveals a gap in the documentation.
Change logging records every modification to database systems—configurations, schema changes, data fixes, and deployments. This audit trail is essential for troubleshooting, compliance, and understanding how systems evolved over time.
| Component | Description | Example |
|---|---|---|
| Timestamp | When the change was made | 2024-01-15 14:30:00 UTC |
| Who | Person or service making the change | jsmith, deploy-service |
| What | Description of change | Added index on orders.customer_id |
| Why | Business/technical justification | Query DB-1234 was causing timeouts |
| Ticket Reference | Link to change request/ticket | JIRA: DB-1234, CHG-5678 |
| Before State | State before change (if applicable) | No index on customer_id |
| After State | State after change | Index IX_orders_customer_id created |
| Rollback Plan | How to revert if needed | DROP INDEX IX_orders_customer_id |
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677
-- SQL Server: Automated change tracking table CREATE TABLE dbo.DatabaseChangeLog ( ChangeID INT IDENTITY PRIMARY KEY, ChangeTimestamp DATETIME2 DEFAULT SYSDATETIME(), DatabaseName SYSNAME, ChangeType VARCHAR(50), ObjectName SYSNAME NULL, ChangedBy SYSNAME DEFAULT SUSER_SNAME(), ChangeDescription NVARCHAR(MAX), TicketReference VARCHAR(100), BeforeState NVARCHAR(MAX), AfterState NVARCHAR(MAX), RollbackScript NVARCHAR(MAX), ExecutedScript NVARCHAR(MAX)); -- DDL Trigger to automatically log schema changesCREATE TRIGGER LogDatabaseChangesON DATABASEFOR DDL_DATABASE_LEVEL_EVENTSASBEGIN SET NOCOUNT ON; DECLARE @EventData XML = EVENTDATA(); INSERT INTO dbo.DatabaseChangeLog ( DatabaseName, ChangeType, ObjectName, ChangeDescription, ExecutedScript ) VALUES ( @EventData.value('(/EVENT_INSTANCE/DatabaseName)[1]', 'sysname'), @EventData.value('(/EVENT_INSTANCE/EventType)[1]', 'varchar(50)'), @EventData.value('(/EVENT_INSTANCE/ObjectName)[1]', 'sysname'), 'Automated DDL capture', @EventData.value('(/EVENT_INSTANCE/TSQLCommand/CommandText)[1]', 'nvarchar(max)') );END;GO -- Manual change log entry (for non-DDL changes)INSERT INTO dbo.DatabaseChangeLog ( DatabaseName, ChangeType, ObjectName, ChangeDescription, TicketReference, BeforeState, AfterState, RollbackScript)VALUES ( 'OrdersDB', 'Configuration Change', 'max_degree_of_parallelism', 'Increased MAXDOP from 4 to 8 to improve query performance', 'JIRA: DBA-1234', 'MAXDOP = 4', 'MAXDOP = 8', 'EXEC sp_configure ''max_degree_of_parallelism'', 4; RECONFIGURE;'); -- Query recent changesSELECT ChangeTimestamp, ChangeType, ObjectName, ChangedBy, ChangeDescription, TicketReferenceFROM dbo.DatabaseChangeLogWHERE ChangeTimestamp > DATEADD(DAY, -7, GETDATE())ORDER BY ChangeTimestamp DESC;Disaster recovery documentation is the most critical documentation you'll create—and the most important to get right. During a disaster, stress is high, systems may be unavailable, and there's no time for improvisation.
DR documentation must be:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129
# Disaster Recovery Plan: Database Systems ## 1. Recovery Objectives | System | RTO | RPO | Priority | Recovery Strategy ||--------|-----|-----|----------|-------------------|| OrdersDB | 1 hour | 15 minutes | P1 - Critical | Failover to DR replica || UsersDB | 1 hour | 15 minutes | P1 - Critical | Failover to DR replica || AnalyticsDB | 24 hours | 24 hours | P3 - Low | Restore from backup || ReportingDB | 4 hours | 1 hour | P2 - High | Rebuild from OLTP | ## 2. Emergency Contacts ### Internal| Role | Primary | Secondary | Email ||------|---------|-----------|-------|| DBA On-Call | PagerDuty | +1-555-0100 | dba@company.com || DBA Manager | J. Smith (+1-555-0101) | M. Chen (+1-555-0102) | jsmith@company.com || VP Infrastructure | A. Johnson (+1-555-0103) | - | ajohnson@company.com || Application Team | PagerDuty | +1-555-0200 | app-team@company.com | ### Vendors| Vendor | Support Number | Contract ID | SLA ||--------|----------------|-------------|-----|| Microsoft SQL | 1-800-936-3500 | MSFT-12345 | 1 hour response (Premier) || AWS Support | Via Console | - | 15 min response (Enterprise) || SAN Vendor | 1-800-555-0300 | SAN-67890 | 4 hour onsite | ## 3. Recovery Procedures ### Scenario: Complete Primary Datacenter Failure **Assumptions**: - DC-EAST is completely unavailable- DR site (DC-WEST) is operational- Last synchronization was within RPO **Step 1: Confirm Disaster Status (5 min)** 1. Attempt to contact DC-EAST NOC: +1-555-04002. Verify inability to access any DC-EAST resources3. Decision authority: VP Infrastructure or DBA Manager4. If confirmed, proceed with DR activation **Step 2: Notify Stakeholders (parallel with Step 3)** Send to: Executive-Team, App-Team, NOC, SupportSubject: [DISASTER] Database DR Activation - [TIMESTAMP] ```STATUS: DISASTER RECOVERY IN PROGRESS Impact: All production database servicesCause: Primary datacenter unavailableAction: Failing over to DR site (DC-WEST)ETA to Recovery: 60 minutes Next Update: 30 minutes DBA Team Lead: [Name, Phone]``` **Step 3: Activate DR Databases (15 min)** 3.1 Connect to DR database server: - Server: db-prod-03.internal (DC-WEST) - Access: Via DC-WEST jump host (dr-jump.internal) - Credentials: See DR Credential Safe (physical binder in DR site) 3.2 Check last synchronization state: ```sql SELECT database_name, last_hardened_lsn, last_hardened_time, DATEDIFF(SECOND, last_hardened_time, GETDATE()) AS lag_seconds FROM sys.dm_hadr_database_replica_states WHERE is_local = 1; ``` **Document the lag for incident report** 3.3 Force failover: ```sql ALTER AVAILABILITY GROUP [ProdAG] FORCE_FAILOVER_ALLOW_DATA_LOSS; ``` ⚠️ Data since last_hardened_time will be lost 3.4 Verify databases are online: ```sql SELECT name, state_desc FROM sys.databases; ``` **Step 4: Update Network Routing (10 min)** - Contact Network Team: +1-555-0500- Request DNS update: prod-db.company.com → 10.1.1.10- Verify propagation: `nslookup prod-db.company.com` **Step 5: Validate Application Connectivity (15 min)** - [ ] Applications can connect to database- [ ] Health checks passing- [ ] Test transactions succeeding- [ ] Error rates normal in monitoring **Step 6: Communicate Recovery (5 min)** Send to: Executive-Team, App-Team, NOC, SupportSubject: [RECOVERED] Database Services Restored - [TIMESTAMP] ```STATUS: RECOVERY COMPLETE Database services have been restored to DR site.Data loss (if any): [X seconds of transactions] Monitoring: DBA team is actively monitoring Post-Incident Review: Scheduled for [DATE/TIME] Questions: Contact DBA On-Call``` ---**DR Plan Version**: 2.3**Last Full DR Test**: 2024-01-10**Last Update**: 2024-01-11**Next Scheduled Review**: 2024-04-01If your DR documentation is only accessible through systems that might fail during a disaster, it's useless when you need it most. Maintain printed copies in a physical binder at each site, and have offline copies on personal devices of key personnel.
Creating documentation is only half the battle. Keeping it current and useful requires ongoing effort and organizational commitment.
| Document Type | Review Frequency | Review Includes | Owner |
|---|---|---|---|
| DR Procedures | Quarterly + after changes | Test execution, contact updates | DBA Manager |
| Architecture Diagrams | Semi-annually + after changes | Accuracy verification | Lead DBA |
| Operational Runbooks | Quarterly | Test execution, step verification | Document owner |
| Configuration Docs | After every change | Match against live systems | Change implementer |
| Security Policies | Annually + after incidents | Compliance review | Security Team + DBAs |
After every incident, update documentation. What information was missing? What steps were unclear? What would have helped? Incidents are expensive lessons—capture the learnings in documentation so you don't pay again.
Documentation is infrastructure. It enables consistent operations, reduces dependency on individuals, accelerates incident response, and preserves institutional knowledge. Investing in documentation pays dividends every time someone needs to understand, operate, or troubleshoot your database systems.
Module Complete:
Congratulations! You have completed the Maintenance Tasks module. You now understand the essential ongoing activities that keep databases healthy, secure, and performant:
These maintenance activities, performed consistently and skillfully, are what separate amateur database administration from professional database engineering. They're the invisible work that keeps systems running reliably—day after day, year after year.
You have mastered the essential maintenance tasks that define professional database administration. Apply these practices to build robust, reliable, and well-documented database environments that enable your organization's success.