Etl Process - Learning Module

Loading content...

0/241

Extract: Data Extraction from Source Systems

The Foundation of Data Integration

Every data warehouse journey begins with a fundamental challenge: How do we reliably extract data from operational systems without disrupting their primary purpose? This question sits at the heart of the Extract phase in ETL (Extract, Transform, Load) processing—the critical first step in building any analytical data infrastructure.

The Extract phase isn't merely about 'copying data.' It's a sophisticated engineering discipline that must navigate heterogeneous source systems, handle constantly changing data, respect operational constraints, and maintain data lineage—all while guaranteeing that the exact state of business data at any point in time can be accurately captured and replayed.

Production databases serve transactions. They're optimized for writes, updates, and point queries. Extracting analytical workloads from these systems—potentially billions of rows across hundreds of tables—requires strategies that minimize impact while maximizing completeness and accuracy. Get extraction wrong, and every downstream transformation and analysis inherits the error.

What You Will Learn

By the end of this page, you will understand the complete landscape of data extraction: source system types and their characteristics, full versus incremental extraction strategies, change data capture (CDC) mechanisms, extraction scheduling patterns, and the engineering trade-offs that govern extraction architecture decisions.

Understanding Source Systems

Before designing an extraction strategy, you must deeply understand the source systems from which data will be pulled. Each source type presents unique access patterns, data formats, and operational constraints that fundamentally shape how extraction must proceed.

The heterogeneity problem:

In enterprise environments, data rarely lives in a single, unified system. A typical organization might have:

Relational databases (Oracle, PostgreSQL, SQL Server) for transactional applications
Legacy mainframe systems with COBOL copybooks and flat files
SaaS applications (Salesforce, Workday, HubSpot) accessible only via REST APIs
Log files and event streams from web applications
Spreadsheets and Access databases maintained by business units
NoSQL databases (MongoDB, Cassandra) for specific workloads

Each of these systems has different authentication mechanisms, query interfaces, rate limits, data models, and change tracking capabilities. The Extract phase must normalize this chaos into a coherent, reliable data flow.

Source System Categories and Extraction Characteristics
Source Type	Access Method	Change Tracking	Extraction Challenge
Relational DBMS	SQL/JDBC/ODBC	Timestamps, CDC, Log Mining	Lock contention, query performance impact
Mainframe/Legacy	Flat file exports, MQ	Batch file timestamps	Complex record formats, encoding issues
SaaS Applications	REST APIs, Webhooks	API-provided timestamps	Rate limits, pagination, API versioning
Event Streams	Kafka, Kinesis consumers	Offsets, partitions	Ordering guarantees, exactly-once semantics
Files (CSV, JSON)	File system, S3, SFTP	File modification time	Schema drift, encoding, corrupt files
NoSQL Databases	Native drivers, change streams	Oplogs, change feeds	Document variability, denormalized data

Operational database considerations:

When extracting from production OLTP systems, you're essentially a secondary workload competing for shared resources. Critical considerations include:

Read replicas: Many organizations provision read replicas specifically for reporting and ETL. Extracting from replicas protects primary database performance but introduces replication lag considerations.
Connection pooling: Long-running extraction queries consume connection slots. Misconfigured extractions can exhaust connection pools, blocking application traffic.
Query optimization: Extraction queries should use appropriate indexes and avoid table scans during peak hours. Monitoring execution plans is essential.
Transaction isolation: The isolation level affects both data consistency and locking behavior. READ COMMITTED typically balances accuracy with minimal blocking.

The Golden Rule of Extraction

Never let ETL extraction degrade the operational systems it reads from. If your extraction causes production outages, you've violated the fundamental contract. Design for minimal impact: use read replicas, schedule during off-peak windows, implement query governors, and monitor resource consumption continuously.

Full Extraction vs. Incremental Extraction

The most fundamental architectural decision in extraction design is choosing between full extraction (pulling all data every time) and incremental extraction (pulling only changed data since the last extraction). This decision impacts performance, storage requirements, data freshness, and recovery capabilities.

Full Extraction

•Simplicity: No need to track what changed—extract everything
•Consistency guarantee: Always have complete, accurate picture
•Recovery: Any extraction is self-contained; failures restart cleanly
•Schema changes: New columns automatically appear
•Audit trail: Each full snapshot can be retained independently
•Use case: Small tables, reference data, complete rebuilds

Incremental Extraction

•Efficiency: Only transfer changed rows—often <1% of table
•Speed: Extraction completes in minutes vs. hours
•Resource-friendly: Minimal query and network load
•Freshness: Enables near-real-time data availability
•Complexity: Requires reliable change detection mechanism
•Use case: Large tables, frequent updates, low-latency requirements

When to use each approach:

Scenario	Recommended Approach
Reference tables (<100K rows, infrequent changes)	Full extraction
Transaction tables (millions of rows, continuous changes)	Incremental extraction
Initial data warehouse population	Full extraction
Daily/hourly warehouse updates	Incremental extraction
Source system lacks change tracking	Full extraction with diff detection
Hard delete detection required	Full extraction or CDC with delete tracking
Schema evolution expected	Full extraction simplifies handling
Near-real-time requirements	Incremental with CDC

Hybrid strategies:

Production systems often employ hybrid approaches:

Full weekly + incremental daily: Weekly full extraction ensures baseline integrity; daily incremental captures changes. Provides both safety and efficiency.
Full on-demand + incremental continuous: Incremental normally; full extraction triggered by data quality issues or schema changes.
Partitioned full: For partitioned tables, extract only changed partitions fully. Common for time-partitioned fact tables.

The Delete Problem

Incremental extraction struggles with detecting deleted rows. If a source row is deleted, timestamp-based incremental extraction won't see it—the row simply vanishes. Solutions include: (1) Soft deletes with status columns, (2) CDC capturing delete operations, (3) Periodic full extraction to identify missing rows, or (4) Tombstone tables recording deletions.

Change Detection Mechanisms

For incremental extraction to work, you need a reliable mechanism to identify which rows have changed. This is the change detection problem, and different approaches offer varying trade-offs between accuracy, performance, and invasiveness.

Primary Change Detection Approaches

•Timestamp-based detection: Query WHERE modified_date > last_extraction_time. Simple and widely used, but requires source tables to maintain accurate timestamps. Fails if timestamps aren't updated correctly.
•Sequence/Version numbers: Query WHERE version > last_version. Similar to timestamps but uses monotonically increasing integers. Often more reliable than timestamps for systems with clock skew.
•Hash-based comparison: Compute hash of each row and compare against previous hashes. Catches all changes regardless of timestamp maintenance, but requires storing previous hashes and computing new ones—expensive for large tables.
•Diff detection: Extract full data and compare against previous extraction using set operations. Identifies inserts, updates, and deletes but requires substantial processing and storage.
•Change Data Capture (CDC): Read database transaction logs directly to capture changes as they occur. Most accurate and complete but requires specialized tooling and database configuration.
•Trigger-based capture: Database triggers write changes to shadow tables. Application-transparent but adds overhead to every transaction and complicates schema changes.

Timestamp-based extraction in detail:

The most common approach relies on timestamp columns. The extraction query pattern looks like:

-- Standard timestamp-based incremental extraction
SELECT *
FROM orders
WHERE last_modified_timestamp >= :last_extraction_timestamp
  AND last_modified_timestamp < :current_extraction_timestamp;

Critical considerations:

Timestamp precision: Millisecond precision matters. If your timestamp column only stores seconds and multiple updates occur within a second, you might miss changes.
Transaction boundaries: A transaction modifying row at 10:00:00.500 might not commit until 10:00:01.200. If extraction runs at 10:00:00.800, it might see the timestamp but query read-committed snapshot misses uncommitted data.
Clock skew: In distributed systems, server clocks can drift. A row might receive a timestamp in the 'past' relative to your extraction watermark.
High-water mark management: The 'last_extraction_timestamp' must be persisted reliably. Losing this value means you must fall back to full extraction.

extraction_pattern.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
-- Safe incremental extraction pattern with overlap window
-- Overlap handles clock skew and in-flight transactions
 
DECLARE @extraction_window_start DATETIME2;
DECLARE @extraction_window_end DATETIME2;
DECLARE @overlap_minutes INT = 5;
 
-- Retrieve last successful extraction timestamp
SELECT @extraction_window_start = DATEADD(MINUTE, -@overlap_minutes, last_extraction_ts)
FROM etl_metadata.extraction_watermarks
WHERE source_table = 'orders';
 
SET @extraction_window_end = SYSUTCDATETIME();
 
-- Extract with overlap (will include some rows extracted previously)
SELECT 
    order_id,
    customer_id,
    order_date,
    total_amount,
    status,
    last_modified_timestamp,
    -- Include extraction metadata
    @extraction_window_end AS extraction_timestamp
FROM source_db.dbo.orders WITH (NOLOCK)
WHERE last_modified_timestamp >= @extraction_window_start
  AND last_modified_timestamp < @extraction_window_end
ORDER BY last_modified_timestamp;
 
-- After successful load, update watermark
UPDATE etl_metadata.extraction_watermarks
SET last_extraction_ts = @extraction_window_end,
    rows_extracted = @@ROWCOUNT
WHERE source_table = 'orders';

The Overlap Strategy

Always extract with a small overlap window (5-15 minutes before the last extraction timestamp). This catches rows that were in-flight during the previous extraction. The downstream staging layer should handle deduplication using primary keys and 'latest wins' logic.

Change Data Capture (CDC)

Change Data Capture (CDC) represents the gold standard for incremental extraction. Rather than querying tables to infer what changed, CDC reads the database's transaction log to observe the exact sequence of insert, update, and delete operations as they occur.

This approach offers profound advantages:

Complete accuracy: Captures every change, including deletes and rapid in-place updates
Minimal source impact: Log reading is typically separate from query processing
Near-real-time: Changes can be available within seconds of commit
Operation type preservation: Know whether a row was inserted, updated, or deleted
Before/after images: Capture previous values on updates for audit trails

CDC Implementation by Database Platform
Database	Native CDC Mechanism	Tooling Support
SQL Server	SQL Server CDC (ct tables)	Debezium, Attunity, native
PostgreSQL	Logical Replication, pgoutput	Debezium, Airbyte, built-in
MySQL	Binary Log (binlog)	Debezium, Maxwell, Airbyte
Oracle	LogMiner, GoldenGate	GoldenGate, Debezium, Attunity
MongoDB	Change Streams, Oplog	Debezium, native driver
Cassandra	CDC log tables	Debezium (incubating)

CDC architecture with Debezium:

Debezium is the most widely adopted open-source CDC platform, typically deployed as Kafka Connect connectors. The architecture follows this flow:

Capture: Debezium connector reads database transaction logs
Serialize: Changes are converted to structured events (JSON or Avro)
Stream: Events are published to Apache Kafka topics (one per table)
Consume: ETL processes consume from Kafka, applying changes to the warehouse

┌─────────────┐     ┌───────────┐     ┌─────────┐     ┌────────────┐
│ Source DB   │────▶│ Debezium  │────▶│  Kafka  │────▶│   ETL      │
│ (MySQL)     │     │ Connector │     │ Topics  │     │ Consumer   │
└─────────────┘     └───────────┘     └─────────┘     └────────────┘
       │                                                     │
       │                                                     ▼
       │                                            ┌────────────┐
       └─────── Binlog reads ─────────────────────▶│    Data    │
                                                    │ Warehouse  │
                                                    └────────────┘

CDC event structure:

A typical CDC event contains rich metadata beyond just the new row values:

cdc_event_structure.json
JSON
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
{
  "schema": { ... },
  "payload": {
    "before": {
      "order_id": 12345,
      "status": "PENDING",
      "total_amount": 150.00
    },
    "after": {
      "order_id": 12345,
      "status": "SHIPPED",
      "total_amount": 150.00
    },
    "source": {
      "version": "2.4.0",
      "connector": "mysql",
      "name": "ecommerce",
      "ts_ms": 1704067200000,
      "db": "orders_db",
      "table": "orders",
      "server_id": 12345,
      "file": "mysql-bin.000003",
      "pos": 56789,
      "row": 0
    },
    "op": "u",  // Operation: c=create, u=update, d=delete, r=read (snapshot)
    "ts_ms": 1704067200500,
    "transaction": {
      "id": "file=mysql-bin.000003,pos=56500",
      "total_order": 3,
      "data_collection_order": 2
    }
  }
}

CDC Challenges to Consider

While CDC is powerful, it introduces operational complexity: (1) Transaction log retention must be configured—logs pruned too early lose changes, (2) Schema changes require careful handling—column additions/removals affect downstream consumers, (3) Initial snapshots for new tables require coordination with streaming changes, (4) Monitoring and alerting for connector failures is critical—silent failures mean data loss.

Extraction Architecture Patterns

Real-world extraction systems employ recognizable architectural patterns that balance performance, reliability, and operational complexity. Understanding these patterns helps you design appropriate solutions for specific requirements.

Common Extraction Patterns

•Pull-based batch extraction: Scheduled jobs query source systems and pull data. Most common pattern, simple to implement, but latency is limited by schedule frequency.
•Push-based event extraction: Sources push changes as events (webhooks, message queues). Lower latency but requires source system modification.
•Streaming CDC extraction: Continuous log-based capture streamed through Kafka. Near-real-time with minimal source impact, but operationally complex.
•File-based extraction: Sources export to files (CSV, Parquet) in shared storage. Decouples source and ETL timing, common for legacy integrations.
•API-based extraction: Paginated REST API calls with rate limiting. Required for SaaS sources, challenging for high-volume data.
•Replicated database extraction: Query dedicated read replicas. Eliminates production impact but may have replication lag.

Parallelization strategies:

Extracting large tables requires parallelization to meet time windows. Common approaches include:

Range partitioning: Divide extraction by key ranges. Multiple extractors each handle a key range:

-- Extractor 1: IDs 1-1000000
-- Extractor 2: IDs 1000001-2000000
-- Extractor 3: IDs 2000001-3000000
SELECT * FROM orders 
WHERE order_id BETWEEN :range_start AND :range_end;

Time partitioning: Divide extraction by time ranges. Particularly effective for append-only fact tables:

-- Extract one day at a time across parallel workers
SELECT * FROM transactions
WHERE transaction_date = :target_date;

Hash partitioning: Extract rows based on hash of a column. Ensures even distribution:

-- 4 parallel extractors, each handling 25% of data
SELECT * FROM customers
WHERE MOD(HASH(customer_id), 4) = :worker_id;

Critical parallel extraction considerations:

Ensure key ranges don't overlap (would duplicate data)
Ensure ranges cover all data (would lose data)
Monitor for skewed partitions (uneven work distribution)
Coordinate checkpointing across parallel workers

Extraction Pattern Selection Guide
Requirement	Recommended Pattern	Rationale
< 15 minute latency	Streaming CDC	Batch scheduling can't achieve this
Legacy mainframe source	File-based extraction	Often only option available
Salesforce/HubSpot source	API-based extraction	Only interface available
Minimal operational complexity	Pull-based batch	Simpler infrastructure required
10TB+ daily extraction	Parallel range extraction	Single-threaded won't complete in time
Audit trail with before/after	CDC	Only CDC captures before images

Handling Extraction Failures

Production extraction systems must be designed for failure. Networks fail, databases become unavailable, credentials expire, and queries time out. Robust extraction architecture assumes failures will occur and provides mechanisms to detect, recover, and prevent data loss.

Common failure modes:

Extraction Failure Categories

•Network failures: Connection drops mid-extraction. Partial data may have been received.
•Query timeouts: Long-running extraction queries killed by database governors.
•Authentication failures: Expired credentials, rotated secrets, permission revocations.
•Source unavailability: Database maintenance windows, failover events, capacity issues.
•Data format errors: Unexpected nulls, encoding issues, truncated values.
•Resource exhaustion: Extraction process runs out of memory or disk space.
•Concurrent modification: Source data changes during extraction, causing inconsistency.

Recovery strategies:

Checkpoint-based recovery: Persist extraction progress at regular intervals. On failure, resume from the last checkpoint rather than restarting from the beginning.

Extraction progress: [=====|=========>          ]
                           ↑
                      Checkpoint saved
                           
                      On failure, resume here

Idempotent extraction: Design extraction to be safely repeatable. Re-extracting the same data range should produce identical results, allowing failed batches to be retried without duplication or data loss.

Dead letter handling: When specific rows fail extraction (encoding issues, constraint violations), route them to a 'dead letter' queue for investigation while continuing with other rows.

Circuit breaker pattern: After repeated failures, stop attempting extraction and alert operators rather than overwhelming failing source systems with retry storms.

The Extraction Contract

Document and enforce extraction SLAs: Maximum extraction duration, retry policy, escalation procedures, and data freshness guarantees. When extraction fails, stakeholders should know exactly what happens, who is notified, and how recovery proceeds.

Data Quality in Extraction

The Extract phase is your first line of defense for data quality. Problems detected early are far cheaper to resolve than problems that propagate through transformation layers and into analytical reports.

Extraction-time validations:

Extraction Data Quality Checks
Check Type	Implementation	Action on Failure
Row count validation	Compare extracted count to source count	Alert if difference exceeds threshold
Null ratio monitoring	Track % of nulls in critical columns	Alert if exceeds historical baseline
Value range validation	Check min/max of numeric fields	Reject or flag outliers
Referential integrity	Verify foreign keys exist in related extracts	Log orphans for investigation
Duplicate detection	Check for duplicate primary keys	Deduplicate or fail extraction
Schema validation	Verify expected columns exist with correct types	Fail fast on schema drift
Freshness validation	Confirm recent timestamps exist	Alert if data appears stale

Schema evolution handling:

Source schemas change over time—new columns are added, columns are renamed, data types change. Extraction systems must detect and handle these changes:

Schema detection: Query source metadata (information_schema, pg_catalog) to detect current schema
Schema comparison: Compare against previously extracted schema
Change classification: Categorize changes as backward-compatible (add column) or breaking (drop column, type change)
Automated handling: Backward-compatible changes can often be handled automatically
Alerting: Breaking changes require human review before proceeding

Data lineage tracking:

Every extracted row should carry metadata enabling full traceability:

Source system identifier
Source table/entity name
Extraction timestamp
Extraction job ID
Source row version/timestamp

This metadata enables debugging, auditing, and impact analysis when issues are discovered downstream.

Quality Thresholds, Not Absolutes

Perfect data quality is rarely achievable. Establish acceptable thresholds based on business requirements. A 0.01% duplicate rate might be acceptable for analytics but unacceptable for financial reporting. Document these thresholds and alert when they're exceeded, rather than blocking all extraction on every minor issue.

Summary: Mastering Data Extraction

The Extract phase sets the foundation for everything that follows in the ETL pipeline. Extraction done well provides reliable, timely, high-quality data to downstream processes. Extraction done poorly creates data quality issues, operational incidents, and eroded trust in analytical outputs.

Key Takeaways

•Source systems are diverse: Relational databases, APIs, files, streams—each requires different extraction approaches tuned to their access patterns and constraints.
•Incremental extraction enables scale: Full extraction works for small tables; incremental extraction with reliable change detection is essential for large, frequently updated tables.
•CDC is the gold standard: Change Data Capture provides the most accurate, complete, and efficient change tracking, though it requires more sophisticated infrastructure.
•Minimize source impact: Extraction is a secondary workload. Use read replicas, schedule off-peak, implement query governors, and monitor continuously.
•Design for failure: Implement checkpointing, idempotent operations, circuit breakers, and clear recovery procedures. Failures will happen.
•Validate early: Catch data quality issues during extraction—row counts, null ratios, schema changes—before they propagate downstream.
•Track lineage: Metadata enabling traceability from warehouse back to source is essential for debugging and audit.

What's next:

With data successfully extracted from source systems, we turn to the Transform phase—where raw operational data is cleaned, integrated, conformed to business rules, and structured for analytical consumption. The next page explores transformation techniques from simple data cleansing to complex business logic application.

Page Complete

You now understand the Extract phase of ETL: source system characteristics, full vs. incremental extraction trade-offs, change detection mechanisms including CDC, architectural patterns, failure handling, and data quality considerations. Next, we'll explore the Transform phase where extracted data is refined and enriched.