Etl Process - Learning Module

Loading content...

0/241

Transform: Data Transformation and Integration

The Heart of Data Engineering

If extraction is about acquiring raw materials, transformation is about craftsmanship—the skilled process of converting disparate, messy, operational data into consistent, integrated, analytically-ready information. This is where data engineering truly happens.

Raw extracted data carries the DNA of its source systems: inconsistent formats, business logic embedded in application code, cryptic codes meaningful only to legacy applications, duplicates from systems that never enforced uniqueness, and relationships implied but never explicitly declared. The Transform phase exists to resolve this chaos.

Transformation isn't merely technical data manipulation. It's the implementation of business rules, the creation of common language across organizational silos, and the establishment of single sources of truth for critical business entities. The customer record that means one thing in sales, another in support, and yet another in billing must become one coherent entity that all analytical consumers can trust.

This is where you turn data into information—where bytes become insights.

What You Will Learn

By the end of this page, you will master the full spectrum of data transformation: cleansing techniques for quality remediation, standardization for consistency, deduplication strategies, data type conversions, derived field calculations, surrogate key generation, and dimension conformance. You'll understand when to apply each technique and how to design transformation pipelines that production systems can rely upon.

The Transformation Landscape

Transformation encompasses a broad spectrum of operations, from simple format conversions to complex business logic implementation. Understanding this landscape helps you architect appropriate transformation pipelines.

Categories of transformation:

Transformation Operation Categories
Category	Purpose	Examples
Data Cleansing	Fix quality issues in source data	Null handling, trimming whitespace, fixing encoding
Standardization	Enforce consistent formats	Date formats, phone formats, address normalization
Data Type Conversion	Match target schema types	String to number, timestamp parsing, boolean mapping
Deduplication	Identify and resolve duplicate records	Exact match, fuzzy matching, survivorship rules
Derived Fields	Calculate new values from existing	Age from birthdate, total from quantity×price
Aggregation	Summarize detail to higher grain	Daily sales totals, customer lifetime value
Enrichment	Add external data	Geocoding, market data lookup, demographics
Filtering	Include/exclude rows	Remove test data, filter to active records
Conformance	Align to dimensional model	Surrogate keys, conformed dimensions
Business Rules	Apply domain-specific logic	Revenue recognition, status derivation

Transformation ordering matters:

The sequence in which transformations execute can significantly impact results. Consider this example:

Source: "  123.45  " (string with whitespace)

 Order A:                    Order B:
 1. Convert to number        1. Trim whitespace
 2. Trim whitespace          2. Convert to number
    ↓                           ↓
 Result: ERROR              Result: 123.45
 (conversion fails)         (correct)

General transformation ordering principles:

Cleanse first: Remove garbage before attempting interpretation
Standardize second: Normalize formats before comparison or calculation
Convert types third: String operations should precede type conversion
Calculate fourth: Derived fields after base fields are clean
Conform last: Dimensional conformance after business logic

Idempotency Is Your Friend

Design transformations to be idempotent—running them twice should produce the same result as running once. This enables safe retry on failure, simplifies testing, and allows reprocessing historical data when transformation logic changes.

Data Cleansing Techniques

Data cleansing addresses quality issues that make data unreliable or unusable. The goal isn't perfection—it's fitness for purpose. Different analytical use cases have different quality thresholds.

Common data quality issues and remediation:

Quality Issue Remediation Patterns

•NULL values: Replace with default, derive from other fields, propagate explicitly, or filter row. Decision depends on downstream usage and business context.
•Whitespace: Trim leading/trailing spaces, normalize internal whitespace to single space. Seemingly trivial but critical for matching and uniqueness.
•Encoding issues: Detect and convert character encodings (UTF-8, Latin-1, etc.). Mojibake (garbled characters) indicates encoding mismatch.
•Truncation: Values cut off by source system limits. May require source fix or explicit handling of partial data.
•Invalid values: Values outside expected domains (negative ages, future birthdates). Correct if rule-based fix exists, otherwise flag for review.
•Inconsistent representations: Same concept represented differently ('Yes'/'Y'/'1'/true). Map to canonical representation.
•Embedded delimiters: Commas in CSV fields, quotes in JSON strings. Proper escaping or field enclosure required.

cleansing_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
-- Comprehensive cleansing transformation examples
 
-- 1. NULL handling with business-appropriate defaults
SELECT 
    customer_id,
    -- Use COALESCE with appropriate defaults
    COALESCE(first_name, 'Unknown') AS first_name,
    COALESCE(email, 'no-email@placeholder.com') AS email,
    -- Numeric NULLs often default to 0
    COALESCE(credit_limit, 0) AS credit_limit,
    -- Dates may use sentinel values
    COALESCE(last_purchase_date, '1900-01-01') AS last_purchase_date
FROM raw_customers;
 
-- 2. Whitespace normalization
SELECT 
    customer_id,
    -- Trim and normalize internal whitespace
    REGEXP_REPLACE(TRIM(company_name), '\s+', ' ', 'g') AS company_name,
    LOWER(TRIM(email)) AS email_normalized
FROM raw_customers;
 
-- 3. Value standardization and mapping
SELECT 
    order_id,
    CASE UPPER(TRIM(status))
        WHEN 'PENDING' THEN 'PENDING'
        WHEN 'P' THEN 'PENDING'
        WHEN 'PEND' THEN 'PENDING'
        WHEN 'SHIPPED' THEN 'SHIPPED'
        WHEN 'S' THEN 'SHIPPED'
        WHEN 'SHIP' THEN 'SHIPPED'
        WHEN 'DELIVERED' THEN 'DELIVERED'
        WHEN 'D' THEN 'DELIVERED'
        WHEN 'CANCELLED' THEN 'CANCELLED'
        WHEN 'C' THEN 'CANCELLED'
        WHEN 'CANCELED' THEN 'CANCELLED'  -- Spelling variation
        ELSE 'UNKNOWN'
    END AS status_standardized
FROM raw_orders;
 
-- 4. Data validation with rejection tracking
INSERT INTO rejected_records (source_table, record_key, rejection_reason, record_data)
SELECT 
    'orders',
    order_id::text,
    CASE 
        WHEN order_total < 0 THEN 'Negative order total'
        WHEN order_date > CURRENT_DATE THEN 'Future order date'
        WHEN customer_id IS NULL THEN 'Missing customer ID'
    END,
    row_to_json(raw_orders.*)::text
FROM raw_orders
WHERE order_total < 0 
   OR order_date > CURRENT_DATE 
   OR customer_id IS NULL;

Don't Silently Fix Everything

While automation is valuable, blindly 'fixing' data can hide systemic source issues. Track cleansing actions, report fix rates, and escalate when thresholds are exceeded. If 30% of addresses need geocoding correction, that's a source data problem requiring upstream attention.

Standardization and Formatting

Standardization ensures that equivalent values are represented identically, enabling accurate matching, grouping, and analysis. Without standardization, 'United States', 'US', 'U.S.A.', and 'USA' appear as four different countries.

Critical standardization domains:

Standardization Requirements by Data Type
Data Type	Common Variations	Standard Format
Dates	MM/DD/YYYY, DD-MM-YYYY, 'Jan 1, 2024'	ISO 8601: YYYY-MM-DD
Timestamps	Various timezones, formats	UTC timestamp with timezone
Phone numbers	(123) 456-7890, 123.456.7890	E.164: +11234567890
Addresses	St., Street, Str.; Ave., Avenue	USPS standardization or local postal format
Names	Case variations, titles, suffixes	Proper case, parsed components
Currency	$1,234.56, 1234,56 EUR	Numeric value + ISO currency code
Countries	US, USA, United States	ISO 3166-1 alpha-2 codes
Boolean	Yes/No, Y/N, True/False, 1/0	Consistent TRUE/FALSE or 1/0

Date and timestamp standardization:

Date handling is notoriously error-prone. Key considerations:

Timezone handling: Store all timestamps in UTC, convert to local time only at presentation layer
Date parsing: Use explicit format specifications; never rely on implicit parsing
Null dates: Use consistent sentinel values ('1900-01-01') or explicit NULLs—never mix
Date validation: Catch invalid dates (Feb 30, month 13) before they corrupt data
Fiscal calendars: Business fiscal years may not align with calendar years

standardization_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
-- Date standardization with explicit format handling
SELECT 
    order_id,
    -- Parse various date formats to standard ISO format
    CASE 
        -- MM/DD/YYYY format
        WHEN raw_date ~ '^\d{2}/\d{2}/\d{4}$' THEN
            TO_DATE(raw_date, 'MM/DD/YYYY')
        -- DD-MM-YYYY format
        WHEN raw_date ~ '^\d{2}-\d{2}-\d{4}$' THEN
            TO_DATE(raw_date, 'DD-MM-YYYY')
        -- YYYY-MM-DD (already ISO)
        WHEN raw_date ~ '^\d{4}-\d{2}-\d{2}$' THEN
            TO_DATE(raw_date, 'YYYY-MM-DD')
        -- Month name formats: 'Jan 1, 2024'
        WHEN raw_date ~ '^[A-Za-z]{3} \d{1,2}, \d{4}$' THEN
            TO_DATE(raw_date, 'Mon DD, YYYY')
        ELSE NULL  -- Unknown format - flag for review
    END AS order_date_standardized
FROM raw_orders;
 
-- Phone number standardization to E.164 format
SELECT 
    customer_id,
    -- Strip all non-numeric characters
    '+1' || REGEXP_REPLACE(phone, '[^0-9]', '', 'g') AS phone_e164
FROM raw_customers
WHERE LENGTH(REGEXP_REPLACE(phone, '[^0-9]', '', 'g')) = 10;
 
-- Address standardization
SELECT 
    address_id,
    UPPER(TRIM(street_line_1)) AS street_1,
    -- Standardize common abbreviations
    REGEXP_REPLACE(
        REGEXP_REPLACE(
            REGEXP_REPLACE(
                UPPER(TRIM(street_line_1)),
                '\bSTREET\b|\bSTR\.?\b', 'ST', 'g'
            ),
            '\bAVENUE\b|\bAVE\.?\b', 'AVE', 'g'
        ),
        '\bBOULEVARD\b|\bBLVD\.?\b', 'BLVD', 'g'
    ) AS street_1_standardized,
    UPPER(TRIM(city)) AS city,
    -- State abbreviation mapping
    COALESCE(state_mapping.abbreviation, UPPER(TRIM(state))) AS state_abbr,
    -- ZIP code format: 5 digits or 5+4
    REGEXP_REPLACE(postal_code, '[^0-9]', '', 'g') AS zip_clean
FROM raw_addresses
LEFT JOIN state_mapping ON UPPER(TRIM(raw_addresses.state)) = state_mapping.full_name;

Standardization Libraries

Don't reinvent standardization logic. Libraries like libpostal (addresses), phonenumbers (phone), and dateutil (dates) encode years of edge case handling. SQL-based standardization works for simple cases; complex standardization often requires purpose-built tooling.

Deduplication Strategies

Duplicate records are endemic in operational systems. The same customer may exist multiple times with slight variations. The same product may be entered repeatedly with typos. Mergers and acquisitions combine databases with overlapping entities. Deduplication is the process of identifying and consolidating these duplicates.

Deduplication approaches:

Exact Match Deduplication

•Compare records on specific key fields
•Records must match exactly to be duplicates
•Fast and deterministic
•Misses duplicates with typos or variations
•Use case: Technical duplicates from extract issues

Fuzzy Match Deduplication

•Use similarity algorithms for comparison
•Handles typos, abbreviations, variations
•Computationally expensive (O(n²) naive)
•Requires threshold tuning and validation
•Use case: Business duplicates from manual entry

Fuzzy matching techniques:

Algorithm	Best For	How It Works
Levenshtein distance	Typos, misspellings	Counts edit operations to transform one string to another
Jaro-Winkler	Names	Weighs character matches, favoring prefix similarity
Soundex/Metaphone	Phonetic variations	Encodes words by sound, matching pronunciations
N-gram similarity	General text	Compares frequency of character sequences
TF-IDF + Cosine	Documents, descriptions	Vector space similarity for longer text

Blocking for scalability:

Comparing every record to every other record is O(n²)—prohibitive for large datasets. Blocking reduces the search space by only comparing records within the same 'block' defined by a coarse key:

Without blocking: 1,000,000 records → 500 billion comparisons
With blocking by first letter of last name:
  26 blocks of ~38,500 records each → ~19 billion comparisons (96% reduction)
With blocking by ZIP code:
  40,000 blocks of ~25 records each → ~12.5 million comparisons (99.997% reduction)

Choose blocking keys that balance reduction ratio against missed matches (records that should match but are in different blocks).

deduplication_pattern.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- Exact match deduplication - keep most recent
WITH ranked_customers AS (
    SELECT 
        *,
        ROW_NUMBER() OVER (
            PARTITION BY 
                LOWER(TRIM(email)),  -- Dedupe key
                LOWER(TRIM(first_name)),
                LOWER(TRIM(last_name))
            ORDER BY 
                last_modified_date DESC,  -- Most recent first
                customer_id ASC           -- Tie-breaker
        ) AS duplicate_rank
    FROM staging_customers
)
SELECT * FROM ranked_customers WHERE duplicate_rank = 1;
 
-- Fuzzy match candidate identification using blocking
-- Step 1: Create blocks and identify candidates
WITH blocked_customers AS (
    SELECT 
        c1.customer_id AS id1,
        c2.customer_id AS id2,
        c1.first_name AS first1,
        c2.first_name AS first2,
        c1.last_name AS last1,
        c2.last_name AS last2,
        c1.email AS email1,
        c2.email AS email2,
        -- Calculate similarity scores
        similarity(c1.last_name, c2.last_name) AS last_name_sim,
        similarity(c1.first_name, c2.first_name) AS first_name_sim,
        similarity(c1.email, c2.email) AS email_sim
    FROM staging_customers c1
    JOIN staging_customers c2 
        ON c1.customer_id < c2.customer_id  -- Avoid self-join and duplicates
        AND LEFT(c1.last_name, 2) = LEFT(c2.last_name, 2)  -- Blocking key
        AND c1.state = c2.state  -- Additional blocking
)
-- Step 2: Score and threshold
SELECT 
    id1, id2,
    first1, first2,
    last1, last2,
    email1, email2,
    -- Weighted composite score
    (last_name_sim * 0.35 + first_name_sim * 0.25 + email_sim * 0.40) AS match_score
FROM blocked_customers
WHERE (last_name_sim * 0.35 + first_name_sim * 0.25 + email_sim * 0.40) > 0.85
ORDER BY match_score DESC;

Survivorship Rules

When duplicates are identified, you must decide which value survives for each field. Common rules: (1) Most recent non-null value, (2) Most complete record, (3) Most trusted source system, (4) Most frequent value across duplicates. Document survivorship rules explicitly—they encode business decisions.

Data Integration and Conformance

Data integration is the process of combining data from multiple sources into a unified, consistent view. This is where disparate definitions of 'customer', 'product', and 'revenue' across organizational silos become single, authoritative concepts.

Key integration challenges:

Integration Complexity Factors

•Schema heterogeneity: Same entity has different attributes across sources (CRM customer vs. ERP customer)
•Semantic heterogeneity: Same field name means different things ('revenue' = gross vs. net)
•Granularity mismatch: One source has daily data, another has monthly; one has product-level, another category-level
•Key mapping: No common identifier across sources; must match on business attributes
•Temporal alignment: Sources updated at different frequencies; point-in-time consistency challenging
•Conflict resolution: When sources disagree, which is authoritative?

Conformed dimensions:

In dimensional modeling, conformed dimensions are dimension tables that are shared across multiple fact tables and data marts. They provide the consistent vocabulary that enables enterprise-wide analysis.

For example, a conformed Date dimension ensures that 'Q1 2024' means the same thing whether you're analyzing sales, support tickets, or marketing campaigns. A conformed Customer dimension ensures customer attributes are consistent across all analyses.

Creating conformed dimensions:

Source 1: CRM System          Source 2: ERP System          Conformed Customer Dimension
┌─────────────────────┐        ┌─────────────────────┐        ┌─────────────────────────┐
│ crm_customer_id     │        │ erp_cust_id         │        │ customer_sk (surrogate) │
│ contact_name        │   ───▶ │ company_name        │   ───▶ │ customer_id (natural)   │
│ email               │        │ email_address       │        │ customer_name           │
│ industry_code       │        │ naics_code          │        │ email                   │
│ annual_revenue      │        │ credit_limit        │        │ industry_code           │
└─────────────────────┘        └─────────────────────┘        │ annual_revenue          │
                                                               │ source_system           │
                                                               │ effective_date          │
                                                               │ expiration_date         │
                                                               └─────────────────────────┘

Surrogate keys:

Conformed dimensions typically use surrogate keys—system-generated unique identifiers (usually integers)—rather than source system business keys. Surrogate keys provide:

Stability: Don't change when source keys change
Integration: Single key regardless of source
Performance: Compact integer keys are faster for joins
History tracking: Enable Type 2 slowly changing dimensions

conformed_dimension_creation.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
-- Building a conformed Customer dimension from multiple sources
 
-- Step 1: Combine and integrate source customer records
WITH integrated_customers AS (
    -- Match customers across CRM and ERP using email (after standardization)
    SELECT 
        COALESCE(crm.crm_customer_id, 'CRM-' || erp.erp_cust_id) AS customer_natural_key,
        COALESCE(crm.contact_name, erp.company_name) AS customer_name,
        COALESCE(crm.email, erp.email_address) AS email,
        crm.industry_code,
        erp.credit_limit,
        CASE 
            WHEN crm.crm_customer_id IS NOT NULL AND erp.erp_cust_id IS NOT NULL THEN 'BOTH'
            WHEN crm.crm_customer_id IS NOT NULL THEN 'CRM'
            ELSE 'ERP'
        END AS source_system,
        GREATEST(crm.last_modified, erp.last_update) AS last_updated
    FROM staging_crm_customers crm
    FULL OUTER JOIN staging_erp_customers erp 
        ON LOWER(TRIM(crm.email)) = LOWER(TRIM(erp.email_address))
)
 
-- Step 2: Generate surrogate keys and create dimension records
INSERT INTO dim_customer (
    customer_sk,
    customer_natural_key,
    customer_name,
    email,
    industry_code,
    credit_limit,
    source_system,
    effective_date,
    expiration_date,
    is_current
)
SELECT 
    NEXTVAL('customer_sk_seq') AS customer_sk,
    customer_natural_key,
    customer_name,
    email,
    industry_code,
    credit_limit,
    source_system,
    CURRENT_DATE AS effective_date,
    '9999-12-31'::DATE AS expiration_date,
    TRUE AS is_current
FROM integrated_customers
WHERE customer_natural_key NOT IN (
    SELECT customer_natural_key FROM dim_customer WHERE is_current = TRUE
);

Derived Fields and Business Logic

Many analytical fields don't exist in source systems—they must be derived through calculation. These derivations encode business logic that should be applied consistently across all analyses.

Categories of derived fields:

Derived Field Categories
Category	Examples	Implementation Notes
Calculated measures	Total = Quantity × Price, Tax = Subtotal × Rate	Simple arithmetic; ensure consistent calculation
Date-based derivations	Age, Tenure, Days Since Last Purchase	Handle NULL dates and edge cases
Classification/Buckets	Revenue Tier, Customer Segment, Age Group	Document bucket boundaries clearly
Status derivations	Is Active, Is VIP, Churn Risk Level	Complex logic with multiple conditions
Aggregate rollups	Lifetime Value, YTD Sales, Running Total	May require window functions or pre-aggregation
External lookups	Industry Name from Code, Country from IP	Join to reference tables
Flags and indicators	Has Email, Is Complete, Needs Review	Boolean derivations for filtering

Centralizing business logic:

Derived field calculations should be implemented once and applied consistently. Avoid situations where different reports calculate 'customer lifetime value' differently.

Approaches:

Transformation layer: Calculate all derivations during ETL; store results
Semantic layer: BI tools (Looker, dbt, etc.) define calculations applied at query time
View abstraction: Database views encapsulate calculation logic

Trade-off: Pre-calculated fields are faster to query but require reprocessing when logic changes. Query-time calculations are always current but may be slower.

derived_field_examples.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
-- Comprehensive derived field examples
 
-- Customer lifetime value calculation
SELECT 
    c.customer_id,
    c.customer_name,
    c.registration_date,
    -- Tenure in months
    DATE_PART('year', AGE(CURRENT_DATE, c.registration_date)) * 12 +
    DATE_PART('month', AGE(CURRENT_DATE, c.registration_date)) AS tenure_months,
    
    -- Aggregate purchase metrics
    COUNT(DISTINCT o.order_id) AS total_orders,
    SUM(o.order_total) AS total_revenue,
    AVG(o.order_total) AS avg_order_value,
    
    -- Days since last purchase
    DATE_PART('day', CURRENT_DATE - MAX(o.order_date)) AS days_since_last_order,
    
    -- Customer lifetime value (simplified: total revenue / tenure in years)
    CASE 
        WHEN DATE_PART('year', AGE(CURRENT_DATE, c.registration_date)) > 0 THEN
            ROUND(SUM(o.order_total) / 
                  DATE_PART('year', AGE(CURRENT_DATE, c.registration_date)), 2)
        ELSE SUM(o.order_total)
    END AS annual_value,
    
    -- Customer segment classification
    CASE 
        WHEN SUM(o.order_total) >= 10000 AND COUNT(o.order_id) >= 20 THEN 'VIP'
        WHEN SUM(o.order_total) >= 5000 OR COUNT(o.order_id) >= 10 THEN 'High Value'
        WHEN SUM(o.order_total) >= 1000 OR COUNT(o.order_id) >= 5 THEN 'Medium Value'
        WHEN COUNT(o.order_id) >= 1 THEN 'Low Value'
        ELSE 'Inactive'
    END AS customer_segment,
    
    -- Churn risk indicator
    CASE 
        WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '365 days' THEN 'HIGH'
        WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '180 days' THEN 'MEDIUM'
        WHEN MAX(o.order_date) < CURRENT_DATE - INTERVAL '90 days' THEN 'LOW'
        ELSE 'NONE'
    END AS churn_risk
    
FROM dim_customer c
LEFT JOIN fact_orders o ON c.customer_id = o.customer_id
GROUP BY c.customer_id, c.customer_name, c.registration_date;
 
-- Revenue tier derivation with explicit bucket definitions
SELECT 
    order_id,
    order_total,
    CASE 
        WHEN order_total >= 1000 THEN 'Large (≥$1000)'
        WHEN order_total >= 500 THEN 'Medium ($500-$999)'
        WHEN order_total >= 100 THEN 'Small ($100-$499)'
        WHEN order_total >= 0 THEN 'Micro (<$100)'
        ELSE 'Invalid'
    END AS order_size_tier,
    -- Numeric bucket for sorting
    CASE 
        WHEN order_total >= 1000 THEN 4
        WHEN order_total >= 500 THEN 3
        WHEN order_total >= 100 THEN 2
        WHEN order_total >= 0 THEN 1
        ELSE 0
    END AS order_size_tier_rank
FROM fact_orders;

Document Your Logic

Every derived field should have documentation explaining: (1) What it represents, (2) How it's calculated, (3) Edge case handling, (4) When it was last changed. Business users depend on these fields; they need to understand what they're seeing.

Transformation Patterns and Frameworks

Modern transformation pipelines follow recognizable patterns and increasingly use specialized frameworks that provide structure, testing, and lineage tracking.

Common transformation patterns:

Transformation Architectural Patterns

•Staging → Integration → Presentation: Three-layer architecture. Raw staging, integrated warehouse layer, business-friendly presentation layer (data marts).
•ELT (Extract-Load-Transform): Load raw data first, transform using warehouse compute. Enabled by modern cloud warehouse performance (Snowflake, BigQuery, Redshift).
•Medallion architecture: Bronze (raw), Silver (cleansed), Gold (business-ready) layers. Popularized by Databricks, applicable broadly.
•Incremental transformations: Transform only changed data, merge into existing tables. Essential for performance at scale.
•Full rebuild: Drop and recreate tables completely. Simpler logic, but expensive for large tables; use for smaller aggregates.

dbt (data build tool):

dbt has become the dominant framework for transformation in modern data stacks. Key features:

SQL-centric: Transformations written as SELECT statements
Dependency management: Automatic DAG based on ref() function
Testing: Built-in tests for nulls, uniqueness, relationships
Documentation: Generates documentation from model definitions
Incremental models: First-class support for incremental transformations
Lineage: Tracks column-level lineage through transformations

-- Example dbt model: dim_customer.sql
{{ config(materialized='incremental', unique_key='customer_id') }}

SELECT 
    {{ dbt_utils.surrogate_key(['source_system', 'source_customer_id']) }} AS customer_sk,
    source_customer_id AS customer_id,
    INITCAP(TRIM(first_name)) AS first_name,
    INITCAP(TRIM(last_name)) AS last_name,
    LOWER(TRIM(email)) AS email,
    CURRENT_TIMESTAMP AS dbt_updated_at
FROM {{ ref('stg_customers') }}

{% if is_incremental() %}
  WHERE updated_at > (SELECT MAX(dbt_updated_at) FROM {{ this }})
{% endif %}

Testing Transformations

Transformation logic should be tested like application code. Implement: (1) Unit tests for individual transformation functions, (2) Data quality tests for null counts, uniqueness, and referential integrity, (3) Reconciliation tests comparing source and target counts/sums, (4) Regression tests to catch unintended changes.

Summary: Mastering Data Transformation

The Transform phase is where raw operational data becomes analytical gold. It's the implementation of business understanding, the resolution of organizational inconsistencies, and the creation of trusted information assets.

Key Takeaways

•Transformation ordering matters: Cleanse before standardize, standardize before calculate, calculate before conform. Wrong ordering produces wrong results.
•Cleansing fixes quality issues: NULL handling, whitespace, encoding, invalid values. Track and report fixes—high fix rates indicate upstream problems.
•Standardization enables matching: Dates, phones, addresses, codes—all must be normalized before comparison or integration across sources.
•Deduplication is hard: Exact matching catches technical duplicates; fuzzy matching catches business duplicates. Blocking strategies scale the process.
•Conformed dimensions unify the enterprise: Shared dimensions provide common vocabulary. Surrogate keys enable integration and history tracking.
•Derived fields encode business logic: Calculations should be implemented once and applied consistently. Document thoroughly.
•Modern frameworks add structure: Tools like dbt provide testing, documentation, dependency management, and incremental processing.
•ELT shifts work to the warehouse: Modern cloud warehouses have sufficient compute for transformation; raw data lands first, transforms happen inside.

What's next:

With data extracted, cleansed, standardized, integrated, and enriched, we're ready to Load it into the target data warehouse. The next page covers loading patterns, including initial loads, incremental updates, slowly changing dimension handling, and performance optimization techniques.

Page Complete

You now understand the Transform phase: cleansing techniques, standardization patterns, deduplication strategies, data integration, conformed dimensions, derived fields, and transformation frameworks. Next, we'll explore the Load phase where transformed data reaches its final destination.