Loading content...
Imagine an enterprise with decades of accumulated systems: a mainframe from the 1980s storing customer records as fixed-width COBOL data, an ERP implemented in 2005 using Oracle, a CRM acquired in a merger running on Salesforce, and a modern e-commerce platform built on MongoDB.
Each system speaks its own language:
M/FMale/FemaleMan/Woman/Non-Binary/Other1/2/3/4Dates come as YYYYMMDD, MM/DD/YYYY, DD-Mon-YY, and Unix timestamps. Currency appears as cents in one system, dollars with two decimals in another, and local currency with variable precision in a third.
This is the data integration challenge—and it's far worse than format inconsistencies. Semantic differences lurk beneath: What counts as a 'sale'? When is a transaction 'complete'? Who qualifies as an 'active customer'?
A data warehouse addresses this chaos through integration: the systematic process of unifying heterogeneous data into a consistent, meaningful whole.
By the end of this page, you will deeply understand data integration: the dimensions of heterogeneity it addresses, the techniques for achieving consistency, the role of ETL in integration, and the governance structures that maintain integration over time. You'll learn to identify integration issues and design solutions that create genuinely unified data.
Integration in the data warehouse context means that data from multiple, disparate source systems is transformed into a consistent, unified format with reconciled semantics and resolved conflicts.
"Data is gathered from the many operational systems and is put into the data warehouse. The data is coded consistently—naming, attribute measures, encoding structure, physical attributes, data formats, and so forth." — Bill Inmon
What Integration Achieves:
Integration is not simply combining data—it's reconciling data. You can aggregate disparate records, but if 'customer' means different things in each source, the aggregate is meaningless. True integration requires semantic alignment before combination.
Understanding integration requires understanding what creates heterogeneity in the first place. Differences arise from multiple dimensions:
| Type | Description | Examples |
|---|---|---|
| Schematic | Different structures for same concepts | Customer in CRM has 50 columns; in ERP, 20 columns. Address as single field vs. decomposed fields. |
| Syntactic | Different representations of same values | Date: 2024-01-15 vs 01/15/2024 vs 15-JAN-24. Boolean: true/false vs 1/0 vs Y/N. |
| Semantic | Different meanings for same terms | 'Active Customer': CRM = any login in 90 days; Sales = purchase in 365 days. |
| Granularity | Different levels of detail | Web: per-click events; POS: daily summaries; Finance: monthly totals. |
| Temporal | Different time references | Real-time feed vs. daily batch vs. monthly snapshot. Time zones differ. |
| Technical | Different platforms and formats | Oracle tables, Salesforce APIs, JSON files, CSV exports, MongoDB documents. |
The Root Causes:
Heterogeneity isn't accidental—it emerges from organizational realities:
Technical differences are solvable with code. Semantic differences require organizational decisions. When Finance says 'revenue' includes pending orders and Sales says it doesn't, no ETL logic resolves this—only business governance can. This is why integration projects are as much about process as technology.
Achieving integration requires a systematic approach applied during the ETL (Extract-Transform-Load) process. Here are the core techniques:
1234567891011121314151617181920212223242526272829303132333435
-- ==============================================-- EXAMPLE: Gender Code Standardization-- ============================================== -- Source mappings tableCREATE TABLE etl_code_mapping ( domain VARCHAR(50), -- 'GENDER', 'ORDER_STATUS', etc. source_system VARCHAR(50), source_value VARCHAR(100), standard_value VARCHAR(100)); INSERT INTO etl_code_mapping VALUES-- Gender mappings from various sources('GENDER', 'CRM_LEGACY', 'M', 'Male'),('GENDER', 'CRM_LEGACY', 'F', 'Female'),('GENDER', 'ERP_SAP', 'Male', 'Male'),('GENDER', 'ERP_SAP', 'Female', 'Female'),('GENDER', 'ECOM_MONGO', '1', 'Male'),('GENDER', 'ECOM_MONGO', '2', 'Female'),('GENDER', 'ECOM_MONGO', '3', 'Non-Binary'),('GENDER', 'ECOM_MONGO', '4', 'Unknown'); -- Transformation querySELECT src.customer_id, COALESCE( map.standard_value, 'Unknown' -- Default for unmapped values ) AS gender_standardizedFROM source_customer srcLEFT JOIN etl_code_mapping map ON map.domain = 'GENDER' AND map.source_system = src.source_name AND map.source_value = src.gender_code;Perhaps the most challenging aspect of integration is entity resolution (also called record linkage or data matching)—determining when records from different sources represent the same real-world entity.
The Problem:
Are these the same person? The answer is critical—merge incorrectly and you conflate different people; fail to merge and you fragment your customer view.
Once records are matched, a golden record is created—the single, authoritative version of the entity. This requires survivorship rules: which source provides the 'best' value for each attribute? Email from CRM (primary system), address from most recent order, name from official billing records. These rules codify business logic into the integration process.
Conformed dimensions are a critical concept from Ralph Kimball's dimensional modeling methodology. A conformed dimension is a dimension that means the same thing across all fact tables and data marts that reference it.
Why Conformance Matters:
Without conformed dimensions, integration breaks down at the analytical level. If 'dim_customer' in the Sales mart has different keys or attributes than 'dim_customer' in the Marketing mart, they cannot be joined. Reports become incompatible. 'Customer' stops being a unified concept.
Building Conformed Dimensions:
The date dimension is the canonical example of a conformed dimension. Every fact table has date keys. If sales data and marketing data use different date definitions (fiscal vs. calendar, different week boundaries), cross-analysis fails. A single, conformed dim_date ensures 'Q4 2024' means the same thing everywhere.
ETL (Extract-Transform-Load) processes are the primary mechanism for achieving integration. The 'Transform' phase is where integration happens—where heterogeneous source data becomes unified warehouse data.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
-- ==============================================-- INTEGRATION TRANSFORMATION EXAMPLE-- Unified Customer from Multiple Sources-- ============================================== INSERT INTO dim_customer ( customer_key, customer_id, full_name, email, phone, address_standard, city, state, postal_code, first_purchase_date, customer_segment, loyalty_tier, source_crm, source_erp, source_ecom, effective_date, is_current)WITH matched_customers AS ( -- Entity resolution already performed SELECT * FROM stg_customer_matches),survivorship AS ( SELECT m.unified_customer_id, -- Name: prefer CRM (primary system) COALESCE(crm.full_name, erp.full_name, ecom.full_name) AS full_name, -- Email: prefer most recently updated CASE WHEN crm.email_updated > COALESCE(erp.email_updated, '1900-01-01') AND crm.email_updated > COALESCE(ecom.email_updated, '1900-01-01') THEN crm.email WHEN erp.email_updated > COALESCE(ecom.email_updated, '1900-01-01') THEN erp.email ELSE ecom.email END AS email, -- Address: prefer ERP (billing system) COALESCE(erp.address, crm.address, ecom.address) AS address_standard, -- First purchase: earliest across all sources LEAST( COALESCE(crm.first_order_date, '2099-12-31'), COALESCE(erp.first_order_date, '2099-12-31'), COALESCE(ecom.first_order_date, '2099-12-31') ) AS first_purchase_date, -- Track which sources contributed crm.customer_id AS source_crm, erp.customer_id AS source_erp, ecom.customer_id AS source_ecom FROM matched_customers m LEFT JOIN stg_crm_customer crm ON m.crm_key = crm.customer_id LEFT JOIN stg_erp_customer erp ON m.erp_key = erp.customer_id LEFT JOIN stg_ecom_customer ecom ON m.ecom_key = ecom.customer_id)SELECT NEXT VALUE FOR seq_customer_key, unified_customer_id, full_name, email, -- Additional standardization standardize_phone(phone) AS phone, address_standard, city, state, standardize_postal_code(postal_code, country) AS postal_code, first_purchase_date, calculate_segment(lifetime_value, recency) AS customer_segment, loyalty.tier AS loyalty_tier, source_crm, source_erp, source_ecom, CURRENT_DATE AS effective_date, TRUE AS is_currentFROM survivorship sLEFT JOIN stg_loyalty loyalty ON s.unified_customer_id = loyalty.customer_id;Integration isn't a one-time project—it's an ongoing discipline that requires governance structures to maintain consistency as systems evolve.
| Component | Purpose | Implementation |
|---|---|---|
| Business Glossary | Authoritative definitions for business terms | Documented definitions of 'customer', 'sale', 'revenue' with source-of-truth designation |
| Data Stewardship | Assigned ownership for data domains | Named stewards for Customer, Product, Finance data with decision authority |
| Code/Value Standards | Master lists of valid values | Centralized reference tables for countries, currencies, product categories |
| Integration Rules | Documented transformation logic | Version-controlled specifications for how source values map to warehouse values |
| Data Quality Metrics | Ongoing measurement of integration quality | Match rates, standardization coverage, conformance scores tracked over time |
| Change Management | Process for evolving standards | Governed procedure when source systems change or new sources onboard |
Without governance, integration degrades over time. New source systems onboard without proper mapping. Existing mappings become stale as source systems evolve. Business definitions drift as new stakeholders reinterpret them. Governance structures exist to prevent this entropy—they're not bureaucracy, they're preservation of integration investment.
Modern Integration Platforms:
While integration principles are timeless, modern tools assist:
Integration is the engine that transforms chaotic, contradictory data into a coherent analytical foundation. Let's consolidate the key insights:
What's Next:
With subject-orientation and integration established, we now turn to Time-Variance—the characteristic that preserves historical depth. Unlike operational systems that maintain current state, data warehouses capture data across time, enabling trend analysis, point-in-time reporting, and understanding of how subjects evolve.
You now understand Integration—the process of unifying heterogeneous data into consistent warehouse content. From standardization techniques to entity resolution to conformed dimensions, you have the conceptual toolkit for achieving true data integration. Next, we explore Time-Variance: how warehouses preserve history.