Loading learning content...
Real-world data rarely arrives in neat relational form. It comes from spreadsheets with merged cells, JSON documents with nested objects, XML files with hierarchical elements, and legacy systems with pre-relational designs. Before this data can be properly normalized, it must be flattened into the regular, rectangular structure that the relational model requires.
Flattening is the process of transforming hierarchical, nested, or irregularly structured data into relations where:
This page provides systematic techniques for flattening various data structures into 1NF-compliant relations.
By the end of this page, you will understand what flattening means in the context of relational databases, how to flatten nested hierarchical data into multiple related tables, techniques for handling variable-depth hierarchies, strategies for transforming document-oriented data (JSON, XML) to relations, how to preserve data integrity and relationships during flattening, and common pitfalls that lead to data loss during transformation.
Flattening in database design refers to the process of converting data with irregular, nested, or hierarchical structure into the uniform, rectangular format required by the relational model.
Why flattening is necessary:
The relational model, as defined by E.F. Codd, requires that:
Hierarchical data violates these requirements because:
| Characteristic | Hierarchical Data | Relational (Flat) Data |
|---|---|---|
| Structure | Trees with parent-child nesting | Flat tables with rows and columns |
| Cell contents | May contain objects, arrays, nested structures | Single atomic values only |
| Row uniformity | Different nodes may have different attributes | All rows have identical columns |
| Relationships | Embedded within structure | Explicit via foreign keys |
| Query approach | Path traversal (XPath, JSONPath) | Set operations (SQL) |
| Schema flexibility | Often schema-less or flexible | Strict schema enforcement |
The flattening transformation:
Flattening is not merely about layout—it's about decomposing complex structures into simple, atomic components. The process typically involves:
Think of flattening as the inverse of denormalization. Where denormalization combines related data for read performance, flattening separates combined data for write integrity and query flexibility. A deeply nested JSON document might become 5-10 related tables when properly flattened.
The simplest flattening scenario involves single embedded objects—complex attributes that should be decomposed into multiple atomic attributes or extracted into related tables.
Case 1: Embedded object with single occurrence
When a row contains an embedded object that occurs exactly once, you can often flatten by promoting the object's properties to top-level columns:
1234567891011121314151617181920212223
-- BEFORE: Conceptual representation with embedded address-- (Might come from JSON: {"id":1, "name":"Alice", "address":{"street":"123 Main", "city":"Boston", "zip":"02101"}}) -- Non-relational representation:-- CustomerID | Name | Address (embedded object)-- 1 | Alice | {street: "123 Main", city: "Boston", zip: "02101"} -- AFTER: Flattened into atomic columnsCREATE TABLE Customers ( CustomerID INT PRIMARY KEY, Name VARCHAR(100) NOT NULL, -- Address properties promoted to columns Street VARCHAR(200), City VARCHAR(100), State VARCHAR(50), ZipCode VARCHAR(20), Country VARCHAR(50)); -- This works when:-- 1. Each customer has exactly one address-- 2. The address components are accessed/queried individually-- 3. No address sharing between customers is neededCase 2: Embedded object that should be a separate entity
When the embedded object represents a distinct entity with its own identity, or when it might be shared or referenced independently, extract it to a separate table:
1234567891011121314151617181920212223242526272829303132
-- BEFORE: Orders with embedded product details-- OrderID | CustomerID | Product (embedded)-- 1 | 101 | {sku: "LAPTOP", name: "Pro Laptop", price: 999.99, category: "Electronics"} -- This is wrong because:-- - Products exist independently of orders-- - Same product appears in multiple orders-- - Product details shouldn't be duplicated -- AFTER: Proper separation into related tablesCREATE TABLE Products ( SKU VARCHAR(20) PRIMARY KEY, ProductName VARCHAR(100) NOT NULL, Price DECIMAL(10,2) NOT NULL, Category VARCHAR(50)); CREATE TABLE Orders ( OrderID INT PRIMARY KEY, CustomerID INT NOT NULL, OrderDate DATE NOT NULL); CREATE TABLE OrderItems ( OrderID INT NOT NULL, SKU VARCHAR(20) NOT NULL, Quantity INT NOT NULL, UnitPrice DECIMAL(10,2) NOT NULL, -- Price at time of order PRIMARY KEY (OrderID, SKU), FOREIGN KEY (OrderID) REFERENCES Orders(OrderID), FOREIGN KEY (SKU) REFERENCES Products(SKU));Promote embedded attributes to columns when: the object occurs exactly once per parent, has no independent identity, and won't be shared. Extract to a separate table when: the object could exist independently, might be referenced by multiple parents, needs its own constraints, or represents a distinct business concept.
Nested arrays represent one-to-many relationships embedded within a parent object. Flattening them requires creating child tables with foreign key references back to the parent.
The general pattern:
For each array in the source data:
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960
-- SOURCE: JSON document with nested arrays/*{ "orderId": 1001, "customer": "Alice Johnson", "orderDate": "2024-03-15", "items": [ {"sku": "LAPTOP", "qty": 1, "price": 999.99}, {"sku": "MOUSE", "qty": 2, "price": 29.99}, {"sku": "KEYBOARD", "qty": 1, "price": 79.99} ], "payments": [ {"method": "CREDIT", "amount": 500.00, "date": "2024-03-15"}, {"method": "CREDIT", "amount": 639.96, "date": "2024-03-20"} ]}*/ -- FLATTENED SCHEMA: Three related tables -- Parent table for order headerCREATE TABLE Orders ( OrderID INT PRIMARY KEY, CustomerName VARCHAR(100) NOT NULL, OrderDate DATE NOT NULL); -- Child table for items arrayCREATE TABLE OrderItems ( OrderID INT NOT NULL, ItemSequence INT NOT NULL, -- Preserves array order SKU VARCHAR(20) NOT NULL, Quantity INT NOT NULL, Price DECIMAL(10,2) NOT NULL, PRIMARY KEY (OrderID, ItemSequence), FOREIGN KEY (OrderID) REFERENCES Orders(OrderID) ON DELETE CASCADE); -- Child table for payments arrayCREATE TABLE OrderPayments ( OrderID INT NOT NULL, PaymentSequence INT NOT NULL, PaymentMethod VARCHAR(20) NOT NULL, Amount DECIMAL(10,2) NOT NULL, PaymentDate DATE NOT NULL, PRIMARY KEY (OrderID, PaymentSequence), FOREIGN KEY (OrderID) REFERENCES Orders(OrderID) ON DELETE CASCADE); -- DATA INSERTION:INSERT INTO Orders VALUES (1001, 'Alice Johnson', '2024-03-15'); INSERT INTO OrderItems VALUES (1001, 1, 'LAPTOP', 1, 999.99), (1001, 2, 'MOUSE', 2, 29.99), (1001, 3, 'KEYBOARD', 1, 79.99); INSERT INTO OrderPayments VALUES (1001, 1, 'CREDIT', 500.00, '2024-03-15'), (1001, 2, 'CREDIT', 639.96, '2024-03-20');When flattening arrays that are truly dependent on their parent (like order items), use ON DELETE CASCADE on the foreign key. This maintains the semantic that the child rows have no meaning without their parent—deleting an order automatically removes its items.
Complex data often contains multiple levels of nesting. Flattening these requires creating a table for each level, with foreign keys forming a reference chain.
Example: Corporate hierarchy with nested departments and employees
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
-- SOURCE: Three-level nested structure/*{ "companyId": 1, "companyName": "TechCorp", "divisions": [ { "divisionName": "Engineering", "departments": [ { "deptName": "Backend", "employees": [ {"empId": 101, "name": "Alice", "role": "Senior Engineer"}, {"empId": 102, "name": "Bob", "role": "Engineer"} ] }, { "deptName": "Frontend", "employees": [ {"empId": 103, "name": "Charlie", "role": "Lead Developer"} ] } ] }, { "divisionName": "Sales", "departments": [ { "deptName": "Enterprise", "employees": [ {"empId": 201, "name": "Diana", "role": "Account Executive"} ] } ] } ]}*/ -- FLATTENED SCHEMA: Four tables with referential chain CREATE TABLE Companies ( CompanyID INT PRIMARY KEY, CompanyName VARCHAR(100) NOT NULL); CREATE TABLE Divisions ( DivisionID INT PRIMARY KEY AUTO_INCREMENT, CompanyID INT NOT NULL, DivisionName VARCHAR(100) NOT NULL, FOREIGN KEY (CompanyID) REFERENCES Companies(CompanyID), UNIQUE (CompanyID, DivisionName) -- Division names unique within company); CREATE TABLE Departments ( DepartmentID INT PRIMARY KEY AUTO_INCREMENT, DivisionID INT NOT NULL, DepartmentName VARCHAR(100) NOT NULL, FOREIGN KEY (DivisionID) REFERENCES Divisions(DivisionID), UNIQUE (DivisionID, DepartmentName) -- Dept names unique within division); CREATE TABLE Employees ( EmployeeID INT PRIMARY KEY, -- From source data DepartmentID INT NOT NULL, EmployeeName VARCHAR(100) NOT NULL, Role VARCHAR(50), FOREIGN KEY (DepartmentID) REFERENCES Departments(DepartmentID)); -- To query all employees in Engineering division:SELECT e.EmployeeName, e.Role, dep.DepartmentNameFROM Employees eJOIN Departments dep ON e.DepartmentID = dep.DepartmentIDJOIN Divisions div ON dep.DivisionID = div.DivisionIDWHERE div.DivisionName = 'Engineering';Flattening decisions at each level:
For each level of nesting, decide:
| Level | Source Structure | Resulting Table | Key Strategy |
|---|---|---|---|
| 1 | Root object | Companies | Natural or surrogate PK |
| 2 | divisions[] | Divisions | Surrogate PK, FK to level 1 |
| 3 | departments[] | Departments | Surrogate PK, FK to level 2 |
| 4 | employees[] | Employees | Natural PK (empId), FK to level 3 |
Each level of flattened hierarchy requires an additional JOIN to traverse. A 5-level hierarchy needs 4 JOINs to get from leaf to root. Consider whether frequently-needed queries might benefit from denormalized columns (like storing CompanyID directly on Employees for reporting), balanced against the update anomalies this introduces.
Some hierarchies don't have a fixed depth—the same type of node can nest arbitrarily deep. Examples include organizational charts, file systems, comment threads, and category trees. These require special modeling techniques.
The adjacency list model stores each node with a reference to its parent. It's simple to implement but requires recursive queries to traverse.
1234567891011121314151617181920212223242526272829303132
-- Adjacency List for variable-depth categoriesCREATE TABLE Categories ( CategoryID INT PRIMARY KEY, CategoryName VARCHAR(100) NOT NULL, ParentCategoryID INT NULL, -- NULL for root nodes FOREIGN KEY (ParentCategoryID) REFERENCES Categories(CategoryID)); -- Sample data: Electronics > Computers > Laptops > Gaming LaptopsINSERT INTO Categories VALUES (1, 'Electronics', NULL);INSERT INTO Categories VALUES (2, 'Computers', 1);INSERT INTO Categories VALUES (3, 'Laptops', 2);INSERT INTO Categories VALUES (4, 'Gaming Laptops', 3);INSERT INTO Categories VALUES (5, 'Phones', 1); -- Query ancestors using recursive CTE (SQL standard)WITH RECURSIVE CategoryPath AS ( -- Base case: start from the target category SELECT CategoryID, CategoryName, ParentCategoryID, 1 AS Depth FROM Categories WHERE CategoryID = 4 -- Gaming Laptops UNION ALL -- Recursive case: join with parent SELECT c.CategoryID, c.CategoryName, c.ParentCategoryID, cp.Depth + 1 FROM Categories c JOIN CategoryPath cp ON c.CategoryID = cp.ParentCategoryID)SELECT * FROM CategoryPath; -- Result: Gaming Laptops -> Laptops -> Computers -> ElectronicsModern data ingestion often involves JSON from APIs or XML from enterprise systems. Here's a systematic approach to flattening these formats.
JSON Flattening Process:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182
-- COMPLEX JSON SOURCE/*{ "userId": 1001, "profile": { "firstName": "Alice", "lastName": "Smith", "email": "alice@example.com", "preferences": { "theme": "dark", "language": "en", "notifications": true } }, "orders": [ { "orderId": 5001, "date": "2024-03-15", "items": [ {"sku": "A1", "qty": 2, "price": 29.99}, {"sku": "B2", "qty": 1, "price": 49.99} ], "shipping": { "address": "123 Main St", "city": "Boston", "method": "EXPRESS" } } ], "tags": ["premium", "early-adopter", "referrer"]}*/ -- FLATTENED SCHEMA -- Users table: combines profile scalars (promoted)CREATE TABLE Users ( UserID INT PRIMARY KEY, FirstName VARCHAR(50) NOT NULL, LastName VARCHAR(50) NOT NULL, Email VARCHAR(100) NOT NULL, -- Preferences promoted since 1:1 relationship PreferenceTheme VARCHAR(20) DEFAULT 'light', PreferenceLanguage VARCHAR(10) DEFAULT 'en', PreferenceNotifications BOOLEAN DEFAULT TRUE); -- Orders table: from orders arrayCREATE TABLE UserOrders ( OrderID INT PRIMARY KEY, UserID INT NOT NULL, OrderDate DATE NOT NULL, -- Shipping promoted since 1:1 with order ShippingAddress VARCHAR(200), ShippingCity VARCHAR(100), ShippingMethod VARCHAR(20), FOREIGN KEY (UserID) REFERENCES Users(UserID)); -- Order Items: from nested items array within ordersCREATE TABLE OrderItems ( OrderID INT NOT NULL, SKU VARCHAR(20) NOT NULL, Quantity INT NOT NULL, Price DECIMAL(10,2) NOT NULL, PRIMARY KEY (OrderID, SKU), FOREIGN KEY (OrderID) REFERENCES UserOrders(OrderID)); -- User Tags: from tags arrayCREATE TABLE UserTags ( UserID INT NOT NULL, Tag VARCHAR(50) NOT NULL, PRIMARY KEY (UserID, Tag), FOREIGN KEY (UserID) REFERENCES Users(UserID)); -- DATA POPULATIONINSERT INTO Users VALUES (1001, 'Alice', 'Smith', 'alice@example.com', 'dark', 'en', TRUE);INSERT INTO UserOrders VALUES (5001, 1001, '2024-03-15', '123 Main St', 'Boston', 'EXPRESS');INSERT INTO OrderItems VALUES (5001, 'A1', 2, 29.99), (5001, 'B2', 1, 49.99);INSERT INTO UserTags VALUES (1001, 'premium'), (1001, 'early-adopter'), (1001, 'referrer');JSON is often schema-less with optional fields. When flattening, make columns for optional fields NULLable, or use default values where semantic defaults exist. Document which fields may be absent in your data dictionary.
Spreadsheet data presents unique flattening challenges: merged cells, implicit hierarchies through indentation, repeating groups, and header rows embedded within data. Here's how to handle common patterns.
Challenge 1: Merged header cells indicating grouping
Spreadsheets often merge cells to create visual groups:
1234567891011121314151617181920212223242526272829303132333435
ORIGINAL SPREADSHEET FORMAT: | Employee | Department | | Contact | || Name | Name | Manager | Email | Phone ||----------|---------|---------|------------------|-----------------|| Alice | Eng | Bob | alice@co.com | 555-1234 || Charlie | Sales | Diana | charlie@co.com | 555-5678 | The visual grouping ("Department" spans Name and Manager) is lost in CSV export. FLATTENING APPROACH:1. Identify that "Manager" is a Department property, not Employee property2. Create separate tables if Departments can exist independently3. Or flatten if it's purely denormalized data CREATE TABLE EmployeeData ( EmployeeName VARCHAR(100) PRIMARY KEY, DepartmentName VARCHAR(100), DepartmentManager VARCHAR(100), Email VARCHAR(100), Phone VARCHAR(20)); -- If departments should be entities:CREATE TABLE Departments ( DepartmentName VARCHAR(100) PRIMARY KEY, Manager VARCHAR(100)); CREATE TABLE Employees ( EmployeeName VARCHAR(100) PRIMARY KEY, DepartmentName VARCHAR(100) REFERENCES Departments(DepartmentName), Email VARCHAR(100), Phone VARCHAR(20));Challenge 2: Implicit hierarchy through indentation
Some spreadsheets show hierarchy through indentation rather than explicit relationships:
123456789101112131415161718192021222324252627282930313233343536373839404142
ORIGINAL SPREADSHEET (indentation shows hierarchy): | Category | SKU | Stock ||--------------------|--------|-------|| Electronics | | || Computers | | || Laptop X | SKU001 | 50 || Desktop Y | SKU002 | 30 || Phones | | || Phone A | SKU003 | 100 || Clothing | | || Shirts | | || Blue Shirt | SKU004 | 75 | FLATTENING APPROACH:1. Parse indentation levels to determine parent-child relationships2. Create adjacency list or path enumeration structure3. SKU rows are leaf products; non-SKU rows are categories -- After parsing:CREATE TABLE Categories ( CategoryID INT PRIMARY KEY, CategoryName VARCHAR(100), ParentCategoryID INT REFERENCES Categories(CategoryID)); CREATE TABLE Products ( SKU VARCHAR(20) PRIMARY KEY, ProductName VARCHAR(100), CategoryID INT REFERENCES Categories(CategoryID), Stock INT); INSERT INTO Categories VALUES (1, 'Electronics', NULL);INSERT INTO Categories VALUES (2, 'Computers', 1);INSERT INTO Categories VALUES (3, 'Phones', 1);INSERT INTO Categories VALUES (4, 'Clothing', NULL);INSERT INTO Categories VALUES (5, 'Shirts', 4); INSERT INTO Products VALUES ('SKU001', 'Laptop X', 2, 50);INSERT INTO Products VALUES ('SKU002', 'Desktop Y', 2, 30);-- ... etc.For complex spreadsheet transformations, consider using Python pandas for data wrangling, Apache tools like Spark for large datasets, or ETL tools with visual transformations. Parse the human-readable format into intermediate structures before generating SQL.
Flattening transforms hierarchical, nested, or irregular data into the uniform relational structure that 1NF requires. It's often the first step when ingesting data from external sources or migrating from non-relational systems. Let's consolidate the key learnings:
What's next:
With understanding of atomicity, repeating groups, and flattening, we're ready to look at what actually constitutes a 1NF violation. The next page provides a comprehensive taxonomy of 1NF violations, helping you identify issues in existing schemas and avoid them in new designs.
You can now transform hierarchical data into flat relational structures, handle embedded objects and nested arrays, model variable-depth hierarchies with appropriate techniques, and flatten JSON, XML, and spreadsheet data for relational storage. Next, we'll examine common 1NF violations in detail.