Generalization - Learning Module

Loading content...

0/241

Bottom-Up Approach

Building Abstract from Concrete

Imagine you're an archaeologist who has discovered artifacts from an ancient civilization: clay pots, stone tools, metal jewelry, and wooden utensils. Initially, each artifact seems unique, but as you study them, patterns emerge. You begin grouping items by material, by purpose, by era. Eventually, you realize that certain artifacts—though superficially different—share deep structural properties. They're all tools, or all ceremonial objects, or all household items.

This is precisely what database designers do when applying the bottom-up approach to generalization. You start with concrete, specific entity types—perhaps inherited from a legacy system, discovered through requirements gathering, or created during iterative design. Then, through careful analysis, you discover hidden commonalities that suggest a more elegant, unified model.

The bottom-up approach is not merely a technique; it is a disciplined methodology for discovering abstraction in data. It transforms a collection of seemingly disparate entities into a coherent type hierarchy that reflects the true conceptual structure of the domain.

What You Will Learn

By the end of this page, you will master the systematic bottom-up methodology for generalization: how to identify candidate entities, analyze their attributes and relationships, discover meaningful commonalities, create well-formed supertypes, and validate your generalizations against domain requirements.

Understanding Bottom-Up vs Top-Down Approaches

Before diving into the methodology, it's essential to understand the distinction between bottom-up and top-down approaches to type hierarchies, as they represent fundamentally different cognitive and design processes.

Bottom-Up (Generalization):

Starting Point: Specific, concrete entity types already exist in the model
Process: Analyze entities for common features and relationships
Direction: Move from specific to general (toward abstraction)
Discovery: Commonalities are discovered through analysis
Result: A supertype that emerges from observed similarities

Top-Down (Specialization):

Starting Point: A general entity type exists or is conceived
Process: Identify subgroups with distinct characteristics
Direction: Move from general to specific (toward concretization)
Discovery: Differences are identified to justify subtypes
Result: Subtypes that subdivide the original entity

While these approaches are conceptually inverse, they often converge to the same hierarchical structure. The choice of approach depends on how the design problem presents itself.

Bottom-Up vs Top-Down: Key Differences
Aspect	Bottom-Up (Generalization)	Top-Down (Specialization)
Initial model state	Multiple specific entity types exist	Single general entity exists
Design question	"What do these have in common?"	"What are the different kinds of this?"
Cognitive process	Synthesis (combining into whole)	Analysis (breaking into parts)
When typically used	Legacy system integration, data consolidation	New system design, domain decomposition
Risk	Forcing artificial commonalities	Missing important variations
Challenge	Naming the discovered supertype	Completely enumerating subtypes

Hybrid Approaches Are Common

In practice, database design often involves both approaches iteratively. You might start bottom-up by generalizing CHECKING and SAVINGS into ACCOUNT, then apply top-down thinking to realize you're missing other account types like CD and MONEY_MARKET. The approaches complement each other.

The Systematic Bottom-Up Methodology

The bottom-up approach to generalization follows a disciplined, repeatable methodology. Each step builds on the previous, leading from initial observation to validated generalization.

Phase 1: Entity Catalog and Initial Analysis

•Inventory All Entities — Create a complete catalog of all entity types in the current model or domain. Include entities from all sources: requirements documents, legacy systems, stakeholder interviews, and existing databases.
•Document Attributes Thoroughly — For each entity, list all attributes with their data types, constraints, and semantic meanings. Don't just note 'name' — note 'customer full name, VARCHAR(100), NOT NULL, format: Last, First'.
•Map Relationships — Document all relationships each entity participates in, including cardinality, participation constraints, and relationship attributes. This provides insight into how entities interact with the broader model.
•Gather Domain Context — Collect business rules, constraints, and domain knowledge associated with each entity. Understanding why attributes exist and what they mean is crucial for meaningful generalization.

Phase 2: Similarity Analysis

•Create Comparison Matrix — Build a matrix comparing entities pairwise for attribute overlap, relationship similarity, and semantic relatedness. Quantify similarity where possible.
•Identify Attribute Clusters — Group attributes that appear across multiple entities. These represent potential generalized properties. Watch for synonyms: 'clientName' vs 'customerName' vs 'accountHolder'.
•Analyze Relationship Patterns — Look for entities that participate in the same or similar relationships. If three entities all relate to DEPARTMENT, ADDRESS, and AUDIT_LOG, they likely share a common identity.
•Consider Semantic Context — Ask domain experts whether entities conceptually belong together. Technical similarity without semantic relatedness leads to poor generalizations.

Converting Mermaid diagram...

Phase 3: Supertype Construction

•Name the Supertype — Choose a name that accurately reflects the common concept. The name should be a natural noun that domain experts would use when speaking generally about all subtypes.
•Define Supertype Attributes — Include all attributes common to all subtypes. Ensure consistent data types and semantics. Handle naming variations by choosing the most appropriate term.
•Establish Supertype Relationships — Move common relationships to the supertype level. The relationship should make semantic sense with all subtypes, not just some of them.
•Refine Subtype Definitions — Remove now-inherited attributes from subtypes. Verify that remaining attributes are genuinely specific to each subtype and not shared.

Phase 4: Validation and Refinement

•Domain Expert Review — Present the generalization to stakeholders. Does the supertype concept make sense to them? Can they naturally use the term? Are any subtypes incorrectly grouped?
•Query Analysis — Consider the queries the system needs to support. Does the generalization simplify common queries? Does it unnecessarily complicate specific queries?
•Constraint Verification — Ensure that all constraints from the original entities are preserved. Common constraints should be in the supertype; specific constraints remain in subtypes.
•Documentation — Document the generalization decision, including the rationale, the common attributes identified, and the expected benefits. Future maintainers need this context.

Detailed Case Study: Healthcare System

Let's walk through the complete bottom-up generalization process using a realistic healthcare system scenario. This extended example demonstrates each phase of the methodology in action.

Initial Situation:

A hospital information system has evolved over years, with different departments creating their own entity models. You've been tasked with consolidating the data model. The current state includes several person-related entities that appear redundant:

Step 1.1: Entity Inventory

We identify four person-related entities in the current model:

PHYSICIAN

physicianId (PK, system-generated ID)
firstName, lastName, middleName
dateOfBirth
ssn (encrypted)
phone, email, address
medicalLicenseNumber
specialty
boardCertifications
hospitalPrivileges

NURSE

nurseId (PK, system-generated)
firstName, lastName
dateOfBirth
ssn
phone, email, address
nursingLicenseNumber
certifications (RN, LPN, NP, etc.)
department
shiftPreference

PATIENT

mrn (Medical Record Number, PK)
firstName, lastName, middleName
dateOfBirth
ssn
phone, email, address
emergencyContact
insuranceProvider
insurancePolicyNumber
bloodType
allergies

ADMINISTRATIVE_STAFF

employeeId (PK)
firstName, lastName
dateOfBirth
phone, email, address
department
role
hireDate
accessLevel

Step 1.2: Relationship Documentation

Entity	Relationships
PHYSICIAN	TREATS→Patient, WORKS_IN→Department, HAS_PRIVILEGES→Hospital, ORDERS→Test/Medication
NURSE	ASSIGNED_TO→Patient, WORKS_IN→Department, REPORTS_TO→Physician, ADMINISTERS→Medication
PATIENT	TREATED_BY→Physician, HAS→Appointment, RECEIVES→Test/Medication, BILLED_TO→Insurance
ADMIN_STAFF	WORKS_IN→Department, SCHEDULES→Appointment, PROCESSES→Billing

Attribute Analysis Techniques

The heart of bottom-up generalization is attribute analysis. Several systematic techniques help identify common attributes that should move to a supertype.

Similarity Metrics for Attribute Comparison

•Jaccard Similarity — J(A,B) = |A∩B| / |A∪B|. For two entities with attribute sets A and B, this measures the proportion of shared attributes. Values above 0.5 suggest generalization potential. If ENTITY_X has 10 attributes and ENTITY_Y has 12, with 6 shared, Jaccard = 6/16 = 0.375.
•Name Matching — Identify attributes with identical names across entities. These are prime candidates for the supertype. Be careful of false positives: 'status' in ORDER and 'status' in EMPLOYEE may have different semantics.
•Semantic Matching — Look for synonymous attributes: 'clientId' vs 'customerId', 'createdAt' vs 'dateCreated', 'amt' vs 'amount'. Domain expertise is essential for recognizing these.
•Type Compatibility — Attributes with the same data type and similar constraints often represent the same concept. Two VARCHAR(50) 'name' fields are likely the same; a VARCHAR(50) 'name' and DECIMAL(10,2) 'name' warrant investigation.
•Position Analysis — In legacy systems, attributes in similar positions (e.g., first 5 columns) often represent standard header fields that should generalize.

Handling Attribute Variations:

Real-world entities often have 'almost identical' attributes that require careful handling:

Case 1: Naming Differences

CUSTOMER.clientName vs SUPPLIER.vendorName
Both represent 'organization name'
Solution: Generalize to ORGANIZATION.name, document original names

Case 2: Type Differences

EMPLOYEE.employeeId (INT) vs CONTRACTOR.contractorId (VARCHAR)
Both serve as identifiers
Solution: Choose the more general type (VARCHAR accommodates both), or create a new unified identifier

Case 3: Constraint Differences

FULL_TIME.startDate (NOT NULL) vs PART_TIME.startDate (NULL allowed)
Same concept, different constraints
Solution: Use the looser constraint in supertype (nullable), enforce stricter constraint only in FULL_TIME subtype

Case 4: Presence Differences

DOMESTIC_CUSTOMER.taxId (present) vs INTERNATIONAL_CUSTOMER.taxId (absent)
Concept exists in some subtypes only
Solution: Include in supertype as optional (nullable), or keep in applicable subtypes only

The 80% Rule

A common heuristic: if an attribute appears in 80% or more of the candidate subtypes, consider placing it in the supertype (possibly as nullable for the exceptions). Below 50%, it's likely a subtype-specific attribute. Between 50-80%, careful domain analysis is needed.

Relationship Analysis for Generalization

Beyond attributes, relationship patterns provide powerful signals for generalization. When multiple entities participate in the same relationships, they likely share a common abstract identity.

Pattern 1: Common Participation

Multiple entities participate in relationships with the same target entity:

CUSTOMER ──(has)──→ ADDRESS
EMPLOYEE ──(has)──→ ADDRESS
SUPPLIER ──(has)──→ ADDRESS

This pattern suggests all three should generalize to a PARTY or ADDRESSABLE_ENTITY supertype that has the ADDRESS relationship.

Pattern 2: Role Substitutability

Multiple entities can play the same role in a relationship:

CUSTOMER ──(places)──→ ORDER
DISTRIBUTOR ──(places)──→ ORDER
INTERNAL_DEPT ──(places)──→ ORDER

Any of these can place orders. Generalization to ORDER_PARTY or ACCOUNT simplifies the ORDER entity, which now relates to one supertype.

Pattern 3: Audit/Tracking Relationships

Entities with common tracking relationships:

CONTRACT ──(created_by)──→ USER
CONTRACT ──(modified_by)──→ USER
INVOICE ──(created_by)──→ USER
INVOICE ──(modified_by)──→ USER
PROPOSAL ──(created_by)──→ USER

This suggests an AUDITABLE_DOCUMENT supertype with common audit relationships.

Relationship Pattern Analysis Template
Pattern Type	Detection Signal	Generalization Action	Example
Common Target	Same entity appears as target of multiple relationships	Create supertype for sources	PERSON for all entities linked to ADDRESS
Role Substitution	Multiple entities fill same role in different instances	Create supertype to fill role	ACCOUNT for all entities that can place orders
Parallel Relationships	Multiple entities have identical relationship sets	Create supertype, move relationships up	DOCUMENT for all entities with audit trails
Hierarchical Pattern	Entity relates to entities of the same type	Consider self-referential on supertype	EMPLOYEE.manages→EMPLOYEE generalizes to PERSON

Relationship-First Discovery

Sometimes, examining relationships reveals generalization opportunities that attribute analysis misses. If five entities all participate in relationships with DEPARTMENT, BUILDING, and ACCESS_CONTROL, they likely share an ORGANIZATIONAL_ENTITY identity—even if their attributes seem quite different.

Tools and Techniques for Bottom-Up Analysis

Several practical tools and techniques support the bottom-up generalization process in real-world database design projects.

Analysis Techniques

•Entity Comparison Spreadsheets — Create a matrix with entities as rows and all possible attributes as columns. Mark which entities have which attributes. Visual patterns emerge quickly, revealing clusters of common attributes.
•Attribute Frequency Analysis — Count how many entities each attribute appears in. Attributes with high frequency (appearing in many entities) are candidates for supertypes.
•Relationship Dependency Graphs — Visualize which entities connect to which other entities. Entities with similar connection patterns belong together.
•Domain Glossary Cross-Reference — Map entity names and attribute names to domain glossary terms. Entities mapping to the same high-level concept should generalize.
•Query Pattern Analysis — Examine existing or planned queries. Queries that UNION multiple entity types indicate generalization need.

attribute-analysis-query.sql
SQL
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
-- SQL Query to identify common columns across tables
-- (for analyzing existing database for generalization opportunities)
 
WITH column_counts AS (
    SELECT 
        column_name,
        data_type,
        COUNT(DISTINCT table_name) as table_count,
        STRING_AGG(table_name, ', ') as appears_in_tables
    FROM information_schema.columns
    WHERE table_schema = 'public'
      AND table_name IN ('customer', 'supplier', 'employee', 'contractor')
    GROUP BY column_name, data_type
)
SELECT 
    column_name,
    data_type,
    table_count,
    appears_in_tables,
    CASE 
        WHEN table_count = 4 THEN 'SUPERTYPE CANDIDATE'
        WHEN table_count >= 3 THEN 'PROBABLE SUPERTYPE'
        WHEN table_count = 2 THEN 'CONSIDER'
        ELSE 'SUBTYPE SPECIFIC'
    END as recommendation
FROM column_counts
ORDER BY table_count DESC, column_name;

Automated Discovery Tools:

Several database design and data modeling tools offer features that assist with generalization discovery:

ERwin Data Modeler: Subtype clustering analysis, attribute comparison reports
ER/Studio: Automatic common attribute detection, generalization wizards
Oracle SQL Developer Data Modeler: Supertype suggestion based on attribute overlap
PowerDesigner: Model comparison and merge features that highlight commonalities

While these tools can assist, human judgment remains essential. The tools identify potential generalizations; domain expertise determines which are meaningful.

Common Pitfalls in Bottom-Up Generalization

Even experienced database designers can fall into traps when applying bottom-up generalization. Recognizing these pitfalls helps avoid them.

Anti-Patterns to Avoid

•Over-Generalization — Creating supertypes for entities that share only superficial similarities. Just because CAR and REFRIGERATOR both have 'model' and 'manufacturer' doesn't mean they should share an APPLIANCE supertype (unless that's actually the domain).
•Forced Naming — Struggling to name a supertype is a sign it might not exist. If the best name you can devise is 'THING_WITH_DATE_AND_ID', reconsider whether the generalization is valid.
•Ignoring Semantic Differences — Treating attributes as identical because they have the same name, when they mean different things. An ORDER.status and a PATIENT.status may have entirely different value domains and semantics.
•Premature Generalization — Creating generalizations before fully understanding the domain. Early in design, entities may seem similar but diverge significantly as requirements clarify.
•Losing Subtype Identity — Generalizing so aggressively that subtypes become meaningless. If subtypes have no local attributes and no specific relationships, the distinction may not be meaningful.
•Ignoring Query Impact — Creating generalizations that make simple queries complex. If every query that was against one entity now requires joins to reconstruct subtype data, reconsider.

The Naming Test

If you can't explain the supertype to a domain expert in one natural sentence—'X is a general category that includes [subtypes] because they all [common property]'—the generalization may be artificial. The supertype should have a natural name that domain experts would use.

Bad Generalization

•Supertype: DATED_ITEM
•Subtypes: ORDER, EMPLOYEE, LOG_ENTRY
•Common: They all have a date field
•Problem: No semantic relationship
•Name is forced and meaningless

Good Generalization

•Supertype: DOCUMENT
•Subtypes: CONTRACT, PROPOSAL, INVOICE
•Common: creation date, author, status, version
•Semantic: All are formal business documents
•Name is natural and domain-appropriate

Summary: The Bottom-Up Approach

We've explored the complete bottom-up methodology for discovering and implementing generalization. Let's consolidate the essential practices:

Key Takeaways

•Bottom-up moves from specific to general — You start with concrete entities and synthesize abstract supertypes based on observed commonalities.
•Follow a systematic methodology — Catalog entities, analyze similarities, construct supertypes, and validate thoroughly. Shortcuts lead to poor generalizations.
•Analyze both attributes and relationships — Common participation in relationships is as important as common attributes for identifying generalization opportunities.
•Use quantitative techniques — Similarity metrics, comparison matrices, and frequency analysis provide objective evidence for generalization decisions.
•Validate with domain experts — Technical similarity without semantic relatedness creates artificial, unhelpful generalizations. The supertype must represent a real domain concept.
•Avoid common pitfalls — Over-generalization, forced naming, and ignoring semantics undermine the benefits of proper generalization.

What's Next:

Now that we understand the bottom-up process for discovering generalizations, we'll examine common attributes in detail—how to identify, analyze, and properly move shared attributes to the supertype while handling edge cases like nullable values, type mismatches, and constraint differences.

Page Complete

You now understand the systematic bottom-up methodology for generalization. You can analyze existing entities, identify commonalities, construct meaningful supertypes, and validate your generalizations against domain requirements. Next, we'll focus specifically on handling common attributes—the building blocks of generalization.