Graph Databases - Learning Module

Loading content...

0/273

The Graph Data Model: Nodes, Edges, and Properties

When Relationships Are the Data

Consider a seemingly simple question: "Find me all the friends of friends of Alice who work at the same company as Bob and have recommended a product that Carol purchased."

In a relational database, answering this question requires joining multiple tables—users to friendships to users again, users to employment, users to recommendations, users to purchases. The query plan explodes with JOIN operations. At scale, it becomes prohibitively expensive.

What if the relationships themselves were first-class citizens?

This is the fundamental insight behind graph databases. While relational databases excel at storing structured data with well-defined schemas, they struggle when the connections between entities are as important as the entities themselves. Graph databases flip this paradigm: relationships aren't expensive operations to compute—they're pre-materialized pointers that can be traversed in constant time.

What You Will Learn

By the end of this page, you will understand the fundamental graph data model—nodes, edges (relationships), and properties. You'll learn how graphs differ mathematically and practically from relational tables, when the graph model provides decisive advantages, and how property graphs extend classical graph theory for real-world applications. This foundation is essential for everything that follows in graph databases.

Mathematical Foundations of Graphs

Before diving into database specifics, we must understand the mathematical structures that underpin graph databases. Graph theory—a branch of discrete mathematics—provides the formal language for describing connected data.

Definition: A Graph G = (V, E)

A graph G consists of two fundamental components:

V (Vertices/Nodes): A set of entities, objects, or data points
E (Edges/Relationships): A set of connections between pairs of vertices

Each edge e ∈ E connects two vertices (v₁, v₂), potentially with directionality and additional attributes.

Types of Graphs in Mathematical Theory
Graph Type	Definition	Edge Representation	Database Example
Undirected Graph	Edges have no direction; relationship is symmetric	(A, B) = (B, A)	Facebook friendships (mutual)
Directed Graph (Digraph)	Edges have direction; relationship is asymmetric	(A → B) ≠ (B → A)	Twitter follows (one-way)
Weighted Graph	Edges carry numerical weights or costs	(A, B, w) where w = weight	Road networks (distance)
Multigraph	Multiple edges allowed between same vertex pair	{(A, B)₁, (A, B)₂, ...}	Multiple relationships between users
Hypergraph	Single edge can connect multiple vertices	(A, B, C, D)	Group memberships, tags

Key Graph Metrics:

Understanding these metrics is crucial for query optimization and capacity planning:

Degree: Number of edges connected to a node. In directed graphs, we distinguish between in-degree (incoming edges) and out-degree (outgoing edges).
Path: A sequence of vertices where each adjacent pair is connected by an edge.
Path Length: Number of edges in a path (or sum of weights in weighted graphs).
Shortest Path: Minimum-length path between two vertices.
Cycle: A path that starts and ends at the same vertex.
Connected Component: Maximal subgraph where every vertex is reachable from every other vertex.
Diameter: Maximum shortest path length between any two vertices in the graph.

The Small World Phenomenon

Real-world social graphs exhibit the 'small world' property—despite potentially billions of nodes, the average path length between any two nodes is remarkably short (typically 4-6 hops). This is why graph databases can efficiently answer multi-hop traversal queries that would require exponentially expensive JOINs in relational databases.

The Property Graph Model

Pure mathematical graphs—while theoretically elegant—lack the richness needed for real-world data modeling. The Property Graph Model extends basic graph theory with three essential enhancements:

Labels: Nodes and edges can have type classifications
Properties: Key-value pairs attached to nodes and edges
Directed Relationships: Edges have explicit direction with semantic meaning

This creates a powerful, flexible data model that combines graph structure with document-like richness.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Node with label and properties
(alice:User {
  id: "user-001",
  name: "Alice Chen",
  email: "alice@example.com",
  joinDate: "2021-03-15",
  tier: "premium"
})
 
// Another node
(bob:User {
  id: "user-002", 
  name: "Bob Smith",
  email: "bob@example.com",
  joinDate: "2020-11-22",
  tier: "standard"
})
 
// Relationship with type, direction, and properties
(alice)-[:FOLLOWS {
  since: "2022-01-10",
  source: "suggested",
  notificationEnabled: true
}]->(bob)
 
// Multiple relationship types between same nodes
(alice)-[:WORKS_WITH {
  department: "Engineering",
  since: "2021-06-01"
}]->(bob)

Anatomy of a Property Graph:

Nodes (Vertices):

Represent entities: users, products, locations, events
Have one or more labels (types): :User, :Product, :Company
Contain properties: key-value pairs with typed values
Have internal unique identifiers (implementation-specific)

Relationships (Edges):

Connect exactly two nodes (start node → end node)
Have exactly one type: :FOLLOWS, :PURCHASED, :WORKS_AT
Always directed (though can be traversed in either direction during queries)
Can carry properties: weight, timestamp, metadata
Are first-class citizens, not afterthoughts

Labels and Types:

Enable efficient indexing and query filtering
Support multiple labels per node (:User:Admin:Premium)
Provide schema-like structure without rigid schema enforcement

Relationships Are Not Foreign Keys

In relational databases, relationships are implicit—discovered through expensive JOIN operations that match foreign keys. In property graphs, relationships are explicit pointers stored with the node. Traversing a relationship is O(1)—like following a pointer—not O(log n) or O(n) like index lookups or table scans.

Nodes: Modeling Entities

Nodes are the fundamental units of data in a graph database—analogous to rows in relational tables or documents in document stores. However, nodes have properties that make them uniquely suited for connected data:

Node Identity:

Every node has an internal, unique identifier generated by the database. This ID is:

Immutable for the lifetime of the node
Used internally for relationship pointers
Not suitable for external references (use application-defined IDs as properties)

Labels as Flexible Types:

Labels provide typing without the rigidity of relational schemas:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
// Single label
(:User {name: "Alice"})
 
// Multiple labels - entity has multiple types
(:User:Admin:Auditor {
  name: "Bob",
  role: "Platform Administrator",
  auditLevel: "full"
})
 
// Labels enable polymorphism
(:Vehicle:Truck:Rental {
  licensePlate: "ABC-123",
  make: "Ford",
  model: "F-150",
  rentalPricePerDay: 89.99
})
 
// Query can target any label
MATCH (v:Vehicle) RETURN v          // All vehicles
MATCH (r:Rental) RETURN r           // All rentable things  
MATCH (t:Truck) RETURN t            // All trucks
MATCH (rt:Rental:Truck) RETURN rt   // Rentable trucks

Node Properties:

Properties are key-value pairs with typed values. Supported property types typically include:

Primitives: strings, integers, floats, booleans
Temporal: dates, times, datetimes, durations
Spatial: points (2D or 3D coordinates)
Collections: lists of primitives (homogeneous)

Best Practices for Node Modeling:

Use meaningful labels: :Person not :Node1
Include application IDs: Don't rely on internal database IDs
Balance properties vs. relationships: Primitive attributes go in properties; complex associations become relationships
Consider query patterns: Properties you filter on frequently should be indexed

Avoid the "God Node" Anti-pattern

Don't create nodes with thousands of relationships (e.g., a single :Company node connected to all employees). This creates hotspots during traversal. Instead, introduce intermediate nodes: departments, teams, locations. Well-distributed graphs perform better at scale.

Edges: Modeling Relationships

If nodes represent the nouns in your data, relationships are the verbs—the actions, associations, and connections that give meaning to isolated entities. In graph databases, relationships are first-class citizens with their own identity, type, and properties.

Relationship Anatomy:

Every relationship has:

Start Node: The originating vertex
End Node: The destination vertex
Type: A semantic label (:FOLLOWS, :PURCHASED, :REPORTED_TO)
Direction: Always stored as directed (start → end)
Properties: Optional key-value metadata

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Simple relationship
(alice)-[:FOLLOWS]->(bob)
 
// Relationship with rich properties
(alice)-[:PURCHASED {
  orderId: "ORD-2024-001",
  purchaseDate: datetime("2024-03-15T14:30:00Z"),
  quantity: 2,
  totalPrice: 149.98,
  paymentMethod: "credit_card",
  shippingAddress: "123 Main St"
}]->(product)
 
// Temporal relationships - capture history
(employee)-[:WORKS_AT {
  startDate: date("2020-03-01"),
  endDate: date("2023-05-15"),
  role: "Senior Engineer",
  department: "Platform"
}]->(previousCompany)
 
(employee)-[:WORKS_AT {
  startDate: date("2023-05-16"),
  endDate: null,  // Current position
  role: "Staff Engineer",
  department: "Infrastructure"
}]->(currentCompany)
 
// Self-relationships are allowed
(user)-[:REFERS_TO {weight: 0.8}]->(user)  // Self-referencing loop

Direction Semantics:

Relationships are always stored with direction, but can be queried bidirectionally:

// Match specific direction
(a)-[:FOLLOWS]->(b)      // a follows b
(a)<-[:FOLLOWS]-(b)      // b follows a

// Match either direction
(a)-[:KNOWS]-(b)         // a knows b OR b knows a

// Direction matters for semantics
(manager)-[:MANAGES]->(employee)   // Clear hierarchy
(alice)-[:SENT]->(message)-[:TO]->(bob)  // Message flow

Relationship Types as Schema:

While graph databases are schema-optional, relationship types provide implicit schema:

:FOLLOWS likely connects :User to :User
:PURCHASED likely connects :User to :Product
:LOCATED_IN likely connects various entities to :Location

This implicit schema enables efficient query planning and provides documentation.

Why Relationships Are O(1) to Traverse

Graph databases store relationships as direct pointers between nodes—not as separate tables requiring lookup. When you traverse from Alice to her friends, the database follows in-memory pointers directly to friend nodes. No index lookups, no hash probes, no table scans. This is 'index-free adjacency'—the killer feature of graph databases.

Graph vs Relational: The JOIN Problem

To understand why graph databases exist, we must examine the problem they solve: the cost of JOIN operations in relational databases when modeling connected data.

The Relational Model for Relationships:

In relational databases, many-to-many relationships require junction (join) tables:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
-- Schema
CREATE TABLE users (
  id SERIAL PRIMARY KEY,
  name VARCHAR(255)
);
 
CREATE TABLE friendships (
  user_id INT REFERENCES users(id),
  friend_id INT REFERENCES users(id),
  PRIMARY KEY (user_id, friend_id)
);
 
-- Query: Find friends of friends of Alice
SELECT DISTINCT fof.name AS friend_of_friend
FROM users u
JOIN friendships f1 ON u.id = f1.user_id
JOIN users friend ON f1.friend_id = friend.id
JOIN friendships f2 ON friend.id = f2.user_id
JOIN users fof ON f2.friend_id = fof.id
WHERE u.name = 'Alice'
  AND fof.id != u.id  -- Exclude Alice
  AND fof.id NOT IN (  -- Exclude direct friends
    SELECT f3.friend_id FROM friendships f3 WHERE f3.user_id = u.id
  );

The Problem: JOIN Cost Explosion

Each JOIN in a relational query typically requires:

Index lookups: O(log n) for B-tree indexes
Hash probes: O(1) average, O(n) worst case
Nested loop joins: O(n × m) for n and m rows

For k degrees of separation, the query requires k JOINs. If each user has d average friends, the intermediate result sets grow exponentially: O(d^k).

Concrete Example:

1 million users, average 150 friends each
Friends of Alice: 150 rows (1 JOIN)
Friends of friends: 150 × 150 = 22,500 rows (2 JOINs)
3 degrees: 3.4 million rows (3 JOINs)
4 degrees: 500+ million rows (4 JOINs) — database melts

Relational Approach

•JOINs compute relationships at query time
•Performance degrades with depth (O(d^k))
•Index lookups for each relationship level
•Query planner struggles with variable-depth
•Recursive CTEs possible but still expensive
•Schema rigid; relationship changes costly

Graph Approach

•Relationships pre-materialized as pointers
•Traversal cost proportional to results
•O(1) per hop—follow pointer, not lookup
•Variable-depth queries natural (MATCH *1..5)
•Path finding is native operation
•Schema-flexible; new relationships trivial

The Crossover Point

For queries involving 1-2 relationship hops, relational databases with proper indexing perform comparably to graph databases. For 3+ hops, or for queries with variable depth, graph databases increasingly dominate. The more connected your data and the deeper your traversals, the more compelling the graph advantage.

Index-Free Adjacency: The Secret Weapon

The performance advantage of graph databases stems from a fundamental architectural difference called index-free adjacency. This concept is central to understanding why graph databases can traverse millions of relationships efficiently.

What is Index-Free Adjacency?

In a graph database with index-free adjacency, each node directly references its adjacent nodes through physical pointers (typically memory addresses or record IDs). Traversing a relationship means dereferencing a pointer—no index lookup required.

Contrast with Relational Databases:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
RELATIONAL DATABASE STORAGE:
┌─────────────────────────────────────────────────────┐
│ users table                                         │
│ ┌────┬──────────┐                                   │
│ │ id │ name     │                                   │
│ ├────┼──────────┤                                   │
│ │ 1  │ Alice    │                                   │
│ │ 2  │ Bob      │                                   │
│ │ 3  │ Charlie  │                                   │
│ └────┴──────────┘                                   │
│                                                     │
│ friendships table (junction)                        │
│ ┌─────────┬───────────┐                             │
│ │ user_id │ friend_id │                             │
│ ├─────────┼───────────┤                             │
│ │ 1       │ 2         │ ← Must lookup in users table│
│ │ 1       │ 3         │                             │
│ │ 2       │ 3         │                             │
│ └─────────┴───────────┘                             │
│                                                     │
│ Query: "Find Alice's friends"                       │
│ 1. Find Alice's id (index scan)                     │
│ 2. Find friendships where user_id=1 (index scan)    │
│ 3. For each friend_id, lookup user (index scan × N) │
│                                                     │
│ Cost: O(log n) + O(log n) + O(k × log n) = O(k log n)│
└─────────────────────────────────────────────────────┘
 
GRAPH DATABASE WITH INDEX-FREE ADJACENCY:
┌─────────────────────────────────────────────────────┐
│ Node: Alice                                         │
│ ├── Properties: {name: "Alice"}                     │
│ └── Relationships:                                  │
│     ├── FRIENDS → [pointer to Bob node]             │
│     └── FRIENDS → [pointer to Charlie node]         │
│                                                     │
│ Node: Bob                                           │
│ ├── Properties: {name: "Bob"}                       │
│ └── Relationships:                                  │
│     ├── FRIENDS → [pointer to Alice node]           │
│     └── FRIENDS → [pointer to Charlie node]         │
│                                                     │
│ Query: "Find Alice's friends"                       │
│ 1. Find Alice node (index scan for starting point)  │
│ 2. Follow FRIENDS pointers directly to Bob, Charlie │
│                                                     │
│ Cost: O(log n) + O(k) = O(k) for k friends          │
│       (Only initial lookup uses index!)             │
└─────────────────────────────────────────────────────┘

Why This Matters at Scale:

The key insight: traversal cost is proportional to the amount of graph explored, not the total graph size.

In a relational database with 1 billion users, each JOIN still requires index lookups proportional to log(1 billion) ≈ 30 comparisons
In a graph database, traversing from Alice to her 150 friends requires 150 pointer dereferences—regardless of whether the graph has 1 million or 1 billion nodes

This is locality of reference in action. The graph database only touches the subgraph relevant to your query.

Caveat: Initial Node Lookup

Graph databases still need indexes to find starting nodes. The query "Find Alice" requires an index lookup. But once you have a starting node, all traversals are index-free. This is why graph query patterns typically start with a lookup and then traverse:

MATCH (alice:User {name: 'Alice'})-[:FRIENDS]->(friend)
      ^-- index lookup                ^-- pointer traversal

Native vs. Non-Native Graph Databases

Not all graph databases implement true index-free adjacency. Some are 'graph layers' on top of relational or document stores, translating graph operations into underlying queries. These lack the performance characteristics of native graph databases like Neo4j. When evaluating graph databases, confirm they use native graph storage—the performance difference can be orders of magnitude.

Graph Data Modeling Principles

Effective graph data modeling requires a paradigm shift from relational thinking. Rather than normalizing data into tables, you model based on how the data will be traversed.

Core Principles:

1. Model for Query Patterns, Not Normalization

In relational databases, we normalize to eliminate redundancy. In graph databases, we denormalize toward usage patterns. If a query needs to traverse from User to Order to Product, ensure those relationships exist directly.

2. Nodes for Nouns, Relationships for Verbs

Entities become nodes. Actions and associations become relationships. The sentence "Alice purchased a laptop from TechStore" translates directly:

(:User {name:'Alice'})-[:PURCHASED]->(:Product {name:'Laptop'})
(:User {name:'Alice'})-[:PURCHASED_FROM]->(:Store {name:'TechStore'})

3. Relationship Types Over Generic Relationships

Instead of one :RELATES_TO with a type property, use specific relationship types:

// ❌ Generic (poor query performance)
(a)-[:RELATES {type:'works_at'}]->(b)
(a)-[:RELATES {type:'lives_in'}]->(c)

// ✅ Specific (efficient filtering)
(a)-[:WORKS_AT]->(b)
(a)-[:LIVES_IN]->(c)

Database engines optimize for relationship types, not property values.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
// Anti-pattern: Direct relationship with too many properties
(customer)-[:PURCHASED {
  orderId: "...",
  date: "...",
  quantity: 5,
  totalPrice: 499.95,
  discountCode: "...",
  shippingAddress: "...",
  billingAddress: "...",
  paymentMethod: "...",
  status: "shipped",
  trackingNumber: "..."
}]->(product)
 
// Better: Intermediate 'event' node
(customer)-[:PLACED]->(order:Order {
  orderId: "ORD-001",
  date: datetime("2024-03-15"),
  status: "shipped",
  totalPrice: 499.95
})
(order)-[:CONTAINS {quantity: 5, unitPrice: 99.99}]->(product)
(order)-[:SHIPPED_TO]->(address:Address {...})
(order)-[:PAID_VIA]->(payment:Payment {...})
 
// Benefits:
// - Order is queryable: "Find all orders over $100"
// - Multiple products per order: order-[:CONTAINS]->products
// - Order history: easy to find all orders for customer
// - Status tracking: order status is a property, not buried in relationship

4. Use Intermediate Nodes for Rich Relationships

When relationships carry significant data or need to be connected to other entities, promote them to nodes:

Job positions: (person)-[:HELD]->(position)-[:AT]->(company)
Transactions: (buyer)-[:INITIATED]->(transaction)-[:FOR]->(item)
Ratings: (user)-[:GAVE]->(rating)-[:TO]->(movie)

5. Balance Depth vs. Breadth

Deep traversals are expensive. If you frequently need Order → LineItem → Product → Category → Department, consider modeling shortcuts:

// Direct shortcut relationship
(order)-[:DEPARTMENT]->(department)  // Precomputed association

This trades storage for query performance—a classic engineering tradeoff.

The Whiteboard Test

Graph models should look like whiteboard diagrams of your domain. If your domain experts draw circles and arrows when explaining concepts, that's your graph model. Property graphs excel at matching mental models—one reason they're intuitive to design and query.

When to Use Graph Databases

Graph databases aren't a universal solution—they excel in specific scenarios where the relationship-centric model provides decisive advantages. Understanding these scenarios helps you make informed technology choices.

Strong Fit: Use Graph Databases When:

Ideal Use Cases

•Relationship-intensive queries dominate — Finding paths, patterns, and connections between entities is your primary access pattern
•Variable-depth traversals required — "Find all ancestors", "shortest path", "reachable within N hops"
•Complex pattern matching — Detecting subgraph patterns, finding cycles, motif discovery
•Highly connected data — Each entity connects to many others; JOIN tables would explode
•Schema evolves frequently — Adding new relationship types without migrations
•Real-time recommendations — User→Product→Category→SimilarProduct traversals
•Access control hierarchies — User→Role→Permission→Resource graphs
•Network and dependency analysis — Software dependencies, infrastructure topology

Poor Fit: Avoid Graph Databases When

•Bulk aggregations dominate — "Sum of all sales in 2024" — OLAP systems excel here
•Simple CRUD patterns — Basic create, read, update, delete without traversals
•Tabular/reporting workloads — Generating reports, pivoting data, analytics
•High-throughput writes with minimal reads — Log ingestion, event streaming
•Full-text search is primary need — Search engines (Elasticsearch) specialize here
•Data is naturally tabular — Rows with identical schemas, sparse relationships
•Team lacks graph expertise — Learning curve can slow initial development

The Hybrid Approach:

Modern architectures often combine graph databases with other storage systems:

Relational for transactional CRUD and aggregations
Graph for relationship queries and recommendations
Search for full-text and fuzzy matching
Cache for hot data and session state

Graph databases sync well with other systems via Change Data Capture (CDC) or materialized views, allowing you to leverage their strengths without abandoning your existing stack.

Don't Over-Graph

Not every foreign key is a relationship worth materializing. If you only ever query users by ID and never traverse their connections, a graph database adds complexity without benefit. Choose graphs when relationships are actively traversed, not merely referenced.

Summary: The Graph Data Model

We've explored the fundamental building blocks of graph databases—the mathematical model, property graph extensions, and the architectural advantages that enable performant traversal queries. Let's consolidate the key insights:

Key Takeaways

•Graphs are G = (V, E) — Vertices (nodes) are entities; Edges (relationships) are connections. Property graphs add labels and key-value properties to both.
•Relationships are first-class citizens — Unlike relational JOINs that compute relationships at query time, graph relationships are pre-materialized pointers.
•Index-free adjacency enables O(1) traversals — Following a relationship is a pointer dereference, not an index lookup. Traversal cost scales with subgraph explored, not total graph size.
•JOINs explode at depth; graph traversals scale gracefully — Multi-hop relational queries grow exponentially. Graph queries pay per hop linearly.
•Model for query patterns, not normalization — Graph schemas optimize for how data is traversed, not how it's stored.
•Use specific relationship types — :WORKS_AT, :PURCHASED, :FOLLOWS are more efficient than generic :RELATED with type properties.
•Intermediate nodes for rich relationships — When relationships carry significant data, promote them to nodes (e.g., Order, Transaction, Rating).
•Graphs excel at connected data — Social networks, recommendations, fraud detection, knowledge graphs—anywhere relationships matter.

What's Next:

With the graph data model understood, we'll dive into Neo4j—the leading property graph database. We'll explore its query language (Cypher), storage engine, architectural patterns, and operational characteristics. Neo4j puts these graph concepts into practice at production scale.

Foundation Complete

You now understand the graph data model—nodes, edges, properties—and why it matters. The property graph model provides a powerful abstraction for connected data, with index-free adjacency enabling performant traversals that would cripple relational databases. Next, we'll see these concepts implemented in Neo4j.