Graph Databases - Learning Module

Loading content...

0/273

Neo4j: The Leading Property Graph Database

The Industry Standard for Graph Databases

When organizations evaluate graph databases, one name dominates the conversation: Neo4j. Founded in Sweden in 2007, Neo4j pioneered the property graph model and has grown to become the most widely deployed graph database in production. From startups to Fortune 500 companies, Neo4j powers recommendations at eBay, fraud detection at financial institutions, knowledge graphs at NASA, and network management at telecom giants.

What makes Neo4j the de facto standard? It's not just first-mover advantage—Neo4j embodies a complete vision: a native graph storage engine, an expressive query language (Cypher), a mature operational model, and an ecosystem of tools. Understanding Neo4j is essential for any system designer working with connected data.

What You Will Learn

This page covers Neo4j comprehensively: its native graph storage engine, the Cypher query language, indexing and constraints, transaction semantics, clustering for high availability, and operational best practices. You'll understand when Neo4j shines, its limitations, and how to integrate it into production systems.

Neo4j Architecture Overview

Neo4j is a native graph database—meaning both its storage engine and query processor are designed from the ground up for graph operations, not adapted from relational or document models.

Core Architectural Principles:

1. Native Graph Storage

Neo4j stores nodes and relationships as first-class entities with direct physical pointers between them. This is not a graph abstraction layer over tables—the on-disk format mirrors the graph model.

2. Index-Free Adjacency

Each node directly references its adjacent nodes through relationship pointers. Traversing a relationship is O(1)—no index lookup, no hash table probe. Query performance depends on the subgraph traversed, not total database size.

3. Property Graph First-Class

Labels (node types), relationship types, and properties are core constructs with dedicated storage and indexing, not simulated through conventions.

4. ACID Transactions

Neo4j provides full ACID guarantees—even for multi-statement, multi-node updates. This is unusual among NoSQL databases and essential for applications requiring consistency.

┌─────────────────────────────────────────────────────────────────┐
│                       CLIENT APPLICATIONS                       │
│    (Bolt Protocol, HTTP API, Official Drivers, GraphQL)         │
└───────────────────────────────┬─────────────────────────────────┘
                                │
┌───────────────────────────────▼─────────────────────────────────┐
│                         NEO4J SERVER                            │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    CYPHER RUNTIME                        │    │
│  │   (Parser → Planner → Optimizer → Execution Engine)      │    │
│  └─────────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                 TRANSACTION MANAGER                      │    │
│  │         (ACID, Write-Ahead Logging, Locking)             │    │
│  └─────────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                   STORAGE ENGINE                         │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │    │
│  │  │  Node Store  │  │ Relationship │  │  Property    │   │    │
│  │  │  (nodes.db)  │  │    Store     │  │    Store     │   │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘   │    │
│  │  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐   │    │
│  │  │ Label Store  │  │ Index Store  │  │ Schema Store │   │    │
│  │  └──────────────┘  └──────────────┘  └──────────────┘   │    │
│  └─────────────────────────────────────────────────────────┘    │
│  ┌─────────────────────────────────────────────────────────┐    │
│  │                    PAGE CACHE                            │    │
│  │            (Memory-Mapped File Access)                   │    │
│  └─────────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────────┘

Community vs. Enterprise Edition

Neo4j offers two editions. Community Edition is open-source, single-instance, suitable for development and small deployments. Enterprise Edition adds clustering, role-based access control, hot backups, and advanced monitoring—essential for production at scale.

Native Graph Storage Engine

Neo4j's storage engine is purpose-built for graph data. Understanding its structure explains why graph operations are so performant.

Fixed-Size Record Stores:

Neo4j uses fixed-size records for nodes and relationships, enabling O(1) lookups by record ID:

Node Store (neostore.nodestore.db): ~15 bytes per node record
Relationship Store (neostore.relationshipstore.db): ~34 bytes per relationship
Property Store: Variable-size property values with fixed index entries
Label Store: Efficient label-to-node mappings

Node Record Structure:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Node Record (15 bytes):
┌─────────────────────────────────────────────────────────────────┐
│ inUse │ nextRelId │ nextPropId │ labels │ extra │              │
│ (1b)  │   (5b)    │    (5b)    │  (5b)  │       │              │
└─────────────────────────────────────────────────────────────────┘
 
inUse:        Whether this record is active (not deleted)
nextRelId:    Pointer to first relationship in chain
nextPropId:   Pointer to first property in chain
labels:       Inline labels or pointer to label store
 
Relationship Record (34 bytes):
┌─────────────────────────────────────────────────────────────────┐
│ inUse │ firstNode │ secondNode │ type │ firstPrev │ firstNext │ │
│       │           │            │      │ secondPrev│ secondNext│ │
└─────────────────────────────────────────────────────────────────┘
 
firstNode:    Start node of relationship (pointer)
secondNode:   End node of relationship (pointer)  
type:         Relationship type ID
first/second: Doubly-linked list pointers for relationship chains

Doubly-Linked Relationship Chains:

Each node maintains two linked lists of relationships:

Relationships where this node is the start node
Relationships where this node is the end node

This enables O(k) iteration over a node's k relationships—without scanning all relationships in the database.

Property Storage:

Properties use a chain structure with fixed-size index blocks pointing to variable-size value storage:

Small values (≤8 bytes): stored inline in property record
Medium values: stored in dedicated string/array store
Large values: chunked storage with linked blocks

Locality Benefits:

Neo4j attempts to store related nodes and relationships in adjacent disk pages, optimizing for cache coherence during traversals. The page cache (configured via dbms.memory.pagecache.size) holds hot pages in memory.

Right-Size Your Page Cache

For optimal performance, the page cache should fit your entire graph (or at least the hot portion). A common rule: allocate 50-70% of available RAM to page cache. If pages must be fetched from disk during traversal, latency increases dramatically.

Cypher: The Graph Query Language

Cypher is Neo4j's declarative query language—designed specifically for graphs with an ASCII-art syntax that visually represents patterns. If SQL expresses "what rows match these criteria," Cypher expresses "what subgraphs match this pattern."

Core Syntax Elements:

Nodes: (variable:Label {prop: value})
Relationships: -[variable:TYPE {prop: value}]->
Patterns: Combine nodes and relationships: (a)-[:KNOWS]->(b)
Direction: -> outgoing, <- incoming, - either
Variable-length paths: * or *1..3 for hop ranges

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
// CREATE: Insert nodes and relationships
CREATE (alice:User {name: 'Alice', email: 'alice@example.com'})
CREATE (bob:User {name: 'Bob'})
CREATE (alice)-[:FOLLOWS {since: date('2024-01-15')}]->(bob)
 
// MATCH: Pattern matching - the core of Cypher
MATCH (u:User {name: 'Alice'})
RETURN u
 
// Pattern with relationship
MATCH (a:User)-[:FOLLOWS]->(b:User)
WHERE a.name = 'Alice'
RETURN b.name AS following
 
// Variable-length paths: friends within 1-3 hops
MATCH (alice:User {name: 'Alice'})-[:KNOWS*1..3]->(friend)
RETURN DISTINCT friend.name
 
// Filtering with WHERE
MATCH (u:User)-[f:FOLLOWS]->(target)
WHERE f.since > date('2023-01-01')
  AND u.tier = 'premium'
RETURN u.name, target.name
 
// Multiple patterns
MATCH (a:User)-[:FOLLOWS]->(b:User),
      (b)-[:PURCHASED]->(p:Product)
WHERE a.name = 'Alice'
RETURN p.name AS products_bought_by_following
 
// Optional match (LEFT JOIN equivalent)
MATCH (u:User)
OPTIONAL MATCH (u)-[:MANAGES]->(team:Team)
RETURN u.name, team.name AS managed_team

Key Cypher Clauses:

Clause	Purpose	SQL Equivalent
`MATCH`	Pattern matching	FROM + JOIN
`WHERE`	Filtering	WHERE
`RETURN`	Projection	SELECT
`CREATE`	Insert nodes/relationships	INSERT
`MERGE`	Create if not exists	INSERT...ON CONFLICT
`SET`	Update properties	UPDATE
`DELETE`	Remove nodes/relationships	DELETE
`WITH`	Chaining, aggregation	Subquery
`UNWIND`	Expand collections	UNNEST
`ORDER BY`	Sorting	ORDER BY
`SKIP/LIMIT`	Pagination	OFFSET/LIMIT

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Aggregation
MATCH (u:User)-[:PURCHASED]->(p:Product)
RETURN u.name, count(p) AS purchase_count, sum(p.price) AS total_spent
ORDER BY total_spent DESC
LIMIT 10
 
// Shortest path
MATCH path = shortestPath(
  (alice:User {name: 'Alice'})-[:KNOWS*]-(bob:User {name: 'Bob'})
)
RETURN path, length(path) AS degrees_of_separation
 
// All shortest paths
MATCH paths = allShortestPaths(
  (a:User {name: 'Alice'})-[:KNOWS*]-(b:User {name: 'Bob'})
)
RETURN paths
 
// Collect into lists
MATCH (u:User)-[:FOLLOWS]->(following:User)
RETURN u.name, collect(following.name) AS following_list
 
// MERGE - create if not exists (idempotent)
MERGE (u:User {email: 'new@example.com'})
ON CREATE SET u.createdAt = datetime()
ON MATCH SET u.lastLogin = datetime()
RETURN u
 
// Existential subqueries
MATCH (u:User)
WHERE EXISTS {
  MATCH (u)-[:PURCHASED]->(p:Product)
  WHERE p.price > 1000
}
RETURN u.name AS big_spenders

Avoid Cartesian Products

Multiple unconnected patterns in MATCH create Cartesian products: MATCH (a:User), (b:Product) evaluates every user × every product. Always connect patterns through relationships, or use explicit WITH clauses to control scope.

Indexes and Constraints

While index-free adjacency handles traversals, you still need indexes for finding starting nodes. Neo4j provides several index types and constraint mechanisms.

Index Types in Neo4j:

1. B-Tree Indexes (Default)

Range queries: WHERE age > 30
Prefix matching: WHERE name STARTS WITH 'Ali'
Equality: WHERE email = 'alice@example.com'

2. Full-Text Indexes

Text search with tokenization
Supports fuzzy matching, stemming
Powered by Apache Lucene

3. Point Indexes (Spatial)

Geographic and geometric queries
Distance calculations, bounding boxes

4. Range Indexes (Neo4j 5+)

Optimized for numeric ranges
Improved performance over B-tree for numbers

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Single property index on node label
CREATE INDEX user_email FOR (u:User) ON (u.email)
 
// Composite index (multiple properties)
CREATE INDEX user_name_age FOR (u:User) ON (u.lastName, u.firstName, u.age)
 
// Index on relationship property
CREATE INDEX follows_since FOR ()-[f:FOLLOWS]-() ON (f.since)
 
// Full-text index (Lucene-backed)
CREATE FULLTEXT INDEX user_search FOR (u:User) ON EACH [u.name, u.bio]
 
// Point/spatial index
CREATE POINT INDEX location_idx FOR (l:Location) ON (l.coordinates)
 
// List indexes to verify
SHOW INDEXES
 
// Drop an index
DROP INDEX user_email

Constraints:

Constraints enforce data integrity at the database level:

1. Uniqueness Constraints

Ensures property values are unique across all nodes with a label
Automatically creates an index

2. Node Key Constraints (Enterprise)

Composite uniqueness (unique combination of properties)
Properties cannot be null

3. Property Existence Constraints (Enterprise)

Ensures a property exists on all nodes/relationships of a type
Schema enforcement for required fields

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
// Uniqueness constraint (also creates index)
CREATE CONSTRAINT user_email_unique FOR (u:User) REQUIRE u.email IS UNIQUE
 
// Node key - composite uniqueness (Enterprise)
CREATE CONSTRAINT person_key FOR (p:Person) 
REQUIRE (p.firstName, p.lastName, p.birthDate) IS NODE KEY
 
// Property existence constraint (Enterprise)
CREATE CONSTRAINT user_email_exists FOR (u:User) REQUIRE u.email IS NOT NULL
 
// Property type constraint (Neo4j 5.9+)
CREATE CONSTRAINT user_age_type FOR (u:User) REQUIRE u.age IS :: INTEGER
 
// Relationship property existence (Enterprise)
CREATE CONSTRAINT follows_since_exists 
FOR ()-[f:FOLLOWS]-() REQUIRE f.since IS NOT NULL
 
// List constraints
SHOW CONSTRAINTS
 
// Drop constraint
DROP CONSTRAINT user_email_unique

Index Strategy Best Practices

Index properties used in WHERE clauses for starting node lookups. Don't index properties only used for RETURN or after traversal—index-free adjacency handles those. Use EXPLAIN before queries to verify index usage. Composite indexes should match query predicate order.

Transactions and ACID Compliance

One of Neo4j's distinguishing features among NoSQL databases is its full ACID compliance. This makes Neo4j suitable for applications where data integrity is paramount—financial systems, healthcare records, compliance-sensitive domains.

ACID in Neo4j:

Atomicity: Each Cypher statement executes atomically. Multi-statement transactions (explicit BEGIN/COMMIT) are fully atomic—all changes succeed or all are rolled back.

Consistency: Constraints are enforced at commit time. If a transaction violates a uniqueness or existence constraint, the entire transaction rolls back.

Isolation: Neo4j uses read-committed isolation by default with optional serializable transactions. Readers don't block writers; writers don't block readers until commit.

Durability: Write-ahead logging (WAL) ensures committed transactions survive crashes. Transaction logs are fsynced to disk before commit acknowledgment.

Locking Model:

Neo4j uses a fine-grained locking model:

Read locks: Shared; multiple readers allowed
Write locks: Exclusive; block other writers
Lock ordering: Deadlock detection with transaction victim selection

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Single-statement transactions (auto-commit)
CREATE (u:User {name: 'Alice'})
// Automatically committed
 
// Explicit transaction with driver (pseudo-code)
session.executeWrite(tx -> {
    tx.run("CREATE (a:Account {id: $accountId})", Map.of("accountId", 1001));
    tx.run("CREATE (a:Account {id: $accountId})", Map.of("accountId", 1002));
    tx.run("""
        MATCH (a1:Account {id: 1001}), (a2:Account {id: 1002})
        CREATE (a1)-[:LINKED_TO]->(a2)
    """);
    // All three operations commit together or roll back together
});
 
// Handling constraint violations
try {
    session.executeWrite(tx -> {
        tx.run("CREATE (u:User {email: 'dupe@example.com'})");
        // Throws ClientException if email constraint violated
    });
} catch (ClientException e) {
    // Transaction already rolled back
    log.error("Constraint violation: {}", e.getMessage());
}

Write-Ahead Logging (WAL):

All changes are first written to a transaction log before being applied to the main store:

Transaction executes, locks acquired
Changes written to transaction log (sequential writes)
Log fsynced to disk
Success returned to client
Background checkpoint applies changes to store files

This pattern provides durability while maintaining performance—sequential log writes are fast, and store updates happen asynchronously.

Checkpoint Process:

Periodically, Neo4j checkpoints—flushing in-memory changes to store files. Checkpoint frequency is tunable:

dbms.checkpoint.interval.time - time-based trigger
dbms.checkpoint.interval.tx - transaction-count trigger
dbms.checkpoint.iops.limit - I/O rate limiting

Transaction Timeouts

Long-running transactions can hold locks and consume memory. Configure transaction timeouts (db.transaction.timeout) to kill runaway queries. For ETL workloads, batch commits every N records to limit transaction scope.

Neo4j Clustering for High Availability

Production Neo4j deployments require clustering for high availability, read scalability, and fault tolerance. Neo4j Enterprise Edition provides a Raft-based clustering model.

Cluster Architecture:

Neo4j clusters consist of Primary and Secondary servers:

Primary Servers (Core Servers):

Accept all write operations
Participate in Raft consensus for write coordination
Must have majority (quorum) available for writes
Recommended: 3 or 5 for fault tolerance

Secondary Servers (Read Replicas):

Asynchronously replicate from primaries
Accept read-only queries
Horizontal read scaling
Can be promoted in disaster scenarios

Raft Consensus:

Neo4j uses the Raft protocol for write coordination:

Client sends write to any primary
Leader coordinates write with followers
Once majority acknowledges, transaction commits
Leader confirms to client

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
┌─────────────────────────────────────────────────────────────────────┐
│                        NEO4J CLUSTER                                │
│                                                                     │
│    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐           │
│    │  Primary 1  │◄──►│  Primary 2  │◄──►│  Primary 3  │           │
│    │  (Leader)   │    │ (Follower)  │    │ (Follower)  │           │
│    └──────┬──────┘    └──────┬──────┘    └──────┬──────┘           │
│           │                  │                  │                   │
│           │    Raft Consensus (synchronous)     │                   │
│           │                                     │                   │
│           ▼                  ▼                  ▼                   │
│    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐           │
│    │ Read Replica│    │ Read Replica│    │ Read Replica│           │
│    │     1       │    │     2       │    │     3       │           │
│    └─────────────┘    └─────────────┘    └─────────────┘           │
│           │                  │                  │                   │
│           └────────────┬─────┴──────────┬───────┘                   │
│                        ▼                ▼                           │
│                   Asynchronous replication                          │
└─────────────────────────────────────────────────────────────────────┘
 
Client Connectivity:
┌───────────────────┐
│   Load Balancer   │  ◄── Read queries distributed to replicas
│   (or driver)     │  ◄── Write queries routed to leader primary
└───────────────────┘

Causal Consistency:

With asynchronous replication, read replicas may lag behind primaries. Neo4j provides causal consistency through bookmarks:

Write to primary returns a bookmark (transaction ID)
Client sends bookmark with subsequent read
Read replica waits until it has replicated that transaction
Read returns consistent data

This prevents read-your-writes anomalies without requiring synchronous replication.

Driver Routing:

Neo4j drivers automatically route queries:

Write transactions → Primary (leader)
Read transactions → Read replicas (load balanced)

// JavaScript driver with routing
const driver = neo4j.driver(
  'neo4j://cluster.example.com',
  neo4j.auth.basic('user', 'password')
);

// Reads go to replicas
const session = driver.session({ defaultAccessMode: 'READ' });
// Writes go to primary
const session = driver.session({ defaultAccessMode: 'WRITE' });

Quorum Requirements

A 3-node primary cluster tolerates 1 failure. A 5-node cluster tolerates 2 failures. If majority is lost, writes halt until recovered. Design for your failure tolerance requirements—but remember, more primaries means more consensus overhead per write.

Performance Tuning

Neo4j performance depends on proper configuration, query optimization, and capacity planning. Here are the key tuning levers:

Memory Configuration:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Heap memory for query processing, caches, transaction state
# Recommended: 8-16GB for most workloads
dbms.memory.heap.initial_size=8g
dbms.memory.heap.max_size=16g
 
# Page cache - holds graph data pages in memory
# Critical for performance - size to fit your graph
# Rule of thumb: store file sizes + 20% headroom
dbms.memory.pagecache.size=32g
 
# Transaction memory limits
db.memory.transaction.total.max=2g
db.memory.transaction.max=512m
 
# Example allocation for 64GB server:
# - Page cache: 40GB (graph data)
# - Heap: 16GB (query execution)
# - OS/buffers: 8GB (remaining)

Query Optimization with EXPLAIN and PROFILE:

Cypher provides execution plan analysis:

EXPLAIN - Shows planned execution without running
PROFILE - Executes and shows actual row counts, db hits

Key Metrics in Execution Plans:

Estimated Rows: Planner's prediction
Actual Rows (PROFILE): Real rows processed
DB Hits: Storage engine operations (lower is better)
Operators: NodeByLabelScan (slow), NodeIndexSeek (fast)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// See execution plan without running
EXPLAIN MATCH (u:User)-[:FOLLOWS]->(f:User)
WHERE u.email = 'alice@example.com'
RETURN f.name
 
// Run and profile actual execution
PROFILE MATCH (u:User)-[:FOLLOWS]->(f:User)
WHERE u.email = 'alice@example.com'
RETURN f.name
 
// Good: NodeIndexSeek for starting node
// Bad: AllNodesScan or NodeByLabelScan with many nodes
 
// Bad pattern: Cartesian product
PROFILE MATCH (u:User), (p:Product) RETURN count(*)
// Produces Users × Products rows!
 
// Better: filter early, connect patterns
PROFILE MATCH (u:User)-[:PURCHASED]->(p:Product)
WHERE u.tier = 'premium' AND p.category = 'electronics'
RETURN u.name, p.name

Query Optimization Best Practices

•Index your starting points — Ensure WHERE predicates for starting nodes hit indexes
•Filter early — Put restrictive WHERE clauses before traversals
•Use query parameters — $userId not 'user-123' for query plan caching
•Limit traversal depth — Use explicit bounds: *1..3 not *
•Avoid COLLECT on huge sets — Memory-intensive; paginate instead
•Use COUNT subqueries for existence — EXISTS {} is faster than OPTIONAL MATCH + filter
•Consider query splitting — One complex query can often be faster as two simpler ones

Monitor Slow Queries

Enable query logging: db.logs.query.enabled=true. Set thresholds with db.logs.query.threshold. Review slow query logs regularly—performance issues often come from a few problematic queries.

Data Import and ETL Patterns

Loading data into Neo4j efficiently requires understanding the available import mechanisms and their tradeoffs.

Import Methods:

1. LOAD CSV (Online Import)

Reads CSV files directly in Cypher
Good for medium datasets (up to ~10M rows)
Runs as transactions within running database

2. neo4j-admin import (Bulk Import)

Fastest method for initial load
Requires stopped database
Directly writes store files, bypassing transaction layer
Good for billions of records

3. Cypher Batching

Process data in chunks with APOC procedures
Handles large datasets with backpressure
Good for ongoing ETL, streaming sources

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
// Basic LOAD CSV with headers
LOAD CSV WITH HEADERS FROM 'file:///users.csv' AS row
CREATE (u:User {
  id: row.user_id,
  name: row.name,
  email: row.email
})
 
// Batched import with periodic commit
:auto LOAD CSV WITH HEADERS FROM 'file:///users.csv' AS row
CALL {
  WITH row
  MERGE (u:User {id: row.user_id})
  SET u.name = row.name, u.email = row.email
} IN TRANSACTIONS OF 1000 ROWS
 
// Import relationships
LOAD CSV WITH HEADERS FROM 'file:///follows.csv' AS row
MATCH (follower:User {id: row.follower_id})
MATCH (followee:User {id: row.followee_id})
CREATE (follower)-[:FOLLOWS {since: date(row.since)}]->(followee)
 
// APOC for complex transformations
CALL apoc.periodic.iterate(
  "CALL apoc.load.json('https://api.example.com/users')",
  "MERGE (u:User {id: event.id}) SET u += event",
  {batchSize: 500, parallel: true}
)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
# Stop the database first!
neo4j stop
 
# Bulk import from CSV files
neo4j-admin database import full \
  --nodes=User=import/users_header.csv,import/users.csv \
  --nodes=Product=import/products_header.csv,import/products.csv \
  --relationships=PURCHASED=import/purchases_header.csv,import/purchases.csv \
  --skip-bad-relationships=true \
  --skip-duplicate-nodes=true \
  neo4j
 
# Header files define schema:
# users_header.csv: userId:ID,name,email,:LABEL
# purchases_header.csv: :START_ID,:END_ID,date,:TYPE
 
# Start database with new data
neo4j start

Import Performance Tips

For LOAD CSV: create indexes before import, use MERGE with unique properties, commit every 1000-10000 rows. For bulk import: sort CSV files by ID columns, use SSDs, set --high-io for fast storage. Avoid importing during production traffic.

Summary: Neo4j in Production

Neo4j represents the mature, production-proven implementation of the property graph model. Let's consolidate the key insights:

Key Takeaways

•Native graph storage — Neo4j stores nodes and relationships with physical pointers, enabling O(1) traversals via index-free adjacency.
•Cypher query language — Declarative, pattern-based queries with ASCII-art syntax. Master MATCH, WHERE, RETURN, and variable-length paths.
•ACID compliance — Full transactional guarantees distinguish Neo4j from eventually consistent graph systems. Essential for data integrity.
•Indexes for starting points — Index properties used in WHERE clauses for initial node lookup. Traversals don't need additional indexes.
•Clustering with Raft — Primary servers for writes (consensus), read replicas for scaling reads. Causal consistency via bookmarks.
•Page cache is critical — Size to fit your graph in memory. Performance degrades rapidly when pages spill to disk.
•EXPLAIN/PROFILE for optimization — Always analyze execution plans. Look for NodeIndexSeek, avoid NodeByLabelScan on large labels.
•Choose import method wisely — LOAD CSV for online, neo4j-admin import for bulk initial load, APOC for streaming ETL.

What's Next:

With Neo4j fundamentals understood, we'll explore relationship-heavy query patterns—traversals, path finding, pattern matching, and graph algorithms. These patterns unlock the full power of graph databases for real-world applications.

Neo4j Fundamentals Complete

You now understand Neo4j's architecture, Cypher query language, indexing, transactions, and clustering. This foundation prepares you to model domains, write efficient queries, and operate Neo4j in production. Next, we'll explore advanced traversal and query patterns.