Loading content...
A graph database storing nodes and relationships is merely a storage system. The true power of graphs emerges when you traverse those relationships—finding paths, matching patterns, discovering communities, and computing influence. These operations would be prohibitively expensive in relational databases, but they're the native language of graph engines.
Consider queries that are trivial in graphs but nightmarish in SQL:
These relationship-heavy queries are where graph databases demonstrate their decisive advantage. Understanding how to express and optimize them is essential for any system designer working with connected data.
This page covers advanced graph query patterns: variable-length path traversals, shortest path algorithms, subgraph pattern matching, and analytical graph algorithms. You'll learn Cypher patterns for each, understand their computational complexity, and know when to apply them in production systems.
The most distinguishing feature of graph query languages is the ability to express variable-length paths—traversals where the number of hops is unknown at query time. In SQL, this requires recursive CTEs with complex termination logic. In Cypher, it's a single character: *.
Basic Syntax:
(start)-[:RELATIONSHIP*]->(end) // Any number of hops
(start)-[:RELATIONSHIP*1..3]->(end) // 1 to 3 hops
(start)-[:RELATIONSHIP*2]->(end) // Exactly 2 hops
(start)-[:RELATIONSHIP*..5]->(end) // Up to 5 hops (includes 0)
Key Concepts:
p = (a)-[*]->(b)relationships(p)nodes(p)12345678910111213141516171819202122232425262728
// Find everyone within 3 hops of Alice in social networkMATCH (alice:User {name: 'Alice'})-[:KNOWS*1..3]->(connection)RETURN DISTINCT connection.name, count(*) AS path_count // Capture and analyze pathsMATCH path = (alice:User {name: 'Alice'})-[:KNOWS*1..3]->(bob:User {name: 'Bob'})RETURN path, length(path) AS hops, [n IN nodes(path) | n.name] AS node_names, [r IN relationships(path) | type(r)] AS rel_types // Find dependency chains in software modulesMATCH chain = (app:Module {name: 'MyApp'})-[:DEPENDS_ON*]->(lib:Module)WHERE lib.deprecated = trueRETURN app.name, lib.name AS deprecated_dependency, length(chain) AS chain_depth // Organizational hierarchy - find all reports (direct and indirect)MATCH (ceo:Employee {title: 'CEO'})<-[:REPORTS_TO*]-(employee)RETURN employee.name, employee.titleORDER BY employee.lastName // Limit depth for performanceMATCH (start:Node)-[:CONNECTED*1..6]->(end:Node) // Max 6 hopsWHERE start.id = 123RETURN DISTINCT end.idUnbounded traversals (* without upper limit) can explore exponentially growing result sets. If each node has 100 connections, 6 hops explores 100^6 = 1 trillion potential paths. Always bound traversals in production: use *1..4 not *, and apply LIMIT to results.
Filtering During Traversal:
You can filter which relationships and nodes are traversed, pruning the search space:
1234567891011121314151617181920212223
// Only traverse active relationshipsMATCH path = (start)-[:KNOWS*1..4 {active: true}]->(end)WHERE start.id = 'user-1'RETURN path // Filter nodes during traversal (Cypher 5+)MATCH path = (start) ((intermediate WHERE intermediate.status = 'active') -[:CONNECTED]->)+ (end)WHERE start.id = 'node-1'RETURN path // Traditional approach: filter after, then check path validityMATCH path = (start)-[:CONNECTED*1..5]->(end)WHERE start.id = 'node-1' AND all(n IN nodes(path)[1..-1] WHERE n.status = 'active') // Intermediate nodes onlyRETURN path // Multiple relationship types in pathMATCH (a)-[:KNOWS|WORKS_WITH*1..3]->(b) // Either relationship typeWHERE a.name = 'Alice'RETURN DISTINCT b.nameFinding the shortest path between nodes is a fundamental graph operation. Neo4j provides built-in functions for efficient path finding, leveraging algorithms like bidirectional BFS.
Key Functions:
| Function | Purpose | Algorithm |
|---|---|---|
shortestPath() | Single shortest path | Bidirectional BFS |
allShortestPaths() | All paths of minimum length | Bidirectional BFS |
apoc.algo.dijkstra() | Weighted shortest path | Dijkstra's |
apoc.algo.aStar() | Weighted with heuristic | A* |
Unweighted Shortest Path:
1234567891011121314151617181920212223242526272829303132
// Find single shortest path (any path of minimum length)MATCH path = shortestPath( (alice:User {name: 'Alice'})-[:KNOWS*]-(bob:User {name: 'Bob'}))RETURN path, length(path) AS degrees_of_separation, [n IN nodes(path) | n.name] AS connection_chain // Find ALL shortest paths (may be multiple of same length)MATCH paths = allShortestPaths( (alice:User {name: 'Alice'})-[:KNOWS*]-(bob:User {name: 'Bob'}))RETURN paths, length(paths) AS path_length // Constrain path length for performanceMATCH path = shortestPath( (a:Station {name: 'Kings Cross'})-[:CONNECTS*..15]-(b:Station {name: 'Heathrow'}))RETURN path // Filter during shortest path searchMATCH path = shortestPath( (a:City {name: 'London'})-[:ROAD* {toll: false}]-(b:City {name: 'Edinburgh'}))RETURN path // Only toll-free roads // Shortest path with relationship type filterMATCH path = shortestPath( (a:User {id: $userId})-[:FOLLOWS|KNOWS*..6]-(b:User {id: $targetId}))RETURN pathWeighted Shortest Path (Dijkstra):
When relationships have costs (distance, time, weight), use Dijkstra's algorithm to find the minimum-cost path:
123456789101112131415161718192021222324252627282930
// Using APOC Dijkstra for weighted pathsMATCH (start:Station {name: 'London'}), (end:Station {name: 'Edinburgh'})CALL apoc.algo.dijkstra(start, end, 'CONNECTS', 'distance')YIELD path, weightRETURN path, weight AS total_distance, [n IN nodes(path) | n.name] AS stations // A* algorithm with heuristic (for geographic data)MATCH (start:Location {name: 'Paris'}), (end:Location {name: 'Berlin'})CALL apoc.algo.aStar( start, end, 'ROAD', // Relationship type 'distance', // Weight property 'latitude', // Latitude property for heuristic 'longitude' // Longitude property for heuristic) YIELD path, weightRETURN path, weight // Multiple weighted criteria (compute custom weight)MATCH (start:City {name: 'NYC'}), (end:City {name: 'LA'})CALL apoc.algo.dijkstraWithDefaultWeight( start, end, 'FLIGHT', 'cost', // Primary weight property 1.0 // Default weight if property missing) YIELD path, weightRETURN path, weight AS total_costNeo4j's shortestPath uses bidirectional BFS—simultaneously searching from both start and end nodes. This dramatically reduces the search space compared to unidirectional search, especially in dense graphs. The search terminates when wavefronts meet.
Beyond paths between two nodes, graph databases excel at finding patterns—specific configurations of nodes and relationships that match a template. This is the essence of graph query languages: "Find all subgraphs matching this shape."
Common Pattern Types:
123456789101112131415161718192021222324252627282930313233
// Triangle pattern: three mutual friendsMATCH (a:User)-[:KNOWS]->(b:User)-[:KNOWS]->(c:User)-[:KNOWS]->(a)WHERE a.name < b.name AND b.name < c.name // Avoid duplicatesRETURN a.name, b.name, c.name AS triangle_members // Star pattern: hub with many connectionsMATCH (hub:User)-[:FOLLOWS]->(follower:User)WITH hub, count(follower) AS follower_countWHERE follower_count > 1000RETURN hub.name AS influencer, follower_countORDER BY follower_count DESC // Fork pattern: shared connectionsMATCH (a:User)-[:KNOWS]->(shared:User)<-[:KNOWS]-(b:User)WHERE a.name = 'Alice' AND b.name = 'Bob' AND a <> b // Different peopleRETURN shared.name AS mutual_friend // Chain pattern: referral chainMATCH chain = (referrer:User)-[:REFERRED*1..5]->(referee:User)WHERE referrer.name = 'Alice'RETURN chain, length(chain) AS chain_length, [n IN nodes(chain) | n.name] AS referral_chain // Recommendation pattern: friend bought, you might likeMATCH (me:User {name: 'Alice'})-[:KNOWS]->(friend:User) -[:PURCHASED]->(product:Product)WHERE NOT (me)-[:PURCHASED]->(product) // I haven't bought itRETURN product.name, count(friend) AS friends_who_boughtORDER BY friends_who_bought DESCLIMIT 10Cycle Detection:
Finding cycles is crucial for dependency analysis, fraud detection, and data integrity checks:
12345678910111213141516171819202122232425262728
// Find circular dependencies in modulesMATCH cycle = (m:Module)-[:DEPENDS_ON*]->(m)RETURN [n IN nodes(cycle) | n.name] AS circular_dependency // Detect fraud rings: accounts in transaction cyclesMATCH cycle = (a:Account)-[:TRANSFERRED_TO*2..6]->(a)WHERE sum([r IN relationships(cycle) | r.amount]) > 10000RETURN cycle // Check for circular references in hierarchies (data integrity)MATCH (e:Employee)-[:REPORTS_TO*]->(e)RETURN e.name AS cycle_participant, 'Circular reporting chain detected' AS issue // Negative cycles in weighted graphs (arbitrage detection)// Requires specific algorithm - APOC or GDSCALL gds.bellmanFord.stream({ nodeProjection: 'Currency', relationshipProjection: { EXCHANGE: { type: 'EXCHANGE_RATE', properties: ['rate'] } }, sourceNode: 'USD'})YIELD nodeId, totalCost// If negative cycle exists, arbitrage opportunityPattern matching complexity depends on pattern size and graph structure. Simple patterns (triangles, pairs) are fast. Complex patterns with many nodes and relationships require careful optimization—use MATCH... WITH... MATCH to break into stages, filter aggressively, and consider approximate algorithms for very large patterns.
Beyond pattern matching, graph databases support analytical graph algorithms—computations that analyze the entire graph structure. Neo4j provides these through the Graph Data Science (GDS) library.
Algorithm Categories:
| Category | Purpose | Examples |
|---|---|---|
| Centrality | Identify important nodes | PageRank, Betweenness, Degree |
| Community Detection | Find clusters/groups | Louvain, Label Propagation |
| Pathfinding | Find routes | Dijkstra, A*, BFS, DFS |
| Similarity | Compare nodes | Jaccard, Cosine, Euclidean |
| Link Prediction | Predict new edges | Common Neighbors, Adamic-Adar |
| Node Embedding | Vector representations | Node2Vec, GraphSAGE |
123456789101112131415161718192021222324252627282930313233
// Create in-memory graph projection for algorithmsCALL gds.graph.project( 'socialGraph', // Projection name 'User', // Node labels 'FOLLOWS', // Relationship type { relationshipProperties: 'weight' }) // PageRank: identify influential nodesCALL gds.pageRank.stream('socialGraph')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS user, score AS influenceORDER BY score DESCLIMIT 10 // Betweenness Centrality: nodes bridging communitiesCALL gds.betweenness.stream('socialGraph')YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS user, score AS bridges_communitiesORDER BY score DESCLIMIT 10 // Degree Centrality: most connected nodesCALL gds.degree.stream('socialGraph', { orientation: 'REVERSE' })YIELD nodeId, scoreRETURN gds.util.asNode(nodeId).name AS user, score AS follower_countORDER BY score DESC // Write results back to graphCALL gds.pageRank.write('socialGraph', { writeProperty: 'pageRank'})YIELD nodePropertiesWritten1234567891011121314151617181920212223242526
// Louvain community detection (modularity optimization)CALL gds.louvain.stream('socialGraph')YIELD nodeId, communityIdRETURN communityId, count(*) AS members, collect(gds.util.asNode(nodeId).name)[..5] AS sample_membersORDER BY members DESC // Label Propagation (faster, less optimal)CALL gds.labelPropagation.stream('socialGraph')YIELD nodeId, communityIdWITH communityId, collect(gds.util.asNode(nodeId)) AS membersRETURN communityId, size(members) AS community_sizeORDER BY community_size DESC // Weakly Connected ComponentsCALL gds.wcc.stream('socialGraph')YIELD nodeId, componentIdRETURN componentId, count(*) AS component_sizeORDER BY component_size DESC // Write community assignments back to graphCALL gds.louvain.write('socialGraph', { writeProperty: 'community'})YIELD communityCount, modularitySimilarity and Link Prediction:
These algorithms are fundamental for recommendation systems:
12345678910111213141516171819202122232425262728293031
// Node Similarity: find users with similar follow patternsCALL gds.nodeSimilarity.stream('socialGraph')YIELD node1, node2, similarityRETURN gds.util.asNode(node1).name AS user1, gds.util.asNode(node2).name AS user2, similarityORDER BY similarity DESCLIMIT 20 // Link Prediction: predict likely future connections// Adamic-Adar: weighted common neighborsMATCH (a:User {name: 'Alice'}), (b:User)WHERE NOT (a)-[:FOLLOWS]->(b) AND a <> bCALL gds.alpha.linkprediction.adamicAdar.stream({ node1: a, node2: b, relationshipQuery: 'FOLLOWS'})YIELD node1, node2, similarityRETURN gds.util.asNode(node2).name AS recommended_follow, similarityORDER BY similarity DESCLIMIT 10 // Common Neighbors predictionMATCH (a:User {name: 'Alice'})MATCH (a)-[:FOLLOWS]->(:User)<-[:FOLLOWS]-(recommended:User)WHERE NOT (a)-[:FOLLOWS]->(recommended) AND a <> recommendedRETURN recommended.name, count(*) AS common_followeesORDER BY common_followees DESCLIMIT 10GDS algorithms create in-memory graph projections. For large graphs, memory requirements can exceed heap size. Use gds.graph.project with node/relationship filters to work on subgraphs, or upgrade to machines with more RAM for full-graph analysis.
Complex graph queries can become expensive. Here are optimization strategies for relationship-heavy operations:
1. Bound Your Traversals:
Never use unbounded variable-length paths in production:
12345678910111213141516171819
// ❌ Bad: Unbounded traversal (potential explosion)MATCH (a)-[:KNOWS*]->(b)RETURN b // ✅ Good: Bounded traversalMATCH (a)-[:KNOWS*1..4]->(b)RETURN b // ✅ Better: Bounded with LIMITMATCH (a:User {id: $userId})-[:KNOWS*1..4]->(b)RETURN DISTINCT b.nameLIMIT 100 // ✅ Best: Strategic termination with aggregationMATCH (a:User {id: $userId})-[:KNOWS*1..3]->(b)WITH b, count(*) AS relevanceORDER BY relevance DESCLIMIT 50RETURN b.name, relevance2. Filter Early, Traverse Late:
Apply WHERE clauses as early as possible to prune the search space:
1234567891011121314151617181920
// ❌ Bad: Filter after expensive traversalMATCH (a:User)-[:KNOWS*1..4]->(b:User)-[:PURCHASED]->(p:Product)WHERE a.tier = 'premium' AND p.category = 'electronics'RETURN p // ✅ Good: Filter before traversalMATCH (a:User {tier: 'premium'})MATCH (a)-[:KNOWS*1..4]->(b:User)MATCH (b)-[:PURCHASED]->(p:Product {category: 'electronics'})RETURN DISTINCT p // ✅ Better: Use WITH to stage and filterMATCH (a:User {tier: 'premium'})WITH a LIMIT 1000 // Reduce starting set if neededMATCH (a)-[:KNOWS*1..3]->(b:User)WITH DISTINCT bMATCH (b)-[:PURCHASED]->(p:Product)WHERE p.price > 100RETURN p.name, count(*) AS buyersORDER BY buyers DESC3. Use DISTINCT Strategically:
Variable-length paths can reach the same node via different routes. Use DISTINCT to eliminate duplicates early, but be aware of memory implications:
12345678910111213
// Without DISTINCT: may process same node multiple timesMATCH (a:User {id: 1})-[:KNOWS*1..4]->(friend)RETURN friend.name // Bob might appear 100 times via different paths // With DISTINCT: unique results, but evaluated at endMATCH (a:User {id: 1})-[:KNOWS*1..4]->(friend)RETURN DISTINCT friend.name // Better: DISTINCT mid-query to reduce downstream workMATCH (a:User {id: 1})-[:KNOWS*1..4]->(friend)WITH DISTINCT friend // Deduplicate earlyMATCH (friend)-[:PURCHASED]->(product)RETURN friend.name, collect(product.name) AS purchases4. Analyze with PROFILE:
Always profile complex queries to understand execution:
12345678910111213141516
// Profile to see actual executionPROFILEMATCH (a:User {email: 'alice@example.com'})-[:KNOWS*1..3]->(friend)MATCH (friend)-[:PURCHASED]->(product:Product)RETURN DISTINCT product.name, count(friend) AS recommendersORDER BY recommenders DESCLIMIT 10 // Look for in the output:// - DB HITS: Lower is better// - Estimated vs Actual rows: Big differences indicate stale statistics// - NodeByLabelScan: Should use NodeIndexSeek instead// - Expand(All) operators: Shows traversal patterns // Update statistics if estimates are offCALL db.prepareForReplanning()Neo4j caches query plans based on query structure. Use $parameters instead of literal values to benefit from plan caching. WHERE name = $name caches; WHERE name = 'Alice' creates a new plan each time.
Graph queries often require aggregating results across paths and relationships. Cypher provides powerful aggregation functions:
Key Aggregation Functions:
count(): Count rows or distinct valuescollect(): Gather values into a listsum(), avg(), min(), max(): Numeric aggregationspercentileCont(), percentileDisc(): Percentile calculationsstDev(), stDevP(): Standard deviation1234567891011121314151617181920212223242526272829303132
// Count relationships per nodeMATCH (u:User)-[r:FOLLOWS]->()RETURN u.name, count(r) AS following_countORDER BY following_count DESC // Collect into listsMATCH (u:User)-[:PURCHASED]->(p:Product)RETURN u.name, collect(p.name) AS purchased_products, sum(p.price) AS total_spent // Nested aggregations with collectMATCH (category:Category)<-[:IN_CATEGORY]-(product:Product)<-[:PURCHASED]-(user:User)WITH category, product, count(user) AS buyersRETURN category.name, collect({product: product.name, buyers: buyers})[..5] AS top_productsORDER BY category.name // Aggregating path propertiesMATCH path = (a:City)-[:ROAD*]->(b:City)WHERE a.name = 'London' AND b.name = 'Edinburgh'WITH path, reduce(total = 0, r IN relationships(path) | total + r.distance) AS total_distRETURN path, total_distORDER BY total_distLIMIT 5 // Group by path lengthMATCH (a:User {name: 'Alice'})-[path:KNOWS*1..4]->(b:User)RETURN length(path) AS degrees, count(DISTINCT b) AS reachable_peopleORDER BY degreesUsing REDUCE for Path Calculations:
The reduce function iterates over lists, accumulating a result—essential for path weight calculations:
123456789101112131415161718192021222324
// Sum weights along a pathMATCH path = (a:Station)-[:CONNECTS*]->(b:Station)WHERE a.name = 'Kings Cross' AND b.name = 'Heathrow'WITH path, reduce(time = 0, r IN relationships(path) | time + r.minutes) AS travel_timeRETURN [n IN nodes(path) | n.name] AS route, travel_timeORDER BY travel_timeLIMIT 5 // Multiply probabilities along pathMATCH path = (start:Process)-[:LEADS_TO*]->(end:Outcome)WHERE start.name = 'Initial'WITH path, reduce(prob = 1.0, r IN relationships(path) | prob * r.probability) AS total_probRETURN [n IN nodes(path) | n.name] AS steps, total_probORDER BY total_prob DESC // Collect node properties along pathMATCH path = (a:User)-[:KNOWS*2..3]->(b:User)WHERE a.name = 'Alice'RETURN [n IN nodes(path) | n.name] AS connection_chain, reduce(ages = [], n IN nodes(path) | ages + n.age) AS ages_in_chainLarge collect() results consume heap memory. For very large result sets, use pagination (SKIP/LIMIT) or streaming APIs instead of collecting everything. Consider apoc.periodic.iterate for batch processing.
Let's examine query patterns from production systems across common graph use cases:
123456789101112131415161718192021222324
// Mutual friends between two usersMATCH (a:User {id: $userId1})-[:FRIENDS]-(mutual:User)-[:FRIENDS]-(b:User {id: $userId2})RETURN mutual.name, mutual.avatar // Friend suggestions (friends of friends who aren't already friends)MATCH (me:User {id: $userId})-[:FRIENDS]-(friend)-[:FRIENDS]-(suggested)WHERE NOT (me)-[:FRIENDS]-(suggested) AND me <> suggestedRETURN suggested.name, count(friend) AS mutual_friends, collect(friend.name)[..3] AS throughORDER BY mutual_friends DESCLIMIT 10 // Engagement-weighted feedMATCH (me:User {id: $userId})-[f:FOLLOWS]->(creator:User)-[:POSTED]->(post:Post)WHERE post.timestamp > datetime() - duration('P7D')WITH post, creator, CASE WHEN f.engagement = 'high' THEN 3 WHEN f.engagement = 'medium' THEN 2 ELSE 1 END AS weightRETURN post.id, creator.name, post.contentORDER BY post.timestamp DESC, weight DESCLIMIT 50123456789101112131415161718192021222324252627
// Account sharing detection: multiple logins from same deviceMATCH (account1:Account)-[:LOGGED_IN_FROM]->(device:Device)<-[:LOGGED_IN_FROM]-(account2:Account)WHERE account1 <> account2WITH device, collect(DISTINCT account1.id) AS accountsWHERE size(accounts) > 3RETURN device.fingerprint, accounts AS suspicious_accounts // Transaction ring detectionMATCH ring = (a:Account)-[:TRANSFERRED*3..6]->(a)WHERE all(r IN relationships(ring) WHERE r.amount > 5000) AND all(r IN relationships(ring) WHERE r.timestamp > datetime() - duration('P30D'))RETURN [n IN nodes(ring) | n.id] AS ring_members, size(nodes(ring)) AS ring_size, reduce(total = 0, r IN relationships(ring) | total + r.amount) AS total_transferred // First-party fraud: synthetic identitiesMATCH (identity:Identity)OPTIONAL MATCH (identity)-[:HAS_SSN]->(ssn:SSN)<-[:HAS_SSN]-(other1:Identity)OPTIONAL MATCH (identity)-[:HAS_ADDRESS]->(addr:Address)<-[:HAS_ADDRESS]-(other2:Identity)OPTIONAL MATCH (identity)-[:HAS_PHONE]->(phone:Phone)<-[:HAS_PHONE]-(other3:Identity)WITH identity, count(DISTINCT other1) AS shared_ssn, count(DISTINCT other2) AS shared_address, count(DISTINCT other3) AS shared_phoneWHERE shared_ssn > 0 OR shared_address + shared_phone > 3RETURN identity.name, shared_ssn, shared_address, shared_phoneORDER BY shared_ssn DESC, shared_address + shared_phone DESC123456789101112131415161718
// Question answering: relationships between conceptsMATCH path = shortestPath((c1:Concept {name: $concept1})-[*..5]-(c2:Concept {name: $concept2}))RETURN path, [r IN relationships(path) | type(r)] AS relationship_chain // Semantic similarity via shared categoriesMATCH (a:Article {id: $articleId})-[:TAGGED]->(tag:Tag)<-[:TAGGED]-(similar:Article)WITH similar, count(tag) AS shared_tagsWHERE shared_tags >= 3RETURN similar.title, shared_tagsORDER BY shared_tags DESCLIMIT 10 // Inferred relationships (reasoning)MATCH (person:Person)-[:WORKS_AT]->(company:Company) -[:LOCATED_IN]->(city:City)-[:IN_COUNTRY]->(country:Country)RETURN person.name, company.name, city.name, country.name AS likely_country// Person's likely country inferred through company locationBuild a library of parameterized query templates for your domain. Common patterns—friend suggestions, shortest paths, fraud triangles—should be optimized once and reused. Document expected cardinalities and performance characteristics.
Relationship-heavy queries are where graph databases demonstrate their decisive advantage over relational systems. Let's consolidate the key patterns and principles:
*1..N syntax for multi-hop traversals. Always bound with upper limits in production to prevent explosion.shortestPath() for unweighted, apoc.algo.dijkstra() for weighted. Bidirectional search makes this efficient.$param not literals for query plan reuse.What's Next:
We'll explore real-world graph database applications—social networks, recommendation engines, fraud detection, and knowledge graphs. These case studies demonstrate how the patterns you've learned apply to production systems at scale.
You now understand the full spectrum of graph query patterns—from simple path traversals to advanced algorithms. These patterns are the tools that unlock the graph's power: finding connections, detecting patterns, and analyzing networks in ways that would be prohibitive in other database models.