Graph Databases - Learning Module

Loading content...

0/273

Social Networks and Recommendation Engines

Where Graphs Shine Brightest

Two application domains have propelled graph databases from academic curiosity to essential infrastructure: social networks and recommendation engines. These aren't niche applications—they power the features billions of people use daily:

Facebook's social graph connects 3 billion users with trillions of relationships
LinkedIn's "People You May Know" drives 50%+ of connection requests
Netflix's recommendation engine generates $1 billion+ in annual value through retention
Amazon's "Customers who bought this also bought" drives 35% of revenue

Both domains share a common characteristic: the value lies in the connections, not just the entities. Users matter, but their friendships, follows, purchases, and ratings form the true data asset. This makes graph databases the natural fit.

What You Will Learn

This page explores social network and recommendation engine architectures in depth. You'll learn graph models for social features, feed algorithms, friend suggestion engines, collaborative filtering on graphs, content-based recommendations, and hybrid approaches. We'll examine real production patterns used by leading platforms.

The Social Network Data Model

A social network is, at its core, a graph of people connected through relationships. But real social platforms have evolved sophisticated models that capture nuanced interactions.

Core Entities (Nodes):

User: Profile, settings, status, tier
Post/Content: Text, images, videos, metadata
Group/Community: Shared interest spaces
Page/Brand: Organizational entities
Event: Time-bound gatherings
Media: Photos, videos, documents
Location: Places, check-ins

Core Relationships (Edges):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
// User connections - various relationship types
(user1:User)-[:FOLLOWS {since: datetime(), muted: false}]->(user2:User)
(user1:User)-[:FRIENDS {since: datetime(), source: 'request'}]->(user2:User)
(user1:User)-[:BLOCKED {since: datetime()}]->(user2:User)
(user:User)-[:MEMBER_OF {role: 'admin', since: datetime()}]->(group:Group)
 
// Content creation and engagement
(user:User)-[:POSTED {at: datetime()}]->(post:Post)
(user:User)-[:LIKED {at: datetime()}]->(post:Post)
(user:User)-[:COMMENTED {at: datetime()}]->(comment:Comment)-[:ON]->(post:Post)
(user:User)-[:SHARED {at: datetime(), via: 'story'}]->(post:Post)
(user:User)-[:MENTIONED_IN]->(post:Post)
(user:User)-[:TAGGED_IN]->(media:Media)
 
// Content relationships
(post:Post)-[:REPLY_TO]->(parentPost:Post)  // Thread structure
(post:Post)-[:QUOTE_OF]->(originalPost:Post)
(post:Post)-[:TAGGED_WITH]->(hashtag:Hashtag)
(post:Post)-[:AT_LOCATION]->(location:Location)
(post:Post)-[:CONTAINS]->(media:Media)
 
// User attributes
(user:User)-[:WORKS_AT {title: 'Engineer', since: date()}]->(company:Company)
(user:User)-[:STUDIED_AT {degree: 'BS CS', year: 2018}]->(school:School)
(user:User)-[:LIVES_IN]->(city:City)
(user:User)-[:INTERESTED_IN]->(topic:Topic)

Symmetric vs. Asymmetric Relationships:

Relationship	Symmetry	Example
FRIENDS	Symmetric	Facebook mutual friendship
FOLLOWS	Asymmetric	Twitter, Instagram
BLOCKS	Asymmetric	One-way block
KNOWS	Often symmetric	LinkedIn connections

For symmetric relationships, you typically store both directions:

// When Alice friends Bob, create both directions
CREATE (alice)-[:FRIENDS {since: $date}]->(bob)
CREATE (bob)-[:FRIENDS {since: $date}]->(alice)

Or query bidirectionally:

// Match friends regardless of direction
MATCH (user:User {id: $userId})-[:FRIENDS]-(friend:User)
RETURN friend

Privacy and Access Control

Social platforms embed privacy controls into the graph model. Relationship properties track visibility (public/friends/custom). Queries must filter based on viewer permissions—a user's blocked list, privacy settings, and group memberships all affect what data can be returned.

Friend Suggestions: People You May Know

"People You May Know" (PYMK) is one of the most impactful features in social platforms—LinkedIn reports it drives half of all connection requests. The core insight: people with mutual connections are likely to know each other in real life.

Basic Algorithm: Mutual Friends

The simplest PYMK algorithm counts mutual connections:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
// Basic PYMK: friends of friends not already connected
MATCH (me:User {id: $userId})-[:FRIENDS]-(friend:User)-[:FRIENDS]-(suggestion:User)
WHERE NOT (me)-[:FRIENDS]-(suggestion)
  AND NOT (me)-[:BLOCKED]-(suggestion)
  AND me <> suggestion
RETURN suggestion.id, suggestion.name, suggestion.avatar,
       count(friend) AS mutual_count,
       collect(friend.name)[..3] AS sample_mutuals
ORDER BY mutual_count DESC
LIMIT 20
 
// With quality scoring
MATCH (me:User {id: $userId})-[:FRIENDS]-(friend:User)-[:FRIENDS]-(suggestion:User)
WHERE NOT (me)-[:FRIENDS]-(suggestion)
  AND me <> suggestion
WITH suggestion, 
     count(friend) AS mutual_count,
     collect(friend) AS mutuals
// Weight recent mutuals higher
WITH suggestion, mutual_count,
     size([f IN mutuals WHERE f.lastActive > datetime() - duration('P7D')]) AS active_mutuals
RETURN suggestion.id,
       suggestion.name,
       mutual_count,
       active_mutuals,
       // Combined score: mutuals + recency boost
       mutual_count + (active_mutuals * 0.5) AS score
ORDER BY score DESC
LIMIT 20

Enhanced Signals:

Production PYMK systems incorporate multiple signals beyond mutual friends:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Combine multiple connection signals
MATCH (me:User {id: $userId})
 
// Signal 1: Mutual friends (strongest signal)
OPTIONAL MATCH (me)-[:FRIENDS]-(mutual)-[:FRIENDS]-(s1:User)
WHERE NOT (me)-[:FRIENDS]-(s1) AND me <> s1
WITH me, s1, count(DISTINCT mutual) AS mutual_count
 
// Signal 2: Same workplace
OPTIONAL MATCH (me)-[:WORKS_AT]->(company)<-[:WORKS_AT]-(s2:User)
WHERE NOT (me)-[:FRIENDS]-(s2) AND me <> s2
WITH me, collect(DISTINCT {user: s1, mutuals: mutual_count}) AS mutual_suggestions,
     collect(DISTINCT s2) AS coworkers
 
// Signal 3: Same school/year
OPTIONAL MATCH (me)-[:STUDIED_AT]->(school)<-[:STUDIED_AT]-(s3:User)
WHERE NOT (me)-[:FRIENDS]-(s3) AND me <> s3
WITH me, mutual_suggestions, coworkers, collect(DISTINCT s3) AS classmates
 
// Signal 4: Same groups
OPTIONAL MATCH (me)-[:MEMBER_OF]->(group)<-[:MEMBER_OF]-(s4:User)
WHERE NOT (me)-[:FRIENDS]-(s4) AND me <> s4
WITH me, mutual_suggestions, coworkers, classmates, collect(DISTINCT s4) AS group_members
 
// Combine and score all signals
UNWIND mutual_suggestions AS ms
WITH ms.user AS suggestion,
     ms.mutuals * 10 AS mutual_score,  // Highest weight
     CASE WHEN suggestion IN coworkers THEN 5 ELSE 0 END AS work_score,
     CASE WHEN suggestion IN classmates THEN 4 ELSE 0 END AS school_score,
     CASE WHEN suggestion IN group_members THEN 2 ELSE 0 END AS group_score
RETURN suggestion.id, suggestion.name,
       mutual_score + work_score + school_score + group_score AS total_score
ORDER BY total_score DESC
LIMIT 20

PYMK Signal Weights (Typical)

•Mutual friends: Highest weight; each mutual adds significant signal
•Shared employer: Strong signal; people know coworkers
•Shared school + overlapping years: Strong; alumni networks
•Shared groups/interests: Medium; common context
•Location proximity: Medium; local connections valuable
•Contact book matches: Strongest when available; explicit real-world connection
•Profile viewers: Medium; if they viewed you, there's interest

Pre-compute for Scale

Real-time PYMK computation is expensive for high-connection users. Production systems pre-compute suggestions periodically (hourly/daily), storing results in materialized views. Real-time queries fetch from cache, with background jobs refreshing the suggestions.

Social Feed Algorithms

The social feed—showing users relevant content from their network—is the core product of most social platforms. Feed algorithms balance recency, relevance, and engagement to maximize user value.

Feed Types:

Feed Type	Algorithm	Use Case
Chronological	Time-sorted	Twitter's "Latest"
Ranked	ML-based scoring	Facebook's main feed
Interest-based	Topic-weighted	TikTok's "For You"
Hybrid	Recency + ranking	Most modern platforms

Basic Chronological Feed:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
// Simple time-ordered feed from followed accounts
MATCH (me:User {id: $userId})-[:FOLLOWS]->(following:User)
      -[:POSTED]->(post:Post)
WHERE post.createdAt > datetime() - duration('P7D')
  AND NOT (me)-[:BLOCKED]-(following)
  AND post.visibility = 'public' OR post.visibility = 'followers'
RETURN post.id, post.content, post.createdAt,
       following.name AS author,
       following.avatar AS authorAvatar
ORDER BY post.createdAt DESC
LIMIT 50
 
// With engagement counts (likes, comments, shares)
MATCH (me:User {id: $userId})-[:FOLLOWS]->(author:User)-[:POSTED]->(post:Post)
WHERE post.createdAt > datetime() - duration('P7D')
OPTIONAL MATCH (post)<-[like:LIKED]-()
OPTIONAL MATCH (post)<-[:ON]-(comment:Comment)
OPTIONAL MATCH (post)<-[share:SHARED]-()
RETURN post.id, post.content, post.createdAt,
       author.name,
       count(DISTINCT like) AS likes,
       count(DISTINCT comment) AS comments,
       count(DISTINCT share) AS shares
ORDER BY post.createdAt DESC
LIMIT 50

Ranked Feed with Engagement Scoring:

Ranked feeds use multiple signals to surface "better" content:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Multi-factor ranked feed
MATCH (me:User {id: $userId})-[follow:FOLLOWS]->(author:User)-[:POSTED]->(post:Post)
WHERE post.createdAt > datetime() - duration('P3D')
 
// Collect engagement metrics
OPTIONAL MATCH (post)<-[:LIKED]-(liker)
OPTIONAL MATCH (post)<-[:ON]-(comment)
OPTIONAL MATCH (post)<-[:SHARED]-(sharer)
 
WITH me, author, post, follow,
     count(DISTINCT liker) AS likes,
     count(DISTINCT comment) AS comments,
     count(DISTINCT sharer) AS shares
 
// Calculate engagement score
WITH me, author, post, follow, likes, comments, shares,
     likes + (comments * 2) + (shares * 3) AS engagement_score
 
// Factor in relationship strength
OPTIONAL MATCH (me)-[interact:LIKED|COMMENTED|SHARED]->(:Post)<-[:POSTED]-(author)
WITH me, author, post, engagement_score,
     count(interact) AS interaction_history
 
// Factor in recency (decay over time)
WITH post, author, engagement_score, interaction_history,
     duration.inSeconds(datetime(), post.createdAt).seconds AS age_seconds
 
// Combined score: engagement + relationship - recency penalty
WITH post, author, engagement_score, interaction_history, age_seconds,
     engagement_score * 0.4 +
     interaction_history * 2 +
     (1.0 / (1 + age_seconds / 86400.0)) * 10 AS feed_score  // Decay over 24h
 
RETURN post.id, post.content, author.name,
       round(feed_score * 100) / 100 AS score
ORDER BY feed_score DESC
LIMIT 50

Diversification:

Pure engagement-ranked feeds create filter bubbles. Production systems inject diversity:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// Ensure variety: no more than 3 posts per author in top 20
MATCH (me:User {id: $userId})-[:FOLLOWS]->(author:User)-[:POSTED]->(post:Post)
WHERE post.createdAt > datetime() - duration('P3D')
 
WITH author, post
ORDER BY post.engagementScore DESC
 
// Collect posts per author, take top 3
WITH author, collect(post)[..3] AS author_posts
UNWIND author_posts AS post
 
// Return diversified feed
RETURN post.id, post.content, author.name
ORDER BY post.engagementScore DESC
LIMIT 50

Real-World Complexity

Production feed systems are far more complex—incorporating ML models trained on millions of interactions, A/B testing frameworks, real-time feature stores, and caching layers. The graph provides the relationship substrate; ML models provide personalized scoring.

Recommendation Engine Fundamentals

Recommendation engines predict what users will like based on their behavior and the behavior of similar users. Graphs provide a natural model for these relationships.

Core Approaches:

Approach	Description	Graph Pattern
Collaborative Filtering	Users with similar behavior like similar things	User-Item bipartite graph
Content-Based	Items similar to what you liked	Item-Feature graph
Knowledge-Based	Domain rules and constraints	Knowledge graph
Hybrid	Combine multiple approaches	Multi-type graph

The Bipartite User-Item Graph:

Most recommendations start with a bipartite graph—users on one side, items on the other:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// Core interaction patterns
(user:User)-[:VIEWED {at: datetime(), duration: 120}]->(item:Product)
(user:User)-[:PURCHASED {at: datetime(), price: 49.99}]->(item:Product)
(user:User)-[:RATED {score: 4.5, at: datetime()}]->(item:Movie)
(user:User)-[:ADDED_TO_CART]->(item:Product)
(user:User)-[:WISHLISTED]->(item:Product)
(user:User)-[:REVIEWED {rating: 5, text: '...'}]->(item:Restaurant)
 
// Item metadata
(item:Product)-[:IN_CATEGORY]->(category:Category)
(item:Product)-[:HAS_TAG]->(tag:Tag)
(item:Movie)-[:HAS_GENRE]->(genre:Genre)
(item:Movie)-[:STARRING]->(actor:Actor)
(item:Movie)-[:DIRECTED_BY]->(director:Director)
(item:Product)-[:MADE_BY]->(brand:Brand)
 
// User preferences
(user:User)-[:PREFERS]->(category:Category)
(user:User)-[:FOLLOWS]->(brand:Brand)
(user:User)-[:DISLIKES]->(genre:Genre)

Implicit vs. Explicit Signals:

Signal Type	Examples	Strength
Explicit	Ratings, reviews, likes	Strong but sparse
Implicit	Views, purchases, time spent	Weaker but abundant
Negative	Skip, hide, dislike	Important for filtering

Production systems weight signals appropriately:

// Weighted interaction score
WITH user, item,
     CASE type(r)
       WHEN 'PURCHASED' THEN 10
       WHEN 'RATED' THEN r.score * 2
       WHEN 'ADDED_TO_CART' THEN 5
       WHEN 'VIEWED' THEN 1
       ELSE 0
     END AS weight

The Cold Start Problem

New users and new items lack interaction history. Solutions include: onboarding questionnaires (explicit preferences), content-based fallbacks (item features), popularity-based defaults, and demographic clustering. Graphs help by enabling transitive inference through shared attributes.

Collaborative Filtering on Graphs

Collaborative filtering (CF) is based on a powerful insight: users who agreed in the past will agree in the future. On graphs, this translates to finding paths through shared items or shared users.

User-Based Collaborative Filtering:

"Find users similar to me, recommend what they liked that I haven't seen."

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
// Find users with similar purchase history
MATCH (me:User {id: $userId})-[:PURCHASED]->(shared:Product)<-[:PURCHASED]-(similar:User)
WHERE me <> similar
WITH similar, count(shared) AS shared_purchases
ORDER BY shared_purchases DESC
LIMIT 50  // Top 50 similar users
 
// Get their purchases that I don't have
MATCH (similar)-[:PURCHASED]->(recommendation:Product)
WHERE NOT (me)-[:PURCHASED]->(recommendation)
WITH recommendation, count(similar) AS recommender_count
RETURN recommendation.id, recommendation.name, recommender_count
ORDER BY recommender_count DESC
LIMIT 20
 
// With weighted similarity (Jaccard)
MATCH (me:User {id: $userId})-[:PURCHASED]->(myProducts:Product)
WITH me, collect(myProducts) AS my_purchases
 
MATCH (similar:User)-[:PURCHASED]->(theirProducts:Product)
WHERE similar <> me
WITH me, my_purchases, similar, collect(theirProducts) AS their_purchases
 
// Jaccard similarity = intersection / union
WITH me, similar,
     size([p IN their_purchases WHERE p IN my_purchases]) AS intersection,
     size(my_purchases) + size(their_purchases) - 
       size([p IN their_purchases WHERE p IN my_purchases]) AS union
WHERE intersection > 2  // Minimum overlap
WITH me, similar, intersection * 1.0 / union AS jaccard_similarity
ORDER BY jaccard_similarity DESC
LIMIT 30
 
// Get recommendations from similar users
MATCH (similar)-[:PURCHASED]->(rec:Product)
WHERE NOT (me)-[:PURCHASED]->(rec)
RETURN rec.id, rec.name, 
       sum(jaccard_similarity) AS weighted_score
ORDER BY weighted_score DESC
LIMIT 10

Item-Based Collaborative Filtering:

"Find items similar to what I liked, recommend those."

Item-based CF is often preferred because item similarity is more stable than user similarity (items don't change behavior like users do).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
// Find items frequently co-purchased with items I bought
MATCH (me:User {id: $userId})-[:PURCHASED]->(myItem:Product)
      <-[:PURCHASED]-(other:User)-[:PURCHASED]->(coItem:Product)
WHERE NOT (me)-[:PURCHASED]->(coItem)
  AND myItem <> coItem
 
// Count co-purchase frequency
WITH coItem, count(DISTINCT other) AS co_purchase_count
RETURN coItem.id, coItem.name, 
       co_purchase_count AS frequently_bought_together
ORDER BY co_purchase_count DESC
LIMIT 10
 
// With lift calculation (co-occurrence beyond random chance)
MATCH (allUsers:User)-[:PURCHASED]->(item:Product)
WITH item, count(DISTINCT allUsers) AS item_buyers
 
MATCH (me:User {id: $userId})-[:PURCHASED]->(myItem:Product)
MATCH (myItem)<-[:PURCHASED]-(copurchaser:User)-[:PURCHASED]->(rec:Product)
WHERE NOT (me)-[:PURCHASED]->(rec) AND myItem <> rec
 
WITH rec, myItem, count(DISTINCT copurchaser) AS co_buyers
MATCH (anyBuyer:User)-[:PURCHASED]->(rec)
WITH rec, myItem, co_buyers, count(DISTINCT anyBuyer) AS rec_total_buyers
MATCH (anyBuyer2:User)-[:PURCHASED]->(myItem)
WITH rec, co_buyers, rec_total_buyers, count(DISTINCT anyBuyer2) AS my_item_buyers
MATCH (allUsers:User)
WITH rec, co_buyers, rec_total_buyers, my_item_buyers, count(allUsers) AS total_users
 
// Lift = P(A and B) / (P(A) * P(B))
WITH rec,
     (co_buyers * 1.0 / total_users) / 
     ((rec_total_buyers * 1.0 / total_users) * (my_item_buyers * 1.0 / total_users)) AS lift
WHERE lift > 1.5  // Only items with positive lift
RETURN rec.name, round(lift * 100) / 100 AS lift_score
ORDER BY lift_score DESC
LIMIT 10

Popularity Bias

Raw co-occurrence counts bias toward popular items—bestsellers co-occur with everything. Use lift, pointwise mutual information (PMI), or log-likelihood ratio to surface genuinely associated items, not just popular ones.

Content-Based Recommendations

Content-based filtering recommends items with similar attributes to items a user has liked. Unlike collaborative filtering, it works for new items with no interaction history (solving the item cold-start problem).

The Item Feature Graph:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
// Movie with features
(movie:Movie {
  title: 'Inception',
  releaseYear: 2010,
  runtime: 148,
  budget: 160000000
})
(movie)-[:HAS_GENRE]->(genre:Genre {name: 'Sci-Fi'})
(movie)-[:HAS_GENRE]->(genre2:Genre {name: 'Thriller'})
(movie)-[:DIRECTED_BY]->(director:Person {name: 'Nolan'})
(movie)-[:STARRING]->(actor:Person {name: 'DiCaprio'})
(movie)-[:HAS_TAG]->(tag:Tag {name: 'mind-bending'})
(movie)-[:FROM_STUDIO]->(studio:Studio {name: 'Warner Bros'})
 
// Product with features
(product:Product {
  name: 'iPhone 15',
  price: 999,
  releaseDate: date('2023-09-22')
})
(product)-[:IN_CATEGORY]->(category:Category {name: 'Smartphones'})
(product)-[:MADE_BY]->(brand:Brand {name: 'Apple'})
(product)-[:HAS_FEATURE]->(feature:Feature {name: 'Face ID'})
(product)-[:HAS_SPEC {value: '48MP'}]->(spec:Spec {name: 'Camera'})

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
// Find movies similar to ones I've liked based on shared features
MATCH (me:User {id: $userId})-[r:RATED]->(liked:Movie)
WHERE r.score >= 4.0
 
// Extract features from liked movies
MATCH (liked)-[:HAS_GENRE]->(genre:Genre)
MATCH (liked)-[:DIRECTED_BY]->(director:Person)
MATCH (liked)-[:STARRING]->(actor:Person)
WITH me, 
     collect(DISTINCT genre) AS preferred_genres,
     collect(DISTINCT director) AS preferred_directors,
     collect(DISTINCT actor) AS preferred_actors,
     collect(DISTINCT liked) AS already_seen
 
// Find movies matching those features
MATCH (rec:Movie)
WHERE NOT rec IN already_seen
OPTIONAL MATCH (rec)-[:HAS_GENRE]->(g:Genre) WHERE g IN preferred_genres
OPTIONAL MATCH (rec)-[:DIRECTED_BY]->(d:Person) WHERE d IN preferred_directors
OPTIONAL MATCH (rec)-[:STARRING]->(a:Person) WHERE a IN preferred_actors
 
WITH rec, 
     count(DISTINCT g) AS genre_matches,
     count(DISTINCT d) AS director_matches,
     count(DISTINCT a) AS actor_matches
 
// Weighted feature matching
WITH rec,
     genre_matches * 2 + director_matches * 5 + actor_matches * 3 AS content_score
WHERE content_score > 3
RETURN rec.title, content_score
ORDER BY content_score DESC
LIMIT 20
 
// Jaccard similarity on feature sets
MATCH (me:User {id: $userId})-[:RATED {score: 5}]->(loved:Movie)
MATCH (loved)-[:HAS_GENRE|DIRECTED_BY|STARRING]->(feature)
WITH me, loved, collect(feature) AS loved_features
 
MATCH (rec:Movie)
WHERE rec <> loved
MATCH (rec)-[:HAS_GENRE|DIRECTED_BY|STARRING]->(rec_feature)
WITH loved, rec, loved_features, collect(rec_feature) AS rec_features
 
WITH rec,
     size([f IN rec_features WHERE f IN loved_features]) AS intersection,
     size(loved_features) + size(rec_features) AS union_approx
WITH rec, intersection * 1.0 / (union_approx - intersection) AS jaccard
WHERE jaccard > 0.3
RETURN rec.title, round(jaccard * 100) / 100 AS similarity
ORDER BY similarity DESC
LIMIT 10

Feature Engineering Matters

Content-based quality depends on feature richness. Netflix uses 1000s of microgenres ('Mind-bending Sci-Fi', 'Visually-striking Dramas'). Rich feature graphs enable nuanced similarity. Consider: explicit features, derived tags, and embeddings-as-nodes.

Hybrid and Graph-Native Recommendations

The most effective recommendation systems combine multiple approaches. Graphs naturally enable hybrid recommendations by connecting users, items, and features in a unified model.

Hybrid Pattern: Multi-Path Recommendations

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
// Combine collaborative and content-based signals
MATCH (me:User {id: $userId})
 
// Path 1: Collaborative - what similar users bought
OPTIONAL MATCH (me)-[:PURCHASED]->(shared:Product)<-[:PURCHASED]-(similar:User)
              -[:PURCHASED]->(collab_rec:Product)
WHERE NOT (me)-[:PURCHASED]->(collab_rec)
WITH me, collab_rec, count(DISTINCT similar) AS collab_score
 
// Path 2: Content-based - similar to what I bought
OPTIONAL MATCH (me)-[:PURCHASED]->(bought:Product)-[:IN_CATEGORY]->(cat:Category)
              <-[:IN_CATEGORY]-(content_rec:Product)
WHERE NOT (me)-[:PURCHASED]->(content_rec)
WITH me, collab_rec, collab_score, content_rec, count(DISTINCT bought) AS content_score
 
// Path 3: Brand affinity - from brands I buy
OPTIONAL MATCH (me)-[:PURCHASED]->(:Product)-[:MADE_BY]->(brand:Brand)
              <-[:MADE_BY]-(brand_rec:Product)
WHERE NOT (me)-[:PURCHASED]->(brand_rec)
WITH me, collab_rec, collab_score, content_rec, content_score, 
     brand_rec, count(*) AS brand_score
 
// Combine all recommendations
WITH collect({item: collab_rec, score: collab_score * 3, source: 'collab'}) +
     collect({item: content_rec, score: content_score * 2, source: 'content'}) +
     collect({item: brand_rec, score: brand_score * 1.5, source: 'brand'}) AS all_recs
UNWIND all_recs AS rec
WHERE rec.item IS NOT NULL
 
// Aggregate scores per item
WITH rec.item AS item, sum(rec.score) AS total_score, 
     collect(DISTINCT rec.source) AS sources
RETURN item.name, total_score, sources
ORDER BY total_score DESC
LIMIT 15

Graph-Native: Leveraging Graph Algorithms

Graph algorithms provide powerful recommendation signals unavailable to other approaches:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
// 1. Node Similarity (GDS): items with similar buyer profiles
CALL gds.graph.project(
  'purchaseGraph',
  ['User', 'Product'],
  {PURCHASED: {type: 'PURCHASED', orientation: 'NATURAL'}}
)
 
CALL gds.nodeSimilarity.stream('purchaseGraph', {topK: 10})
YIELD node1, node2, similarity
WHERE gds.util.asNode(node1):Product AND gds.util.asNode(node2):Product
RETURN gds.util.asNode(node1).name AS product1,
       gds.util.asNode(node2).name AS product2,
       similarity
 
// 2. PageRank for item authority
// High PageRank items are purchased by users who buy many things (tastemakers)
CALL gds.pageRank.stream('purchaseGraph', {
  relationshipWeightProperty: null
})
YIELD nodeId, score
WHERE gds.util.asNode(nodeId):Product
RETURN gds.util.asNode(nodeId).name AS product, score AS authority
ORDER BY authority DESC
LIMIT 20
 
// 3. Community detection for user segments
CALL gds.louvain.stream('purchaseGraph')
YIELD nodeId, communityId
WHERE gds.util.asNode(nodeId):User
WITH communityId, collect(gds.util.asNode(nodeId)) AS users
MATCH (u)-[:PURCHASED]->(popular:Product)
WHERE u IN users
WITH communityId, popular, count(*) AS purchases
ORDER BY communityId, purchases DESC
RETURN communityId, 
       collect(popular.name)[..5] AS segment_favorites

Real-Time vs. Batch

Pre-compute expensive graph algorithms (PageRank, community detection) in batch. Store results as node properties. Real-time queries use these pre-computed features combined with live interaction data. This balance provides rich signals with low latency.

Production Considerations at Scale

Deploying social and recommendation features at scale requires careful architecture beyond just graph queries.

Caching Strategy:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
┌─────────────────────────────────────────────────────────────────┐
│                        REQUEST FLOW                             │
└─────────────────────────────────┬───────────────────────────────┘
                                  │
        ┌─────────────────────────▼──────────────────────────┐
        │              APPLICATION CACHE (Redis)              │
        │    Personalized feeds, PYMK results, user prefs     │
        │    TTL: 5-15 minutes for feeds, 1hr for PYMK        │
        └─────────────────────────┬──────────────────────────┘
                                  │ Cache Miss
        ┌─────────────────────────▼──────────────────────────┐
        │           PRE-COMPUTED RESULTS (Redis/DB)           │
        │    Batch-generated recommendations, similarities    │
        │    Refreshed: hourly for active users, daily others │
        └─────────────────────────┬──────────────────────────┘
                                  │ Not Pre-computed
        ┌─────────────────────────▼──────────────────────────┐
        │              REAL-TIME GRAPH QUERY                  │
        │    Neo4j with bounded traversals, timeouts          │
        │    Fallback: popularity-based defaults              │
        └─────────────────────────────────────────────────────┘

Handling High-Degree Nodes:

Celebrities with millions of followers ("superconnectors") break naive graph algorithms. Solutions:

Supernode Strategies

•Early termination: Stop traversal after N results (LIMIT mid-query)
•Sampling: Randomly sample edges from high-degree nodes
•Degree filtering: Exclude nodes above degree threshold from certain algorithms
•Separate modeling: Handle celebrities differently (broadcast model vs. graph)
•Pre-materialization: Store aggregated data for supernodes

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
// Skip high-degree nodes in traversal
MATCH (me:User {id: $userId})-[:FOLLOWS]->(following:User)
WHERE following.followerCount < 1000000  // Exclude celebrities
MATCH (following)-[:FOLLOWS]->(suggestion:User)
WHERE NOT (me)-[:FOLLOWS]->(suggestion)
RETURN suggestion.name, count(*) AS via_count
ORDER BY via_count DESC
LIMIT 20
 
// Sample relationships from supernodes
MATCH (me:User {id: $userId})-[:FOLLOWS]->(celebrity:User)
WHERE celebrity.followerCount > 1000000
WITH me, celebrity, rand() AS r
ORDER BY r
LIMIT 10  // Sample 10 random celebrity follows
MATCH (celebrity)-[:POSTED]->(post:Post)
WHERE post.createdAt > datetime() - duration('P1D')
RETURN post.id
LIMIT 5

Typical Production SLAs
Feature	Latency Target	Refresh Rate	Fallback
Social Feed	< 100ms	Real-time + 5min cache	Chronological
PYMK	< 200ms	Every 1-6 hours	Popular users
Product Recommendations	< 150ms	Daily batch + real-time signals	Bestsellers
Similar Items	< 50ms	Daily pre-compute	Same category items

Monitor and Iterate

Track recommendation quality with metrics: click-through rate (CTR), conversion, engagement time, and diversity. A/B test algorithm changes. Use feedback loops—clicks and purchases generate signals that improve future recommendations.

Summary: Social Networks and Recommendations

Social networking and recommendation engines represent the flagship applications of graph databases—domains where the relationship-centric model provides decisive advantages. Let's consolidate the key insights:

Key Takeaways

•Social graphs model people and connections — Users, posts, groups as nodes; follows, likes, memberships as edges. Relationship properties capture context (since, role, engagement level).
•PYMK leverages mutual connections — Friends of friends who aren't already connected. Enhance with workplace, school, group signals. Pre-compute for scale.
•Feeds balance recency and relevance — Chronological is simple; ranked feeds use engagement and relationship strength. Diversify to avoid filter bubbles.
•Collaborative filtering finds similar users/items — Co-purchase, co-rating patterns. Jaccard similarity, lift scores filter popularity bias.
•Content-based uses item features — Recommend based on genres, tags, attributes. Solves item cold-start. Rich feature graphs enable nuanced matching.
•Hybrid approaches combine signals — Multi-path queries through user similarity, item similarity, and content matching. Graph algorithms (PageRank, communities) add powerful features.
•Cache extensively — Pre-compute in batch; cache personalized results. Real-time queries for cache misses with bounded complexity.
•Handle supernodes specially — Celebrities and bestsellers break naive algorithms. Sample, filter, or model separately.

What's Next:

We'll complete the graph database module by examining use cases and trade-offs—when to choose graph databases, their limitations, and how they fit into polyglot persistence architectures. This will give you the decision framework for your own systems.

Social and Recommendation Mastery

You now understand how graph databases power social networks and recommendation engines—the applications that made graph databases mainstream. These patterns apply across domains wherever human connections and item preferences drive value.