NoSQL
Graph Databases: When Relationships Are the Data
PayPal catches fraudsters by finding accounts that share device IDs with accounts that share phone numbers with accounts linked to known bad actors — three hops deep, across 400 million users, in milliseconds. In SQL that is five self-joins on a billion-row table. In a graph database it is one pattern match. That is the difference between catching fraud and letting it walk out the door.
The Problem That Graphs Solve
Most databases treat relationships as an afterthought — foreign keys, JOIN tables, nested arrays. The relationship is something you reconstruct at query time by scanning large tables and matching IDs. Graph databases flip that model entirely: the relationship is a first-class citizen, stored explicitly as a physical pointer between two nodes. Traversal follows those pointers directly — no JOIN, no index lookup, no full table scan.
Index-Free Adjacency — The Core Superpower
A SQL JOIN on a 10-million-row table takes seconds because the database must look up matching IDs in an index. In a graph database, each relationship is a direct physical pointer to the next node. Traversing from A to B to C takes the same time whether you have 1,000 nodes or 1 billion — the database follows the pointer, it does not search. This property is called index-free adjacency.
The Core Concepts — Nodes, Edges, and Properties
Everything in a graph database is one of three things:
Nodes (Vertices)
The entities. A Person, a Product, a City, a Transaction. Each node has a label (its type) and carries any number of key-value properties like name: "Alice" or age: 29.
Edges (Relationships)
The connections. FOLLOWS, PURCHASED, LIVES_IN. Every edge has a direction, a type, and can carry its own properties — like since: 2023-01 or amount: 149.99.
Properties
Key-value data attached to either nodes or edges. What makes graphs expressive — the relationship itself can hold data, not just the entities it connects.
Graph Model — Social Network Example
Nodes hold entity data. Edges hold relationship data. Traversal follows edges as direct pointers — no JOIN tables, no index lookups.
Where Graph Databases Win Every Time
Fraud Detection
Find accounts sharing device IDs, IP addresses, or phone numbers across multiple hops. Fraud rings hide in relationship patterns that SQL aggregates miss entirely. Three hops in graph = five self-joins in SQL.
Recommendation Engines
"Users who bought X also bought Y" — traversing purchase graphs at the speed users expect. Collaborative filtering queries that take 6 SQL JOINs become a single 6-line Cypher pattern match.
Knowledge Graphs
Connecting entities across domains — Google's Knowledge Graph, drug interaction databases, supply chain maps. The product is the relationships themselves, not just the entities they connect.
Access Control & IAM
Who has access to what through which roles and groups? Hierarchical permission graphs that would require recursive CTEs in SQL are natural traversals in a graph database.
Hands-on — Building a Streaming Recommendation Graph
The scenario: You are a backend engineer at a streaming platform. The recommendation team wants "shows watched by people like you" served in under 50ms. The current SQL approach — JOIN users → watch_history → shows → genre_tags — takes 800ms on a 5-million-user dataset. Your tech lead says it is time to prototype a graph model. You are building the proof of concept in Neo4j.
# Connect via Cypher shell (Neo4j must be running on localhost:7687)
cypher-shell -u neo4j -p yourpassword
# Or open the browser UI at:
# http://localhost:7474
Connected to Neo4j using Bolt protocol version 5.0 at neo4j://localhost:7687 as user neo4j. Type :help for a list of available commands or :exit to exit the shell. Note that Cypher queries must end with a semicolon. neo4j@neo4j>
cypher-shell -u neo4j -p yourpassword
Opens the Cypher command-line shell. Cypher is Neo4j's query language — it is designed to look like ASCII art you would draw to describe a graph. Nodes are written in parentheses (), relationships in square brackets []. Bolt is the binary wire protocol Neo4j uses internally — faster than HTTP for query throughput.
The scenario continues: You create nodes for users and shows, then build WATCHED relationships between them. Cypher's ASCII-art syntax means the query literally looks like the graph pattern you are trying to describe.
-- Create two users and two shows
CREATE (:User {id: 'u1', name: 'Alice', plan: 'premium'});
CREATE (:User {id: 'u2', name: 'Bob', plan: 'basic'});
CREATE (:Show {id: 's1', title: 'Cosmos', genre: 'documentary'});
CREATE (:Show {id: 's2', title: 'Dark', genre: 'thriller'});
Added 1 label, created 1 node, set 3 properties, completed in 12 ms. Added 1 label, created 1 node, set 3 properties, completed in 3 ms. Added 1 label, created 1 node, set 3 properties, completed in 3 ms. Added 1 label, created 1 node, set 3 properties, completed in 2 ms.
(:User {"{"} id: 'u1', name: 'Alice', plan: 'premium' {"}"})
Parentheses define a node. :User is the label — like a type or class. Curly braces contain properties. There is no fixed schema — you could add country to one user node and leave it off another. Schema-optional, just like document stores.
Added 1 label, created 1 node, set 3 properties
Neo4j reports exactly what changed. If you ran this twice with the same id, you would get two separate nodes — CREATE always creates. To upsert (create only if not exists), use MERGE instead.
The scenario continues: Now you connect users to the shows they have watched. Each WATCHED relationship carries a timestamp and a rating — that data lives on the edge itself, not in a separate join table.
-- Create WATCHED relationships with properties on the edge
MATCH (u:User {id: 'u1'}), (s:Show {id: 's1'})
CREATE (u)-[:WATCHED {at: '2025-01-10', rating: 5}]->(s);
MATCH (u:User {id: 'u1'}), (s:Show {id: 's2'})
CREATE (u)-[:WATCHED {at: '2025-01-11', rating: 4}]->(s);
MATCH (u:User {id: 'u2'}), (s:Show {id: 's1'})
CREATE (u)-[:WATCHED {at: '2025-01-09', rating: 5}]->(s);
Created 1 relationship, set 2 properties, completed in 5 ms. Created 1 relationship, set 2 properties, completed in 3 ms. Created 1 relationship, set 2 properties, completed in 3 ms.
MATCH (u:User {"{"} id: 'u1' {"}"}), (s:Show {"{"} id: 's1' {"}"})
MATCH finds existing nodes — equivalent to a SELECT. It binds the found nodes to variables u and s. If either node does not exist, nothing is created and no error is raised. You need the nodes before you can connect them.
(u)-[:WATCHED {"{"} at: '2025-01-10', rating: 5 {"}"}]->(s)
The ASCII-art relationship syntax. -[]-> draws a directed edge from u to s. WATCHED is the relationship type. It carries two properties directly on the edge — at and rating. In SQL, this metadata would require a separate join table row.
The scenario continues: The product manager wants "you might also like" recommendations. The logic: find shows watched by users who also watched the same shows as Alice, but that Alice has not seen yet. In SQL this is six JOINs and a NOT EXISTS subquery. In Cypher it is one pattern match.
-- Recommendation: shows watched by Alice's taste-twins that Alice hasn't seen
MATCH (alice:User {name: 'Alice'})-[:WATCHED]->(show:Show)
<-[:WATCHED]-(other:User)-[:WATCHED]->(rec:Show)
WHERE NOT (alice)-[:WATCHED]->(rec)
RETURN rec.title AS recommendation,
count(other) AS score
ORDER BY score DESC
LIMIT 10;
+----------------------------------+ | recommendation | score | +----------------------------------+ | "Breaking Bad" | 4 | | "Planet Earth II" | 3 | | "Mindhunter" | 2 | +----------------------------------+ 3 rows available after 38 ms
(alice)-[:WATCHED]->(show)<-[:WATCHED]-(other)
A two-hop traversal in a single pattern. Alice watched a show. Another user also watched that same show. The <- arrow means the incoming edge. Neo4j finds all paths matching this pattern simultaneously — no loops in code, no JOINs.
WHERE NOT (alice)-[:WATCHED]->(rec)
A negative pattern predicate — filters out shows Alice already watched. Expressing this in SQL requires a NOT EXISTS correlated subquery. In Cypher it reads exactly like an English sentence: "where Alice has not watched the recommendation."
count(other) AS score
How many users share Alice's taste and also watched this recommendation. Higher count = stronger signal. This is collaborative filtering expressed in 6 lines of Cypher — an algorithm that takes hundreds of lines of SQL and multiple intermediate tables.
Types of Graph Databases
Not all graph databases work the same way. Two families dominate:
| Type | Model | Examples | Best for |
|---|---|---|---|
| Labeled Property Graph | Nodes + typed edges, both with properties | Neo4j, Amazon Neptune, TigerGraph | Social, fraud, recommendations, IAM |
| RDF / Triple Store | Subject → Predicate → Object triples | Amazon Neptune (RDF), GraphDB, Stardog | Knowledge graphs, semantic web, ontologies |
When Graphs Are the Wrong Tool
Heavy Aggregations
Summing revenue across billions of transactions. SQL or columnar stores (BigQuery, Redshift) vastly outperform graph databases on GROUP BY at scale.
Tabular / Flat Data
Product catalogues, financial ledgers, user tables with flat attributes — if there is no meaningful relationship structure, you are fighting the model to store simple rows.
Massive Write Throughput
IoT sensor streams at millions of writes per second. Graph databases maintain complex pointer structures — high-velocity append workloads are not their strength.
Teacher's Note
In production systems, graph databases almost always live alongside relational or document stores — not instead of them. A common pattern: Postgres for transactional data, MongoDB for product content, Neo4j only for relationship-heavy features like recommendations or fraud detection. The graph covers maybe 10% of the data but 90% of the performance pain. Use it as a specialist tool, not a general replacement.
Practice Questions — You're the Engineer
Scenario:
Scenario:
CREATE statements are creating duplicate nodes on every run. Which single Cypher command should replace CREATE to create the node only when it does not already exist?
Scenario:
timestamp, an amount, and a channel property. A colleague suggests using an RDF triple store. You push back — because RDF edges (predicates) cannot carry multiple properties, the transaction metadata would require separate triples and extra indirection. What graph model natively supports properties on edges and is the better fit here?
Quiz — Graph Databases in Production
Scenario:
Scenario:
Scenario:
Up Next · Lesson 21
Neo4j Introduction
Go deep on the world's most popular graph database — Cypher queries, indexing, APOC procedures, and production patterns.