NO SQL Lesson 20 – Graph Databases | Dataplexa
NoSQL Database Types · Lesson 20

Graph Databases: When Relationships Are the Data

PayPal catches fraudsters by finding accounts that share device IDs with accounts that share phone numbers with accounts linked to known bad actors — three hops deep, across 400 million users, in milliseconds. In SQL that is five self-joins on a billion-row table. In a graph database it is one pattern match. That is the difference between catching fraud and letting it walk out the door.

The Problem That Graphs Solve

Most databases treat relationships as an afterthought — foreign keys, JOIN tables, nested arrays. The relationship is something you reconstruct at query time by scanning large tables and matching IDs. Graph databases flip that model entirely: the relationship is a first-class citizen, stored explicitly as a physical pointer between two nodes. Traversal follows those pointers directly — no JOIN, no index lookup, no full table scan.

Index-Free Adjacency — The Core Superpower

A SQL JOIN on a 10-million-row table takes seconds because the database must look up matching IDs in an index. In a graph database, each relationship is a direct physical pointer to the next node. Traversing from A to B to C takes the same time whether you have 1,000 nodes or 1 billion — the database follows the pointer, it does not search. This property is called index-free adjacency.

The Core Concepts — Nodes, Edges, and Properties

Everything in a graph database is one of three things:

Nodes (Vertices)

The entities. A Person, a Product, a City, a Transaction. Each node has a label (its type) and carries any number of key-value properties like name: "Alice" or age: 29.

Edges (Relationships)

The connections. FOLLOWS, PURCHASED, LIVES_IN. Every edge has a direction, a type, and can carry its own properties — like since: 2023-01 or amount: 149.99.

Properties

Key-value data attached to either nodes or edges. What makes graphs expressive — the relationship itself can hold data, not just the entities it connects.

Graph Model — Social Network Example

Person Alice Person Bob City London Product GraphDB Book FOLLOWS LIVES_IN PURCHASED LOCATED_IN

Nodes hold entity data. Edges hold relationship data. Traversal follows edges as direct pointers — no JOIN tables, no index lookups.

Where Graph Databases Win Every Time

Fraud Detection

Find accounts sharing device IDs, IP addresses, or phone numbers across multiple hops. Fraud rings hide in relationship patterns that SQL aggregates miss entirely. Three hops in graph = five self-joins in SQL.

Recommendation Engines

"Users who bought X also bought Y" — traversing purchase graphs at the speed users expect. Collaborative filtering queries that take 6 SQL JOINs become a single 6-line Cypher pattern match.

Knowledge Graphs

Connecting entities across domains — Google's Knowledge Graph, drug interaction databases, supply chain maps. The product is the relationships themselves, not just the entities they connect.

Access Control & IAM

Who has access to what through which roles and groups? Hierarchical permission graphs that would require recursive CTEs in SQL are natural traversals in a graph database.

Hands-on — Building a Streaming Recommendation Graph

The scenario: You are a backend engineer at a streaming platform. The recommendation team wants "shows watched by people like you" served in under 50ms. The current SQL approach — JOIN users → watch_history → shows → genre_tags — takes 800ms on a 5-million-user dataset. Your tech lead says it is time to prototype a graph model. You are building the proof of concept in Neo4j.

# Connect via Cypher shell (Neo4j must be running on localhost:7687)
cypher-shell -u neo4j -p yourpassword

# Or open the browser UI at:
# http://localhost:7474
Connected to Neo4j using Bolt protocol version 5.0 at neo4j://localhost:7687
as user neo4j.
Type :help for a list of available commands or :exit to exit the shell.
Note that Cypher queries must end with a semicolon.
neo4j@neo4j>
cypher-shell -u neo4j -p yourpassword

Opens the Cypher command-line shell. Cypher is Neo4j's query language — it is designed to look like ASCII art you would draw to describe a graph. Nodes are written in parentheses (), relationships in square brackets []. Bolt is the binary wire protocol Neo4j uses internally — faster than HTTP for query throughput.

The scenario continues: You create nodes for users and shows, then build WATCHED relationships between them. Cypher's ASCII-art syntax means the query literally looks like the graph pattern you are trying to describe.

-- Create two users and two shows
CREATE (:User {id: 'u1', name: 'Alice', plan: 'premium'});
CREATE (:User {id: 'u2', name: 'Bob',   plan: 'basic'});
CREATE (:Show {id: 's1', title: 'Cosmos', genre: 'documentary'});
CREATE (:Show {id: 's2', title: 'Dark',   genre: 'thriller'});
Added 1 label, created 1 node, set 3 properties, completed in 12 ms.
Added 1 label, created 1 node, set 3 properties, completed in 3 ms.
Added 1 label, created 1 node, set 3 properties, completed in 3 ms.
Added 1 label, created 1 node, set 3 properties, completed in 2 ms.
(:User {"{"} id: 'u1', name: 'Alice', plan: 'premium' {"}"})

Parentheses define a node. :User is the label — like a type or class. Curly braces contain properties. There is no fixed schema — you could add country to one user node and leave it off another. Schema-optional, just like document stores.

Added 1 label, created 1 node, set 3 properties

Neo4j reports exactly what changed. If you ran this twice with the same id, you would get two separate nodes — CREATE always creates. To upsert (create only if not exists), use MERGE instead.

The scenario continues: Now you connect users to the shows they have watched. Each WATCHED relationship carries a timestamp and a rating — that data lives on the edge itself, not in a separate join table.

-- Create WATCHED relationships with properties on the edge
MATCH (u:User {id: 'u1'}), (s:Show {id: 's1'})
CREATE (u)-[:WATCHED {at: '2025-01-10', rating: 5}]->(s);

MATCH (u:User {id: 'u1'}), (s:Show {id: 's2'})
CREATE (u)-[:WATCHED {at: '2025-01-11', rating: 4}]->(s);

MATCH (u:User {id: 'u2'}), (s:Show {id: 's1'})
CREATE (u)-[:WATCHED {at: '2025-01-09', rating: 5}]->(s);
Created 1 relationship, set 2 properties, completed in 5 ms.
Created 1 relationship, set 2 properties, completed in 3 ms.
Created 1 relationship, set 2 properties, completed in 3 ms.
MATCH (u:User {"{"} id: 'u1' {"}"}), (s:Show {"{"} id: 's1' {"}"})

MATCH finds existing nodes — equivalent to a SELECT. It binds the found nodes to variables u and s. If either node does not exist, nothing is created and no error is raised. You need the nodes before you can connect them.

(u)-[:WATCHED {"{"} at: '2025-01-10', rating: 5 {"}"}]->(s)

The ASCII-art relationship syntax. -[]-> draws a directed edge from u to s. WATCHED is the relationship type. It carries two properties directly on the edge — at and rating. In SQL, this metadata would require a separate join table row.

The scenario continues: The product manager wants "you might also like" recommendations. The logic: find shows watched by users who also watched the same shows as Alice, but that Alice has not seen yet. In SQL this is six JOINs and a NOT EXISTS subquery. In Cypher it is one pattern match.

-- Recommendation: shows watched by Alice's taste-twins that Alice hasn't seen
MATCH (alice:User {name: 'Alice'})-[:WATCHED]->(show:Show)
      <-[:WATCHED]-(other:User)-[:WATCHED]->(rec:Show)
WHERE NOT (alice)-[:WATCHED]->(rec)
RETURN rec.title AS recommendation,
       count(other) AS score
ORDER BY score DESC
LIMIT 10;
+----------------------------------+
| recommendation     | score       |
+----------------------------------+
| "Breaking Bad"     | 4           |
| "Planet Earth II"  | 3           |
| "Mindhunter"       | 2           |
+----------------------------------+
3 rows available after 38 ms
(alice)-[:WATCHED]->(show)<-[:WATCHED]-(other)

A two-hop traversal in a single pattern. Alice watched a show. Another user also watched that same show. The <- arrow means the incoming edge. Neo4j finds all paths matching this pattern simultaneously — no loops in code, no JOINs.

WHERE NOT (alice)-[:WATCHED]->(rec)

A negative pattern predicate — filters out shows Alice already watched. Expressing this in SQL requires a NOT EXISTS correlated subquery. In Cypher it reads exactly like an English sentence: "where Alice has not watched the recommendation."

count(other) AS score

How many users share Alice's taste and also watched this recommendation. Higher count = stronger signal. This is collaborative filtering expressed in 6 lines of Cypher — an algorithm that takes hundreds of lines of SQL and multiple intermediate tables.

Types of Graph Databases

Not all graph databases work the same way. Two families dominate:

Type Model Examples Best for
Labeled Property Graph Nodes + typed edges, both with properties Neo4j, Amazon Neptune, TigerGraph Social, fraud, recommendations, IAM
RDF / Triple Store Subject → Predicate → Object triples Amazon Neptune (RDF), GraphDB, Stardog Knowledge graphs, semantic web, ontologies

When Graphs Are the Wrong Tool

Heavy Aggregations

Summing revenue across billions of transactions. SQL or columnar stores (BigQuery, Redshift) vastly outperform graph databases on GROUP BY at scale.

Tabular / Flat Data

Product catalogues, financial ledgers, user tables with flat attributes — if there is no meaningful relationship structure, you are fighting the model to store simple rows.

Massive Write Throughput

IoT sensor streams at millions of writes per second. Graph databases maintain complex pointer structures — high-velocity append workloads are not their strength.

Teacher's Note

In production systems, graph databases almost always live alongside relational or document stores — not instead of them. A common pattern: Postgres for transactional data, MongoDB for product content, Neo4j only for relationship-heavy features like recommendations or fraud detection. The graph covers maybe 10% of the data but 90% of the performance pain. Use it as a specialist tool, not a general replacement.

Practice Questions — You're the Engineer

Scenario:

Your colleague is benchmarking Neo4j against PostgreSQL. She finds that traversing three relationship hops in Neo4j takes 12ms regardless of whether the graph has 100,000 nodes or 100 million. She explains: in Neo4j, each relationship is a direct physical pointer to the next node — the database follows the pointer, it does not search an index. What is this property called?


Scenario:

You are loading user data into Neo4j from a CSV file. The pipeline runs daily, and the same users appear in multiple days' files. After three days, your monitoring shows the node count is three times higher than your actual user count. Your CREATE statements are creating duplicate nodes on every run. Which single Cypher command should replace CREATE to create the node only when it does not already exist?


Scenario:

You are designing a fraud detection system where nodes represent bank accounts and edges represent transactions. Each transaction edge must carry a timestamp, an amount, and a channel property. A colleague suggests using an RDF triple store. You push back — because RDF edges (predicates) cannot carry multiple properties, the transaction metadata would require separate triples and extra indirection. What graph model natively supports properties on edges and is the better fit here?


Quiz — Graph Databases in Production

Scenario:

Your engineering director asks why the graph-based recommendation query runs in 38ms against 50 million users when the equivalent SQL query — five JOINs across the watch_history, users, and shows tables — takes 12 seconds. How do you explain the performance difference?

Scenario:

Your fintech company has flagged an account for suspicious transactions. You need to surface all related accounts in a fraud ring — accounts sharing device IDs, linked phone numbers, or shared IPs up to 3 relationship hops away. The fraud analytics team asks which database approach is most appropriate. What is the correct answer?

Scenario:

A startup wants to add a "people you may know" feature to their existing app. Their core data — user profiles, posts, messages — is stored in Postgres and working well. The CTO proposes migrating everything to Neo4j since graph databases are more powerful for social features. As the senior engineer, what is the most pragmatic recommendation?

Up Next · Lesson 21

Neo4j Introduction

Go deep on the world's most popular graph database — Cypher queries, indexing, APOC procedures, and production patterns.