NoSQL
Neo4j Introduction
Neo4j is the database that put graph databases on the map. eBay uses it to detect fraud rings in real time. Airbnb uses it to power identity verification across 800 million listings. NASA uses it to manage the relationships between spacecraft components. The reason they all chose Neo4j comes down to one thing: Cypher — a query language so readable that a pattern you draw on a whiteboard is almost identical to the query you write in code.
What Makes Neo4j Different
Neo4j is a native graph database — meaning the storage engine itself is built around nodes and relationships, not tables that simulate a graph. Every node stores a direct physical pointer to each of its relationships. Every relationship stores direct pointers to its start and end nodes. Traversal follows those pointers — no index involved, no table scan, no JOIN.
The Cypher query language is designed around pattern matching. You describe the shape of the graph you want to find using ASCII art, and Neo4j finds it. Parentheses are nodes (), square brackets are relationships [], and arrows show direction ->.
Native Graph Storage vs Graph-on-Top-of-SQL
Some databases bolt graph querying on top of a relational or document engine — Amazon Neptune's property graph mode, for example, sits on top of a purpose-built storage layer. Neo4j's storage engine was written from scratch with nodes and relationships as the fundamental unit. That is why deep traversals stay fast — no abstraction layer, no translation cost, just pointer chasing.
Neo4j Architecture — How It Works Internally
Understanding Neo4j's internals helps you write faster queries and design better schemas.
Node Store
Fixed-size records for every node. Each record stores: a flag (in use?), the first relationship ID, the first property ID, and the label store ID. Fixed size means any node can be found by its ID in a single O(1) seek.
Relationship Store
Each relationship record stores: start node ID, end node ID, relationship type, and pointers to the previous and next relationship in the chain for both nodes. This doubly-linked list is how traversal works — follow the pointer, not an index.
Property Store
Properties are stored separately from the node and relationship stores — as a linked list of property records. Each record holds up to four property key-value pairs. Long strings and arrays spill into a dynamic store.
Label Index
An index per label that maps label → node IDs. When your query starts with MATCH (u:User), Neo4j uses this index to find all User nodes rather than scanning every node in the store.
Cypher Fundamentals — Reading the ASCII Art
Cypher has five core clauses. Everything you will ever write is a combination of these:
| Clause | What it does | SQL equivalent |
|---|---|---|
| MATCH | Find nodes/relationships matching a pattern | SELECT … FROM … JOIN |
| WHERE | Filter results by property values or patterns | WHERE |
| CREATE | Create new nodes or relationships | INSERT INTO |
| MERGE | Create if not exists, match if it does | INSERT … ON CONFLICT DO NOTHING |
| RETURN | Specify what to send back to the client | SELECT (column list) |
Hands-on — Building a Fraud Detection Graph
The scenario: You are a data engineer at a payment processor. Your fraud team has identified that fraudsters operate in rings — multiple accounts sharing the same device fingerprint or phone number, coordinating small transactions that individually look legitimate. Your existing SQL queries miss these rings because they only check one-hop relationships. You are prototyping a Neo4j model to surface connected suspicious accounts.
-- Create accounts with MERGE (safe to run multiple times — no duplicates)
MERGE (a1:Account {id: 'acc_001', name: 'Alice', status: 'flagged'})
MERGE (a2:Account {id: 'acc_002', name: 'Bob', status: 'clean'})
MERGE (a3:Account {id: 'acc_003', name: 'Carol', status: 'clean'})
-- Create a shared device node
MERGE (d1:Device {fingerprint: 'fp_XK99'})
-- Connect all three accounts to the same device
MERGE (a1)-[:USES_DEVICE]->(d1)
MERGE (a2)-[:USES_DEVICE]->(d1)
MERGE (a3)-[:USES_DEVICE]->(d1);
Merged 1 node, set 3 properties, completed in 8 ms. Merged 1 node, set 3 properties, completed in 2 ms. Merged 1 node, set 3 properties, completed in 2 ms. Merged 1 node, set 1 property, completed in 3 ms. Created 1 relationship, completed in 2 ms. Created 1 relationship, completed in 1 ms. Created 1 relationship, completed in 1 ms.
MERGE (a1:Account {"{"}id: 'acc_001'...{"}"})
MERGE is an upsert — it matches the node on the property inside the braces (id) and creates it only if no match exists. If you run this script daily to load new accounts, you will never create duplicates. This is the correct default for data loading pipelines. Use CREATE only when you know the node does not already exist.
MERGE (a1)-[:USES_DEVICE]->(d1)
Creates the relationship only if it does not already exist between these two specific nodes. The device node d1 is referenced by the variable bound earlier in the same query — Cypher shares context across all MERGE/CREATE/MATCH clauses in one statement.
The scenario continues: Alice's account has been flagged by the risk engine. You need to find every account that shares a device with Alice — one hop away — plus all accounts that share a device with those accounts — two hops. That is the fraud ring. In SQL: three self-joins and a UNION. In Cypher: variable-length traversal with a single pattern.
-- Find all accounts connected to Alice within 2 hops via shared devices
MATCH (alice:Account {name: 'Alice'})
-[:USES_DEVICE*1..2]-(connected:Account)
WHERE connected <> alice
RETURN connected.name AS account,
connected.status AS status,
connected.id AS id;
+-----------------------+ | account | status | id | +-----------------------+ | "Bob" | clean | acc_002 | | "Carol" | clean | acc_003 | +-----------------------+ 2 rows available after 11 ms
-[:USES_DEVICE*1..2]-
Variable-length traversal. The *1..2 means "follow this relationship type between 1 and 2 hops." Change to *1..5 and you get five hops — deep ring detection — with zero extra code. No UNION, no recursive CTE. The undirected dash (no arrow) means follow in either direction, which is correct here since the relationship points account → device.
WHERE connected <> alice
Excludes Alice herself from the results. Without this, Alice would appear in her own connected accounts list at hop 2 (Alice → device → Alice). The <> operator compares node identity, not just property values.
The scenario continues: The risk team wants to see the actual path between Alice and each connected account — which device is the link, and when the connection was established. Returning the path object gives them full visibility without running separate queries.
-- Return the full path so investigators can see the connection chain
MATCH path = (alice:Account {name: 'Alice'})
-[:USES_DEVICE*1..2]-(connected:Account)
WHERE connected <> alice
RETURN path,
length(path) AS hops,
connected.name AS connected_account
ORDER BY hops;
+--------------------------------------------------------+ | path | hops | connected_account | +--------------------------------------------------------+ | (Alice)-[:USES_DEVICE]->(fp_XK99) <-[:USES_DEVICE]-(Bob) | 2 | Bob | | (Alice)-[:USES_DEVICE]->(fp_XK99) <-[:USES_DEVICE]-(Carol) | 2 | Carol | +--------------------------------------------------------+ 2 rows available after 9 ms
path = (alice)...(connected)
Assigning the entire traversal to the variable path captures every node and relationship in the chain. You can return it directly for graph visualization in Neo4j Browser (it draws the graph), or decompose it with functions like nodes(path) and relationships(path) to inspect individual elements.
length(path)
Returns the number of relationships in the path — 2 here because Alice → device ← Bob is two relationship hops. Useful for prioritising investigations: accounts two hops away are more suspicious than those three hops away.
Indexing in Neo4j — Making Lookups Fast
Every traversal in Neo4j starts somewhere. That starting point — the first node in a MATCH — needs to be found quickly. Without an index, Neo4j scans every node with the given label. With an index, it jumps directly to matching nodes. Always index the properties you use in your starting MATCH conditions.
The scenario: Your fraud graph now has 10 million account nodes. Every fraud investigation starts with MATCH (a:Account {id: 'acc_001'}). Without an index on id, Neo4j scans all 10 million Account nodes to find one. With an index, it is a single lookup. You are adding the index before the team runs production queries.
-- Create a range index on Account.id (Neo4j 5.x syntax)
CREATE INDEX account_id_idx IF NOT EXISTS
FOR (a:Account) ON (a.id);
-- Verify the index was created
SHOW INDEXES
YIELD name, labelsOrTypes, properties, state
WHERE 'Account' IN labelsOrTypes;
+------------------------------------------------------------------+ | name | labelsOrTypes | properties | state | +------------------------------------------------------------------+ | account_id_idx | ["Account"] | ["id"] | ONLINE | +------------------------------------------------------------------+ 1 row available after 14 ms Index created and ONLINE — lookup on Account.id is now O(log n).
CREATE INDEX … IF NOT EXISTS
The IF NOT EXISTS guard makes this idempotent — safe to include in setup scripts that run multiple times without throwing an error if the index already exists. Omit it and re-running the script on a live cluster throws a There already exists an index error.
FOR (a:Account) ON (a.id)
This is a label-property index — specific to the Account label. A MATCH (a:Account {id: 'x'}) query will use it automatically. Neo4j's query planner checks available indexes at query planning time and selects the best one — you do not need query hints for straightforward lookups.
state: ONLINE
Neo4j builds indexes asynchronously. On a large existing dataset the index shows as POPULATING first, then transitions to ONLINE when the backfill is complete. Until it is ONLINE, queries will not use it. Always check state before going to production.
Connecting from Python — The Official Driver
The scenario: Your fraud detection service is written in Python. Every time the risk engine flags a transaction, it calls your service to check whether the source account is connected to any known-bad accounts within three hops. You need a production-safe Neo4j connection with a connection pool, parameterised queries to prevent injection, and proper session management.
from neo4j import GraphDatabase
# Driver maintains a connection pool — create once, reuse across requests
driver = GraphDatabase.driver(
"bolt://localhost:7687",
auth=("neo4j", "yourpassword")
)
def find_connected_accounts(account_id: str, max_hops: int = 3):
query = """
MATCH (start:Account {id: $account_id})
-[:USES_DEVICE*1..$max_hops]-(connected:Account)
WHERE connected <> start
RETURN connected.id AS id,
connected.name AS name,
connected.status AS status
"""
with driver.session(database="neo4j") as session:
result = session.run(query,
account_id=account_id,
max_hops=max_hops)
return [record.data() for record in result]
>>> find_connected_accounts('acc_001', max_hops=2)
[
{'id': 'acc_002', 'name': 'Bob', 'status': 'clean'},
{'id': 'acc_003', 'name': 'Carol', 'status': 'clean'}
]
Query completed in 11 ms.GraphDatabase.driver("bolt://...", auth=(...))
Creates the driver — which manages a connection pool internally. Call this once at application startup and store it. Creating a new driver per request is a common mistake that exhausts file descriptors under load. The driver is thread-safe and designed to be shared.
session.run(query, account_id=account_id, max_hops=max_hops)
Parameters are passed separately from the query string — never use f-strings to embed user input directly into a Cypher query. Parameterised queries prevent Cypher injection, and Neo4j can cache the query plan for parameterised queries, dramatically improving throughput under repeated calls.
with driver.session(...) as session:
The context manager acquires a session from the pool and returns it on exit. Each session is a logical connection — it maps to a single Bolt connection from the pool. Always use sessions as context managers so they are returned to the pool even if an exception is raised.
Query Planning — Understanding EXPLAIN and PROFILE
Before pushing a query to production, you want to know whether it is using your indexes or doing a full node scan. Neo4j has two tools for this: EXPLAIN shows the query plan without running the query, and PROFILE runs it and shows actual row counts at each step.
The scenario: A junior engineer wrote a fraud query that is taking 800ms in staging. You suspect it is doing a NodeByLabelScan instead of using the account_id_idx index. You run PROFILE to confirm before pushing a fix.
-- PROFILE runs the query and reports rows processed at each step
PROFILE
MATCH (alice:Account {id: 'acc_001'})
-[:USES_DEVICE*1..2]-(connected:Account)
WHERE connected <> alice
RETURN connected.name, connected.status;
Operator | Rows | DB Hits | Details ----------------------+------+---------+---------------------------- ProduceResults | 2 | 0 | connected.name, status Filter | 2 | 4 | connected <> alice VarLengthExpand(All) | 2 | 8 | *1..2 via USES_DEVICE NodeIndexSeek | 1 | 2 | :Account(id) = 'acc_001' ✓ ----------------------+------+---------+---------------------------- Total DB hits: 14 | Completed in 9 ms
NodeIndexSeek
This is what you want to see. The query planner used account_id_idx to find Alice with 2 DB hits — one to locate the index entry, one to retrieve the node. If you saw NodeByLabelScan here instead, it would mean Neo4j scanned every Account node — your index is either missing, not ONLINE yet, or the query is written in a way that prevents its use.
DB Hits
The unit of work in Neo4j's query plan. Each DB hit represents one record read from the store. Lower is faster. A full label scan on 10 million nodes would show 10 million DB hits at the NodeByLabelScan step — an immediate red flag that an index is missing.
Neo4j Deployment Options
Neo4j Community
Free, open-source, single instance. No clustering, no enterprise features. Perfect for development, prototyping, and small production workloads that do not need HA.
Neo4j Enterprise
Causal clustering for HA, role-based access control, hot backups, multi-database. The version running eBay's fraud system and Airbnb's identity graph.
AuraDB (Cloud)
Neo4j's fully managed cloud service on GCP, AWS, and Azure. Free tier available. No infrastructure to manage — ideal for teams who want graph capabilities without ops overhead.
Neo4j Desktop
Local development environment with a built-in browser, project management, and one-click plugin installation. The fastest way to get started on your laptop.
Teacher's Note
The single most common Neo4j performance mistake is starting a traversal without an index — resulting in a full NodeByLabelScan on millions of nodes. Always run PROFILE on any new query before it hits production, and make sure every property you filter on in a starting MATCH clause has an index behind it.
Practice Questions — You're the Engineer
Scenario:
CREATE for every row. Which Cypher command should you replace CREATE with to ensure each account is written only once regardless of how many times the script runs?
Scenario:
PROFILE on a fraud query that starts with MATCH (a:Account {id: $id}). The profile output shows 9,800,000 DB hits at the first operator and the query takes 2.1 seconds. You look at the operator name — it is not NodeIndexSeek. Your index exists but was created 30 seconds ago on a live cluster. What operator are you seeing, and what does it tell you?
Scenario:
f"MATCH (a:Account {{id: '{account_id}'}})". The auditor says this creates two problems: a Cypher injection vulnerability, and degraded performance under load because Neo4j cannot cache the query plan. What technique should replace f-string query building?
Quiz — Neo4j in Production
Scenario:
Account.id on a production cluster with 50 million Account nodes. Five seconds later your monitoring shows that queries starting with MATCH (a:Account {id: $id}) are still taking 3 seconds and showing 50 million DB hits in PROFILE. You check SHOW INDEXES and see state: POPULATING. What is happening and what should you do?
Scenario:
GraphDatabase.driver(...) to create a fresh driver, runs the query, then calls driver.close(). Under load the service starts throwing "too many open file descriptors" errors and query latency spikes. What is the correct fix?
Scenario:
Up Next · Lesson 22
Choosing the Right NoSQL Database
Key-value, document, column-family, or graph — a practical decision framework for picking the right database the first time.