NO SQL Lesson 21 – Neo4j Introduction | Dataplexa
NoSQL Database Types · Lesson 21

Neo4j Introduction

Neo4j is the database that put graph databases on the map. eBay uses it to detect fraud rings in real time. Airbnb uses it to power identity verification across 800 million listings. NASA uses it to manage the relationships between spacecraft components. The reason they all chose Neo4j comes down to one thing: Cypher — a query language so readable that a pattern you draw on a whiteboard is almost identical to the query you write in code.

What Makes Neo4j Different

Neo4j is a native graph database — meaning the storage engine itself is built around nodes and relationships, not tables that simulate a graph. Every node stores a direct physical pointer to each of its relationships. Every relationship stores direct pointers to its start and end nodes. Traversal follows those pointers — no index involved, no table scan, no JOIN.

The Cypher query language is designed around pattern matching. You describe the shape of the graph you want to find using ASCII art, and Neo4j finds it. Parentheses are nodes (), square brackets are relationships [], and arrows show direction ->.

Native Graph Storage vs Graph-on-Top-of-SQL

Some databases bolt graph querying on top of a relational or document engine — Amazon Neptune's property graph mode, for example, sits on top of a purpose-built storage layer. Neo4j's storage engine was written from scratch with nodes and relationships as the fundamental unit. That is why deep traversals stay fast — no abstraction layer, no translation cost, just pointer chasing.

Neo4j Architecture — How It Works Internally

Understanding Neo4j's internals helps you write faster queries and design better schemas.

Node Store

Fixed-size records for every node. Each record stores: a flag (in use?), the first relationship ID, the first property ID, and the label store ID. Fixed size means any node can be found by its ID in a single O(1) seek.

Relationship Store

Each relationship record stores: start node ID, end node ID, relationship type, and pointers to the previous and next relationship in the chain for both nodes. This doubly-linked list is how traversal works — follow the pointer, not an index.

Property Store

Properties are stored separately from the node and relationship stores — as a linked list of property records. Each record holds up to four property key-value pairs. Long strings and arrays spill into a dynamic store.

Label Index

An index per label that maps label → node IDs. When your query starts with MATCH (u:User), Neo4j uses this index to find all User nodes rather than scanning every node in the store.

Cypher Fundamentals — Reading the ASCII Art

Cypher has five core clauses. Everything you will ever write is a combination of these:

Clause What it does SQL equivalent
MATCH Find nodes/relationships matching a pattern SELECT … FROM … JOIN
WHERE Filter results by property values or patterns WHERE
CREATE Create new nodes or relationships INSERT INTO
MERGE Create if not exists, match if it does INSERT … ON CONFLICT DO NOTHING
RETURN Specify what to send back to the client SELECT (column list)

Hands-on — Building a Fraud Detection Graph

The scenario: You are a data engineer at a payment processor. Your fraud team has identified that fraudsters operate in rings — multiple accounts sharing the same device fingerprint or phone number, coordinating small transactions that individually look legitimate. Your existing SQL queries miss these rings because they only check one-hop relationships. You are prototyping a Neo4j model to surface connected suspicious accounts.

-- Create accounts with MERGE (safe to run multiple times — no duplicates)
MERGE (a1:Account {id: 'acc_001', name: 'Alice', status: 'flagged'})
MERGE (a2:Account {id: 'acc_002', name: 'Bob',   status: 'clean'})
MERGE (a3:Account {id: 'acc_003', name: 'Carol', status: 'clean'})

-- Create a shared device node
MERGE (d1:Device {fingerprint: 'fp_XK99'})

-- Connect all three accounts to the same device
MERGE (a1)-[:USES_DEVICE]->(d1)
MERGE (a2)-[:USES_DEVICE]->(d1)
MERGE (a3)-[:USES_DEVICE]->(d1);
Merged 1 node, set 3 properties, completed in 8 ms.
Merged 1 node, set 3 properties, completed in 2 ms.
Merged 1 node, set 3 properties, completed in 2 ms.
Merged 1 node, set 1 property, completed in 3 ms.
Created 1 relationship, completed in 2 ms.
Created 1 relationship, completed in 1 ms.
Created 1 relationship, completed in 1 ms.
MERGE (a1:Account {"{"}id: 'acc_001'...{"}"})

MERGE is an upsert — it matches the node on the property inside the braces (id) and creates it only if no match exists. If you run this script daily to load new accounts, you will never create duplicates. This is the correct default for data loading pipelines. Use CREATE only when you know the node does not already exist.

MERGE (a1)-[:USES_DEVICE]->(d1)

Creates the relationship only if it does not already exist between these two specific nodes. The device node d1 is referenced by the variable bound earlier in the same query — Cypher shares context across all MERGE/CREATE/MATCH clauses in one statement.

The scenario continues: Alice's account has been flagged by the risk engine. You need to find every account that shares a device with Alice — one hop away — plus all accounts that share a device with those accounts — two hops. That is the fraud ring. In SQL: three self-joins and a UNION. In Cypher: variable-length traversal with a single pattern.

-- Find all accounts connected to Alice within 2 hops via shared devices
MATCH (alice:Account {name: 'Alice'})
      -[:USES_DEVICE*1..2]-(connected:Account)
WHERE connected <> alice
RETURN connected.name  AS account,
       connected.status AS status,
       connected.id     AS id;
+-----------------------+
| account | status | id      |
+-----------------------+
| "Bob"   | clean  | acc_002 |
| "Carol" | clean  | acc_003 |
+-----------------------+
2 rows available after 11 ms
-[:USES_DEVICE*1..2]-

Variable-length traversal. The *1..2 means "follow this relationship type between 1 and 2 hops." Change to *1..5 and you get five hops — deep ring detection — with zero extra code. No UNION, no recursive CTE. The undirected dash (no arrow) means follow in either direction, which is correct here since the relationship points account → device.

WHERE connected <> alice

Excludes Alice herself from the results. Without this, Alice would appear in her own connected accounts list at hop 2 (Alice → device → Alice). The <> operator compares node identity, not just property values.

The scenario continues: The risk team wants to see the actual path between Alice and each connected account — which device is the link, and when the connection was established. Returning the path object gives them full visibility without running separate queries.

-- Return the full path so investigators can see the connection chain
MATCH path = (alice:Account {name: 'Alice'})
             -[:USES_DEVICE*1..2]-(connected:Account)
WHERE connected <> alice
RETURN path,
       length(path)       AS hops,
       connected.name     AS connected_account
ORDER BY hops;
+--------------------------------------------------------+
| path                                    | hops | connected_account |
+--------------------------------------------------------+
| (Alice)-[:USES_DEVICE]->(fp_XK99)
  <-[:USES_DEVICE]-(Bob)                 | 2    | Bob               |
| (Alice)-[:USES_DEVICE]->(fp_XK99)
  <-[:USES_DEVICE]-(Carol)               | 2    | Carol             |
+--------------------------------------------------------+
2 rows available after 9 ms
path = (alice)...(connected)

Assigning the entire traversal to the variable path captures every node and relationship in the chain. You can return it directly for graph visualization in Neo4j Browser (it draws the graph), or decompose it with functions like nodes(path) and relationships(path) to inspect individual elements.

length(path)

Returns the number of relationships in the path — 2 here because Alice → device ← Bob is two relationship hops. Useful for prioritising investigations: accounts two hops away are more suspicious than those three hops away.

Indexing in Neo4j — Making Lookups Fast

Every traversal in Neo4j starts somewhere. That starting point — the first node in a MATCH — needs to be found quickly. Without an index, Neo4j scans every node with the given label. With an index, it jumps directly to matching nodes. Always index the properties you use in your starting MATCH conditions.

The scenario: Your fraud graph now has 10 million account nodes. Every fraud investigation starts with MATCH (a:Account {id: 'acc_001'}). Without an index on id, Neo4j scans all 10 million Account nodes to find one. With an index, it is a single lookup. You are adding the index before the team runs production queries.

-- Create a range index on Account.id (Neo4j 5.x syntax)
CREATE INDEX account_id_idx IF NOT EXISTS
FOR (a:Account) ON (a.id);

-- Verify the index was created
SHOW INDEXES
YIELD name, labelsOrTypes, properties, state
WHERE 'Account' IN labelsOrTypes;
+------------------------------------------------------------------+
| name              | labelsOrTypes | properties | state  |
+------------------------------------------------------------------+
| account_id_idx    | ["Account"]   | ["id"]     | ONLINE |
+------------------------------------------------------------------+
1 row available after 14 ms

Index created and ONLINE — lookup on Account.id is now O(log n).
CREATE INDEX … IF NOT EXISTS

The IF NOT EXISTS guard makes this idempotent — safe to include in setup scripts that run multiple times without throwing an error if the index already exists. Omit it and re-running the script on a live cluster throws a There already exists an index error.

FOR (a:Account) ON (a.id)

This is a label-property index — specific to the Account label. A MATCH (a:Account {id: 'x'}) query will use it automatically. Neo4j's query planner checks available indexes at query planning time and selects the best one — you do not need query hints for straightforward lookups.

state: ONLINE

Neo4j builds indexes asynchronously. On a large existing dataset the index shows as POPULATING first, then transitions to ONLINE when the backfill is complete. Until it is ONLINE, queries will not use it. Always check state before going to production.

Connecting from Python — The Official Driver

The scenario: Your fraud detection service is written in Python. Every time the risk engine flags a transaction, it calls your service to check whether the source account is connected to any known-bad accounts within three hops. You need a production-safe Neo4j connection with a connection pool, parameterised queries to prevent injection, and proper session management.

from neo4j import GraphDatabase

# Driver maintains a connection pool — create once, reuse across requests
driver = GraphDatabase.driver(
    "bolt://localhost:7687",
    auth=("neo4j", "yourpassword")
)

def find_connected_accounts(account_id: str, max_hops: int = 3):
    query = """
        MATCH (start:Account {id: $account_id})
              -[:USES_DEVICE*1..$max_hops]-(connected:Account)
        WHERE connected <> start
        RETURN connected.id   AS id,
               connected.name AS name,
               connected.status AS status
    """
    with driver.session(database="neo4j") as session:
        result = session.run(query,
                             account_id=account_id,
                             max_hops=max_hops)
        return [record.data() for record in result]
>>> find_connected_accounts('acc_001', max_hops=2)
[
  {'id': 'acc_002', 'name': 'Bob',   'status': 'clean'},
  {'id': 'acc_003', 'name': 'Carol', 'status': 'clean'}
]
Query completed in 11 ms.
GraphDatabase.driver("bolt://...", auth=(...))

Creates the driver — which manages a connection pool internally. Call this once at application startup and store it. Creating a new driver per request is a common mistake that exhausts file descriptors under load. The driver is thread-safe and designed to be shared.

session.run(query, account_id=account_id, max_hops=max_hops)

Parameters are passed separately from the query string — never use f-strings to embed user input directly into a Cypher query. Parameterised queries prevent Cypher injection, and Neo4j can cache the query plan for parameterised queries, dramatically improving throughput under repeated calls.

with driver.session(...) as session:

The context manager acquires a session from the pool and returns it on exit. Each session is a logical connection — it maps to a single Bolt connection from the pool. Always use sessions as context managers so they are returned to the pool even if an exception is raised.

Query Planning — Understanding EXPLAIN and PROFILE

Before pushing a query to production, you want to know whether it is using your indexes or doing a full node scan. Neo4j has two tools for this: EXPLAIN shows the query plan without running the query, and PROFILE runs it and shows actual row counts at each step.

The scenario: A junior engineer wrote a fraud query that is taking 800ms in staging. You suspect it is doing a NodeByLabelScan instead of using the account_id_idx index. You run PROFILE to confirm before pushing a fix.

-- PROFILE runs the query and reports rows processed at each step
PROFILE
MATCH (alice:Account {id: 'acc_001'})
      -[:USES_DEVICE*1..2]-(connected:Account)
WHERE connected <> alice
RETURN connected.name, connected.status;
Operator              | Rows | DB Hits | Details
----------------------+------+---------+----------------------------
ProduceResults        |    2 |       0 | connected.name, status
Filter                |    2 |       4 | connected <> alice
VarLengthExpand(All)  |    2 |       8 | *1..2 via USES_DEVICE
NodeIndexSeek         |    1 |       2 | :Account(id) = 'acc_001'  ✓
----------------------+------+---------+----------------------------
Total DB hits: 14     |  Completed in 9 ms
NodeIndexSeek

This is what you want to see. The query planner used account_id_idx to find Alice with 2 DB hits — one to locate the index entry, one to retrieve the node. If you saw NodeByLabelScan here instead, it would mean Neo4j scanned every Account node — your index is either missing, not ONLINE yet, or the query is written in a way that prevents its use.

DB Hits

The unit of work in Neo4j's query plan. Each DB hit represents one record read from the store. Lower is faster. A full label scan on 10 million nodes would show 10 million DB hits at the NodeByLabelScan step — an immediate red flag that an index is missing.

Neo4j Deployment Options

Neo4j Community

Free, open-source, single instance. No clustering, no enterprise features. Perfect for development, prototyping, and small production workloads that do not need HA.

Neo4j Enterprise

Causal clustering for HA, role-based access control, hot backups, multi-database. The version running eBay's fraud system and Airbnb's identity graph.

AuraDB (Cloud)

Neo4j's fully managed cloud service on GCP, AWS, and Azure. Free tier available. No infrastructure to manage — ideal for teams who want graph capabilities without ops overhead.

Neo4j Desktop

Local development environment with a built-in browser, project management, and one-click plugin installation. The fastest way to get started on your laptop.

Teacher's Note

The single most common Neo4j performance mistake is starting a traversal without an index — resulting in a full NodeByLabelScan on millions of nodes. Always run PROFILE on any new query before it hits production, and make sure every property you filter on in a starting MATCH clause has an index behind it.

Practice Questions — You're the Engineer

Scenario:

Your team runs a daily import script that loads new bank accounts into Neo4j from a CSV export. After three days you notice the Account node count is exactly three times the number of unique accounts in your source system. Your script uses CREATE for every row. Which Cypher command should you replace CREATE with to ensure each account is written only once regardless of how many times the script runs?


Scenario:

You run PROFILE on a fraud query that starts with MATCH (a:Account {id: $id}). The profile output shows 9,800,000 DB hits at the first operator and the query takes 2.1 seconds. You look at the operator name — it is not NodeIndexSeek. Your index exists but was created 30 seconds ago on a live cluster. What operator are you seeing, and what does it tell you?


Scenario:

A security audit flags your Neo4j Python service. The engineer who wrote it built queries using f-strings: f"MATCH (a:Account {{id: '{account_id}'}})". The auditor says this creates two problems: a Cypher injection vulnerability, and degraded performance under load because Neo4j cannot cache the query plan. What technique should replace f-string query building?


Quiz — Neo4j in Production

Scenario:

You create an index on Account.id on a production cluster with 50 million Account nodes. Five seconds later your monitoring shows that queries starting with MATCH (a:Account {id: $id}) are still taking 3 seconds and showing 50 million DB hits in PROFILE. You check SHOW INDEXES and see state: POPULATING. What is happening and what should you do?

Scenario:

Your Python fraud service is handling 500 requests per second. A new engineer joined and refactored it so that every incoming request calls GraphDatabase.driver(...) to create a fresh driver, runs the query, then calls driver.close(). Under load the service starts throwing "too many open file descriptors" errors and query latency spikes. What is the correct fix?

Scenario:

Your fraud team asks you to extend ring detection from 2 hops to 3. Your current SQL implementation uses three self-joins — adding a fourth join for the third hop doubles query time to 24 seconds. The product team says 24 seconds is unacceptable. What is the correct approach in Neo4j and why is it faster?

Up Next · Lesson 22

Choosing the Right NoSQL Database

Key-value, document, column-family, or graph — a practical decision framework for picking the right database the first time.