NoSQL
HBase: Hadoop's Built-In Database Engine
Your Spark jobs churn through petabytes of data on HDFS every night — but the moment product asks for a dashboard that shows any user's last 50 actions in under 100ms, batch processing falls flat. HBase is the answer to that exact problem: a distributed, strongly consistent store built directly on top of HDFS that gives you random read/write access at scale, without leaving the Hadoop ecosystem.
Where HBase Fits in the Big Data Stack
HBase was modelled after Google's Bigtable paper, published in 2006. Google needed a system to store and retrieve massive amounts of structured data — web crawl results, satellite imagery, user behaviour — with random access at any row. HBase brought that same model into the open-source Hadoop world.
You already know Cassandra from Lesson 18. Both are column-family stores. But there is a critical difference: Cassandra is purpose-built for operational, low-latency workloads from live applications. HBase is built to sit on top of HDFS and serve both batch pipelines (Spark, MapReduce) and random-access lookups from the same underlying data.
The Hadoop Ecosystem Connection
HBase stores data in HDFS. That means it inherits HDFS's fault tolerance and replication automatically. Your HBase table is a collection of files on HDFS — HBase just gives you a fast, structured way to read and write individual rows within those files without running a full MapReduce job every time.
The HBase Data Model — Rows, Families, and Cells
HBase stores data in tables, but the analogy to SQL ends immediately. Every row is identified by a unique row key. Inside each row, data is grouped into column families, and within each family you can have any number of column qualifiers. The intersection of row key + column family + column qualifier + timestamp is a cell.
HBase Data Model — Visual
| Row Key | CF: profile | CF: activity | ||
|---|---|---|---|---|
| name | last_login | clicks | ||
| user_001 | Alice | alice@x.com | 2025-01-10 | 492 |
| user_002 | Bob | bob@y.com | — | 118 |
| user_003 | Carol | — | 2025-01-09 | — |
— means the cell does not exist. No NULL, no placeholder. HBase is sparse by design.
HBase Architecture — The Moving Parts
Understanding the architecture tells you exactly why HBase is fast and where its limits are. Three components do all the work.
HMaster
Coordinates the cluster. Assigns regions to RegionServers, handles table DDL (create/delete), monitors RegionServer health. Not on the hot read/write path — clients talk directly to RegionServers after their first lookup.
RegionServer
Where actual data lives. Each RegionServer hosts multiple Regions (horizontal shards of a table). Handles all reads and writes for the rows it owns. Stores writes in MemStore (RAM), then flushes to HFiles on HDFS.
ZooKeeper
Coordination layer. Tracks which RegionServers are alive, stores the META table location (index of all regions), and handles HMaster leader election on failover.
HBase Write Path — What Happens on Every Put
Row Key Design — The Most Important Decision You'll Make
In HBase, all data is physically sorted by row key. This is both a superpower and a trap. If your row key is a timestamp or an auto-incrementing integer, all your writes land on the same region — one RegionServer gets hammered while the rest sit idle.
The Hotspot Problem
Row keys like user_00001, user_00002, user_00003... are sequential — HBase places them in the same region. All writes pile onto RegionServer #1 while RegionServers #2–#10 handle nothing. Fix it with salting (random hash prefix), field swapping (high-cardinality field first), or reversed timestamps (newest data does not cluster together).
Hands-on — Building a Telecom CDR Store
The scenario: You are a data engineer at a telecom company. Your team needs to store call detail records — 50 million rows per day, each containing call duration, cell tower ID, and signal strength. The billing team needs to query any subscriber's full call history instantly. Spark jobs scan full days for fraud analysis. You are setting up the HBase schema and validating it end to end.
# Start HBase in standalone mode (local dev/test)
hbase start
# Open the HBase shell
hbase shell
# Create table with two column families
create 'call_records', 'meta', 'signal'
# List tables to confirm
list
HBase(main):001:0> create 'call_records', 'meta', 'signal' Created table call_records Took 1.4210 seconds => Hbase::Table - call_records HBase(main):002:0> list TABLE call_records 1 row(s) Took 0.0310 seconds
hbase shell
Opens the interactive JRuby shell. Every command — create, put, get, scan — is a Ruby method call. The shell is for admin work and prototyping. Production pipelines use the Java API, Python happybase, or a Thrift client.
create 'call_records', 'meta', 'signal'
Creates a table with two column families: meta (duration, tower ID) and signal (RSSI, quality). Column families must be declared upfront — column qualifiers within each family can be added freely at write time without any schema change.
The scenario continues: You insert test CDRs to validate the schema. Row keys are encoded as phoneNumber_reverseTimestamp — the reverse timestamp ensures the most recent call for each subscriber sorts first, and spreads writes evenly across regions.
# Row key: phoneNumber_reverseTimestamp (hotspot-safe)
put 'call_records', '447700900123_9999999987654', 'meta:duration', '142'
put 'call_records', '447700900123_9999999987654', 'meta:tower_id', 'TOWER-88'
put 'call_records', '447700900123_9999999987654', 'signal:rssi', '-72'
# Read the row back
get 'call_records', '447700900123_9999999987654'
COLUMN CELL meta:duration timestamp=1736512943210, value=142 meta:tower_id timestamp=1736512943318, value=TOWER-88 signal:rssi timestamp=1736512943404, value=-72 3 row(s) Took 0.0048 seconds
put 'table', 'rowkey', 'family:qualifier', 'value'
Each put writes exactly one cell. No schema enforcement — you can write signal:new_metric at any time without an ALTER TABLE. HBase is fully schema-flexible within a column family.
'447700900123_9999999987654'
Phone number + reversed timestamp. 9999999999999 minus actualTimestamp means the most recent call sorts first for that subscriber — the reverse scan pattern. Essential for "get latest N calls" without scanning the full partition.
timestamp=1736512943210
HBase auto-records a version timestamp for every cell write. You can configure a column family to store multiple versions and retrieve historical values by timestamp. Default max versions is 1.
The scenario continues: Production pipelines will use Python. Data engineers need to bulk-load test records from scripts running on Spark workers. You demonstrate programmatic access using the happybase library with batched writes.
import happybase
# Connect via HBase Thrift server (must be running: hbase thrift start)
connection = happybase.Connection('localhost', port=9090)
connection.open()
# Reference the table
table = connection.table('call_records')
# Bulk write using batch — collapses many puts into fewer Thrift calls
with table.batch(batch_size=100) as batch:
for i in range(3):
row_key = f'447700900{i:03d}_9999999987{i:03d}'.encode()
batch.put(row_key, {
b'meta:duration': str(120 + i * 10).encode(),
b'meta:tower_id': f'TOWER-{80 + i}'.encode(),
b'signal:rssi': str(-70 - i).encode(),
})
print("Batch written successfully")
Batch written successfully
happybase.Connection('localhost', port=9090)
HBase has no native Python driver. Python communicates via the HBase Thrift server — a process that translates Thrift RPC calls into HBase Java API calls. You must start it separately with hbase thrift start. Without it, every connection attempt is refused on port 9090.
table.batch(batch_size=100)
Buffers writes locally and sends them in groups of 100. Without batching, every put is a separate Thrift round trip — on 10,000 rows that is 10,000 individual calls. Batching collapses this to roughly 100. Critical for any bulk load pipeline.
b'meta:duration'
HBase stores everything as raw bytes. The b'' prefix is mandatory — keys, column names, and values must all be bytes objects in Python 3 happybase. Passing a plain string raises a TypeError that will waste 20 minutes on a live deploy.
The scenario continues: The billing team runs end-of-month jobs totalling call duration per subscriber. They need all records for one phone number across a date range. Your row key prefix design makes this a clean range scan — no secondary index required.
# Scan all records for one subscriber using prefix
# row_prefix is internally translated to a start-row / stop-row range
for key, data in table.scan(
row_prefix=b'447700900123_', # Only this subscriber
columns=[b'meta:duration'] # Column projection — skip signal data entirely
):
duration = int(data[b'meta:duration'])
print(f" Row: {key.decode()[:30]}... Duration: {duration}s")
Row: 447700900123_9999999987654... Duration: 142s Row: 447700900123_9999999987000... Duration: 198s Row: 447700900123_9999999986500... Duration: 67s Scan complete: 3 rows in 4.2ms
row_prefix=b'447700900123_'
HBase translates this into a start-row / stop-row scan internally. It touches only the RegionServer(s) owning rows with this prefix. No full table scan. No secondary index. This is precisely why row key design is the most consequential HBase decision you make.
columns=[b'meta:duration']
Because HBase stores each column family in separate HDFS files, this projection means the billing job never reads signal:rssi data at all. Massive I/O savings when scanning millions of CDR rows across a month.
HBase vs Cassandra — Picking the Right Column Store
| Factor | HBase | Cassandra |
|---|---|---|
| Primary use case | Batch analytics + random access on HDFS data | Low-latency operational reads/writes |
| Ecosystem | Hadoop, Spark, HDFS, MapReduce | Standalone, integrates with Kafka, Spark |
| Consistency model | Strong (single master per region) | Tunable (eventual to strong) |
| Write latency | 1–10ms (WAL overhead) | <1ms (memtable acknowledgement) |
| Operational complexity | High — needs ZooKeeper, HDFS, HMaster | Medium — peer-to-peer, fewer moving parts |
| Best for | Existing Hadoop shop, analytical pipelines | Greenfield apps, global always-on systems |
Teacher's Note
Most teams running HBase are already deep in the Hadoop ecosystem — HDFS clusters, Spark jobs, data engineers living in YARN. If you are starting fresh without HDFS, Cassandra is almost always the easier operational path. HBase's specific power is as the serving layer for a pipeline that already produces data on HDFS — Spark writes processed data, HBase serves the real-time point lookups on that same dataset. That combination is genuinely hard to beat when you are already in that world.
Practice Questions — You're the Engineer
Scenario:
sensor_001_1736512943. After deploying to production, one RegionServer is handling 90% of all writes while the other nine are nearly idle. A colleague suggests encoding the timestamp as 9999999999999 minus the actual timestamp in the row key suffix instead of the raw value. What row key technique are they recommending?
Scenario:
price, sku, and weight — but shipping and review data should never be read during a product lookup. You want these groups stored in physically separate HDFS files so reads stay as narrow as possible. What HBase grouping mechanism lets you achieve this?
Scenario:
Connection refused on port 9090. You confirm HBase is running fine — the shell connects and queries work perfectly. You tell the developer to run one additional command to open a bridge between Python and the HBase Java API. What process do they need to start?
Quiz — HBase in Production
Scenario:
Scenario:
table.put() call. The pipeline intermittently fails with Thrift server timeouts under load. You see 50,000 individual connections per minute in the Thrift server logs. What is the most direct fix?
Scenario:
Up Next · Lesson 20
Graph Databases
When your data is all about relationships — friends, fraud rings, recommendations — graphs change everything.