NO SQL Lesson 19 – HBase Overview | Dataplexa
NoSQL Database Types · Lesson 19

HBase: Hadoop's Built-In Database Engine

Your Spark jobs churn through petabytes of data on HDFS every night — but the moment product asks for a dashboard that shows any user's last 50 actions in under 100ms, batch processing falls flat. HBase is the answer to that exact problem: a distributed, strongly consistent store built directly on top of HDFS that gives you random read/write access at scale, without leaving the Hadoop ecosystem.

Where HBase Fits in the Big Data Stack

HBase was modelled after Google's Bigtable paper, published in 2006. Google needed a system to store and retrieve massive amounts of structured data — web crawl results, satellite imagery, user behaviour — with random access at any row. HBase brought that same model into the open-source Hadoop world.

You already know Cassandra from Lesson 18. Both are column-family stores. But there is a critical difference: Cassandra is purpose-built for operational, low-latency workloads from live applications. HBase is built to sit on top of HDFS and serve both batch pipelines (Spark, MapReduce) and random-access lookups from the same underlying data.

The Hadoop Ecosystem Connection

HBase stores data in HDFS. That means it inherits HDFS's fault tolerance and replication automatically. Your HBase table is a collection of files on HDFS — HBase just gives you a fast, structured way to read and write individual rows within those files without running a full MapReduce job every time.

The HBase Data Model — Rows, Families, and Cells

HBase stores data in tables, but the analogy to SQL ends immediately. Every row is identified by a unique row key. Inside each row, data is grouped into column families, and within each family you can have any number of column qualifiers. The intersection of row key + column family + column qualifier + timestamp is a cell.

HBase Data Model — Visual

Row Key CF: profile CF: activity
name email last_login clicks
user_001 Alice alice@x.com 2025-01-10 492
user_002 Bob bob@y.com 118
user_003 Carol 2025-01-09

— means the cell does not exist. No NULL, no placeholder. HBase is sparse by design.

HBase Architecture — The Moving Parts

Understanding the architecture tells you exactly why HBase is fast and where its limits are. Three components do all the work.

HMaster

Coordinates the cluster. Assigns regions to RegionServers, handles table DDL (create/delete), monitors RegionServer health. Not on the hot read/write path — clients talk directly to RegionServers after their first lookup.

RegionServer

Where actual data lives. Each RegionServer hosts multiple Regions (horizontal shards of a table). Handles all reads and writes for the rows it owns. Stores writes in MemStore (RAM), then flushes to HFiles on HDFS.

ZooKeeper

Coordination layer. Tracks which RegionServers are alive, stores the META table location (index of all regions), and handles HMaster leader election on failover.

HBase Write Path — What Happens on Every Put

Step 1 Write appended to WAL (Write-Ahead Log) on disk — sequential, crash-safe
Step 2 Data written to MemStore (in-memory buffer) → client gets success acknowledgement
Step 3 When MemStore fills (default 128MB) → flushed to immutable HFile on HDFS
Step 4 Background compaction merges HFiles periodically → keeps reads fast over time

Row Key Design — The Most Important Decision You'll Make

In HBase, all data is physically sorted by row key. This is both a superpower and a trap. If your row key is a timestamp or an auto-incrementing integer, all your writes land on the same region — one RegionServer gets hammered while the rest sit idle.

The Hotspot Problem

Row keys like user_00001, user_00002, user_00003... are sequential — HBase places them in the same region. All writes pile onto RegionServer #1 while RegionServers #2–#10 handle nothing. Fix it with salting (random hash prefix), field swapping (high-cardinality field first), or reversed timestamps (newest data does not cluster together).

Hands-on — Building a Telecom CDR Store

The scenario: You are a data engineer at a telecom company. Your team needs to store call detail records — 50 million rows per day, each containing call duration, cell tower ID, and signal strength. The billing team needs to query any subscriber's full call history instantly. Spark jobs scan full days for fraud analysis. You are setting up the HBase schema and validating it end to end.

# Start HBase in standalone mode (local dev/test)
hbase start

# Open the HBase shell
hbase shell

# Create table with two column families
create 'call_records', 'meta', 'signal'

# List tables to confirm
list
HBase(main):001:0> create 'call_records', 'meta', 'signal'
Created table call_records
Took 1.4210 seconds
=> Hbase::Table - call_records

HBase(main):002:0> list
TABLE
call_records
1 row(s)
Took 0.0310 seconds
hbase shell

Opens the interactive JRuby shell. Every command — create, put, get, scan — is a Ruby method call. The shell is for admin work and prototyping. Production pipelines use the Java API, Python happybase, or a Thrift client.

create 'call_records', 'meta', 'signal'

Creates a table with two column families: meta (duration, tower ID) and signal (RSSI, quality). Column families must be declared upfront — column qualifiers within each family can be added freely at write time without any schema change.

The scenario continues: You insert test CDRs to validate the schema. Row keys are encoded as phoneNumber_reverseTimestamp — the reverse timestamp ensures the most recent call for each subscriber sorts first, and spreads writes evenly across regions.

# Row key: phoneNumber_reverseTimestamp  (hotspot-safe)
put 'call_records', '447700900123_9999999987654', 'meta:duration', '142'
put 'call_records', '447700900123_9999999987654', 'meta:tower_id', 'TOWER-88'
put 'call_records', '447700900123_9999999987654', 'signal:rssi',   '-72'

# Read the row back
get 'call_records', '447700900123_9999999987654'
COLUMN                  CELL
 meta:duration          timestamp=1736512943210, value=142
 meta:tower_id          timestamp=1736512943318, value=TOWER-88
 signal:rssi            timestamp=1736512943404, value=-72
3 row(s)
Took 0.0048 seconds
put 'table', 'rowkey', 'family:qualifier', 'value'

Each put writes exactly one cell. No schema enforcement — you can write signal:new_metric at any time without an ALTER TABLE. HBase is fully schema-flexible within a column family.

'447700900123_9999999987654'

Phone number + reversed timestamp. 9999999999999 minus actualTimestamp means the most recent call sorts first for that subscriber — the reverse scan pattern. Essential for "get latest N calls" without scanning the full partition.

timestamp=1736512943210

HBase auto-records a version timestamp for every cell write. You can configure a column family to store multiple versions and retrieve historical values by timestamp. Default max versions is 1.

The scenario continues: Production pipelines will use Python. Data engineers need to bulk-load test records from scripts running on Spark workers. You demonstrate programmatic access using the happybase library with batched writes.

import happybase

# Connect via HBase Thrift server (must be running: hbase thrift start)
connection = happybase.Connection('localhost', port=9090)
connection.open()

# Reference the table
table = connection.table('call_records')

# Bulk write using batch — collapses many puts into fewer Thrift calls
with table.batch(batch_size=100) as batch:
    for i in range(3):
        row_key = f'447700900{i:03d}_9999999987{i:03d}'.encode()
        batch.put(row_key, {
            b'meta:duration': str(120 + i * 10).encode(),
            b'meta:tower_id': f'TOWER-{80 + i}'.encode(),
            b'signal:rssi':   str(-70 - i).encode(),
        })

print("Batch written successfully")
Batch written successfully
happybase.Connection('localhost', port=9090)

HBase has no native Python driver. Python communicates via the HBase Thrift server — a process that translates Thrift RPC calls into HBase Java API calls. You must start it separately with hbase thrift start. Without it, every connection attempt is refused on port 9090.

table.batch(batch_size=100)

Buffers writes locally and sends them in groups of 100. Without batching, every put is a separate Thrift round trip — on 10,000 rows that is 10,000 individual calls. Batching collapses this to roughly 100. Critical for any bulk load pipeline.

b'meta:duration'

HBase stores everything as raw bytes. The b'' prefix is mandatory — keys, column names, and values must all be bytes objects in Python 3 happybase. Passing a plain string raises a TypeError that will waste 20 minutes on a live deploy.

The scenario continues: The billing team runs end-of-month jobs totalling call duration per subscriber. They need all records for one phone number across a date range. Your row key prefix design makes this a clean range scan — no secondary index required.

# Scan all records for one subscriber using prefix
# row_prefix is internally translated to a start-row / stop-row range
for key, data in table.scan(
    row_prefix=b'447700900123_',  # Only this subscriber
    columns=[b'meta:duration']    # Column projection — skip signal data entirely
):
    duration = int(data[b'meta:duration'])
    print(f"  Row: {key.decode()[:30]}...  Duration: {duration}s")
  Row: 447700900123_9999999987654...  Duration: 142s
  Row: 447700900123_9999999987000...  Duration: 198s
  Row: 447700900123_9999999986500...  Duration: 67s
Scan complete: 3 rows in 4.2ms
row_prefix=b'447700900123_'

HBase translates this into a start-row / stop-row scan internally. It touches only the RegionServer(s) owning rows with this prefix. No full table scan. No secondary index. This is precisely why row key design is the most consequential HBase decision you make.

columns=[b'meta:duration']

Because HBase stores each column family in separate HDFS files, this projection means the billing job never reads signal:rssi data at all. Massive I/O savings when scanning millions of CDR rows across a month.

HBase vs Cassandra — Picking the Right Column Store

Factor HBase Cassandra
Primary use case Batch analytics + random access on HDFS data Low-latency operational reads/writes
Ecosystem Hadoop, Spark, HDFS, MapReduce Standalone, integrates with Kafka, Spark
Consistency model Strong (single master per region) Tunable (eventual to strong)
Write latency 1–10ms (WAL overhead) <1ms (memtable acknowledgement)
Operational complexity High — needs ZooKeeper, HDFS, HMaster Medium — peer-to-peer, fewer moving parts
Best for Existing Hadoop shop, analytical pipelines Greenfield apps, global always-on systems

Teacher's Note

Most teams running HBase are already deep in the Hadoop ecosystem — HDFS clusters, Spark jobs, data engineers living in YARN. If you are starting fresh without HDFS, Cassandra is almost always the easier operational path. HBase's specific power is as the serving layer for a pipeline that already produces data on HDFS — Spark writes processed data, HBase serves the real-time point lookups on that same dataset. That combination is genuinely hard to beat when you are already in that world.

Practice Questions — You're the Engineer

Scenario:

Your HBase table stores IoT sensor readings with row keys like sensor_001_1736512943. After deploying to production, one RegionServer is handling 90% of all writes while the other nine are nearly idle. A colleague suggests encoding the timestamp as 9999999999999 minus the actual timestamp in the row key suffix instead of the raw value. What row key technique are they recommending?


Scenario:

You are designing an HBase schema for an e-commerce platform. Product detail queries need price, sku, and weight — but shipping and review data should never be read during a product lookup. You want these groups stored in physically separate HDFS files so reads stay as narrow as possible. What HBase grouping mechanism lets you achieve this?


Scenario:

A Python developer on your team is trying to connect to HBase using happybase and keeps getting Connection refused on port 9090. You confirm HBase is running fine — the shell connects and queries work perfectly. You tell the developer to run one additional command to open a bridge between Python and the HBase Java API. What process do they need to start?


Quiz — HBase in Production

Scenario:

Your HBase cluster has 10 RegionServers. Monitoring shows RegionServer #1 at 94% CPU and the rest below 10%. The table uses auto-incrementing integer row keys starting from 1. Your manager asks for the root cause and the fix. What is the correct diagnosis?

Scenario:

Your Python pipeline inserts 50,000 rows per minute into HBase. Each row is written with a separate table.put() call. The pipeline intermittently fails with Thrift server timeouts under load. You see 50,000 individual connections per minute in the Thrift server logs. What is the most direct fix?

Scenario:

Your team runs nightly Spark jobs processing clickstream data stored in HDFS. Product now wants a dashboard showing any user's last 50 actions in under 100ms. A colleague proposes adding a separate Cassandra cluster just for the dashboard. Your tech lead suggests HBase instead. What is the strongest argument for HBase over a separate Cassandra cluster in this specific situation?

Up Next · Lesson 20

Graph Databases

When your data is all about relationships — friends, fraud rings, recommendations — graphs change everything.