Wide Column Stores - Learning Module

Loading content...

0/273

Apache HBase — The Hadoop Ecosystem Wide-Column Store

The Hadoop-Native Wide-Column Database

While Apache Cassandra optimized for availability and write performance, a different wide-column store emerged from the Hadoop ecosystem with different priorities: Apache HBase. Built as an open-source implementation of Google's Bigtable, HBase provides random, real-time read/write access to massive datasets—billions of rows and millions of columns—while integrating seamlessly with the Hadoop ecosystem for batch analytics.

HBase powers some of the world's most demanding data platforms. Facebook's messaging system originally ran on HBase before moving to internal solutions. Alibaba processes petabytes of data through HBase for their e-commerce platform. Adobe, Yahoo, and countless others rely on HBase for workloads that require both random access and batch processing on the same data.

The key distinction from Cassandra lies in HBase's architectural choices: where Cassandra prioritizes availability in CAP theorem terms, HBase prioritizes consistency. HBase uses a leader-based architecture with ZooKeeper coordination, ensuring strong consistency guarantees that Cassandra cannot provide by default. This makes HBase the natural choice when you need the wide-column model with transactional semantics.

What You Will Master

By the end of this page, you will understand HBase's master-based architecture, how it leverages HDFS for storage and ZooKeeper for coordination, the RegionServer model for data distribution, and when HBase is the right choice compared to Cassandra or other wide-column stores.

HBase Origin and Design Philosophy

HBase's design is a direct descendant of Google's Bigtable paper (2006), which described a distributed storage system for managing petabytes of data across thousands of commodity servers. Understanding this lineage explains many of HBase's architectural decisions.

The Google Bigtable Influence

Google designed Bigtable to meet specific requirements for their internal services:

Massive scale: Store data for Google Search, Maps, Earth, YouTube, Gmail
Strong consistency: Every read returns the most recent write (within a row)
Low latency: Sub-millisecond reads for user-facing applications
High throughput: Handle millions of operations per second
Integration with MapReduce: Serve as both input and output for batch processing

Bigtable achieved this through a master-based architecture using Google's GFS (distributed file system) for storage and Chubby (distributed lock service) for coordination.

HBase as an Open-Source Clone

HBase maps Bigtable's components to the Hadoop ecosystem:

Bigtable Component	HBase Equivalent	Function
GFS	HDFS	Distributed file system for storage
Chubby	ZooKeeper	Distributed coordination service
Bigtable Master	HMaster	Cluster management, region assignment
Tablet Server	RegionServer	Data serving, read/write operations
SSTable	HFile	Immutable sorted data files

HBase vs. Cassandra: Architectural Philosophy
Aspect	HBase	Cassandra
CAP Position	CP (Consistent, Partition-tolerant)	AP (Available, Partition-tolerant)
Architecture	Master-based (HMaster + RegionServers)	Masterless (peer-to-peer)
Storage	HDFS (distributed file system)	Local storage per node
Coordination	ZooKeeper (centralized)	Gossip protocol (decentralized)
Consistency	Strong (row-level atomicity)	Tunable (eventual to strong)
Write model	Single RegionServer per row	Any node, any time
Failure handling	Failover with brief unavailability	No single point of failure

HBase's Niche: Hadoop Ecosystem Integration

HBase's primary advantage is its tight integration with the Hadoop ecosystem:

HDFS for storage: Data is stored on HDFS, benefiting from its fault tolerance and horizontal scalability
MapReduce/Spark integration: Process HBase data with batch analytics frameworks
Hive integration: Query HBase tables using SQL-like syntax
Phoenix: SQL layer on top of HBase for OLTP workloads
Operational tooling: Leverage Hadoop's mature operational tooling

This makes HBase ideal for organizations already invested in the Hadoop ecosystem who need random access to large datasets that are also processed by batch jobs.

When HBase Shines

Choose HBase when you need: (1) Strong consistency guarantees per row, (2) integration with Hadoop batch processing, (3) random access to massive datasets, or (4) column-level security and cell-level ACLs. If you don't have Hadoop or don't need these features, Cassandra's simpler operational model may be preferable.

HBase Architecture Components

HBase's architecture consists of several interconnected components, each with specific responsibilities. Understanding these components is essential for operations and troubleshooting.

The Master Server (HMaster)

The HMaster handles cluster coordination and metadata management:

Region assignment: Assigns regions to RegionServers, rebalances as needed
DDL operations: Creates/deletes tables, column families
Failover coordination: Detects RegionServer failures, redistributes regions
Load balancing: Moves regions to balance load across RegionServers
Schema changes: Handles metadata changes (namespace, table, column family)

Importantly, the HMaster is not on the data path—clients communicate directly with RegionServers for reads and writes. The HMaster handles only metadata operations, making it a less critical single point of failure than it might first appear.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
┌────────────────────────────────────────────────────────────────────────────┐
│                          HBASE ARCHITECTURE                                │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  ┌──────────────────┐     ┌──────────────────┐                             │
│  │    Client        │     │    Client        │                             │
│  │ (Java/Thrift/    │     │                  │                             │
│  │  REST)           │     │                  │                             │
│  └────────┬─────────┘     └────────┬─────────┘                             │
│           │                        │                                       │
│           │ (1) Get region location from ZK/Meta                           │
│           │ (2) Direct read/write to RegionServer                          │
│           │                        │                                       │
│           ▼                        ▼                                       │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                        ZooKeeper Ensemble                           │  │
│  │                                                                     │  │
│  │  • Root region location       • Active HMaster election            │  │
│  │  • RegionServer liveness      • Cluster configuration              │  │
│  │  • Schema version             • Distributed locking                │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│           │                                                                │
│           ▼                                                                │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │  HMaster (Active)         │    HMaster (Standby)                    │  │
│  │  ────────────────         │    ──────────────────                   │  │
│  │  • Region assignment      │    Watches ZK for failover              │  │
│  │  • DDL operations         │                                         │  │
│  │  • Load balancing         │                                         │  │
│  │  • NOT on data path       │                                         │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                            │
│  ┌────────────────┐ ┌────────────────┐ ┌────────────────┐                 │
│  │ RegionServer 1 │ │ RegionServer 2 │ │ RegionServer 3 │  ...            │
│  │ ──────────────│ │ ──────────────│ │ ──────────────│                  │
│  │ Region A      │ │ Region C      │ │ Region E      │                  │
│  │ Region B      │ │ Region D      │ │ Region F      │                  │
│  │               │ │               │ │               │                  │
│  │ • WAL (local) │ │ • WAL (local) │ │ • WAL (local) │                  │
│  │ • MemStore    │ │ • MemStore    │ │ • MemStore    │                  │
│  │ • Block Cache │ │ • Block Cache │ │ • Block Cache │                  │
│  └───────┬───────┘ └───────┬───────┘ └───────┬───────┘                  │
│          │                 │                 │                            │
│          ▼                 ▼                 ▼                            │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                            HDFS                                     │  │
│  │                                                                     │  │
│  │   HFile (SSTables) - Immutable sorted data files                    │  │
│  │   WAL files - Write-ahead logs for durability                       │  │
│  │   Replicated 3x by default                                          │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

RegionServers: The Data Serving Workhorses

RegionServers handle all read and write operations. Each RegionServer:

Hosts multiple regions: A region is a contiguous range of rows from a table
Manages MemStore: In-memory write buffer, one per column family per region
Writes WAL: Write-ahead log for durability, stored on HDFS
Reads/writes HFiles: Immutable data files on HDFS
Block cache: LRU cache of data blocks for read performance
Compaction: Merges HFiles to reduce read amplification

ZooKeeper: The Coordination Layer

ZooKeeper provides distributed coordination services that HBase depends on:

HMaster election: Only one active HMaster; standby takes over on failure
RegionServer liveness: Ephemeral nodes detect RegionServer failures
Root region location: Where is the catalog table (hbase:meta)?
Cluster configuration: Bootstrapping and configuration sharing
Distributed locks: Coordination for DDL operations

HDFS: The Storage Layer

Unlike Cassandra (which uses local storage), HBase stores all data on HDFS:

Fault tolerance: HDFS replicates data (default 3x)
Decoupled storage and compute: RegionServers can move without data migration
Infinite storage: HDFS scales horizontally
Trade-off: HDFS adds latency vs. local disk; not ideal for ultra-low latency

HDFS Latency Characteristics

HDFS is optimized for throughput, not latency. Typical HDFS read latency is 5-10ms vs. <1ms for local SSD. HBase mitigates this with aggressive caching (block cache, bucket cache) but cannot match pure local storage performance. Consider this when evaluating HBase for latency-sensitive workloads.

Data Model and Region Distribution

HBase's data model closely follows the Bigtable column-family model we explored earlier, with some HBase-specific terminology and characteristics.

Tables, Rows, Column Families, and Cells

HBase organizes data hierarchically:

Table: Top-level container, split into regions for distribution
Row: Identified by a row key (byte array), atomically read/written
Column Family: Must be declared at table creation, stored together
Column Qualifier: Dynamic, created on write, identifies a column within a family
Cell: Intersection of row, column family, column qualifier, with versioning by timestamp

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
// HBase table structure example: User Activity Tracking
 
// Table: user_activity
// Column Families: 'profile', 'sessions', 'metrics'
 
// Create table with column families
HTableDescriptor tableDescriptor = new HTableDescriptor(TableName.valueOf("user_activity"));
 
// Column family: profile (static data, rarely changes)
HColumnDescriptor profileCF = new HColumnDescriptor("profile");
profileCF.setMaxVersions(1);           // Only keep latest version
profileCF.setTimeToLive(HConstants.FOREVER);  // Never expire
profileCF.setCompression(Compression.Algorithm.SNAPPY);
tableDescriptor.addFamily(profileCF);
 
// Column family: sessions (time-series, expires after 30 days)
HColumnDescriptor sessionsCF = new HColumnDescriptor("sessions");
sessionsCF.setMaxVersions(1);
sessionsCF.setTimeToLive(30 * 24 * 60 * 60);  // 30 days TTL
sessionsCF.setCompression(Compression.Algorithm.LZ4);
tableDescriptor.addFamily(sessionsCF);
 
// Column family: metrics (counters, keep history)
HColumnDescriptor metricsCF = new HColumnDescriptor("metrics");
metricsCF.setMaxVersions(100);         // Keep 100 versions for trend analysis
metricsCF.setTimeToLive(365 * 24 * 60 * 60);  // 1 year TTL
tableDescriptor.addFamily(metricsCF);
 
admin.createTable(tableDescriptor);
 
// Data structure after writes:
//
// Row Key: "user_12345"
// ├── Column Family: "profile"
// │   ├── "name": "Alice" (version: t1)
// │   ├── "email": "alice@example.com" (version: t1)
// │   └── "created_at": "2024-01-01" (version: t1)
// ├── Column Family: "sessions"
// │   ├── "sess_20240115_001": {ip: "1.2.3.4", device: "mobile"} (version: t5)
// │   ├── "sess_20240115_002": {ip: "5.6.7.8", device: "desktop"} (version: t6)
// │   └── ... (older sessions auto-deleted after 30 days)
// └── Column Family: "metrics"
//     ├── "page_views": "1523" (versions: t7=1523, t6=1520, t5=1515, ...)
//     └── "clicks": "342" (versions: t7=342, t6=340, ...)

Regions: The Unit of Distribution

A table is horizontally partitioned into regions, where each region contains a contiguous range of row keys:

Region boundaries: Defined by start key (inclusive) and end key (exclusive)
Size-based splitting: Regions split when they exceed a size threshold (default ~10GB)
Pre-splitting: Create tables with predefined regions to distribute writes immediately
Region assignment: Each region is assigned to exactly one RegionServer at a time

The region model provides:

Horizontal scalability: Add RegionServers → regions are distributed → more capacity
Load balancing: HMaster moves regions between RegionServers to balance load
Localized operations: Reads/writes for a row only involve one RegionServer
Granular recovery: Failed RegionServer's regions are reassigned, not entire table

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
┌────────────────────────────────────────────────────────────────────────────┐
│                          TABLE REGION DISTRIBUTION                         │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Table: user_activity (sorted by row key lexicographically)                │
│                                                                            │
│  Row Keys: "user_00001" ... "user_99999"                                   │
│                                                                            │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                         Logical Table View                          │  │
│  │  ┌────────────────────────────────────────────────────────────────┐ │  │
│  │  │ user_00001 | user_00002 | ... | user_50000 | ... | user_99999  │ │  │
│  │  └────────────────────────────────────────────────────────────────┘ │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                            │
│  ┌─────────────────────────────────────────────────────────────────────┐  │
│  │                     Physical Region Distribution                    │  │
│  │                                                                     │  │
│  │  Region 1              Region 2              Region 3               │  │
│  │  [user_00001,          [user_33334,          [user_66667,           │  │
│  │   user_33333]           user_66666]           user_99999]           │  │
│  │       │                     │                     │                 │  │
│  │       ▼                     ▼                     ▼                 │  │
│  │  ┌──────────┐          ┌──────────┐          ┌──────────┐          │  │
│  │  │RegionSrv1│          │RegionSrv2│          │RegionSrv3│          │  │
│  │  └──────────┘          └──────────┘          └──────────┘          │  │
│  │                                                                     │  │
│  └─────────────────────────────────────────────────────────────────────┘  │
│                                                                            │
│  Client Query: GET user_45000                                              │
│  1. Client asks ZK/Meta: "Which region has user_45000?"                    │
│  2. Answer: Region 2 ([user_33334, user_66666])                            │
│  3. Client caches region location                                          │
│  4. Client sends GET directly to RegionServer 2                            │
│  5. RegionServer 2 reads from MemStore or HFiles                           │
│  6. RegionServer 2 returns result to client                                │
│                                                                            │
│  Note: HMaster is NOT involved in data operations!                         │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Pre-Splitting for Write Performance

New tables start with a single region, creating a hotspot for writes. Pre-split tables with expected region boundaries:

CREATE 'user_data', 'cf1', SPLITS => ['user_3', 'user_6', 'user_9']

This immediately distributes writes across multiple RegionServers.

Write and Read Paths in HBase

HBase's read and write paths are optimized for different access patterns. Understanding these paths is essential for performance tuning and capacity planning.

The Write Path

When a client writes to HBase:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
┌────────────────────────────────────────────────────────────────────────────┐
│                            HBASE WRITE PATH                                │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Client                                                                    │
│    │                                                                       │
│    │ (1) Put/Delete request for row "user_45000"                           │
│    │                                                                       │
│    │ (2) Look up region location (cached or from hbase:meta)               │
│    │                                                                       │
│    ▼                                                                       │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                      RegionServer 2                               │    │
│  │                   (hosts region for user_45000)                   │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│    │                                                                       │
│    │ ┌─────────────────────────────────────────────────────────────────┐  │
│    │ │ (3) Acquire row lock (ensures atomicity within row)             │  │
│    │ │                                                                 │  │
│    │ │ (4) Append to WAL (Write-Ahead Log)                             │  │
│    │ │     • Sequential write to HDFS                                  │  │
│    │ │     • Synced to disk for durability                             │  │
│    │ │     • Can batch multiple edits for throughput                   │  │
│    │ │                                                                 │  │
│    │ │ (5) Write to MemStore (in-memory)                               │  │
│    │ │     • One MemStore per column family per region                 │  │
│    │ │     • Sorted data structure (ConcurrentSkipListMap)             │  │
│    │ │                                                                 │  │
│    │ │ (6) Release row lock                                            │  │
│    │ │                                                                 │  │
│    │ │ (7) Return success to client                                    │  │
│    │ └─────────────────────────────────────────────────────────────────┘  │
│    │                                                                       │
│    │ ┌─────────────────────────────────────────────────────────────────┐  │
│    │ │                    BACKGROUND OPERATIONS                        │  │
│    │ │                                                                 │  │
│    │ │ (8) MemStore Flush (when size threshold reached ~128MB)         │  │
│    │ │     • Create new HFile on HDFS                                  │  │
│    │ │     • Clear MemStore                                            │  │
│    │ │     • Delete corresponding WAL entries                          │  │
│    │ │                                                                 │  │
│    │ │ (9) Compaction (background process)                             │  │
│    │ │     • Minor: Merge recent HFiles (reduce file count)            │  │
│    │ │     • Major: Merge all HFiles (remove deleted data)             │  │
│    │ │                                                                 │  │
│    │ └─────────────────────────────────────────────────────────────────┘  │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Critical WAL Behaviors:

Write durability: Data is durable once WAL is synced; MemStore loss is recoverable
Sync options: Default syncs on every write; can defer for throughput (hbase.regionserver.optionallogflushinterval)
Recovery: On RegionServer failure, WAL is replayed to restore MemStore data
WAL location: Stored in HDFS, replicated for durability

The Read Path

Reading is more complex because data may exist in multiple locations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
┌────────────────────────────────────────────────────────────────────────────┐
│                            HBASE READ PATH                                 │
├────────────────────────────────────────────────────────────────────────────┤
│                                                                            │
│  Client                                                                    │
│    │                                                                       │
│    │ (1) Get request for row "user_45000"                                  │
│    │                                                                       │
│    │ (2) Look up region location (cached)                                  │
│    │                                                                       │
│    ▼                                                                       │
│  ┌───────────────────────────────────────────────────────────────────┐    │
│  │                      RegionServer 2                               │    │
│  └───────────────────────────────────────────────────────────────────┘    │
│    │                                                                       │
│    │ ┌─────────────────────────────────────────────────────────────────┐  │
│    │ │                     DATA LOCATION SEARCH                        │  │
│    │ │                                                                 │  │
│    │ │  (3) Block Cache (LRU cache of HFile blocks)                    │  │
│    │ │      └─ Hot data served from memory                             │  │
│    │ │           │                                                     │  │
│    │ │           │ (cache miss)                                        │  │
│    │ │           ▼                                                     │  │
│    │ │  (4) MemStore (recent writes, always checked)                   │  │
│    │ │      └─ O(log n) search in sorted structure                     │  │
│    │ │           │                                                     │  │
│    │ │           │ (also check HFiles)                                 │  │
│    │ │           ▼                                                     │  │
│    │ │  (5) Bloom Filters (per HFile, in memory)                       │  │
│    │ │      └─ "Might this HFile contain row X?"                       │  │
│    │ │           │                                                     │  │
│    │ │           │ (positive → read from disk)                         │  │
│    │ │           ▼                                                     │  │
│    │ │  (6) HFile Block Index (sparse index)                           │  │
│    │ │      └─ Find which block contains row X                         │  │
│    │ │           │                                                     │  │
│    │ │           ▼                                                     │  │
│    │ │  (7) Read Block from HDFS                                       │  │
│    │ │      └─ Read data block (~64KB)                                 │  │
│    │ │      └─ Add to Block Cache                                      │  │
│    │ │                                                                 │  │
│    │ │  (8) Merge Results                                              │  │
│    │ │      └─ Combine MemStore + HFiles                               │  │
│    │ │      └─ Apply timestamp filtering                               │  │
│    │ │      └─ Return most recent version(s)                           │  │
│    │ │                                                                 │  │
│    │ └─────────────────────────────────────────────────────────────────┘  │
│    │                                                                       │
│    │ (9) Return result to client                                           │
│                                                                            │
└────────────────────────────────────────────────────────────────────────────┘

Read Performance Optimization Techniques

•Block Cache sizing: Allocate 40-50% of RegionServer heap to block cache for read-heavy workloads
•Bloom filters: Enable row-level or column-level bloom filters to avoid unnecessary HFile reads
•Short-circuit reads: If RegionServer is on same node as HDFS DataNode, read local copy
•BucketCache: Off-heap or SSD-based cache for larger working sets
•Compaction tuning: More aggressive compaction reduces HFile count, improving read latency

Consistency Guarantees and ZooKeeper Coordination

HBase provides strong consistency within rows—a significant difference from Cassandra's eventual consistency default. This is achieved through single-master-per-row design and careful coordination via ZooKeeper.

Row-Level Atomicity

HBase guarantees that all operations within a single row are atomic:

Put: All column updates within a Put to the same row are atomic
Delete: Deleting multiple columns in a row is atomic
CheckAndPut/CheckAndDelete: Conditional operations are atomic
Increment: Atomic counter operations within a row

However, there are no cross-row transactions in base HBase (though Phoenix adds this capability).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
// Row-level atomic operations in HBase
 
// Atomic multi-column put (all or nothing within row)
Put put = new Put(Bytes.toBytes("user_12345"));
put.addColumn(Bytes.toBytes("profile"), Bytes.toBytes("name"), Bytes.toBytes("Alice"));
put.addColumn(Bytes.toBytes("profile"), Bytes.toBytes("email"), Bytes.toBytes("alice@example.com"));
put.addColumn(Bytes.toBytes("metrics"), Bytes.toBytes("last_updated"), Bytes.toBytes(System.currentTimeMillis()));
table.put(put);  // Atomic: all columns written together
 
// Check-and-put: read-modify-write atomically
byte[] rowKey = Bytes.toBytes("user_12345");
byte[] family = Bytes.toBytes("profile");
byte[] qualifier = Bytes.toBytes("status");
 
// Only update if current value is "pending"
Put updatePut = new Put(rowKey);
updatePut.addColumn(family, qualifier, Bytes.toBytes("active"));
 
boolean success = table.checkAndPut(
    rowKey, 
    family, 
    qualifier, 
    Bytes.toBytes("pending"),  // Expected current value
    updatePut                   // New value if condition met
);
// Returns true if update was applied, false if condition not met
 
// Atomic increment (counter pattern)
long newValue = table.incrementColumnValue(
    Bytes.toBytes("user_12345"),
    Bytes.toBytes("metrics"),
    Bytes.toBytes("page_views"),
    1  // Increment by 1
);
// Atomic: no lost updates even under concurrent increments
 
// Batch operations within a row are also atomic
RowMutations mutations = new RowMutations(Bytes.toBytes("user_12345"));
mutations.add(new Put(Bytes.toBytes("user_12345"))
    .addColumn(Bytes.toBytes("profile"), Bytes.toBytes("name"), Bytes.toBytes("Alice Updated")));
mutations.add(new Delete(Bytes.toBytes("user_12345"))
    .addColumn(Bytes.toBytes("profile"), Bytes.toBytes("old_field")));
table.mutateRow(mutations);  // Atomic: both put and delete applied together

ZooKeeper's Role in Consistency

ZooKeeper enables HBase's consistency guarantees by providing:

1. RegionServer Liveness Detection

Each RegionServer creates an ephemeral znode on startup
If RegionServer crashes, ephemeral znode disappears
HMaster detects this and initiates region reassignment
Process ensures only one RegionServer owns a region at any time

2. HMaster Election

Multiple HMasters can run (active + standby)
ZooKeeper ensures only one is active via leader election
Standby watches for leader failure and takes over

3. Catalog Table (hbase:meta) Location

Location of the meta region is stored in ZooKeeper
Clients bootstrap by reading this location
Enables clients to find any region's location

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
/hbase (root znode)
├── /hbase/root-region-server          # Location of hbase:meta region
│   └── <data>: "regionserver2:16020"
│
├── /hbase/master                       # Active HMaster
│   └── <data>: "hmaster1:16000"
│
├── /hbase/backup-masters               # Standby HMasters
│   ├── hmaster2:16000                  # Ephemeral znodes
│   └── hmaster3:16000
│
├── /hbase/rs                           # RegionServer znodes
│   ├── regionserver1,16020,1704067200  # Ephemeral: alive while RS running
│   ├── regionserver2,16020,1704067201
│   ├── regionserver3,16020,1704067202
│   └── ...
│
├── /hbase/table                        # Table metadata
│   ├── user_activity                   # Table state (enabled/disabled)
│   └── orders
│
├── /hbase/splitWAL                     # WAL splitting coordination
│   └── <task znodes for recovery>
│
└── /hbase/region-in-transition         # Regions being reassigned
    └── <transient state during moves>

ZooKeeper is a Single Point of Failure

While HBase can survive individual ZooKeeper node failures (it's a quorum-based system), losing the ZooKeeper quorum (majority of nodes) will prevent HBase from functioning. Always deploy ZooKeeper as a 3+ node ensemble across different failure domains.

HBase vs. Cassandra: Making the Right Choice

Both HBase and Cassandra are wide-column stores, but they serve different needs. Understanding their differences helps you choose the right tool for your use case.

Architectural Differences Summary

HBase vs. Cassandra Detailed Comparison
Aspect	HBase	Cassandra
Consistency Model	Strong (row-level)	Eventual (tunable)
Architecture	Master-based (HMaster)	Masterless (peer-to-peer)
Storage Backend	HDFS	Local disk per node
Coordination	ZooKeeper	Gossip protocol
Write Latency	~10-20ms (HDFS overhead)	~1-5ms (local writes)
Read Latency	~5-50ms (depends on cache)	~1-10ms
Availability	Region unavailable during failover (~30s)	Always available
Scaling	Add RegionServers + wait for rebalance	Add nodes, vnodes auto-balance
Multi-DC	Replication via HBase-DR or CopyTable	Built-in multi-DC replication
Operational Complexity	High (ZK + HDFS + HBase)	Medium (single system)
Hadoop Integration	Native MapReduce/Spark support	External connectors

Choose HBase When:

HBase is the Right Choice

•Strong consistency is required: Financial systems, inventory management, anywhere lost updates are unacceptable
•Hadoop integration is needed: Data is processed by MapReduce, Spark, or other Hadoop ecosystem tools
•Random access + batch processing: Same data serves both real-time queries and analytics jobs
•Cell-level security: HBase provides fine-grained ACLs at cell/column level via Apache Ranger
•Existing Hadoop investment: Organization already operates Hadoop infrastructure
•Read-heavy workloads: With proper caching, HBase excels at random reads

Choose Cassandra When:

Cassandra is the Right Choice

•High availability is paramount: System must never go down, even briefly, for any reason
•Write-heavy workloads: IoT, logging, time-series with millions of writes/second
•Multi-datacenter deployment: Built-in cross-DC replication with tunable consistency
•Operational simplicity: Single distributed system vs. ZK + HDFS + HBase stack
•Low latency requirements: Sub-10ms writes and reads consistently
•Eventual consistency is acceptable: Social feeds, activity logs, metrics

Real-World Decision Framework

One practical heuristic: If you're already using Hadoop and need random access to data that's also processed by Spark/MapReduce, choose HBase. If you're starting fresh and need a distributed database with simple operations and high availability, choose Cassandra.

Summary: HBase's Place in Wide-Column Ecosystem

We've explored Apache HBase as the Hadoop-native wide-column store, understanding its architecture, data model, and trade-offs compared to Cassandra. Let's consolidate the key insights:

Key Takeaways

•HBase is a Bigtable clone that maps Google's architecture to the Hadoop ecosystem—HDFS for storage, ZooKeeper for coordination
•Master-based architecture (HMaster + RegionServers) provides strong consistency but introduces brief unavailability during failovers
•HDFS storage enables fault-tolerant, scalable storage but adds latency compared to local disk
•Region-based distribution partitions tables into row key ranges, each hosted by exactly one RegionServer
•Row-level atomicity guarantees all operations within a row are atomic, enabling CheckAndPut and atomic counters
•ZooKeeper coordination handles HMaster election, RegionServer liveness, and meta table location
•Choose HBase over Cassandra when you need strong consistency, Hadoop integration, or cell-level security

What's Next:

With a deep understanding of both column-family model fundamentals and two major implementations (Cassandra and HBase), we'll now explore the workload characteristics that make wide-column stores shine. The next page examines write-optimized workloads—understanding why these databases excel at high-throughput writes and how to design systems that leverage this strength.

HBase Mastery Achieved

You now understand HBase's architecture, consistency model, and how it differs from Cassandra. This knowledge enables you to evaluate HBase for your use cases, design effective data models, and make informed decisions about which wide-column store fits your requirements.