Cloud Computing (2024-2025)-SEM-II

March 01, 2025

HBase in Cloud Computing

HBase is an open-source, distributed, scalable, and NoSQL database modeled after Google's Bigtable. It is part of the Hadoop ecosystem and is built on top of Hadoop Distributed File System (HDFS) to store large datasets in a fault-tolerant way.

In cloud computing, HBase is widely used because of its scalability, fault tolerance, and ability to integrate seamlessly with distributed storage systems like HDFS, which are commonly used in cloud environments. HBase is often deployed on cloud platforms such as Amazon Web Services (AWS), Google Cloud Platform (GCP), and Microsoft Azure, offering distributed data storage that can grow as demand increases.

HBase Architecture

HBase architecture is designed to provide both scalability and fault tolerance in a distributed environment. The main components of HBase architecture include:

1. HBase Tables

• HBase stores data in tables. Each table is a collection of rows, and each row is uniquely identified by a row key.

• Data in HBase is stored as key-value pairs. Each row contains one or more columns that belong to a column family.

• Tables are split into regions based on the row key, and these regions are distributed across multiple servers.

2. Regions and Region Servers

• Regions: An HBase table is horizontally partitioned into regions, where each region holds a range of rows. Initially, a table contains one region, but as more data is added, the region grows and eventually splits into two.

• Region Servers:

• Region servers are responsible for managing regions and handling client requests (read and write operations).

• Each region server hosts multiple regions.

• Region servers split regions that have become too large and assign new regions as needed.

• The MemStore (in-memory storage) stores newly written data before it's flushed to disk as HFiles (HBase's data storage format).

3. HMaster

• HMaster is the primary coordinator of the HBase cluster. It manages the assignment of regions to region servers, monitors the health of region servers, and handles region splits.

• Responsibilities of HMaster:

• Region Assignment: When a region server starts up, HMaster assigns it regions to manage.

• Region Balancing: HMaster balances the distribution of regions across region servers to avoid hotspots (servers becoming overloaded).

• Schema Changes: HMaster handles requests to add or modify column families in a table.

4. Zookeeper

• Apache ZooKeeper is used as a coordination and synchronization service for HBase. It ensures that only one HMaster is active at any time and helps in region server failover. ZooKeeper is essential for:

• Region Server Coordination: ZooKeeper keeps track of which region servers are available and assigns regions to servers.

• Leader Election: ZooKeeper is responsible for electing the active HMaster (in case of HMaster failure, a new one is chosen).

• Metadata Management: It stores metadata about region servers and regions.

5. HDFS (Hadoop Distributed File System)

• HBase is built on top of HDFS, which provides reliable and scalable storage for HBase data.

• HBase stores data files (HFiles) and write-ahead logs (WAL) in HDFS.

• Write-Ahead Log (WAL): HBase uses WAL for durability. When data is written to HBase, it is first recorded in the WAL before being written to MemStore, ensuring data is not lost in case of a crash.

HBase Working:

1. Data Write Operation:

• When a client writes data to HBase:

• The data is first written to the Write-Ahead Log (WAL) for durability. The WAL ensures that data can be recovered in case of a failure.

• The data is also stored in MemStore (an in-memory storage).

• When the MemStore becomes full, it is flushed to HDFS as an HFile.

2. Data Read Operation:

• When a client reads data from HBase:

• The system first checks the MemStore for any recent data not yet written to disk.

• If the data is not in MemStore, it is fetched from HFiles stored in HDFS.

3. Region Splitting:

• As the size of a region grows (due to new data writes), it eventually becomes too large.

• HBase automatically splits the region into smaller regions, each containing half of the rows.

• These new regions are assigned to region servers, allowing load to be distributed across multiple servers.

4. Compaction:

• Over time, multiple HFiles are created due to repeated MemStore flushes. HBase periodically performs compaction to merge smaller HFiles into larger ones, reducing the number of files that need to be read during a query.

HBase Architecture in simple way to Remember

1. HMaster: Manages region servers, assigns regions, handles splits, and manages metadata.

2. Region Servers: Serve regions (a range of rows) and handle read/write requests. Manage MemStore (in-memory) and HFiles (on disk).

3. Regions: Subdivisions of HBase tables that contain data and are managed by region servers. Regions are split as they grow.

4. ZooKeeper: Coordinates region servers, manages metadata, and provides distributed synchronization.

5. HDFS: Underlying distributed storage system for HFiles and WALs, ensuring data durability and scalability.

Search This Blog