Cloud Computing (2024-2025)-SEM-II

Hadoop Distributed File system(HDFS)

The Hadoop Distributed File System (HDFS) is the primary storage system used by the Apache Hadoop framework for storing large datasets across a distributed computing environment. It is designed to handle large-scale data processing across many machines, providing high throughput, fault tolerance, and scalability. HDFS is optimized for large files (typically several gigabytes or terabytes in size) and is ideal for batch processing workloads that require processing of huge datasets in parallel.

Features of HDFS

1. Distributed Storage:

• HDFS stores data across multiple nodes in a Hadoop cluster, distributing files into blocks and storing these blocks across several machines. This helps to ensure scalability and reliability.

2. Fault Tolerance:

• HDFS is designed with fault tolerance in mind. Data blocks are replicated across different machines in the cluster to ensure that even if a machine or a disk fails, data is not lost and can still be accessed from other replicas.

3. High Throughput:

• HDFS is optimized for high throughput rather than low latency. It is ideal for applications that require reading and writing large volumes of data, such as data analytics, machine learning, and big data processing.

4. Large File Support:

• HDFS works well with large files, as the system splits these files into blocks and stores them across the cluster. Typically, HDFS blocks are 128 MB or 256 MB in size, allowing efficient storage and management of large files.

5. Scalability:

• HDFS is highly scalable, meaning it can expand to handle petabytes of data by simply adding more nodes to the cluster. This scalability makes it suitable for handling big data workloads.

6 .Cost-Efficient:

• HDFS is designed to run on commodity hardware, which makes it cost-effective compared to traditional storage systems that rely on expensive, specialized hardware.

Architecture of HDFS

HDFS has a master-slave architecture consisting of two main components: NameNode and DataNode. In addition, a Secondary NameNode is also used to support HDFS’s fault tolerance and maintainability.

1. NameNode (Master Node)

• The NameNode is the centerpiece of HDFS. It manages the file system namespace and controls access to files by clients.

• The NameNode does not store the actual data (the content of the files). Instead, it keeps the metadata about the files, such as:

• The file-to-block mapping (which block is stored where).

• The locations of data blocks on different DataNodes.

• The file and directory namespace.

• Permissions and access control.

• The NameNode is responsible for managing the replication of data blocks. It ensures that each block is replicated to multiple DataNodes to provide fault tolerance.

• The NameNode is a single point of failure (SPOF) in the system. If the NameNode fails, the whole system becomes unavailable, although data can still be read from the DataNodes until the NameNode is restored.

2. DataNode (Slave Node)

• DataNodes are the worker nodes that store the actual data in the form of blocks.

• DataNodes manage the storage of data blocks on the local disk. They are responsible for:

• Storing and retrieving the actual data blocks.

• Periodically sending heartbeats to the NameNode to indicate that they are alive and functioning.

• Reporting block locations to the NameNode.

• DataNodes handle read and write requests from clients. When a client requests a file, the NameNode gives the client the locations of the blocks, and the client can then retrieve the blocks directly from the DataNodes.

• DataNodes are also responsible for creating, deleting, and replicating blocks based on commands from the NameNode.

3. Secondary NameNode

• The Secondary NameNode is not a direct backup of the NameNode, but it helps in maintaining the consistency of the NameNode’s metadata.

• It periodically downloads the file system metadata from the NameNode, applies the changes (edit logs), and creates a new version of the file system image (FSImage).

4. Client

• Clients interact with the HDFS through the Hadoop Distributed FileSystem (HDFS) API. They communicate with the NameNode to get the block locations of a file and then communicate directly with the DataNodes to read or write data blocks.

• Clients perform file operations (such as create, read, and delete) through the HDFS interface, and the NameNode handles the metadata operations, ensuring the client gets the correct block locations.

5. Block and Block Replication

• Files in HDFS are split into large chunks called blocks. The default size of a block is 128 MB or 256 MB, which is much larger than the default block size in traditional file systems (e.g., 4 KB).

• Each block is replicated across multiple DataNodes (usually three replicas by default) to provide fault tolerance.

Working of the HDFS

1. File Write Operation:

• The client communicates with the Name Node to request permission to write a file.

• The NameNode checks if the file already exists and if there is enough space for the new file.

• The client splits the file into blocks (e.g., 128 MB) and sends each block to the DataNodes as instructed by the NameNode.

• Each block is replicated to multiple DataNodes, as per the replication factor.

• The DataNodes store the blocks and send acknowledgment to the NameNode.

2. File Read Operation:

• The client sends a request to the NameNode to read a file.

• The NameNode provides the block locations for the file.

• The client then contacts the DataNodes directly to read the blocks in sequence.

• The DataNodes stream the blocks to the client, and the client assembles the file.

Disadvantages of HDFS

1. Single Point of Failure (SPOF):

• The NameNode is a single point of failure. If it goes down, the entire HDFS system becomes unavailable. However, Hadoop HA (High Availability) can mitigate this by using a standby NameNode.

2. Not Suitable for Small Files:

• HDFS is not efficient for applications that require storing a large number of small files. The overhead of managing metadata for small files can degrade performance.

3. Write Once, Read Many Model:

• HDFS follows a write-once, read-many model, which is suitable for batch processing but not for applications that require frequent updates or random writes.

Applications of HDFS

1. Data Warehousing:

• HDFS is commonly used for storing large volumes of data in data warehousing solutions where the data is rarely updated but frequently read for analytics.

2. Big Data Analytics:

• HDFS is ideal for big data analytics, where large datasets are processed in parallel using frameworks like Apache Hadoop or Apache Spark.

3. Machine Learning:

• Large datasets for machine learning models (such as image recognition or natural language processing) can be stored and processed using HDFS.

Comments

SharyuPatilMay 13, 2025 at 5:43 AM
Thank you for sharing this blog Bhakti Ma'am I found the section on HDFS particularly helpful, as I was struggling with understanding it during my Exam. Keep going Ma'am!!
Siddharth NalawadeMay 14, 2025 at 9:54 PM
Great read! Made it easier to understand.

Search This Blog