Google File System (GFS) :

It is a distributed file system developed by Google to handle large-scale data processing applications that involve massive amounts of data. It is designed to provide high availability, fault tolerance, scalability, and reliability, even when running on inexpensive, commodity hardware. GFS is the foundational storage system for many of Google's applications, particularly for indexing web data, processing big data, and supporting cloud services like Google Search and Google Maps.

Features of Google File System (GFS)

1.    Scalability:

GFS is built to scale horizontally, allowing it to support a large number of machines (nodes) and handle petabytes of data efficiently.

2.    Fault Tolerance:

GFS ensures data reliability by replicating data across multiple nodes. Even if some nodes or hardware components fail, the system can still function and recover lost data.

3.    Distributed Architecture:

GFS follows a distributed architecture where files are split into fixed-size chunks (usually 64 MB) that are stored on multiple machines (chunk servers). These chunks are replicated across several nodes for redundancy.

4.    High Throughput:

GFS is optimized for applications that require processing large datasets with high throughput, rather than low-latency access. It is designed to support read-heavy and write-heavy operations.

5.    Support for Large Files:

GFS is designed to store very large files, often gigabytes or terabytes in size. It is highly efficient for sequential reading and writing of large blocks of data.

6.    Replication:

Data chunks are replicated across multiple chunk servers (typically three replicas by default) to ensure availability and durability. Even if one or two chunk servers fail, the data can still be accessed from the remaining replicas.

7.    Master-Worker Architecture:

GFS uses a master-slave or master-worker architecture, where a single master node manages metadata and coordinates access, while multiple chunk servers store the actual data.

8.    Designed for Commodity Hardware:

GFS runs on inexpensive, off-the-shelf hardware, making it cost-effective. It is built to handle frequent failures of hardware components, assuming failures are common rather than rare.

Architecture of Google File System (GFS)

1.   Master Node:

The master node is responsible for managing the system's metadata, including namespace management, chunk information, and access control. It does not store actual file data but rather oversees chunk locations and replication.

The master node maintains a mapping of file names to chunk locations and handles operations like file creation, deletion, and renaming.

It also coordinates data replication, making sure there are enough copies of each chunk and deciding when and where to place new replicas.

2.    Chunk Servers:

The chunk servers are responsible for storing the actual file data. Files are broken down into fixed-size chunks (64 MB), and each chunk is replicated across multiple chunk servers.

Chunk servers communicate with the master node but operate independently to handle read/write requests from clients.

Chunk servers periodically send heartbeat messages to the master to ensure that they are still functioning and available.

3.    Clients:

 Clients interact with both the master node and chunk servers. They communicate with the master node to get metadata information (such as chunk locations) and then directly read or write data from/to the chunk servers.

4.    Metadata:

GFS stores three types of metadata on the master node:

File and chunk namespace: A hierarchical structure for files and directories.

Mapping from files to chunks: Information about which chunks correspond to which parts of a file.

Chunk replica locations: Information about where the replicas of each chunk are stored across chunk servers.

Metadata is kept in memory to allow for fast lookups and system performance.

.

5.   Consistency Model:

GFS provides a relaxed consistency model. It allows for eventual consistency, meaning that data may not always be immediately consistent across all replicas after a write operation, but it will eventually reach a consistent state.

Clients can tolerate some temporary inconsistencies (e.g., stale reads), but for most large-scale data processing tasks, this is an acceptable trade-off for better performance and availability.

 Working of  GFS

1.    File Creation:

When a client creates a new file, the master node assigns an identifier to the file and divides it into chunks of 64 MB. The master node decides which chunk servers will store the initial replicas of each chunk.

2.    Reading Data:

When a client wants to read a file, it first contacts the master node to get the location of the relevant chunks. The client is then directed to the appropriate chunk servers where the data is stored. The client reads data directly from the chunk servers.

3.    Writing Data:

For write operations, the client requests the master for the chunk locations and the primary chunk server (which holds the lease for the chunk). The client sends the data to all replicas of the chunk, and the primary chunk server applies the mutation first, followed by the secondary replicas.

Once all replicas have acknowledged the write, the primary informs the client that the write was successful.

4.    Fault Tolerance:

GFS is designed to handle frequent hardware failures. The master node continuously monitors the state of the chunk servers via heartbeat messages. If a chunk server fails, the master node reallocates the lost chunks to other available servers.

The system automatically replicates chunks to maintain the desired replication factor, ensuring that data remains available even in the event of hardware failure.




Applications  of GFS

1.    Web Crawling and Indexing:

GFS was originally designed to support Google’s web crawling and indexing operations, where large amounts of data need to be stored, processed, and retrieved efficiently.

2.    Data-Intensive Applications:

Applications like big data analytics, data mining, and machine learning use GFS to store and process large datasets in a distributed manner.

3.    Log Processing:

GFS is commonly used for storing and processing logs, as it supports high-throughput, write-heavy workloads that are characteristic of log data generation.


Comments

  1. This post is packed with useful information I can implement immediately! Thank You Ma'am

    ReplyDelete
  2. Great read! Made it easier to understand.

    ReplyDelete

Post a Comment

Popular posts from this blog