登录查看更多内容

The Google File System

Sara Khodeir

Senior Software Engineer at Facebook

发布日期: 2024年1月31日

The Google File System is a distributed file system designed and used internally by Google in the early 2000s to support their extensive data processing needs. GFS was crafted to handle the challenges of supporting hundreds of clients concurrently reading and appending to large files while ensuring high performance, availability, and reliability.

Use Cases:

GFS caters to two primary use cases:

One-off Analytics: Researchers and analysts can write data to a file and subsequently read it for analytical purposes.
Streaming Reads: Continuous writing and reading of data in a stream-like fashion.

The predominant patterns in these applications involve sequential appends to files, often originating from numerous processes attempting to write to the same file. The read pattern mirrors this, with most reads accessing contiguous regions in a file through different clients. Notably, these applications prioritize low latency due to their inherent sensitivity to time.

Assumptions:

To meet the demands of such workloads, GFS operates under several assumptions:

The system must handle a large scale of data, requiring thousands of machines where some might fail or lose connectivity.
Files are large, typically ranging from hundreds of megabytes to multiple gigabytes.
Two main types of read patterns: large streaming reads (hundreds of kilobytes to megabytes) and random reads (a few kilobytes at a random offset).
Large sequential writes that append data to files.
Support for writing to files by a large number of clients on different machines, necessitating atomicity without client-side synchronization.
High throughput takes precedence over low latency for the targeted applications.

Interface for Applications:

GFS's file system layout resembles Linux, organizing files in a hierarchical directory structure. Files can be created, deleted, opened, closed, read, written at an arbitrary offset, or in the form of a record append at the end of the file. Additionally, GFS supports snapshots, enabling the creation of a copy of the current file or directory structure.

Architecture:

A GFS cluster comprises thousands of machines and can serve millions of files. Each cluster operates as an isolated universe with specific applications or use cases deployed. The cluster consists of one master and multiple storage nodes known as chunkservers, with larger clusters containing thousands of chunkservers. The master manages namespace operations, ensures data availability through replication, performs load management, and oversees garbage collection. Chunkservers store and serve data as requested.

Due to the substantial file sizes, GFS divides each file into 64 MB units called chunks, stored as Linux files on chunkservers. The architecture distinguishes between control flow and data flow, with the master providing information about chunk locations to applications, and chunkservers handling reads and writes. Applications initially communicate with the master to obtain chunk location information, then establish direct connections with chunkservers for data access.

Master Responsibilities:

The master plays a crucial role in maintaining the system's functionality. It manages system metadata, distributes load, ensures proper data replication, and communicates chunkserver locations to applications. To enable these functions, the master maintains mappings for file namespaces, file-to-chunk mappings, and chunk replica locations. File and chunk namespaces, along with the file-to-chunk mapping, are persisted through logging mutations to an operation log stored in the master's local disk and replicated to a remote machine. The master employs various strategies, including replication maintenance, load balancing, and garbage collection, to ensure the overall health of the system.

The master's state and log are replicated on multiple machines to prevent a single point of failure. In the event of master failure, a shadow master takes over, replicating the operation log and applying changes to its data structures. The shadow master communicates with chunkservers to monitor their status and starts a new master if needed.

ChunkServers Responsibilities:

Chunkservers serve as the storage component of GFS, storing chunks and handling data requests. Chunkservers inform the master about the chunks they contain upon startup, and this information is maintained in memory.

Writes Flow:

To maintain the order of writes across replicas, the master designates one chunkserver to hold the lease. This leaseholder determines the write order and enforces it on other chunkservers. Similar to reads, GFS splits the control and data flow for writes, enhancing performance. Clients request information on chunkservers from the master, which returns the primary (leaseholder) and secondaries. The client pushes data to all replicas in a carefully selected pipeline fashion taking network topology into consideration. After receiving acknowledgments from all replicas, the client sends a write request to the primary, which assigns consecutive serial numbers to the mutations and applies them in order. The primary then forwards the write request, including the order, to the replicas. If the mutation succeeds on all replicas, the primary sends a successful response to the client; otherwise, the client retries.

For large writes, the client breaks them into multiple write operations, and the entire write might be interleaved with writes from other clients. GFS guarantees at-least-once semantics, meaning the same data may be appended more than once. Applications need to be aware of this guarantee and handle duplicates accordingly.

Atomic Record Appends:

GFS introduces atomic record appends to address issues with traditional writes where clients specify offsets. In atomic record appends, clients only specify data without offsets, and GFS guarantees that the data is written as a continuous set of bytes at an offset chosen by GFS. This approach eliminates interleaved data fragments from multiple clients. The execution of atomic record appends is similar to regular writes but with additional logic. The client pushes data to all replicas, and the primary checks if appending data would exceed the chunk's maximum size. If so, it pads the chunk and instructs replicas to do the same, asking the client to retry with the next chunk. If the record fits, the primary writes the data, and successful replicas write the data at the exact offset. Successful atomic record appends ensure that the record is written at the same offset on all chunkservers.

Nexer Group 5 个月前

How I created CrankDB and crossed 50k downloads!

Shrey Batra 1 年前

CAP Theorem: Understanding Trade-Offs in Distributed…

Netopia Solutions 6 个月前

Consistency Guarantees:

GFS provides specific consistency guarantees:

Data is written at least once: while replicas may not be bit-wise copies of each other due to unsuccessful mutations, GFS does guarantee that if a mutation is successful, each replica will have it’s data written at least once.
Data for a given atomic append is always written consecutively, while data from a single write can have other writes interleaved.

The following table shows the state after successful and failed writes and appends:

A file region is consistent if all clients will always see the same data, regardless of which replicas they read from.
A region is defined after a file data mutation if it is consistent and clients will see what the mutation writes in its entirety (no interleaved writes)

Snapshots:

The snapshot functionality in the system allows for nearly instantaneous copying of a file or directory tree while minimizing disruptions to ongoing mutations. Users utilize snapshots to create branch copies of large datasets or to checkpoint the state before making experimental changes, facilitating quicker rollbacks if needed.

When the master receives a snapshot request, it revokes leases on chunks in the file to be snapshot, requiring further interaction for subsequent write requests. This allows the master to create a new copy of the chunk before additional mutations. After the lease is expired or revoked, the master logs the snapshot operation, duplicating the metadata for the file or directory tree in its in-memory state. The newly created snapshot files reference the same chunks as the source file.

Upon the first write request to a chunk in the new snapshot, the master defers replying to the client and selects a new chunk handle. It then instructs chunkservers with the old chunk to create a new chunk locally. This enables the data from the old chunk to be copied locally to the new chunk instead of over the network. After creating the new chunk, the master grants a lease on the new chunk, and operations resume normally

Master Operations:

Namespace Management:

Master operations, which can take a considerable amount of time, are designed to run concurrently to prevent delays. However, synchronization is required for operations that need safety. Locking is implemented over regions and namespaces, with each node in the namespace tree having a read-write lock. The system acquires locks in a total order, first by level in the namespace tree and then lexicographically within the same level, ensuring proper serialization of operations.

Replica Placement:

GFS clusters, distributed with hundreds or thousands of chunkservers across multiple racks, require thoughtful replica placement to maximize data reliability, availability, and network bandwidth utilization. Replicas must be distributed across machines and racks to guard against disk/machine failures.

Creation, Re-replication, Rebalancing:

Creation: The master selects where to create a chunk by considering chunkservers with below-average disk usage, replication across racks, and limiting recent chunk creations on a chunks server to reduce sudden write traffic.
Re-replication: A chunk is re-replicated when the number of replicas falls below the specified replication factor. The master prioritizes which chunks to replicate based on the replication goal, blocking client progress, and file deletion status.
Rebalancing: The master periodically examines replica distribution and moves replicas to achieve better disk usage and load balancing. This ensures that new chunkservers receive older data instead of being overwhelmed with new chunks.

Garbage Collection:

After a file is deleted, reclaiming physical storage is done lazily during regular garbage collection. When deletion is requested, it’s logged to the master, and the file is renamed to a hidden name including deletion timestamp (the file is readable and recoverable until actual deletion to protect against mistakes). During garbage collection the master scans files and deletes hidden files that have been “deleted” for over 3 days - when the file is removed from the namespace its metadata is removed from the in memory table removing it’s link to chunks. The master periodically scans chunk namespace and identifies orphaned chunks (not linked to files) and erases metadata for them. During heartbeats chunkservers sends master chunks it has, the master responds which chunks no longer exist in metadata and the chunkserver can delete them.

This mechanism is simple and reliable against different failures, for example: the replica deletion message may be lost and not executed, it will just inform the master of the chunk during the next heartbeat, and the master will again tell it to delete it. Garbage collection can happen in the background when the master is relatively free.

One issue with this approach is that it delays user effort to fine tune usage when storage is tight. Applications that constantly create and delete temp files may not have free storage right away. To address this GFS implemented a mechanism to more quickly claim back space if a user re-requests deletion.

Stale Replica Detection:

To detect stale replicas, the master maintains a chunk version number. When granting a new lease, the master increments the version number, and during heartbeats, if a chunkserver reports a smaller version number, the master assumes it's stale and initiates garbage collection.

Conclusion

In summary, the Google File System (GFS) was an innovative solution designed to meet Google's extensive data processing needs in the early 2000s. Its strength lied in supporting large-scale workloads on commodity hardware, treating component failures as the norm and optimizing for efficient handling of massive files. The system ensured fault tolerance through constant monitoring, replication, and swift recovery, incorporating novel repair mechanisms for enhanced reliability. GFS achieved high aggregate throughput for numerous readers and writers by separating control flow from data transfer, therefore minimizing master involvement. GFS was widely used within Google, but later replaced by Colossus.

The Google File System

Sara Khodeir

Senior Software Engineer at Facebook

Use Cases:

Assumptions:

Interface for Applications:

Architecture:

Master Responsibilities:

ChunkServers Responsibilities:

Writes Flow:

Atomic Record Appends:

领英推荐

Consistency Guarantees:

Snapshots:

Master Operations:

Namespace Management:

Replica Placement:

Creation, Re-replication, Rebalancing:

Garbage Collection:

Stale Replica Detection:

Conclusion

Exploring Large Scale Systems

351 位关注者

更多精彩文章

社区洞察

其他会员也浏览了

Distributed Snapshots

Distributed File Systems, Simplified!

#96 The Pursuit of the Holy Grail: Balancing Reads and Writes in Vector Databases

Mastering Azure Cosmos DB with ASP.NET Core

Best Way To Store Historical Statistical Data About Users

Survival of the Fittest: Will Vector Databases Thrive or Disappear?

Databricks Shared Compute & Data Governance: Understanding Security

Stop Using UUIDs in Your Database: How UUIDs Can Destroy Database Performance

How Elasticsearch Snapshots Work

It is an argument, not a trojan horse!

Use Cases:

Assumptions:

Interface for Applications:

Architecture:

Master Responsibilities:

ChunkServers Responsibilities:

Writes Flow:

Atomic Record Appends:

领英推荐

Consistency Guarantees:

Snapshots:

Master Operations:

Namespace Management:

Replica Placement:

Creation, Re-replication, Rebalancing:

Garbage Collection:

Stale Replica Detection:

Conclusion

Exploring Large Scale Systems

351 位关注者

Chubby: Google's Distributed Lock Service for Reliable Coordination in Distributed Systems

2024年10月29日

Zookeeper: Wait-free coordination for Internet-scale systems

2024年6月9日

Spanner: Google’s Globally-Distributed Database

2024年3月9日

My Journey with Career Coaching

2021年3月21日

社区洞察

其他会员也浏览了

Distributed Snapshots

Distributed File Systems, Simplified!

#96 The Pursuit of the Holy Grail: Balancing Reads and Writes in Vector Databases

Mastering Azure Cosmos DB with ASP.NET Core

Best Way To Store Historical Statistical Data About Users

Survival of the Fittest: Will Vector Databases Thrive or Disappear?

Databricks Shared Compute & Data Governance: Understanding Security

Stop Using UUIDs in Your Database: How UUIDs Can Destroy Database Performance

How Elasticsearch Snapshots Work

It is an argument, not a trojan horse!