GFS / Google File System

GFS / Google File System

1. Requirements

1.1 Functional

1. The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.

2. System stores 1-100GB sized millions of files.

3. Reads:

  • a. Large reads(reading hundreds of KBs).
  • b. Small random reads(read few KBs at some arbitrary offset), often read through a contiguous region of a file.

4. Write to files: Operations similar to reads

  • a. Small writes at arbitrary positions in a file are supported but do not have to be efficient
  • b. mutliple clients concurrently append to same file
  • Need atomicity with min synchronization overhead.

2. Architecture

  • GFS cluster consists of a 1 master and multiple chunkservers and is accessed by multiple clients

???docx, xls

??[Drop-box]GFS-Client1 <---->? ? ? Chunkserver1(linux)? ---hb-----------

????????????????????????????????????????????????????????????????????????|

????????????????????????????????????Chunkserver2(linux)? --heartbeat -- GFS-Master

????????????????????????????????????????????????????????????????????????|

????????client2	 <----> ? ? ? ? ? ? Chunkserver3(linux)	----hb----------

??????????|

??????????| --------------which chunkserver to contact for file-x----------->

??????????| <--------------------- chunkserver3 ----------------------------
        
	?     | <------- RW ----------------->        
No alt text provided for this image

???2.1 Chunks

  • Files are divided into fixed-size chunks
  • Each chunk is identified by an immutable(non-changable) and globally unique 64 bit chunk. Assigned by the master at the time of chunk creation.
  • For reliability, each chunk is replicated on multiple chunkservers(By default, 3 replicas)

2.1.a Chunk size = 64MB

Advantages of large chunk?

  • Only 1st time client need to contact gfs-master to get chunk-replica-ip-address. //See Read operation below
  • On a large chunk, gfs-client can perform many operations, it reduces network overhead.
  • It reduces the size of the metadata stored on the master.


2.2 GFS Master

2.2.a GFS Master Stores

  • A. All metadata
  • Keeps 64bytes of metadata for each 64MB chunk.
  • chunk namespace(think similar to c++ namespaces), Mapping from files to chunks //These 2 are stored persistant using long Mutations
  • Current location of each chunk's replica, access control information.
  • Chunk location is asked by gfs-master from chunkservers at startup, after that master will updates its DB since all chunkplacement is done by master with regular HeartBeat messages to chunkservers.
  • metadata is stored in IMDB, In memory database.

B. Operation Logs

  • These are stored on gfs-master disk and remotely both.


Purpose?

  • Stores time of creation of Files, chunks, their versions.
  • In case master crashes this will help in recuprating master again.
  • We will keep checkpoints(Maintained as compact B-Tree) in log file, so that when master need to recover it does not read whole file and take lot of time.
  • New checkpoint is created in seperate thread every minute or so for cluster containing 1-10 Million files.
  • After new checkpoint is created, older checkpoints can be deleted, some can be kept to guard against catastrophes.

Tasks Performed

1. Chunk management

  • Garbage collection of orphaned chunks, chunkmigration between chunkservers.
  • Replication decisions using global knowledge
  • Sophisticated chunk placement

Does not:

  • Involve in reads and writes with clients, so that it does not become a bottleneck.


2.3 GFS Client

  • GFS client is linked to each Client application implements the file system API and communicates with the master and chunkservers.
  • Only interacts with GFS master to know which chunkserver to contact for RW, RW is done using chunkserver.


2.3.1 Read Operation: User Reading a File

  • User opens dropbox space, opens a file and places cursor on some index.
  • GFS Client reads Filename, byte offset(in file). Converts byte offset to chunk index, sends Filename, Chunk-index to GFS master.
  • unordered_map<key=offset, value=index>
  • Master returns chunk_handle(ie file handle), chunkserver-replicas ip address to GFS Client.
  • GFS Client will cache this information <key=chunk_index, value=handle>
  • key=chunk_index, value=filename
  • Client connects with nearst replica, sends chunk_handle,offset & get chunk. client can request multiple chunks within 1 request.

???Dropbox-space

??- User opens a.txt at offset=1000 ? //1

??

??????????????GFS-CLIENT

????????????converts offset=1000 to index=2 ? //2

????????????

??????????????????----------- a.txt, index=2----------> ? GFS-Master

??????????????????<----chunk_handle=60, chunkserver-replicas----? ? ? ? ? ? //3

????????????

???????????????Caches? ? ? ? ? ? ? ? //4

??????????<key=chunk_index, value=handle>

??????????

?????????????GFS-CLIENT

??????????connects to nearest replica?

??????????????????----------chunk_handle, start_offset, end_offset-------> Replica
        
??????????????????<--------- chunk.. ---------------------------------------        

2.3.2 Write Operation: User writing to a File

  • 1. User opens dropbox space, opens a file and writes on some index.
  • 2. GFS Client asks which chunk-replica holds ownership/lease of current chunk. if noone has lease master assigns lease to one and returns all
  • Chunk Ownership/LEASE: Every chunk would be owned by 1 of chunk-replica for specific time. This lease is assigned by master to chunk-replica.
  • 3. Master returns chunkserver-replicas ip address to GFS Client.
  • 4. GFS Client will push data to primary chunk-replica. Primary chunk-replica will forward data to its NEAREST chunk replica-b. Chunk-replica-b will forward data to its nearst chunk-replica. This way N/W Bandwidth is saved, latency is minimized. 1MB is roughly stored/distributed in 80ms.
  • 5. Primary Chunk-replica, ie to which lease was provided by master assigns serial number to chunks written.
  • 6. In case of any error in write, its communicated to GFS-Client. In case of failure in writing chunk, particular data block would not been assigned serial number by chunk-replica and its possibly lost or in inconsistent state. Client will retry rewriting the data ie failed mutations.

?Dropbox-space

??- User opens a.txt at offset=1000 and writes data-y? //1

????a.txt,offset=1000 = chunk-x

????

??????????????GFS-CLIENT

??????????????Get chunk-server holding lease //2

????????????

??????????????????----------------- chunk-x ----------> ? GFS-Master

??????????????????<----chunkserver-replicas list----? ? ? ? ? ? //3

??????????????????????

??????????????Push data to all replicas //4

??????????????????------------ data-y ----------> chunk-replica-a ? //5

??????????????????<-------- ACK ----------------

??????????????????

??????????????????????????????????????????????????????--- data-y --> chunk-replica-b

??????????????????????????????????????????????????????<---- ACK ----

??????????????????????????????????????????????????????????????????????--- data-y --> chunk-replica-c
        
??????????????????????????????????????????????????????????????????????<---- ACK ----        

2.3.3 User creates snapshot of workspace

snaphot? Makes a copy of a file or a directory tree (the “source”) almost instantaneously, without affecting any ongoing mutations/write operations.

  • 1. User opens dropbox space, and presses button for snapshot.
  • 2. When the gfs-master receives a snapshot request, it 1st revokes any outstanding leases on the chunks in the files it is about to snapshot. This ensures that any subsequent writes to these chunks will require an interaction with the master to find the lease holder. gfs-master creates new copy of chunk 1st.

Chunk Ownership/LEASE: Every chunk would be owned by 1 of chunk-replica for specific time. This lease is assigned by master to chunk-replica

  • 3. Master writes changes to operation log.
  • 4. Makes snapshot's file-1 point to file-1 in same workspace of client. Newly created snapshot files point to the same chunks as the source files.
  • 5. GFS Client writes new chunk-x to old file-y in snaphot, it sends request to gfs-master to find chunk-replica.
  • 6. GFS-master creates chunk-x on chunk-replica that holds file-y. This ensures data is copied locally(no network based writes), now its same as 2.3.1

Dropbox-space/GFS-Client

??Take snapshot ? //1

???

????????--------------------------> ? GFS-Master

??????????????????????????????????Revoke outstanding leases ? //2

??????????????????????????????????Logs data to operation log? //3

??????????????????????????????????Makes newly created snapshot point to old data //4

??????????????????????

??GFS-Client

add chunk-x to old file-y on snaphot? //5

????????-----------Find chunk-replica------>

??????????????????????????????????Creates chunk-x on chunk-replica? //6

??????????????????????????????????that holds the file-y

??????????????????????????????????????????------------- Create chunk-x ----------> chunk-replica
        
????????????????????????????????//Same as writing to chunk        

2.3.4 Deleting a file, Lazy Claiming. Distributed Garbage Collection.

  • 1. User deletes a file in its namespace.
  • 2. Delete request goes to master, master does not immediattely goes to replica deletes/reclaims the space. Instead master marks deleted file as hidden.

Hidden file remains on filesystem for 3(comfigurable) days, if user tries to restore he can.

After 3 days, during regular scan of filesystem of chunk-servers by master. File and its meta-data are deleted.

During this scan only if master finds orphaned chunks, it reclaims their spaces also deletes their meta-data.

This is done when master is relatively free.


2.4 Chunk Servers

Stores chunks as local files. No caching is needed here.


3. Caching

  • Caching File Data: Neither on client nor the chunkserver.
  • Client caches offer little benefit because most applications stream huge files or have working sets too large.
  • It also creates cache coherance problems.
  • Caching metadata: Yes to be cached.


4. Fault Tolerance

Among hundreds of servers in a GFS cluster, some are bound to be unavailable at any given time. System is kept highly available using these strategies.


4.1 Fast recovery

1. GFS-master and chunkservers are designed such that, if they fail/terminated, they can restart quickly and restore their state in few seconds.


4.2 Chunk replication

1. Each chunk is replicated on multiple chunkservers on different racks(so that if 1 rack fails all chunkservers should not go down).

Redundancy schemes between replicas: parity or erasure codes.


5. Data Integrity

Each chunkserver uses checksum(32 bit) to ensure chunk is not corrupted. Each 64MB chunk is broken into 64KB blocks. As metadata, checksum is also stored in-memory(ie IMDB).

if we try to fetch checksum from other cuhnkservers it would lead to lot of traffic, so each chunkserver will maintain its own checksum record.

If a block does not match the recorded checksum, the chunkserver returns an error to the requestor and reports the mismatch to the master.

Now requestor will read from other replica and master will clone chunk from other replica.


6. BOE

  • Considering test env as:
  • 1 master, 2 master replicas, 16 chunkservers, 16 clients. Memory for every machine: RAM=2GB, Harddisk=80GBx2

All machines are connected to each other using switch.

???????????????|----------------1 Gbps cable------------------------|

????????????switch-1(100 Mbps)? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? switch-2(100Mbps)

???????????????|? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? |

??|------|--------------|--------|? ? ? ? ? ?                       |? 

                                              ? ? ? |--------|-----------|------
        
master? replica-1 replica-2 16-chunkservers ? ? ? 
                                                 client1 ? client2 ? client3 ?         

6.1 Reads

Max reads limits on provided 1Gbps ethernet cable and 100Mbps switch

1 Gbps Line. 109 / 16 = 62.5 Mbps

100 Mbps switch. 1006 / 16 = 6.25 Mbps

1 reader's Read limit = 6.25Mbps(ie lower of above two). This will drop further when readers increases.

Limit can be increased by using CAT7 ethernet cable.

6.2 Appends

let us consider all clients tries to append to 1 file, for 1 writer = 6.25 Mbps.

Considering collisions on network, packet drops append/client reduces to 4.8Mbps.

要查看或添加评论,请登录

Amit K.的更多文章

  • Pyunit or Unitest

    Pyunit or Unitest

    Used to test a unit of source code Features: 1. PyUnit is very simple yet very flexible 2.

  • Calling OpenAI APIs from code

    Calling OpenAI APIs from code

    Steps a. Get openAI API Key openaAI API Key b.

    3 条评论
  • FlatList with Example in React Native

    FlatList with Example in React Native

    What is FlatList? displays a scrolling list of changing, but similarly structured, data. Unlike the more generic…

  • Create Postgres Database, Tables, Schema using Diesel(ORM) Rust

    Create Postgres Database, Tables, Schema using Diesel(ORM) Rust

    What is Diesel Diesel is a ORM(object-relational mapping). ORM is programming technique that connects object-oriented…

  • Location Sharing App System Design (Bump)

    Location Sharing App System Design (Bump)

    What is Bump Bump is location sharing Mobile App. Install bump on 2 phones(add as friends).

  • Load Balancers & Caches

    Load Balancers & Caches

    What is Load Balancer? Load Balancer evenly distributes incoming traffic/load among webservers/workers that are defined…

  • REST API / Representation State Transfer

    REST API / Representation State Transfer

    Restful Web Server/Application? Web application that implements HTTP CRUD methods in Restful way. Eg: Twitter, facebook…

  • Inter Thread Communication in Rust using Channels

    Inter Thread Communication in Rust using Channels

    What is Channel? Sender and Receiver are connected via Channel. They can send/recv data via channel.

  • Slices in Rust

    Slices in Rust

    What is Slice Slice is somepart of larger data, it always borrow data(Hence RO) from the sliced type. It gives you a…

  • Traits in Rust

    Traits in Rust

    What is Triat in Rust Interface/class in Rust(declared with keyword trait) having Virtual Functions(not pure Virtual)…

社区洞察