GFS / Google File System
1. Requirements
1.1 Functional
1. The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis.
2. System stores 1-100GB sized millions of files.
3. Reads:
4. Write to files: Operations similar to reads
2. Architecture
???docx, xls
??[Drop-box]GFS-Client1 <---->? ? ? Chunkserver1(linux)? ---hb-----------
????????????????????????????????????????????????????????????????????????|
????????????????????????????????????Chunkserver2(linux)? --heartbeat -- GFS-Master
????????????????????????????????????????????????????????????????????????|
????????client2 <----> ? ? ? ? ? ? Chunkserver3(linux) ----hb----------
??????????|
??????????| --------------which chunkserver to contact for file-x----------->
??????????| <--------------------- chunkserver3 ----------------------------
? | <------- RW ----------------->
???2.1 Chunks
2.1.a Chunk size = 64MB
Advantages of large chunk?
2.2 GFS Master
2.2.a GFS Master Stores
B. Operation Logs
Purpose?
Tasks Performed
1. Chunk management
Does not:
2.3 GFS Client
2.3.1 Read Operation: User Reading a File
???Dropbox-space
??- User opens a.txt at offset=1000 ? //1
??
??????????????GFS-CLIENT
????????????converts offset=1000 to index=2 ? //2
????????????
??????????????????----------- a.txt, index=2----------> ? GFS-Master
??????????????????<----chunk_handle=60, chunkserver-replicas----? ? ? ? ? ? //3
????????????
???????????????Caches? ? ? ? ? ? ? ? //4
??????????<key=chunk_index, value=handle>
??????????
?????????????GFS-CLIENT
??????????connects to nearest replica?
??????????????????----------chunk_handle, start_offset, end_offset-------> Replica
??????????????????<--------- chunk.. ---------------------------------------
2.3.2 Write Operation: User writing to a File
?Dropbox-space
??- User opens a.txt at offset=1000 and writes data-y? //1
????a.txt,offset=1000 = chunk-x
????
??????????????GFS-CLIENT
??????????????Get chunk-server holding lease //2
????????????
??????????????????----------------- chunk-x ----------> ? GFS-Master
??????????????????<----chunkserver-replicas list----? ? ? ? ? ? //3
??????????????????????
??????????????Push data to all replicas //4
??????????????????------------ data-y ----------> chunk-replica-a ? //5
??????????????????<-------- ACK ----------------
??????????????????
??????????????????????????????????????????????????????--- data-y --> chunk-replica-b
??????????????????????????????????????????????????????<---- ACK ----
??????????????????????????????????????????????????????????????????????--- data-y --> chunk-replica-c
??????????????????????????????????????????????????????????????????????<---- ACK ----
2.3.3 User creates snapshot of workspace
snaphot? Makes a copy of a file or a directory tree (the “source”) almost instantaneously, without affecting any ongoing mutations/write operations.
Chunk Ownership/LEASE: Every chunk would be owned by 1 of chunk-replica for specific time. This lease is assigned by master to chunk-replica
Dropbox-space/GFS-Client
??Take snapshot ? //1
???
????????--------------------------> ? GFS-Master
??????????????????????????????????Revoke outstanding leases ? //2
??????????????????????????????????Logs data to operation log? //3
??????????????????????????????????Makes newly created snapshot point to old data //4
??????????????????????
??GFS-Client
add chunk-x to old file-y on snaphot? //5
????????-----------Find chunk-replica------>
??????????????????????????????????Creates chunk-x on chunk-replica? //6
??????????????????????????????????that holds the file-y
??????????????????????????????????????????------------- Create chunk-x ----------> chunk-replica
????????????????????????????????//Same as writing to chunk
2.3.4 Deleting a file, Lazy Claiming. Distributed Garbage Collection.
Hidden file remains on filesystem for 3(comfigurable) days, if user tries to restore he can.
After 3 days, during regular scan of filesystem of chunk-servers by master. File and its meta-data are deleted.
During this scan only if master finds orphaned chunks, it reclaims their spaces also deletes their meta-data.
This is done when master is relatively free.
2.4 Chunk Servers
Stores chunks as local files. No caching is needed here.
3. Caching
4. Fault Tolerance
Among hundreds of servers in a GFS cluster, some are bound to be unavailable at any given time. System is kept highly available using these strategies.
4.1 Fast recovery
1. GFS-master and chunkservers are designed such that, if they fail/terminated, they can restart quickly and restore their state in few seconds.
4.2 Chunk replication
1. Each chunk is replicated on multiple chunkservers on different racks(so that if 1 rack fails all chunkservers should not go down).
Redundancy schemes between replicas: parity or erasure codes.
5. Data Integrity
Each chunkserver uses checksum(32 bit) to ensure chunk is not corrupted. Each 64MB chunk is broken into 64KB blocks. As metadata, checksum is also stored in-memory(ie IMDB).
if we try to fetch checksum from other cuhnkservers it would lead to lot of traffic, so each chunkserver will maintain its own checksum record.
If a block does not match the recorded checksum, the chunkserver returns an error to the requestor and reports the mismatch to the master.
Now requestor will read from other replica and master will clone chunk from other replica.
6. BOE
All machines are connected to each other using switch.
???????????????|----------------1 Gbps cable------------------------|
????????????switch-1(100 Mbps)? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? switch-2(100Mbps)
???????????????|? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? |
??|------|--------------|--------|? ? ? ? ? ? |?
? ? ? |--------|-----------|------
master? replica-1 replica-2 16-chunkservers ? ? ?
client1 ? client2 ? client3 ?
6.1 Reads
Max reads limits on provided 1Gbps ethernet cable and 100Mbps switch
1 Gbps Line. 109 / 16 = 62.5 Mbps
100 Mbps switch. 1006 / 16 = 6.25 Mbps
1 reader's Read limit = 6.25Mbps(ie lower of above two). This will drop further when readers increases.
Limit can be increased by using CAT7 ethernet cable.
6.2 Appends
let us consider all clients tries to append to 1 file, for 1 writer = 6.25 Mbps.
Considering collisions on network, packet drops append/client reduces to 4.8Mbps.