ABSTRACT. We have designed and implemented the Google File Sys- tem, a scalable distributed file system for large distributed data-intensive applications. We have designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault. monitoring, fault tolerance, auto-recovery (thousands of low-cost machines). • focus on multi-GB files. • handle appends efficiently (no random writes.
|Language:||English, Spanish, Dutch|
|Genre:||Fiction & Literature|
|PDF File Size:||13.41 MB|
|Distribution:||Free* [*Regsitration Required]|
The Google File System. By Sanjay Ghemawat, Howard Gobioff, and. Shun-Tak Leung. (Presented at the 19th ACM Symposium on. Operating Systems. 2. Outline. • File systems overview. • GFS (Google File System). • Motivations. • Architecture. • Algorithms. • HDFS (Hadoop File System). Dennis Ritchie and Ken Thompson, Bell Labs, ❚ “UNIX rose from the ashes of a multi-organizational effort in the early s to develop a dependable.
Successfully reported this slideshow. We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads.
You can change your ad preferences anytime. Upcoming SlideShare. Like this presentation? Why not share! Embed Size px. Start on. Show related SlideShares at end. WordPress Shortcode. Raghunath Nandyala. Raghunath Nandyala Problem Google as a web search engine, it requires to store hundreds of terabytes of web pages across thousands of disks on over thousands of machines and it is consistently accessed by hundreds of client machines. It is difficult to achieve this with traditional distributed platforms.
Traditional approaches like Mainframes etc will become very expensive. And also this problem requires highly scalable, fault tolerant features. The current and anticipated application workloads require a radical distributed file system design that meets the storage needs. This paper opens a window for innovative approaches to store and process data in high scale. Google's success story is linked with their solution to implement their core file system.
GFS is the backbone system for most of their projects. In Google has 10 Exabyte storage on disks Munroe, With this knowledge from Google, technology industry started thinking about BigData Solution and there are many open sources platforms evolved from community. Hadoop is an open source platform which evolved on basis of GFS.
As it is open source project, it is helping many industries to develop applications, analysis tools, services which are handling huge amount of data sets. To build such architectures, this paper is an important frame of reference. In-memory metadata, commodity hardware, fault tolerance, mutations, operation log, scalability, chunk server, heartbeat message, record append, garbage collection, lease, snapshot.
Design Implementations Overview: Google team implemented GFS by observing their application workloads. Inexpensive hardware enabled them to expand the file system with minimal cost. Large number of files distributed across the machines, in which data is been appended to the end of the file. Files are organized hierarchically in directories and identified by pathnames. They support the simple operations to create, delete, open, close, read and write files.
Along with that, there are few special file operations Snapshot and Record Append. Following are the few intriguing points on GFS architecture. This architecture involves three major machine types.
All the servers distributed across. Master, Chunk servers are the holders of the distributed files system. Files are stored in chunks of maximum size 64MB across the chunk servers. GFS maintains a central operation log, which contains a historical record of critical Meta data changes.
Master machine is a single machine which holds complete file system metadata information in-memory. Snapshot creates a copy of a file or a directory tree at low cost. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual clients append.
It is useful for implementing multi-way merge results and producer-consumer queues that many clients can simultaneously append .
Snapshot and record append are discussed further in Sections 3. GFS is a distributed system to be run on clusters. Whereas the master is primarily in charge of managing and monitoring the cluster, all data is stored on the slave machines, which are referred to as chunkservers, as shown in Figure 1. Fig 1: While the exact replication algorithms are not fully documented. This way, the risk of losing data in the event of a failure of an entire rack or even sub-network is mitigated .
To implement such distribution in an efficient manner, GFS employs a number of concepts. First and foremost, files are divided into chunks, each having a fixed size of 64 MB. A file always consists of at least one chunk, although chunks are allowed to contain a smaller payload than 64 MB.
New chunks are automatically allocated in the case of the file growing. Chunks are not only the unit of management and their role can thus be roughly compared to blocks in ordinary file systems, they are also the unit of distribution. Although client code deals with files, files are merely an abstraction provided by GFS in that a file refers to a sequence of chunks. This abstraction is primarily supported by the master, which manages the mapping between files and chunks as part of its metadata .
This system metadata also includes the namespace, access control information, and the current locations of chunks . Chunkservers in turn exclusively deal with chunks, which are identified by unique numbers. Based on this separation between files and chunks, GFS gains the flexibility of implementing file replication solely on the basis of replicating chunks. As the master server holds the metadata and manages file distribution, it is involved whenever chunks are to be read, modified or deleted.
Also, the metadata managed by the master has to contain information about each individual chunk. The size of a chunk and thus the total number of chunks is thus a key Fig 2: Choosing 64 MB as chunk size can be considered a trade-off between trying to limit resource usage and master interactions on the one hand and accepting an increased degree of internal fragmentation on the other hand. In order to safeguard against disk corruption, chunkservers have to verify the integrity of data before it is being delivered to a client by using checksums .
GFS is implemented as user mode components running on the Linux operating system. As such, it exclusively aims at providing a distributed file system while leaving the task of managing disks i. Based on this separation, chunkservers store each chunk as a file in the Linux file system. However, GFS minimizes its involvement in reads and writes so that it does not become a bottleneck.
Clients never read and write file data through the master. Instead, a client asks the master which chunkservers it should contact. It caches this information for a limited time and interacts with the chunkservers directly for many subsequent operations .
Consider the interactions for a simple read with reference to Figure 1. First, using the fixed chunks size, the client translates the file name and byte offset specified by the application into a chunk index within the file.
Then, it sends the master a request containing the file name and chunk index. The master replies with the corresponding chunk handle and locations of the replicas. The client caches this information using the file name and chunk index as the key. The client then sends a request to one of the replicas, most likely the closest one. The request specifies the chunk handle and a byte range within that chunk. Further reads of the same chunk require no more client-master interaction until the cached information expires or the file is reopened.
In fact, the client typically asks for multiple chunks in the same request and the master can also include the information for chunks immediately following those requested. This extra information sidesteps several future client-master interactions at practically no extra cost. Each chunk replica is stored as a plain Linux file on a chunkservers and is extended only as needed.
Lazy space allocation avoids wasting space due to internal fragmentation, perhaps the greatest objection against such a large chunk size. A large chunk size offers several important advantages. First, it reduces clients need to interact with the master because reads and writes on the same chunk require only one initial request to the master for chunk location information. The reduction is especially significant for GFSs workloads because applications mostly read and write large files sequentially.
Even for small random reads, the client can comfortably cache all the chunk location information for a multi-TB working set. Second, since on a large chunk, a client is more likely to perform many operations on a given chunk, it can reduce network overhead by keeping a persistent TCP connection to the chunkserver over an extended period of time. Third, it reduces the size of the metadata stored on the master .
On the other hand, a large chunk size, even with lazy space allocation, has its disadvantages. A small file consists of a small number of chunks, perhaps just one. The chunkservers storing those chunks may become hot spots if many clients are accessing the same file .
All metadata is kept in the masters memory. The first two types namespaces and file-to-chunk mapping are also kept persistent by logging mutations to an operation log stored on the masters local disk and replicated on remote machines.
Using a log allows us to update the master state simply, reliably, and without risking inconsistencies in the event of a master crash. The master does not store chunk location information persistently. Instead, it asks each chunkserver about its chunks at master startup and whenever a chunkserver joins the cluster.
Furthermore, it is easy and efficient for the master to periodically scan through its entire state in the background.
This periodic scanning is used to implement chunk garbage collection, re- replication in the presence of chunkserver failures, and chunk migration to balance load and disk space usage across chunkservers. One potential concern for this memory-only approach is that the number of chunks and hence the capacity of the whole system is limited by how much memory the master has.
This is not a serious limitation in practice. The master maintains less than 64 bytes of metadata for each 64 MB chunk. Most chunks are full because most files contain many chunks, only the last of which may be partially filled. Similarly, the file namespace data typically requires less than 64 bytes per file because it stores file names compactly using prefix compression.
If necessary to support even larger file systems, the cost of adding extra memory to the master is a small price to pay for the simplicity, reliability, performance, and flexibility that GFS gains by storing the metadata in memory .
It simply polls chunkservers for that information at startup. The master can keep itself up-to-date thereafter because it controls all chunk placement and monitors chunkserver status with regular HeartBeat messages . The operation log contains a historical record of critical metadata changes .
It is central to GFS. Not only is it the only persistent record of metadata, but it also serves as a logical time line that defines the order of concurrent operations. Files and chunks, as well as their versions, are all uniquely and eternally identified by the logical times at which they were created . If the master should fail its operation can be recovered by a back-up master which can simply replay the log to get to the same state.
However, this can be very slow, especially if the cluster has been alive for a long time and the log is very long. To help 10 with this issue, the masters state is periodically serialized to disk and then replicated so that on recovery a master may load the checkpoint into memory, replay any subsequent operations in the log, and be available again very quickly.
All metadata is held by the master in main memory this avoids latency problems caused by disk writes, as well as making scanning the entire chunk space e. Check pointing the master state is distinct from the second unusual operation that GFS supports, snapshot .
The master recovers its file system state by replaying the operation log. To minimize startup time, GFS keeps the log small . To limit the size of the log and thus also the time required to replay the log, snapshots of the metadata are taken periodically and written to disk.
After a crash, the latest snapshot is applied and the operation log is replayed, which as all modifications since the last snapshot have been logged before having being applied to the in-memory structures yields the same state as existed before the crash . In fact, if a record append operation succeeds on all but one replica and is then successfully retried, the chunks on all servers where the operation has succeeded initially will now contain a duplicate record.
Similarly, the chunk on the server that initially was unable to perform the modification now contains one record that has to be considered garbage, followed by a successfully written new record, as illustrated by record B in Figure 3 Fig 3: In order not to have these effects influence the correctness of the results generated by applications, GFS specifies a special, relaxed consistency model that supports Googles highly distributed applications well but remains relatively simple and efficient to 11 implement.
GFS classifies a file region, i. A region is consistent if all clients see the same data. A region is defined with respect to a change if it is consistent and all clients see the change in its entirety. A region is inconsistent if it is not consistent. Based on this classification, the situation discussed above yields the first record of record B in an inconsistent state, whereas the second record is considered defined.
Consistent but not defined regions can occur as a result of concurrent successful modifications on overlapping parts of a file. As a consequence, GFS requires clients to correctly cope with file regions being in any of these three states. One of the mechanisms clients can employ to attain this is to include a unique identifier in each record, so that duplicates can be identified easily. Furthermore, records can be written in a format allowing proper self-validation.
The relaxed nature of the consistency model used by GFS and the requirement that client code has to cooperate emphasizes the fact that GFS is indeed a highly specialized file system neither intended nor immediately applicable for general use outside Google. With that background, how the client, master, and chunkservers interact to implement data mutations, atomic record append, and snapshot are described as under .
Each mutation is performed at all the chunks replicas. To maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas, called primary. The primary chooses serial order for all mutations to the chunk.
All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary . Particularly, Use of leases to maintain consistent mutation order . A lease has an initial timeout of 60 seconds, thus a leas can be considered as a lock that has an expiration time.
However, as long as the chunk is being mutated, the primary can request and typically receive extensions from the master indefinitely. These extension requests and grants are piggybacked on the HeartBeat messages regularly exchanged between the master and all chunkservers. The master may sometimes try to revoke a lease before it expires e.
Even if the master loses communication with a primary, it can safely grant a new lease to another replica after the old lease expires . Figure 4, illustrates the process of control flow of a write through these numbered steps. The client asks the master which chunkserver holds the current lease for the chunk and the locations of the other replicas.
If no one has a lease, the master grants one to a replica it chooses not shown. The master replies with the identity of the primary and the locations of the other secondary replicas. The client caches this data for future mutations. It needs to 13 Fig 4: The client pushes the data to all the replicas. Data is pushed linearly along a chain of chunkservers in a pipelined fashion.
A Primary Sec. Once a chunkserver receives some data, it starts forwarding immediately 4.
Once all the replicas have acknowledged receiving the data, the client sends a write request to the primary. The request identifies the data pushed earlier to all of the replicas. The primary assigns consecutive serial numbers to all the mutations it receives, possibly from multiple clients, which provides the necessary serialization. It applies the mutation to its own local state in serial number order. The primary forwards the write request to all secondary replicas.
Each secondary replica applies mutations in the same serial number order assigned by the primary. The secondaries all reply to the primary indicating that they have completed the operation. The primary replies to the client.
Any errors encountered at any of the replicas are reported to the client. In case of errors, the write may have succeeded at the primary and an arbitrary subset of the secondary replicas. If it had failed at the primary, it would not have been assigned a serial number and forwarded.
The client request is considered to have failed, and the modified region is left in an inconsistent state. GFSs client code handles such errors by retrying the failed mutation. It will make a few attempts at steps 3 through 7 before falling back 14 to a retry from the beginning of the write. If a write by the application is large or straddles a chunk boundary, GFS client code breaks it down into multiple write operations. They all follow the control flow described above but may be interleaved with and overwritten by concurrent operations from other clients.
Therefore, the shared file region may end up containing fragments from different clients, although the replicas will be identical because the individual operations are completed successfully in the same order on all replicas.