Spanner: Google’s Globally-Distributed Database
Spanner is Google’s globally distributed, highly available database, offering users an extensive array of features. These include externally-consistent distributed transactions (the strictest consistency property for transaction-processing systems), versioning for data values, strong fault tolerance guarantees, and flexibility for users regarding data placement and replication. Users interact with Spanner through a SQL-based interface.
Consistency Model:
Spanner’s consistency model, referred to as external consistency by the authors, provides the strictest concurrency-control guarantees for transactions. It ensures serializability, linearizability, strong consistency, and more. Despite being a distributed system running on machines worldwide, executing transactions and reading results in the same manner as if operations were performed on a single machine. Achieving this level of consistency without sacrificing performance or latency is challenging and distinguishes Spanner. These features enable Spanner to support consistent backups and operations, such as MapReduce executions.
Let's examine some consistency properties and how external consistency differs from them:
Consistency Definitions:
Under external consistency, the system behaves as if all transactions were executed sequentially, maintaining semantic indistinguishability from a single-machine database. Spanner achieves these guarantees through the use of TrueTime, as discussed in upcoming sections.
Implementation
High-Level System Design:
Spanner, a distributed system operating on thousands of machines across data centers, organizes its deployment into universes (e.g., production and test universes). Each universe consists of zones, representing units of physical isolation, either within different data centers or on various servers within a data center (e.g., different racks). Data replication across zones enhances availability. For instance, with a replication property of 3, data will have copies in 3 zones, ensuring data retrieval if one zone or server fails.
Each zone features a zone master selecting the span server for data placement. When a client requests data, it communicates with a per-zone location proxy to locate the spanserver containing the data. The spanserver then serves the data to the client.
The universe includes a universe master and placement driver. The universe master displays zone status information for debugging, while the placement driver ensures proper data replication and optimal placement, communicating with spanservers and making decisions based on replication constraints or system load balancing.
SpanServer Software Stack:
Spannerservers handle data in tablet form, where each tablet is a collection of (key: string, timestamp: int64) → String, representing a mapping of keys to values with associated timestamps. Tablets, replicated on multiple machines in different zones, use the user-defined or default replication factor. To handle replication, Spanner utilizes Paxos, a consensus algorithm, ensuring correct tablet replication on machines. Each tablet has a distinct Paxos state machine, with a Paxos leader accepting writes and a set of replicas forming the Paxos group.
Each spanserver manages thousands of tablets, with some acting as leader nodes and others as replicas, depending on the tablet.
Spanner introduces a bucketing abstraction called a directory, representing a group of keys with a common prefix. Directories serve as a unit of data placement, ensuring keys of a specific bucket are stored close to each other. A tablet may contain multiple directories, allowing Spanner to move directories efficiently between machines or Paxos groups.
To synchronize and handle concurrency properly, each spanserver implements a lock table mapping between key ranges and lock states for operations requiring synchronization, such as transactional reads.
Transactions across multiple tablets involve multiple Paxos groups, necessitating coordination between them. Spanner addresses this using a transaction manager. One participant group is chosen as the coordinator, with its Paxos leader serving as the transaction coordinator.
Data Model
Spanner provides users with a data model based on schematized semi-relational tables, queryable using a SQL-like query language, and supports transactions. Applications can create databases within the Spanner universe (production/test/dev, etc.), each containing multiple schematized tables.
Spanner tables are similar to tables in a relational database but with the added feature of being versioned. To control table locality, tables must have one or more primary keys, mapping to the values of the other columns in the row, similar to a key → value store structure. The Interleave In property allows users to define relationships between tables, aiding Spanner in achieving better placement and locality. The table at the top of a hierarchy is called the directory table, where each row, along with referenced rows in other tables, makes up a directory. Operations from the directory table can be propagated to other tables using the interleaving property.
TrueTime
To support globally consistent transactions, Spanner utilizes an API called TrueTime. When calling now(), TrueTime returns an interval of [earliest, latest], guaranteeing that the current timestamp is between these two values. TrueTime ensures monotonically increasing timestamps across all servers and all timestamps. In the next section, we will explore how this helps guarantee consistency.
To implement TrueTime, Google employs a set of masters running GPS and another set with atomic clocks. Masters regularly compare time references, cross-check against local clocks, and self-evict if divergence occurs. Daemons poll various masters, applying Marzullo's algorithm to reject liars and synchronize local machine clocks. Daemons advertise uncertainty based on worst-case local clock drift, master uncertainty, and communication delay, resulting in an interval of the earliest and latest possible time.
Concurrency Control
As seen previously, Spanner guarantees externally consistent transactions. It also supports lock-free read-only transactions and non-blocking reads in the past, ensuring that any read at timestamp t will see the effect of every transaction that has committed by timestamp t. In this section, we will explore how Spanner supports different transaction types through the use of the TrueTime API.
To understand how transactions work in Spanner, we need to differentiate between the overall transaction writes (& timestamps) and those of the Paxos state machine running a transaction. For example, if the Paxos state machine has written/committed a transaction, this doesn’t mean the transaction itself has committed yet. This is important to keep in mind to understand the upcoming sections.
领英推荐
Read-Write Transactions:
Read-write transactions update the state of the database and, therefore, require more pessimistic concurrency controls, specifically in the form of two-phase locking to prevent another operation from reading the data being updated within the transaction. Spanner guarantees that if the state of transaction T2 happens after the commit of T1, then the commit timestamp of T2 is greater than that of T1.
Spanner makes use of the Paxos state machine to run the transactions; all read-write transactions have to go through the Paxos leader. Whenever the Paxos leader performs a write operation, it assigns a monotonically increasing timestamp to it. Therefore, for every write t’ happening after t, the timestamp of t’ should be greater than that of t. This is a trivial operation if the leader stays the same; however, if the leader changes, Spanner needs to keep ensuring that timestamps of new write operations to the system have higher timestamps. Spanner ensures this by associating a value Smax with the leader, where Smax is the maximum timestamp the leader has used for an operation. Before a new leader can take over, Smax must have passed. To make more sense of this, remember the TrueTime API for .now(), which returns an interval [earliest, latest]. As long as the interval returned still contains Smax, a new leader can’t take over, as doing so might assign a conflicting timestamp to the transactions of the old leader. It’s important to keep in mind that these are the Paxos commit timestamps and not the transaction commit timestamps, which we’ll look at next.
Proof for External Consistency:
Using the TrueTime API , and some simple rules to prevent overlap of transactions within timestamps, spanner models transactions like they were running on single machine. Each transaction takes place at its own distinctive timestamp, ensuring a clear and non-overlapping temporal sequence by utilizing TrueTime's guarantees that the designated timestamps have genuinely elapsed before the initiation of the subsequent transaction. Let's look at the how Spanner does this:
To ensure external consistency (if the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1), Spanner enforces two rules:
Start Rule
The transaction coordinator leader assigns a timestamp Ti to the transaction with a value of at least TrueTime.now().latest() (therefore any new transactions should have a higher timestamp). This property helps ensure that the timestamps chosen for a transaction does not overlap with the time intervals of older transactions, which can result in ambiguity.
Example:
Commit Wait Rule
This property ensures clients cannot see any data committed until the current current time (TrueTime.now()) is after that of the transaction commit time.
If the transaction commit time is T1, commit wait waits for the time interval returned by TrueTime.now() to be at least [earliest: T1 +1, ..] before allowing the transaction result to be seen. This ensures that the transaction commit time is in the past and removes any ambiguity.
Example:
Running a Read-Write Transaction:
During a Spanner Read-Write transaction, the client buffers writes until it’s time to commit them. The client issues reads to the leader of the Paxos group it’s trying to read from (note a transaction might read from multiple Paxos groups, so multiple leaders may be contacted).
After finishing all the reads, the client chooses a coordinator group (as we’ve seen previously during a transaction containing multiple Paxos groups, each group in the transaction has a participant leader and one of those leaders is chosen as the coordinator for the transaction). The client sends a commit message to each participant leader, along with which leader they’ve chosen to be the coordinator. All non-coordinator participant leaders acquire the correct locks and choose a timestamp for the transaction larger than any timestamp it has assigned to a previous transaction, then logs a prepare record and sends the timestamp to the coordinator. The coordinator also acquires write locks, and once it has received all timestamps from participants, it chooses a timestamp larger than all prepare timestamps & also larger than TT.now().latest() to ensure monotonicity & satisfy the start rule, then logs a commit record to its Paxos state machine. To satisfy the wait rule, the coordinator cannot reply to the client until after the now.latest() from the commit timestamp has passed (TT.now().earliest() ≥ Commit Time latest + 1). When the commit wait time has passed, the coordinator replies to the client and sends the commit to all other participant leaders, which log the commit at the same timestamp as that of the coordinator and release their locks.
Read Transactions:
Read transactions on Spanner are executed like snapshot reads, without requiring the use of locks. They rely on choosing a timestamp that satisfies external consistency (by eliminating ambiguity or overlap with previous transactions) and the fact that the values within Spanner are versioned (by timestamps). If you can choose a correct timestamp at that point in time that doesn’t overlap with previously committed transactions, you can simply read the values from an up-to-date replica where the timestamp is less than or equal to your chosen timestamp.
To be able to serve a read while guaranteeing consistency, Spanner must determine if the replica is sufficiently up-to-date. Each replica tracks a value called safe time, the maximum timestamp which is up to date. If the timestamp of the read is less than or equal to the safe time of a replica, then it can serve the read. The safe time value is calculated at the leader of the Paxos group (since it has the transaction manager) and communicated with replicas within that group.
Safe time is calculated as follows:
As mentioned, a read-only transaction is assigned a timestamp and then executed as a snapshot read at that timestamp. This allows the read to occur at any replica that is sufficiently up-to-date without locking. To ensure the timestamp of the read does not overlap with a previous interval (leading to ambiguity), it should be assigned a value ≥ the timestamp of committed transactions across all groups involved so that in can see the results of these transactions.
One easy way to select this timestamp is to set it to now.latest, which ensures that its value will be larger than any already committed transaction. While this simplifies things, it may cause some delay in reads if the replica the transaction is trying to read from has not yet advanced its safe time yet (safe time < now.latest), and may also advance the leader's smax (used for ensuring disjoint transactions).
During a read transaction, since the keys can be served by multiple Paxos groups, Spanner must understand the scope of the read (what keys will be read in the transaction). There are two cases Spanner handles:
Schema-Change Transactions:
Again, Spanner makes use of TrueTime to run schema change transactions efficiently. Instead of using standard transactions for schema changes (which would require locking a huge number of groups), Spanner, however, can run schema changes in a non-blocking way. It chooses a timestamp t in the future for running the transaction and registers it in the prepare phase. Any previous transactions continue executing as normal, and any future transactions must be assigned a value greater than t to ensure they respect the results of that transaction.
Conclusion
Google's Spanner stands as a pioneering distributed database system, offering a wealth of advanced features and robust capabilities. With externally-consistent distributed transactions, versioned data values, strong fault tolerance, and flexible data placement, Spanner excels in providing a reliable and highly available solution for users. Its implementation, featuring a distributed architecture across multiple machines and data centers, showcases meticulous design considerations such as zones for physical isolation and strategic data replication. The schema, based on schematized semi-relational tables with versioning, allows users to control data locality effectively. The integration of TrueTime ensures global consistency in transactions, contributing to Spanner's unique strengths. Overall, Spanner emerges as a standout solution, successfully balancing the complexities of a distributed system with the demands of scalability, fault tolerance, and stringent consistency requirements.
References: