登录查看更多内容

Spanner: Google’s Globally-Distributed Database

Sara Khodeir

Senior Software Engineer at Facebook

发布日期: 2024年3月9日

Spanner is Google’s globally distributed, highly available database, offering users an extensive array of features. These include externally-consistent distributed transactions (the strictest consistency property for transaction-processing systems), versioning for data values, strong fault tolerance guarantees, and flexibility for users regarding data placement and replication. Users interact with Spanner through a SQL-based interface.

Consistency Model:

Spanner’s consistency model, referred to as external consistency by the authors, provides the strictest concurrency-control guarantees for transactions. It ensures serializability, linearizability, strong consistency, and more. Despite being a distributed system running on machines worldwide, executing transactions and reading results in the same manner as if operations were performed on a single machine. Achieving this level of consistency without sacrificing performance or latency is challenging and distinguishes Spanner. These features enable Spanner to support consistent backups and operations, such as MapReduce executions.

Let's examine some consistency properties and how external consistency differs from them:

Consistency Definitions:

Serializability:Definition: A transaction-processing system is serializable if it ensures that the execution of a set of transactions results in an outcome equivalent to some serial (sequential) order of those transactions.Shortcoming: Serializability does not guarantee the order in which transactions will occur.
Strong Consistency:Definition: Guarantees that observers witness the effects of all transactions committed before the start of the operation, regardless of which replica receives the read.Shortcomings: Strong consistency is object-level (e.g., single row), whereas external consistency guarantees strong consistency for the entire transaction containing multiple objects.
Linearizability:Definition: Linearizability is the strongest form of consistency, ensuring all operations are executed as if on a single machine despite data distribution across multiple replicas.Shortcomings: Linearizability is object-level (e.g., single row), while external consistency guarantees strong consistency for the entire transaction with multiple objects.

Under external consistency, the system behaves as if all transactions were executed sequentially, maintaining semantic indistinguishability from a single-machine database. Spanner achieves these guarantees through the use of TrueTime, as discussed in upcoming sections.

Implementation

High-Level System Design:

Spanner, a distributed system operating on thousands of machines across data centers, organizes its deployment into universes (e.g., production and test universes). Each universe consists of zones, representing units of physical isolation, either within different data centers or on various servers within a data center (e.g., different racks). Data replication across zones enhances availability. For instance, with a replication property of 3, data will have copies in 3 zones, ensuring data retrieval if one zone or server fails.

Each zone features a zone master selecting the span server for data placement. When a client requests data, it communicates with a per-zone location proxy to locate the spanserver containing the data. The spanserver then serves the data to the client.

The universe includes a universe master and placement driver. The universe master displays zone status information for debugging, while the placement driver ensures proper data replication and optimal placement, communicating with spanservers and making decisions based on replication constraints or system load balancing.

SpanServer Software Stack:

Spannerservers handle data in tablet form, where each tablet is a collection of (key: string, timestamp: int64) → String, representing a mapping of keys to values with associated timestamps. Tablets, replicated on multiple machines in different zones, use the user-defined or default replication factor. To handle replication, Spanner utilizes Paxos, a consensus algorithm, ensuring correct tablet replication on machines. Each tablet has a distinct Paxos state machine, with a Paxos leader accepting writes and a set of replicas forming the Paxos group.

Each spanserver manages thousands of tablets, with some acting as leader nodes and others as replicas, depending on the tablet.

Spanner introduces a bucketing abstraction called a directory, representing a group of keys with a common prefix. Directories serve as a unit of data placement, ensuring keys of a specific bucket are stored close to each other. A tablet may contain multiple directories, allowing Spanner to move directories efficiently between machines or Paxos groups.

To synchronize and handle concurrency properly, each spanserver implements a lock table mapping between key ranges and lock states for operations requiring synchronization, such as transactional reads.

Transactions across multiple tablets involve multiple Paxos groups, necessitating coordination between them. Spanner addresses this using a transaction manager. One participant group is chosen as the coordinator, with its Paxos leader serving as the transaction coordinator.

Data Model

Spanner provides users with a data model based on schematized semi-relational tables, queryable using a SQL-like query language, and supports transactions. Applications can create databases within the Spanner universe (production/test/dev, etc.), each containing multiple schematized tables.

Spanner tables are similar to tables in a relational database but with the added feature of being versioned. To control table locality, tables must have one or more primary keys, mapping to the values of the other columns in the row, similar to a key → value store structure. The Interleave In property allows users to define relationships between tables, aiding Spanner in achieving better placement and locality. The table at the top of a hierarchy is called the directory table, where each row, along with referenced rows in other tables, makes up a directory. Operations from the directory table can be propagated to other tables using the interleaving property.

TrueTime

To support globally consistent transactions, Spanner utilizes an API called TrueTime. When calling now(), TrueTime returns an interval of [earliest, latest], guaranteeing that the current timestamp is between these two values. TrueTime ensures monotonically increasing timestamps across all servers and all timestamps. In the next section, we will explore how this helps guarantee consistency.

To implement TrueTime, Google employs a set of masters running GPS and another set with atomic clocks. Masters regularly compare time references, cross-check against local clocks, and self-evict if divergence occurs. Daemons poll various masters, applying Marzullo's algorithm to reject liars and synchronize local machine clocks. Daemons advertise uncertainty based on worst-case local clock drift, master uncertainty, and communication delay, resulting in an interval of the earliest and latest possible time.

Concurrency Control

As seen previously, Spanner guarantees externally consistent transactions. It also supports lock-free read-only transactions and non-blocking reads in the past, ensuring that any read at timestamp t will see the effect of every transaction that has committed by timestamp t. In this section, we will explore how Spanner supports different transaction types through the use of the TrueTime API.

To understand how transactions work in Spanner, we need to differentiate between the overall transaction writes (& timestamps) and those of the Paxos state machine running a transaction. For example, if the Paxos state machine has written/committed a transaction, this doesn’t mean the transaction itself has committed yet. This is important to keep in mind to understand the upcoming sections.

领英推荐

From Tables to Terabytes: Modernizing Databases for…

Creospan Inc. 1 年前

System Design Study: 40M reads per second using Cache

Vivek Bansal 1 年前

Financial Data in Oracle OCI with kdb+ Time-Series DB

Jakub Polec 1 年前

Read-Write Transactions:

Read-write transactions update the state of the database and, therefore, require more pessimistic concurrency controls, specifically in the form of two-phase locking to prevent another operation from reading the data being updated within the transaction. Spanner guarantees that if the state of transaction T2 happens after the commit of T1, then the commit timestamp of T2 is greater than that of T1.

Spanner makes use of the Paxos state machine to run the transactions; all read-write transactions have to go through the Paxos leader. Whenever the Paxos leader performs a write operation, it assigns a monotonically increasing timestamp to it. Therefore, for every write t’ happening after t, the timestamp of t’ should be greater than that of t. This is a trivial operation if the leader stays the same; however, if the leader changes, Spanner needs to keep ensuring that timestamps of new write operations to the system have higher timestamps. Spanner ensures this by associating a value Smax with the leader, where Smax is the maximum timestamp the leader has used for an operation. Before a new leader can take over, Smax must have passed. To make more sense of this, remember the TrueTime API for .now(), which returns an interval [earliest, latest]. As long as the interval returned still contains Smax, a new leader can’t take over, as doing so might assign a conflicting timestamp to the transactions of the old leader. It’s important to keep in mind that these are the Paxos commit timestamps and not the transaction commit timestamps, which we’ll look at next.

Proof for External Consistency:

Using the TrueTime API , and some simple rules to prevent overlap of transactions within timestamps, spanner models transactions like they were running on single machine. Each transaction takes place at its own distinctive timestamp, ensuring a clear and non-overlapping temporal sequence by utilizing TrueTime's guarantees that the designated timestamps have genuinely elapsed before the initiation of the subsequent transaction. Let's look at the how Spanner does this:

To ensure external consistency (if the start of a transaction T2 occurs after the commit of a transaction T1, then the commit timestamp of T2 must be greater than the commit timestamp of T1), Spanner enforces two rules:

Start Rule

The transaction coordinator leader assigns a timestamp Ti to the transaction with a value of at least TrueTime.now().latest() (therefore any new transactions should have a higher timestamp). This property helps ensure that the timestamps chosen for a transaction does not overlap with the time intervals of older transactions, which can result in ambiguity.

Example:

Commit Time of transaction T1 is at 2 (it's interval was [earliest: 1, latest: 2])
T2 arrives at interval [earliest: 2, latest 6] (since T2, happens after T1, it's timestamp interval latest will be larger since it's in the future): if it is assigned timestamp 2, there will be ambiguity, however by setting it's commit timestamp to 6, this guarantees no ambiguity between which transaction came first.

Commit Wait Rule

This property ensures clients cannot see any data committed until the current current time (TrueTime.now()) is after that of the transaction commit time.

If the transaction commit time is T1, commit wait waits for the time interval returned by TrueTime.now() to be at least [earliest: T1 +1, ..] before allowing the transaction result to be seen. This ensures that the transaction commit time is in the past and removes any ambiguity.

Example:

Scenario Without Commit Wait Rule:T1 arrives with a timestamp interval [earliest: 1, latest: 10]. T1 commits with a timestamp of 6. The client is allowed to read T1's committed data immediately.
Ambiguity Without Commit Wait:The actual current time is 4, but the client reads a value with a timestamp of 6.The client is reading a value in the future relative to the current time, creating ambiguity.
Solution with Commit Wait Rule:By waiting for the latest to pass [earliest: latest + 1, ..], you ensure that the timestamp (6) is indeed in the past.This mechanism eliminates the ambiguity by ensuring that the client only sees the committed data after the current time has surpassed the commit time.

Running a Read-Write Transaction:

During a Spanner Read-Write transaction, the client buffers writes until it’s time to commit them. The client issues reads to the leader of the Paxos group it’s trying to read from (note a transaction might read from multiple Paxos groups, so multiple leaders may be contacted).

After finishing all the reads, the client chooses a coordinator group (as we’ve seen previously during a transaction containing multiple Paxos groups, each group in the transaction has a participant leader and one of those leaders is chosen as the coordinator for the transaction). The client sends a commit message to each participant leader, along with which leader they’ve chosen to be the coordinator. All non-coordinator participant leaders acquire the correct locks and choose a timestamp for the transaction larger than any timestamp it has assigned to a previous transaction, then logs a prepare record and sends the timestamp to the coordinator. The coordinator also acquires write locks, and once it has received all timestamps from participants, it chooses a timestamp larger than all prepare timestamps & also larger than TT.now().latest() to ensure monotonicity & satisfy the start rule, then logs a commit record to its Paxos state machine. To satisfy the wait rule, the coordinator cannot reply to the client until after the now.latest() from the commit timestamp has passed (TT.now().earliest() ≥ Commit Time latest + 1). When the commit wait time has passed, the coordinator replies to the client and sends the commit to all other participant leaders, which log the commit at the same timestamp as that of the coordinator and release their locks.

Read Transactions:

Read transactions on Spanner are executed like snapshot reads, without requiring the use of locks. They rely on choosing a timestamp that satisfies external consistency (by eliminating ambiguity or overlap with previous transactions) and the fact that the values within Spanner are versioned (by timestamps). If you can choose a correct timestamp at that point in time that doesn’t overlap with previously committed transactions, you can simply read the values from an up-to-date replica where the timestamp is less than or equal to your chosen timestamp.

To be able to serve a read while guaranteeing consistency, Spanner must determine if the replica is sufficiently up-to-date. Each replica tracks a value called safe time, the maximum timestamp which is up to date. If the timestamp of the read is less than or equal to the safe time of a replica, then it can serve the read. The safe time value is calculated at the leader of the Paxos group (since it has the transaction manager) and communicated with replicas within that group.

Safe time is calculated as follows:

Safe time = ∞ if there are no prepared transactions (no transactions in a pending commit). This means the replica can safely serve the read.
If there are transactions in the prepared state, the replica does not know if it will commit, however, it knows that the transaction's commit time will be ≥ the timestamp of the value it chose for preparing the transaction (since the coordinator master commits the record with a commit time ≥ max(participant prepare times). Therefore the node's safe time is calculated as the minimum of prepare times of all transactions in a prepared state - 1.

As mentioned, a read-only transaction is assigned a timestamp and then executed as a snapshot read at that timestamp. This allows the read to occur at any replica that is sufficiently up-to-date without locking. To ensure the timestamp of the read does not overlap with a previous interval (leading to ambiguity), it should be assigned a value ≥ the timestamp of committed transactions across all groups involved so that in can see the results of these transactions.

One easy way to select this timestamp is to set it to now.latest, which ensures that its value will be larger than any already committed transaction. While this simplifies things, it may cause some delay in reads if the replica the transaction is trying to read from has not yet advanced its safe time yet (safe time < now.latest), and may also advance the leader's smax (used for ensuring disjoint transactions).

During a read transaction, since the keys can be served by multiple Paxos groups, Spanner must understand the scope of the read (what keys will be read in the transaction). There are two cases Spanner handles:

The read is served by a single Paxos group: the reader of the Paxos group assigns the timestamp (and can set it to be the timestamp of the last committed write to guard against the cases mentioned above) then executes the read.
If multiple groups are involved, the read timestamp is chosen as now.latest() to avoid having to communicate with all the groups & select the highest commit timestamp.

Schema-Change Transactions:

Again, Spanner makes use of TrueTime to run schema change transactions efficiently. Instead of using standard transactions for schema changes (which would require locking a huge number of groups), Spanner, however, can run schema changes in a non-blocking way. It chooses a timestamp t in the future for running the transaction and registers it in the prepare phase. Any previous transactions continue executing as normal, and any future transactions must be assigned a value greater than t to ensure they respect the results of that transaction.

Conclusion

Google's Spanner stands as a pioneering distributed database system, offering a wealth of advanced features and robust capabilities. With externally-consistent distributed transactions, versioned data values, strong fault tolerance, and flexible data placement, Spanner excels in providing a reliable and highly available solution for users. Its implementation, featuring a distributed architecture across multiple machines and data centers, showcases meticulous design considerations such as zones for physical isolation and strategic data replication. The schema, based on schematized semi-relational tables with versioning, allows users to control data locality effectively. The integration of TrueTime ensures global consistency in transactions, contributing to Spanner's unique strengths. Overall, Spanner emerges as a standout solution, successfully balancing the complexities of a distributed system with the demands of scalability, fault tolerance, and stringent consistency requirements.

References:

Exploring Large Scale Systems

362 位关注者

要查看或添加评论，请登录

Sara Khodeir的更多文章

Chubby: Google's Distributed Lock Service for Reliable Coordination in Distributed Systems

2024年10月29日

Chubby: Google's Distributed Lock Service for Reliable Coordination in Distributed Systems

Introduction Providing access to common resources and synchronizing access to them in a simple program is a common…
Zookeeper: Wait-free coordination for Internet-scale systems

2024年6月9日

Zookeeper: Wait-free coordination for Internet-scale systems

Large-scale distributed applications require various forms of coordination, such as dynamic configuration, group…

2 条评论
The Google File System

2024年1月31日

The Google File System

The Google File System is a distributed file system designed and used internally by Google in the early 2000s to…
My Journey with Career Coaching

2021年3月21日

My Journey with Career Coaching

A few months ago a post popped up on my timeline at work about some spots for coaching sessions for women in infra. I…

4 条评论

Spanner: Google’s Globally-Distributed Database

Sara Khodeir

Senior Software Engineer at Facebook

Consistency Model:

Implementation

Data Model

TrueTime

Concurrency Control

领英推荐

Read-Write Transactions:

Proof for External Consistency:

Start Rule

Commit Wait Rule

Running a Read-Write Transaction:

Read Transactions:

Schema-Change Transactions:

Conclusion

Exploring Large Scale Systems

362 位关注者

Sara Khodeir的更多文章

社区洞察

其他会员也浏览了

Demystifying DATA: DBMS, Databases, Data Structures, Database Engines and Data

7 Proven Techniques to Optimize Database Performance and Efficiency

CAP Theorem: Optimizing Database Selection for Enhanced Application Performance

Tackling Database Performance Bottlenecks: Write-Read Contention

Billion Dollar Unicorns: MarkLogic Making Waves in Enterprise NoSQL

The Database Odyssey: Evolution from Rigid Tables to Horizental, VectorDB, and Elastic Clouds

Understanding DynamoDB’s scaling features in the console

Couchbase 101: Introduction

An Approach to Database Fine-Grained Access Controls

From Legacy Challenges to Modern Innovations: Navigating the Evolving Database Landscape

Consistency Model:

Implementation

Data Model

TrueTime

Concurrency Control

领英推荐

Read-Write Transactions:

Proof for External Consistency:

Start Rule

Commit Wait Rule

Running a Read-Write Transaction:

Read Transactions:

Schema-Change Transactions:

Conclusion

Exploring Large Scale Systems

362 位关注者

Sara Khodeir的更多文章

Chubby: Google's Distributed Lock Service for Reliable Coordination in Distributed Systems

Zookeeper: Wait-free coordination for Internet-scale systems

The Google File System

My Journey with Career Coaching

社区洞察

其他会员也浏览了

Demystifying DATA: DBMS, Databases, Data Structures, Database Engines and Data

7 Proven Techniques to Optimize Database Performance and Efficiency

CAP Theorem: Optimizing Database Selection for Enhanced Application Performance

Tackling Database Performance Bottlenecks: Write-Read Contention

Billion Dollar Unicorns: MarkLogic Making Waves in Enterprise NoSQL

The Database Odyssey: Evolution from Rigid Tables to Horizental, VectorDB, and Elastic Clouds

Understanding DynamoDB’s scaling features in the console

Couchbase 101: Introduction

An Approach to Database Fine-Grained Access Controls

From Legacy Challenges to Modern Innovations: Navigating the Evolving Database Landscape