Raft Replication in Oracle Database 23ai: High Availability and Scalability made simple

Raft Replication in Oracle Database 23ai: High Availability and Scalability made simple

In today's ever-evolving business landscape, data integrity, fault tolerance, and consistency in distributed systems are paramount. Oracle Database 23c implements the Raft consensus algorithm for data replication and leader node election, ensuring high availability and scalability for mission-critical distributed database deployments. In this article, I discuss architecture of Raft replication in Oracle Database 23ai and how it provides a simple, yet robust solution to the issues that distributed systems (and by inclusion NoSQL databases) aim to address. The RAFT consensus algorithm is described by Ongaro and Outsterhout in the USENIX paper “In Search of an Understandable Consensus Algorithm”. The architecture details provided below is a 1000 ft overview, the reader is invited to read the original paper for the detailed algorithm.


Introduction to Raft Replication

RAFT replication is a built-in Oracle Database 23ai capability that integrates data replication with transaction execution in a distributed Oracle Database. Raft replication enables fast automatic failover with zero data loss. If all shards are in the same data center, it is possible to achieve sub-second failover. This capability provides a uniform configuration with no primary or standby shards.


Raft Replication Terminologies

The Raft consensus protocol is used to maintain consistency between the replicas in case of failures, network partitioning, message loss, or delay. When Raft replication is enabled, a sharded database contains multiple replication units.

Replication Unit (RU) - A replication unit (RU) is a set of chunks that have the same replication topology. Each RU has?multiple?replicas placed on different shards.

Replication factor (RF) - determines the number of participants in a RAFT group. This number includes the leader and its followers. The RU needs a majority of replicas available for write.

  • RF = 3: tolerates one replica failure
  • RF = 5: tolerates two replica failures

In Oracle Globally Distributed Database , the replication factor is specified for the entire database cluster, that is all replication units in the database have the same RF. The number of followers is limited to two, thus the replication factor is three.

Raft Group - Each replication unit contains exactly one chunk set and has a leader and a set of followers, called a raft group. The leader and its followers for a replication unit contain replicas of the same chunk set in different shards as shown below. A shard can be the leader for some replication units and a follower for other replication units. All DMLs for a particular subset of data are executed in the leader first, and then are replicated to its followers.

Replication Role - nodes(shards) in a raft group can be in one of three states: Leader, Follower or Candidate. The leader is responsible for accepting client requests and managing the replication of the log to other servers (followers). The data flows only in one direction: from leader to other servers (followers).


High Level Architecture

Raft replication in Oracle Database 23ai integrates data replication with transaction execution in a sharded database. This enables fast automatic failover with zero data loss. If all shards are in the same data center, it is possible to achieve sub-second failover. There is no need to configure and manage replication tools to achieve high availability.


Raft Replication Logical Architecture


Raft replication automatically reconfigures replication in case of shard host failures or when shards are added or removed from the sharded database. When Raft replication is enabled, a sharded database contains multiple replication units. A replication unit (RU) is a set of chunks that have the same replication topology. Each RU has?multiple?replicas placed on different shards. The Raft consensus protocol is used to maintain consistency between the replicas in case of failures, network partitioning, message loss, or delay.


Application interaction with sharded database during normal operation


Node failure and recovery are handled in an automated way with minimal impact to the application. The failover time is sub-3 seconds with less than 10 millisecond network latencies between AZs. This includes failure detection, shard failover, change of leadership, application reconnecting to new leader and continuing business transactions like before. The impact of the failure on the application can be further abstracted by configuring retries in the JDBC driver, and the end user experience will be that a particular request took longer rather than getting an error.


Single-node failure scenario


As shown in diagram above, when the leader for a replication unit becomes unavailable, followers will initiate a new leader election process using the Raft protocol. As long as a majority of the nodes (quorum) are still healthy, the Raft protocol ensures that a new leader is elected from the available nodes. Once a new leader is elected, routing clients (e.g., UCP) are notified using ONS notifications to update their shard and chunk mapping, ensuring that they route requests to the leader. During this failover and reconnection period, the application could be configured to wait and retry, with the retry interval and retry counts settings on the JDBC driver configuration. This is very similar to the Oracle RAC (Real Application Clusters) instance failover configuration. Upon connecting to a new leader, the application continues to function as before.


Recovered state after single-node failure


When the original leader comes back online after a failure, it first tries to identify the current leader and attempts to rejoin the cluster as a follower. Once the failed shard rejoins the cluster, the recovered follower asks the leader for logs based on its current index in order to sync up with the leader. Raft log retention is configurable using the shardsize_per_raft_logfile_gb parameter. When the recovered follower is synchronized with the leader it continues as a follower.

Leaders can be rebalanced among the shards by issuing the GDSCTL command SWITCHOVER RU -REBALANCE. You can build automation to monitor replication unit status and perform failovers to maintain desired configuration. If there are not enough Raft logs available on the present leader, the follower must be repopulated from one of the good followers using the GDSCTL command COPY RU.


Benefits of Raft Replication Architecture

  1. All the shards are symmetric and actively used. Workload is balanced among all available hardware. Note: this is sometimes called ‘active-active’. We do not call it that because the same data is not simultaneously updated at multiple nodes.
  2. Each node can host primary and spare copies of data. When a new node is added to the pool, it will automatically start storing data and replicas without any manual intervention. This is different from our current requirement where you have to add a primary and Data Guard replicas when you add a shard. This provides finer-grained elasticity; you can grow using smaller units.
  3. The spares for a given shard are also distributed among many servers (instead of 1:1 as with Data Guard).? Distributing the spares among multiple servers reduces the overhead of re-instantiation after a shard failure since we are distributing the?work among multiple shards. This is possible because the replication unit is smaller than an entire shard – as with Active Data Guard or Logical Standby.?
  4. Supports fast failover (sub-second) with zero data loss. Customers running a popular NoSQL database that also implements raft consensus algorithm for leader election report 1-2 second failover (including detection time).
  5. High-performance, low replication overhead because follower writes are asynchronous.

The Business Problems solved with RAFT Replication in Oracle Database 23ai

1. Data Consistency

Consistency is the lifeblood of many business-critical applications, from financial systems to e-commerce platforms. Ensuring that all nodes in a distributed database have the same, up-to-date data is a non-negotiable requirement. RAFT replication in Oracle Database 23ai guarantees data consistency, enabling businesses to operate with confidence.

2. Fault Tolerance

System outages, network partitions, and node failures are unfortunate realities in the world of distributed systems. Oracle Database 23ai leverages RAFT to provide the fault tolerance needed to keep your business operations running smoothly, even in the case of loss of an entire data center or availability zone.

3. Leader Election

In a distributed environment, coordinating writes and ensuring a linearizable order of commands is essential. RAFT in Oracle Database 23ai effectively elects a leader node that manages this coordination, preventing data conflicts and inconsistencies. Business applications that require coordinated, ordered operations benefit immensely from this feature.

4. Simplicity

Complexity can be a significant hurdle when working with distributed systems. Oracle Database 23ai's implementation of RAFT offers simplicity and understandability, making it easier to design, deploy, and troubleshoot your systems. Your technical teams can spend more time innovating and less time deciphering intricacies of managing replication tools and software.


A Glimpse into the Future

In conclusion, the incorporation of raft consensus algorithm in Oracle Database 23ai is not just an upgrade—it's a leap forward. It signifies a deep commitment to delivering data management solutions that empower businesses to navigate the challenges of today and seize new opportunities. We also launched other cool features like Directory Based Sharding Method, Automatic Data Move on Sharding Key Update, Fine-Grained Refresh Rate Control for Duplicated Tables and Synchronous Duplicated Tables along with Raft replication in Oracle Database 23ai. With RAFT replication in Oracle Database 23ai, businesses can trust in the consistency, reliability, and scalability of their data, enabling them to focus on what truly matters—innovation, growth, and success.


Disclaimer

The information provided in this article is intended for information purposes only. Views and opinions shared are solely mine and it may not be incorporated into any contract. It is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions. The development, release, and timing of any features or functionality described for Oracle’s products remains at the sole discretion of Oracle.


Hassan Khan

CSPO | Technical Consultant | Solution Architect | FinTech

2 周

Follow-up questions if anyone care to answer. 1- How Oracle decide which subset of data (DML) belongs to which replication unit with in a shard. Is it defined at the time of table creation? 2- Secondly, can we say replication units are basically partitions (horizontally scaled) with in a shard that holds multiple tables. 3- What if particular replication unit becomes hot spot. Is it possible to break the replication unit into 2 or more replication units (without downtime)? 4- When a dead node comes back online, how to determine this node has all the changes synched up from other nodes and ready to become operational. Is it manual process or automaticity dealt within oracle cluster.

回复
Mykola Jurko

SAP Technical Expert / Oracle DBA

8 个月

Hi Kehinde Otubamowo. Nice article. Could you please clarify following: How will licensing work for Raft replication? Will it be separate from the existing licensing options?

回复
Kirill Panov

Tech Program Manager | Reinforcing & Advancing Engineering Projects with Tech Know-How | HealthTech, Data-Obsessed, Solution Design

9 个月

Hey Kehinde Otubamowo. I can't believe I'm just seeing this article now! Really great and encompassing review of Raft replication in Oracle Database 23c. I especially liked how you detailed the architecture and the logical flow of Raft replication, shedding light on its importance for data integrity, fault tolerance, and consistency in distributed systems. This is particularly pertinent to my current focus as I'm actively working on getting AWS certified, and understanding these concepts is invaluable for designing resilient cloud architectures. Thanks for sharing, Kehinde Otubamowo

Looking forward to seeing more posts on this awesome new feature Kehinde Otubamowo. Good stuff!

Kay Malcolm

Vice President, Database Product Management at Oracle | Board Advisor | Yahoo Heroes Top 100 Women Future Leaders | Speaker | Podcast Host

1 年

Would love to see this in an Oracle LiveLab so we can all try it out! Good info!

要查看或添加评论,请登录

社区洞察

其他会员也浏览了