Sprite

Raft

Raft is a protocol for implementing distributed consensus. 

Raft是一个协议, 实现了达成分布式(multipe nodes)共识的协议.
单个节点没法保证可用性且拓展性差, 受硬件限制. 
多个节点, 提高了可用性, 拓展性强, 但是多个节点的数据可能不一致, 引入了分区容错的概念.
Raft就是实现多个节点返回共识性的数据的协议

Node States

每个节点的状态: Follower, Candidate, Leader

A node can be in 1 of 3 states:
Follower state,
Candidate state,
Leader state,

Leader Election

All our nodes start in the follower state.
If followers don't hear from a leader then they can become a candidate,
The candidate then requests votes from other nodes,
Nodes will reply with their vote,
The candidate becomes the leader if it gets votes from a majority of nodes.
There are two timeout settings which control elections: election timeout and heartbeat timeout

所有节点的初始状态是follower, 
如果一直没有收到主节点的消息, 就认为主节点挂了, 此时子节点变更为candidate
follower节点变更为candidate后, 向其他节点发起投票, 若得到多数投票, candidate节点变更为leader
1. follower等多久变更为candidate - 选举超时
随机的值, 一般在150ms 到 300ms
2. 如何让follower保持状态 - 心跳超时
leader按心跳超时指定的时间间隔发送消息给follower

election timeout

The election timeout is the amount of time a follower waits until becoming a candidate
The election timeout is randomized to be between 150ms and 300ms
After the election timeout the follower becomes a candidate and starts a new election term
votes for itself and sends out Request Vote messages to other nodes
If the receiving node hasn't voted yet in this term then it votes for the candidate
and the node resets its election timeout

heartbeat timeout

Once a candidate has a majority of votes it becomes leader.
The leader begins sending out Append Entries messages to its followers.
These messages are sent in intervals specified by the heartbeat timeout.
Followers then respond to each Append Entries message.
This election term will continue until a follower stops receiving heartbeats and becomes a candidate

Log Replication

All changes to the system now go through the leader
Each change is added as an entry in the node's log
This log entry is currently uncommitted so it won't update the node's value
To commit the entry the node first replicates it to the follower nodes
then the leader waits until a majority of nodes have written the entry.
The entry is now committed on the leader node
The leader then notifies the followers that the entry is committed
The cluster has now come to consensus about the system state
Once we have a leader elected we need to replicate all changes to our system to all nodes.
This is done by using the same Append Entries message that was used for heartbeats.
日志复制
Raft如何保证可用和一致呢, 关键就在于将数据冗余到了所有节点, 
这样主节点挂了, 子节点也能选举为leader, 继续向外部提供达成共识性的数据.
leader将数据复制到follwer的途径就是日志复制.
首先, 分布式系统只通过leader与外部交互
比如外部要更新一个数据, 数据推送到leader节点后, 数据是未提交的状态
leader会将数据通过日志的形式追加到其他节点, 
等到大多数节点返回数据已同步, 这时, leader会更新数据状态为提交, 返回给外部, 数据已更新
1. 网络故障导致了多个leader, 如何保持数据共识呢
首先, 节点数要求是奇数, 这样节点数少的leader不会更新数据, 因为提交数据要获得多数节点的回复.
多数节点指的是至少超过一半
网络修复后, leader要对比选举任期的值, 选举任期小的leader回滚数据, 并且复制数据
选举任期是每次选举成功后都要递增的值, 选举成功要获得半数节点的投票, 所以选举任期大的leader数据新

network partitions - unhappy path

Raft can even stay consistent in the face of network partitions
- separate A & B from C, D & E
Because of our partition we now have two leaders in different terms
- Let's add another client and try to update both leaders
1. One client will try to set the value of node B to "3"
Node B cannot replicate to a majority so its log entry stays uncommitted.
2. The other client will try to set the value of node C to "8".
This will succeed because it can replicate to a majority.
- Now let's heal the network partition.
Node B will see the higher election term and step down.
Both nodes A & B will roll back their uncommitted entries and match the new leader's log.
Our log is now consistent across our cluster.

conclusion

The Secret Lives of Data - https://raft.github.io/ - https://raft.github.io/raft.pdf

Sprite

Raft Distributed Consensus - Raft分布式共识协议

Raft

Node States

Leader Election

election timeout

heartbeat timeout

Log Replication

network partitions - unhappy path

conclusion

发表评论

评论列表，共 0 条评论