A distributed transaction processing system
When we use a server to provide data on the production line service, I will meet the following two questions:
1) a server performance is not enough to provide adequate service to all network request.
2) we always afraid of our this server downtime, the service is not available or loss of data.
So we have to our server to expand, add more machines to share the performance problems, and to solve the problem of single point of failure.Usually, we will through two kinds of means to expand our data services:
Partitions: 1) data is the data block on a different server (such as: uid % 16, consistent hash, etc.).
2) data mirror: let all the servers have the same data, to provide considerable service.
For the first case, we can't solve the problem of data loss, single server out of the question, there will be a part of the loss of data.So, high availability data services can only be done through the second method, data redundant storage (industry generally recognized as safe backup number should be in 3, such as: Hadoop and chateau marmont).Add more machines, however, will make our data services is very complex, especially across server transaction processing, also is the data consistency across servers.This is a difficult problem.Let us Use the most classic Use Case: "A account to remit money to account B" to explain, familiar with RDBMS affairs all know from account to account B need six operation:
1. From A read balance in the account.
2. On A account to do subtraction operation.
3. Write the results back into A account.
4. Read the balance from the B account.
5. The B account do addition operation.
6. Write the results back to the account of B.
To the consistency of the data, the six things, either is successfully done, or is not successful, and the operation process, to A and B account other access must be locked, the lock is to rule out other read and write operations, or you will have the problem of dirty data, that is the transaction.So, are we added more machine, the things get complicated:
1) in data partition scheme: if A account and B account data is not in the same server?We need a transaction across machines.In other words, if A buckle money is
successful, but B and money is not successful, we need to give back to the operation of A roll back.This in the case of cross machine, becomes more complex.
2) in data mirroring solution: A account account remittance between A and B can be done on one machine, but don't forget we have multiple machines there is A copy of the account and B account.If send money to A account has two concurrent operation (should remit B and C), the two operations in different both servers?That is to say, in data mirror, on the same data in different server how to ensure the consistency of write operation, ensure that the data do not conflict?
At the same time, we also need to consider the factor of performance, if don't consider the performance, guarantee transaction is not difficult, the system slowly.In addition to consider performance, we also need to consider availability, in other words, a machine, the data is not lost, services can be provided by the other machine to continue.So, so a few we need to consider the following situations:
1) disaster: data is not lost, the Failover node
2) the consistency of the data: transaction processing
3) performance: throughput, response time
Said before, to solve the data is not lost, only through the method of data redundancy, even data partitioning, each district also need to manipulate the data redundancy.This is a copy of the data: when a node of the lost data can be read in copy, copy of data is the only solution to abnormal loss of data, a distributed system.So, in this article, the sake of simplicity, we only discuss the data redundancy case consider the consistency of the data and performance problems.Simply put:
1) to make data have high availability, will have to write more data.
2) write more questions can lead to data consistency.
3) the problem of data consistency and can cause performance problems
This is the software development, press the gourd ladle.
Of data consistency, simple says there are three types (of course, if the segmentation, there are a lot of consistency model, such as: sequential consistency, consistency, FIFO session consistency, single read consistency, single write consistency, but for the sake of this article easy to read, I only said three below) :
1) Weak Weak consistency: when you write a new value, a read operation on the data copy may be read, can also be read not to come out.For example: some cache system, network
game of the other players data has nothing to do with you, VOIP system, or baidu search engine (ha ha).
2) Eventually eventual consistency: when you write a new value, may read not to come out, but after a certain time window can guarantee ultimate read out.Such as DNS, E-mail, Amazon S3, Google's search engine system.
3) Strong Strong consistency: new data once written, can read in any copy of any time a new value.For example, the file system, an RDBMS, Azure Table are strong consistency.
From these three model of the same type, we can see that Weak and Eventually in general is asynchronous redundant, and Strong in general are synchronous redundant, asynchronous usually means better performance, but also means more complex state control.Synchronization means that simple, but it also means that performance degradation.Ok, let's 1, step by step to see what technology:
The first is the Master - Slave structure, for this kind of structure, general is a Master Slave backup.In this system, the general is designed as follows:
1) read and write requests shall be the responsibility of the Master.
2) write write requests to the Master, the Master synchronization to the Slave.
From Master synchronization to the Slave, you can use asynchronous, you can also use the sync, can use the Master to push, you can also use the Slave to pull.Usually is Slave to cyclical pull, therefore, is the ultimate consistency.The design problem is that if the Master pull cycle broken down, so will lead to the loss of data in the time slice.If you don't want to let the data lost, Slave can Only be Read - Only way to Master recovery, etc.
, of course, if you can tolerate the data lost, you can immediately make instead of Master Slave work (for only responsible for computing nodes, no data consistency and data loss, the Master - Slave mode can solve the problem of single point), of course, the Master Slave can also be a strong consistency, such as: when we write the Master, the Master is responsible for write oneself first, after success, and then write a Slave, after a successful return to success in both, the whole process is synchronous, if write a Slave fails, then the two methods, one is the tag Slave unavailable error and continue to service (such as Slave synchronization of Master data after recovery, you can have multiple Slave, so one less, and backup, as said before write three), the other is a rollback themselves and return to write fail.(note: don't usually write Slave, because if you write the Master himself after failing to rollback Slave even, at this time if the rollback Slave fails, will have to manually correction data) you can see that if the Master - Slave needs to make strong consistency has more complex.
Master, Master, call againMulti-masterRefers to a system, there are two or more Master every Master with the read - write service.This model is the Master - Slave enhanced, data synchronization between generally is through the asynchronous completion between the Master, so is the ultimate consistency.Master - Master is that the benefits of a Master hung, other service Master can do normal, speaking, reading and writing, he and the Master - Slave, when not to be copied to other Master data, the data will be lost.Many database support Master - Master of the Replication mechanism.
In addition, if more than one Master to modify the same data, the model of the nightmare - the conflict between the data in the merger, this is not an easy thing.Look at the design of the Vector of chateau marmont Clock (the version number of the record data and modifier) will know that it is not so simple, and chateau marmont on data conflict this matter were made to the users themselves.Just like our SVN source conflict, for a conflict with a single line of code, only to the developer to handle myself.(behind this article will discuss the chateau marmont Vector Clock)
Two/Three Phase Commit
The abbreviation of the agreement, also called 2 PC, Chinese name is a two-phase commit.In a distributed system, although each node can know their own operation success or failure, but can't know success or failure of the operation of the other nodes.When a transaction across multiple nodes, in order to keep the ACID characteristic of transaction, need to introduce a unity as coordinator component to control all the nodes (referred to as participants) operating results and eventually indicating whether the node should submit the operating results are real (for example, the updated data to disk, etc.).A two-phase commit algorithm is as follows:
1. Coordinator will ask whether all of the participants nodes, submit operations can be performed.
2. Each participant tostart transaction execution of preparation work, such as: for the resource locked,
reserved resources, write the undo/redo log...
3. Preparations for participants response coordinator, if the transaction is successful, the response can
"submit", or in response to "refused to submit.
The second stage:
; If all of the participants responded "can submit", so, the coordinator to send all of the participants
"officially submit" command.Participants completed formal submission, and release all the
resources, and then respond to "finish", the nodes coordinator to collect the "finish" response after
the Global Transaction.
; If there is a participant in response to "refused to submit", so, the coordinator to send "rollback" all
of the participants, and release all the resources, and then respond to "rollback", the coordinator to
collect each node "rollback" response, cancel the Global Transaction.
We can see that 2 PC to do that is the first stage, an algorithm of the second stage to make a decision, also can see the 2 PC this is the strong consistency of the algorithm.We discussed the Master - Slave in front of the strong consistency strategy, and 2 PC is a little similar, just 2 PC more conservative - try to submit again.2 PC use is more, in some system design, will be series of A series of calls, such as: - > C - A - > B > D, each step will allocate some resources or rewrite some of the data.For instance we B2C online shopping order needs to be done in the background will have a series of processes.If we do one step at a time, and will appear such problems, if a particular step do not bottom go to, then at every touch of the front needs to be done to reverse the assigned resources recycling them off operation, therefore, is more complicated to operate.Now a lot of process (Workflow) 2 PC will reference this algorithm, using the try - > confirm process to ensure that the entire process successfully completed.For an example of a popular western church to get married, have such pattern:
1) pastor asked the bride and groom, respectively, are you willing to...Regardless of the physical...(a)
2) when the bride and groom are answer to (lock life resources), the priest will say: I pronounce you...(transaction commit)
This is how classic a two-phase commit transaction processing.In addition, we also can see some of these problems, A) is one of the synchronous blocking operation, it will greatly affect performance.B) another major problem is that on a TimeOut, for instance,
1) if the first phase, the participants were not received inquiries, or the participant's response has been to coordinator.So, need to deal with the coordinator to do overtime, over time, can be as a failure, also can try again.
2) if the second stage, after the issue of the formal submission if some participants did not receive, or participants commit/rollback after the confirmation of no return, once the participants' response timeout, or try again, or node to eliminate the problem of the
participants markers for the whole cluster, so that we can guarantee the service node is data consistency.
3) the worst is that the second phase, if the participants did not receive the commit of the coordinator/fallback instruction, participants will be in a state of "unknown" phase, participants don't know what to do, such as: if all of the participants completed the first phase of the reply after (may all yes, no, all possible parts yes no), if the coordinator to hang out at this time.All the nodes don't know what to do (ask the other participants are not eligible).For consistency, die such as coordinator, or resend the first phase of the yes/no command.
Two paragraphs to submit one of the biggest problems is the item 3), if the first stage is completed, participants in the second order not received the decision, then the data node will into a state of "overwhelmed", the state will block the entire transaction.That is to say, the Coordinator for the Coordinator for the completion of the transaction is very important, the Coordinator of availability is a key.For some, we introduce the three submitted, submit in three sectionsWikipediaDescribed on the following, he put the second section of the submission of the first paragraph break into two sections: ask, and then locked resources.The last real submitted.Three sections of submit schematic diagram is as follows:
Submitted three core philosophy is: when asked not locking resources, unless everyone agreed, to begin to lock resources.
In theory, if return success all the nodes in the first stage, so there is reason to believe that the probability of successful submission.In this way, can reduce the probability that participants Cohorts status unknown.That is to say, once the participants received PreCommit, meaning that he knew everyone agreed to modify it.It is very important.Below we look at the 3 PCS of state transition diagram (note the dotted line in the graph, the F, T is a Failuer or
Timeout, among them: state meaning is q - Query, a - Abort, w - Wait, p - PreCommit, c - Commit)
In fact, the three sections of submission is a very complicated things, to implement very difficult, but there are also some problems.
See here, I believe you have a lot of questions, you must be thinking about 2 PC / 3 PCS in all kinds of failure scenario, you will find the Timeout is a very difficult thing, because of a Timeout on the network in a lot of time for you no matter from, you don't know each other is to do or not do.So how are you good a state machine is because the Timeout is a decoration.
A web service can have three states: 1) Success, 2) Failure, 3) a Timeout, the third is an absolute nightmare, especially when you need to maintain state.
The Two Generals Problem (Two general problems)
Two Generals ProblemThe general problem is such a thought experiment: there are two armies, they have a general leadership, respectively are now ready to attack a city built fortifications.The two armies are stationed in the city, near points of a hill.A valley separating two mountains, and there is only one communication mode two generals on both sides of their respective messenger to and from the valley.Unfortunately, the valley has been taken by the
city's defenders, and there is a likely, that any was arrested will be sent by Courier by valley.Please note that although the two generals have agree on to attack the city, but in their occupation of jockeying for position before, did not agree on attack time.Two generals must let his troops attack cities to succeed at the same time.As a result, they have to communicate with each other, to determine a time to attack, and agreed to at the time of attack.If there is only a general attack, it would be a catastrophic failure.This thought experiment including consider them how to do it.It's our thinking:
1) general first to send a message "let'sstart at 9 o 'clock in the morning attack".However, once the Courier be sent, whether he passed the valley, the first general is unclear.Any uncertainty will make the first general attack hesitate, because if the second general can't attack at the same time, the city garrison will defeat his army of attack, in his army to be destroyed.
2) knowing this, the generals will need to send a confirmation reply slip: "I received your email, and will attack at nine o 'clock."However, if the messenger was caught with a confirmation message?So the second general would hesitate to own a confirmation message if I can reach.
3) so, it seems that we have to let the first general sent a confirmation message - "I got your confirmation.However, if the messenger was caught?
4) in this way, we'll send a second generals "confirm receipt of your confirmation information.
On, then you will find that the things soon become, no matter how much we sent a confirmation message, have no way to guarantee the two generals have enough confidence own messenger is not capture the enemy.
The problem is no solution.Two general problems and its solution is proved by E.A.A first kkoyunlu, Kevin Everett sat kanadham and R.V.H uber in 1975 for "some limitations and eclectic network communication design" published the article, in this article a description in page 73 of the communications between two gangs in were elucidated.In 1978, Jim Gray of his book database operating system considerations (beginning on page 465) was named two general paradox.As the definition of two general problems and solution and the source of the proof of this reference is widely mentioned.
This experiment is intended to clarify: trying to build on a dodgy connection communication to coordinate the hidden danger of an action, and in the design of the big challenges.
In engineering, a practical method to solve the problem of two general is the use of a communication channel to bear not the reliability of the scheme, don't try to eliminate the unreliability, but will not cut to an acceptable degree of reliability.For example, the first general discharge 100 messenger and expects to have the possibility that they were caught in the very small.In this case, whether or not the second general attacks or any news, the first general attack.In addition, the first general can send a message flow, and the second general can each of them a message send a confirmation message, so that if every message has been received, the two generals will feel better.However, we can see from the proof that they are both not sure the attack can be coordinated.They don't have the algorithms available (for example, received four or more messages) to ensure that prevent only one party attack.Moreover, the first general can also for each message number, said this was the no. 1, 2...Until the number n.This method can make the second general know how reliable communication channel, and returns the appropriate amount of information to ensure that the last message was received.If channel is reliable, just a message, and the rest is of little help.Last one and the first message loss probability is equal.
Two general problems can be extended into more abnormal condition of the Byzantine Generals (Byzantine Generals Problem), the story goes like this: the Byzantine located in Istanbul, Turkey, now is the capital of the eastern Roman empire.Due to the Byzantine territory vast Roman empire, for defensive purposes, so every army separated far away, between the general and general can only pass the message on messenger.At the time of war, the Byzantine army general all necessary to reach a consensus, to decide whether to have the chance to win to fight against the enemy's camp.But, the military might be traitors and the enemy spy, the generals would disrupt the traitors around or decision-making process.At that time, in the case of members are known rebellion, the remaining loyal general under is not affected by the traitor how consistent agreement, this is the problem the Byzantine generals. Paxos algorithm
Various Paxos algorithm on WikipediaThe description is very detailed, you can go to check out.
Paxos algorithm to solve the problem is in a distributed system may be described above in how to agree on a certain value, ensure no matter happen more than any exception, have won't destroy the consistency of the resolution.A typical scenario is that in a distributed database system, if the initial state of all nodes are identical with each node to perform the same action sequences, so they can get a consistent state in the end.To ensure that each node to perform the same command sequence, need on each instruction to execute a "consistency algorithm" to ensure that each node to see instructions.A general consistency algorithm can be applied in many scenarios, and is an important problem in distributed computing.Since the 1980 s for consistency algorithm research has not stopped.
Notes: Paxos algorithm is Leslie lambert (Leslie Lamport, is in the LaTeX "La", who is now at Microsoft research) in 1990 put forward a kind of consistency algorithm based on message passing.Because the algorithm is difficult to understand at first did not arouse people's attention, make Lamport published in 1998, eight years later to the ACM the Transactions on Computer Systems (The Part-Time Parliament).Even so paxos algorithm still didn't get to 2001 Lamport think peer can not accept his sense of humor, and easy to accept the way to describe it again (Paxos Made Simple).Visible Lamport have a special liking to Paxos algorithm.Paxos algorithm is widely used in recent years that it is the important position of consistency in a distributed algorithm.2006 Google's "cloud" early three papers, including the Lock service USES Paxos Chubby as a Chubby Cell consistency in the algorithm, Paxos popularity soared from now on.(Lamport himself inIn his blogDescribes the whole he spent nine years published this algorithm)
Note: Amazon AWS, all of the cloud services are based on a ALF (Async Lock Framework), the Framework of the ALF is Paxos algorithm used.When I was in the Amazon, the internal share video, designers in the Principle of internal Talk said he reference the ZooKeeper method, but it took another than they are easier to read the way of the algorithm.
In simple terms, a Paxos purpose is to let the whole cluster nodes to agree a value changes.Paxos algorithm basically is a democratic election, the decision of the most will become the unity of the whole cluster.Any one point can be put forward proposals to modify a certain data, whether to pass the bill depends on whether the cluster have agreed to more than half of the nodes (so Paxos algorithm needs the node in the cluster is singular).
This algorithm has two stages (assuming that there are three nodes: A, B, C) :
First stage: the Prepare phase
Request to Prepare to apply for A change Request to all the nodes A, B, C.Note that Paxos algorithm will have A Sequence Number (you can be thought of as A proposal, the constantly increasing Number, and the only, that is to say, A and B may not have the same bill Number), the proposal will and change requests sent together, often refused to any node in the Prepare phase is less than the current proposal request.So, to all the nodes in A node application change request, need to bring A proposal, the new proposal, the proposal, the more is the biggest of all.