A distributed transaction processing system
When we use a server to provide data on the production line service, I will meet the following two questions:
1) a server performance is not enough to provide adequate service to all network request.
2) we always afraid of our this server downtime, the service is not available or loss of data.
So we have to our server to expand, add more machines to share the performance problems, and to solve the problem of single point of failure.Usually, we will through two kinds of means to expand our data services:
Partitions: 1) data is the data block on a different server (such as: uid % 16, consistent hash, etc.).
2) data mirror: let all the servers have the same data, to provide considerable service.
For the first case, we can't solve the problem of data loss, single server out of the question, there will be a part of the loss of data.So, high availability data services can only be done through the second method, data redundant storage (industry generally recognized as safe backup number should be in 3, such as: Hadoop and chateau marmont).Add more machines, however, will make our data services is very complex, especially across server transaction processing, also is the data consistency across servers.This is a difficult problem.Let us Use the most classic Use Case: "A account to remit money to account B" to explain, familiar with RDBMS affairs all know from account to account B need six operation:
1. From A read balance in the account.
2. On A account to do subtraction operation.
3. Write the results back into A account.
4. Read the balance from the B account.
5. The B account do addition operation.
6. Write the results back to the account of B.
To the consistency of the data, the six things, either is successfully done, or is not successful, and the operation process, to A and B account other access must be locked, the lock is to rule out other read and write operations, or you will have the problem of dirty data, that is the transaction.So, are we added more machine, the things get complicated:
1) in data partition scheme: if A account and B account data is not in the same server?We need a transaction across machines.In other words, if A buckle money is
successful, but B and money is not successful, we need to give back to the operation of A roll back.This in the case of cross machine, becomes more complex.
2) in data mirroring solution: A account account remittance between A and B can be done on one machine, but don't forget we have multiple machines there is A copy of the account and B account.If send money to A account has two concurrent operation (should remit B and C), the two operations in different both servers?That is to say, in data mirror, on the same data in different server how to ensure the consistency of write operation, ensure that the data do not conflict?
At the same time, we also need to consider the factor of performance, if don't consider the performance, guarantee transaction is not difficult, the system slowly.In addition to consider performance, we also need to consider availability, in other words, a machine, the data is not lost, services can be provided by the other machine to continue.So, so a few we need to consider the following situations:
1) disaster: data is not lost, the Failover node
2) the consistency of the data: transaction processing
3) performance: throughput, response time
Said before, to solve the data is not lost, only through the method of data redundancy, even data partitioning, each district also need to manipulate the data redundancy.This is a copy of the data: when a node of the lost data can be read in copy, copy of data is the only solution to abnormal loss of data, a distributed system.So, in this article, the sake of simplicity, we only discuss the data redundancy case consider the consistency of the data and performance problems.Simply put:
1) to make data have high availability, will have to write more data.
2) write more questions can lead to data consistency.
3) the problem of data consistency and can cause performance problems
This is the software development, press the gourd ladle.
Of data consistency, simple says there are three types (of course, if the segmentation, there are a lot of consistency model, such as: sequential consistency, consistency, FIFO session consistency, single read consistency, single write consistency, but for the sake of this article easy to read, I only said three below) :
1) Weak Weak consistency: when you write a new value, a read operation on the data copy may be read, can also be read not to come out.For example: some cache system, network
game of the other players data has nothing to do with you, VOIP system, or baidu search engine (ha ha).
2) Eventually eventual consistency: when you write a new value, may read not to come out, but after a certain time window can guarantee ultimate read out.Such as DNS, E-mail, Amazon S3, Google's search engine system.
3) Strong Strong consistency: new data once written, can read in any copy of any time a new value.For example, the file system, an RDBMS, Azure Table are strong consistency.
From these three model of the same type, we can see that Weak and Eventually in general is asynchronous redundant, and Strong in general are synchronous redundant, asynchronous usually means better performance, but also means more complex state control.Synchronization means that simple, but it also means that performance degradation.Ok, let's 1, step by step to see what technology:
The first is the Master - Slave structure, for this kind of structure, general is a Master Slave backup.In this system, the general is designed as follows:
1) read and write requests shall be the responsibility of the Master.
2) write write requests to the Master, the Master synchronization to the Slave.
From Master synchronization to the Slave, you can use asynchronous, you can also use the sync, can use the Master to push, you can also use the Slave to pull.Usually is Slave to cyclical pull, therefore, is the ultimate consistency.The design problem is that if the Master pull cycle broken down, so will lead to the loss of data in the time slice.If you don't want to let the data lost, Slave can Only be Read - Only way to Master recovery, etc.
, of course, if you can tolerate the data lost, you can immediately make instead of Master Slave work (for only responsible for computing nodes, no data consistency and data loss, the Master - Slave mode can solve the problem of single point), of course, the Master Slave can also be a strong consistency, such as: when we write the Master, the Master is responsible for write oneself first, after success, and then write a Slave, after a successful return to success in both, the whole process is synchronous, if write a Slave fails, then the two methods, one is the tag Slave unavailable error and continue to service (such as Slave synchronization of Master data after recovery, you can have multiple Slave, so one less, and backup, as said before write three), the other is a rollback themselves and return to write fail.(note: don't usually write Slave, because if you write the Master himself after failing to rollback Slave even, at this time if the rollback Slave fails, will have to manually correction data) you can see that if the Master - Slave needs to make strong consistency has more complex.
Master, Master, call againMulti-masterRefers to a system, there are two or more Master every Master with the read - write service.This model is the Master - Slave enhanced, data synchronization between generally is through the asynchronous completion between the Master, so is the ultimate consistency.Master - Master is that the benefits of a Master hung, other service Master can do normal, speaking, reading and writing, he and the Master - Slave, when not to be copied to other Master data, the data will be lost.Many database support Master - Master of the Replication mechanism.
In addition, if more than one Master to modify the same data, the model of the nightmare - the conflict between the data in the merger, this is not an easy thing.Look at the design of the Vector of chateau marmont Clock (the version number of the record data and modifier) will know that it is not so simple, and chateau marmont on data conflict this matter were made to the users themselves.Just like our SVN source conflict, for a conflict with a single line of code, only to the developer to handle myself.(behind this article will discuss the chateau marmont Vector Clock)
Two/Three Phase Commit
The abbreviation of the agreement, also called 2 PC, Chinese name is a two-phase commit.In a distributed system, although each node can know their own operation success or failure, but can't know success or failure of the operation of the other nodes.When a transaction across multiple nodes, in order to keep the ACID characteristic of transaction, need to introduce a unity as coordinator component to control all the nodes (referred to as participants) operating results and eventually indicating whether the node should submit the operating results are real (for example, the updated data to disk, etc.).A two-phase commit algorithm is as follows:
1. Coordinator will ask whether all of the participants nodes, submit operations can be performed.
2. Each participant tostart transaction execution of preparation work, such as: for the resource locked,
reserved resources, write the undo/redo log...
3. Preparations for participants response coordinator, if the transaction is successful, the response can
"submit", or in response to "refused to submit.
The second stage:
; If all of the participants responded "can submit", so, the coordinator to send all of the participants
"officially submit" command.Participants completed formal submission, and release all the
resources, and then respond to "finish", the nodes coordinator to collect the "finish" response after
the Global Transaction.
; If there is a participant in response to "refused to submit", so, the coordinator to send "rollback" all
of the participants, and release all the resources, and then respond to "rollback", the coordinator to
collect each node "rollback" response, cancel the Global Transaction.
We can see that 2 PC to do that is the first stage, an algorithm of the second stage to make a decision, also can see the 2 PC this is the strong consistency of the algorithm.We discussed the Master - Slave in front of the strong consistency strategy, and 2 PC is a little similar, just 2 PC more conservative - try to submit again.2 PC use is more, in some system design, will be series of A series of calls, such as: - > C - A - > B > D, each step will allocate some resources or rewrite some of the data.For instance we B2C online shopping order needs to be done in the background will have a series of processes.If we do one step at a time, and will appear such problems, if a particular step do not bottom go to, then at every touch of the front needs to be done to reverse the assigned resources recycling them off operation, therefore, is more complicated to operate.Now a lot of process (Workflow) 2 PC will reference this algorithm, using the try - > confirm process to ensure that the entire process successfully completed.For an example of a popular western church to get married, have such pattern:
1) pastor asked the bride and groom, respectively, are you willing to...Regardless of the physical...(a)
2) when the bride and groom are answer to (lock life resources), the priest will say: I pronounce you...(transaction commit)
This is how classic a two-phase commit transaction processing.In addition, we also can see some of these problems, A) is one of the synchronous blocking operation, it will greatly affect performance.B) another major problem is that on a TimeOut, for instance,
1) if the first phase, the participants were not received inquiries, or the participant's response has been to coordinator.So, need to deal with the coordinator to do overtime, over time, can be as a failure, also can try again.
2) if the second stage, after the issue of the formal submission if some participants did not receive, or participants commit/rollback after the confirmation of no return, once the participants' response timeout, or try again, or node to eliminate the problem of the
participants markers for the whole cluster, so that we can guarantee the service node is data consistency.
3) the worst is that the second phase, if the participants did not receive the commit of the coordinator/fallback instruction, participants will be in a state of "unknown" phase, participants don't know what to do, such as: if all of the participants completed the first phase of the reply after (may all yes, no, all possible parts yes no), if the coordinator to hang out at this time.All the nodes don't know what to do (ask the other participants are not eligible).For consistency, die such as coordinator, or resend the first p