A distributed algorithm in-depth analytical no database
Scalability of the system is the main reason to promote the development of no movement, contains a distributed system coordination, failover, resource management and many other features.So speak makes no (sounds like a big basket, can plug in.Although no movement did not bring radical technological change to the distributed data processing, but there is news about research and practice of various kinds of protocols and algorithms.It is through these attempts gradually summarized some effective method of establishing the database.In this article, I will focus on no (distributed characteristics do some systematic description of the database.
We will study some distributed strategy, such as copy of fault detection, these policy marked by boldface, is divided into three sections:
; Data consistency.No need to be consistency in a distributed system, fault
tolerance and performance, the trade-off between low latency and high
availability, in general, data consistency is an option, so this section is mainly
about data replication and data recovery.
; Data is placed.A database products should be able to deal with different data
distribution, cluster topology and hardware configuration.In this section we
will discuss how to distribution and adjustment of data distribution to be able
to solve the fault in time, to provide persistence assurance, efficient query and
guarantee of the cluster resources (such as memory and disk space) to get
; Peer-to-peer systems.As leader election this technology had been used in more
than one database products in order to realize the strong fault tolerance and
data consistency.However, even the database (no center) of also want to track
their global state, detect fault and topology changes.This section will introduce
several kinds of the technology of system consistent state.
As is known to all, distributed systems often network isolation or delays, in this case the isolation part is not available, so to keep high availability without sacrificing consistency is impossible.This fact is often referred to as "CAP" theory.However, consistency in a distributed system is a very expensive things, so often need to make some concession on it, not just for availability, there are many kinds of tradeoffs.In order to study the balance, we noticed that the consistency problem of distributed system is caused by isolation and data replication, so we have to start from the characteristics of the copy:
; Availability.In the case of network isolation should rest can still read and write
; Read and write latency.Can read and write requests in a short period of time.
; Reading and writing ductility.Read and write the pressure can be Shared by
multiple nodes is balanced.
; Fault tolerance.To read and write requests processing does not depend on any
; Data persistence.Certain conditions of node failure will not result in data loss. ; Consistency.Consistency is much more complicated than the front several
features, we need to discuss in detail several different point of view.But we
don't involve too much consistency theory and concurrency model, because it
is beyond the scope of this article, I can only use some simple characteristics
constitute a system of lean.
; Read and write consistency.From the perspective of reading and writing, basic
goal is to make copies of the database convergence time as short as possible
(i.e., updates to all copies of the time), ensure the eventual consistency.In
addition to the weaker guarantee, there are more consistent features: ; Write after read consistency.The effect of the write operation on data items X
always seen by subsequent X on the read operation.
; Reflection read consistency.The X data items in a read operation after the
follow-up of X read operation should be returned the same as the first return
values or more new value.
; Write consistency.Database partition often write conflict happens.Database
should be able to deal with the conflict and to ensure that multiple write
requests will not be handle by different partitions.This database provides
several different consistency model:
; Atoms to write.If the database provides the API, a write operation can only be
a single atomic value assignment, avoid conflict writing way is to find out
each data of the latest version.This makes all of the nodes are able to at the
end of the update for the same version, and has nothing to do with the order of
the update, network fault and delay often cause each node does not agree to
update the order.Data version can use timestamp or user the specified value.Is
this kind of method for Cassandra.
; Atomized read - write instead.Applications sometimes need to be read - to -
write sequence operation rather than individual atoms write operations.If there
are two client reads the same version of the data, modify and write the
modified data back to, in accordance with the atom model, comparison of the
time that an update will cover first.This kind of behavior is not correct in some
cases (for example, two add new value to the client list to the same
value).Database provides at least two solutions:
; Conflict prevention.Read and write - to - can be thought of as a special case of
the transaction, so the distributed lock or PAXOS agreement can solve this
problem.This technical support atomic rewrite the semantic and any
transaction isolation level.Another way is to avoid distributed concurrent
writes, will all write operations of certain data items routing on a single node
(the master node can be global master node or partition).In order to avoid the
conflict, the database must sacrifice the availability of network isolation
cases.This method is often used in many provide strong consistency
guarantees system (such as most relational databases, HBase, directing).
; Collision detection.Database to track concurrent updates of conflict, and
choose to roll back one or two version to the client.Concurrent updates often
use vector clock (this is a kind of optimistic locking) to track, or maintain a
complete version of history.This method is used to Riak, Voldemort,
Now let's take a closer look at the commonly used replication technology, and according to describe the characteristics of the points, the class for them.The first drawing of the logical relationship between different technologies and different technologies in system consistency, scalability, availability, delayed the trade-off between coordinates.The second picture depicts in detail each technology.
The available factor of 4.Read and write the coordinator can be an external client or an internal agent node.
We will according to the consistency from weak to strong going over all the technology:
(A, an entropy) the weak consistency, based on the strategy as follows.Write operation when select any one node updates, if new data is not through the background reading of the entropy agreement passed to the reading of the node, then is still the old data is read.(the next section will detail the entropy protocol).The main characteristic of this method is:
; Excessive propagation delay makes it less useful in terms of data
synchronization, so more typical usage is only as a auxiliary function to detect
and repair the unplanned.Cassandra will use the entropy algorithm to pass
between each node topology database and other metadata information.
; Weak consistency guarantees: even in the absence of failure, also can appear
write conflict do not agree with, speaking, reading and writing.
; Under the isolation of network high availability and robustness.Using
asynchronous update batch replaced one by one, which makes performance is
; Persistence guarantees the weaker because of new data initially only a single
(B) of the above model is an improvement in any one node receives updates and asynchronous data request to send updates to all available nodes.It was also thought to be directed against entropy.
; Compared with the pure of entropy, it just a little sacrifice performance is
greatly improved consistency.Formal consistency and persistence, however,
remain the same.
; Due to network failure or if some node failure is not available at the time, the
update will eventually spread by the entropy process to pass to the node.
(C) in the previous model, the use of tip over technology can better deal with the operation of a node failure.For expected failure node updates are recorded on additional agent node, and indicate the once characteristic nodes available will pass the update to the node.Do to improve the consistency and reduce the replication convergence time.
(D, one-time, speaking, reading and writing) for tip may also be handed over to the responsibility of the node in the transfer update out before failure, in this case it is necessary through the so-called read repair to ensure consistency.Each reading operation will start an asynchronous process, to store the data of all nodes request a copy of the data (like signature or hash), if it is found that each node returned is unified the inconsistent data on each node.We use disposable, speaking, reading and writing to named combination of A, B, C, D technology - they didn't provide strict consistency guarantee, but as A backup method already can be used in practice.
(E, read some write several) the above strategy is to reduce the heuristic enhanced replication convergence time.To ensure greater consistency, must sacrifice usability to ensure a certain overlap, speaking, reading and writing.Typically write W at the same time a copy instead of one, read also read R a copy.
; First of all, you can configure the replications W > 1.
; Second, because R > N + W, between write and read the node is bound to be
overlap, so read in at least one copy of the multiple data is a relatively new
data (the figure above W = 2, R = 3, N = 4).In the read and write requests in
order to (write again after reading) to ensure consistency (consistency) for a
single user, speaking, reading and writing, but do not guarantee the universal
read consistency.With the case in the graphic below, R = 2, W = 2, N = 3,
because the write operation for two copies of the update is a transaction, the
update did not complete the reading could read both are old values or new an
; For some read latency requirements, set different values of R and W can adjust
the latency and persistence, and vice versa.
; If W < = N / 2, concurrent multiple write will write different number of nodes
(e.g., write A write first N / 2, after write N / 2 B).Set W > N / 2 can guarantee
in accord with rollback model of atomic read timely detect the conflict.
; Although strictly speaking, this model can tolerate individual node failure, but
for the network isolation of fault tolerance is not good.In practice, often use
"approximate quantity through" this way, by sacrificing consistency to
increase the availability of certain situations.
(F, read all write several) consistency problem can access data through reading all copies (or to check the read data) to ease.This ensures that as long as there is at least one node data update new data can be read to see.But in the case of network isolation this guarantee will not be able to play a role.
(G, master-slave) this technique is often used to provide atomic writing or reading the rewrite the collision detection of lasting level.In order to achieve the level of conflict prevention, must want to or is locked in a centralized management mode.The simplest strategy is to use master-slave asynchronous replication.For a particular data item writes all be routed to a center node, and in the above order.In this case the primary node will become a bottleneck, so the data must be divided into separate area (different have different master), so as to provide extensibility.
(H, Transactional Read Quorum Write Quorum and Read One Write All) update multiple copies of the method can control through the use of transaction technology to avoid conflict.It is well known method is using the two-phase commit protocol.But two phase commit is not completely reliable, because coordinator failure may cause resource block.PAXOS commit protocol is a more reliable choice, but to lose a little performance.There is one small step forward again on this foundation to read a copy write all copies, this method put all copies of the update in a transaction, it provides the strong fault tolerance consistency but lose some of the performance and availability.
The above analysis, some of the trade-offs necessary to repeat it
; Consistency and availability.Tight balance has been given by CAP theory.In
the case of network isolation, the database or data set, or to accept the risk of
; Consistency and extensibility.Can see that even read/write consistency
guarantees reduces the extensibility of copy set, only in the atomic write
model can be in a relative scalable approach to writing.Through the model of
the atomic read rewrite data and a temporary global lock to avoid conflict.That
suggests the dependencies between data or operating within the scope of even
a small or a very short time, also can damage the extensibility.So careful
design data model, data fragmentation stored separately for scalability is very
; Consistency and delays.As mentioned above, when the database need to
provide strong consistency or persistence should prefer to read and write all
copy technology.But obviously consistency and request delay is inversely
proportional to, so use several copies technology will be more fair way.
; Failover with consistency/scalability/delay.Interestingly, scalability, fault
tolerance and uniformity of delay trade-off conflict is not severe.Through the
reasonable to abandon some performance and consistency, cluster can tolerate
up to up to node failure.This kind of compromise in the difference between the
two phase commit and PAXOS agreement is very obvious.Another example of
this kind of fold is increase the consistency of a particular security, such as the
use of strict process of session of "reading oneself writes," but it also increased
the complexity of the failover.
The entropy of agreement and rumor propagation algorithm
Let's start with the following scenario:
There are many nodes, each data will copy of the above entities in a number of nodes.Each node can be dealt with individually update request, each node and other nodes synchronization state regularly, so for a period of time after all the copies will be converged.The synchronization process is how?The synchronous begins?How to choose a synchronization object?How to exchange data?We assume that the two nodes are always
cover the old data with a new version of the data or the two versions are reserved for the application layer.
This problem is common in data consistency maintenance state synchronization and cluster (such as cluster member information dissemination) scenario.While introducing a monitoring database and develop a plan of synchronous coordinator can solve this problem, but the centralized database can provide better fault tolerance.Decentralized main approach is the use of elaborate design of transmission protocol, this protocol is relatively simple, but provides a good convergence time, and able to tolerate any node failure and network isolation.Although there are many types of infectious algorithm, we focus on the entropy of the agreement, because no databases are using it.
Entropy agreement to posit synchronization are executed according to a fixed schedule, each node randomly on a regular basis, or according to certain rules to choose another node exchange data, eliminate the differences.There are three kinds of the style of the entropy protocol: push, pull and mix.Push the agreement principle is simple to select a random node then sends the data form the past.Will push out all the data in real applications is obviously foolish, so nodes generally works as shown in figure.
Node A as synchronous initiator prepared A summary data, containing the data on A fingerprint.Node B received after the will of the data comparing with the local data, and turn into A data gap in return for A.Finally, A send an update to B, B to update the data.Pull mode and hybrid mode of protocol is similar, as shown above.
The entropy agreement provides enough good convergence time and extensibility.The figure below shows a spread in 100 nodes in the cluster an updated simulation results.In each iteration, each node only comes in contact with a randomly selected from peers.
As you can see, the way of convergence is better than push way, this can be proved in theory.And push the way there is a "tail" of convergence problem.After many iterations, although almost traverse all the nodes, but there are few didn't affected.Compared with the simple way of push and pull, mixed mode is more efficient, so often use this way in practical application.The entropy is extensible, because the average conversion time to grow by a logarithmic function of the cluster size.
While the technology seems very simple, there are still many studies focused on the performance of the agreement under different constraint conditions, the entropy.One of them through a more effective structure using the network topology to replace the random.Under the condition of limited network bandwidth to adjust transfer rate or the use of advanced rules to select to synchronize data.The calculation is also facing challenges, database will maintain a recently updated log to contribute to the calculation.
Eventually Consistent Data type Eventually Consistent Data Types
In the last section, we assume that the two nodes are always combined their data version.But it's not easy to solve update conflict, let all copies are ultimately reached a semantically correct value surprisingly difficult.A well-known example is the Amazon chateau marmont database has been deleted items can reproduce.