Tell people of distributed data storage
Neo, this is the question about which make us upset
Why there are so many AWS data storage options?Which one should I use?These are customers of common problems.In this series is divided into three parts blog, I will try to do some clarification.inThe first part, I will be on the basis of high availability, as well as why redundancy is the commonly used method to achieve high availability.I also briefly mentioned in the data layer redundancy will bring new problems.In the second part of this blog series, I will discuss some of these problems, and you need to consider when to overcome these problems.The third part of this blog series on the basis of the information, discusses the AWS specific data storage options, and optimization of each storage options for what the workload.In after you finish reading this blog all three part series, you will agree with AWS provides rich data storage products, and learn to correct the workload of choosing the right choice.
What's the problem with a relational database?
As many of you may already know, relational database (RDB) technology existed since the 1970 s, until the late 1990 s has been structured to store the DE facto standard.RDB is excellent for decades to support the high consistency transaction workload, and remain strong.As time goes on, the ancient technology in response to customer demand for the new abilities, such as a BLOB storage, XML/document storage, full text search, code is executed in the database, usingstardata structure of data warehouse, and geographical spatial extension.As long as everything is packed into a relational data structure definition, and suitable for the single machine, can be implemented in a relational database.
Then, have taken place in the commercialization of the Internet, changed everything, and make the relational database can no longer meet the demand of all the storage.Compared
to the consistency, availability, performance and extension is becoming as important - sometimes even more important.
Performance has been very important, but with the emergence of the Internet commercialization, change is the size.Facts have proven that to scale up the performance of the required skills and technology before the age of the Internet is unacceptable.Relational database around the ACID (Atomicity Atomicity, Consistency, Consistency, Isolation, Isolation and Durability Durability), the concept of implementing ACID the simplest way is to keep everything on a single machine.Is, therefore, the traditional method of RDB scale vertical extension (scale up), the vernacular said, is to use a bigger machine.
Oh - oh, I think I need a bigger machine
Use a larger machine solution has been very good, until the Internet brings the load to the single machine can't handle.This forced engineers have come up with a clever technology to overcome the limits of the single machine.There are many different ways, each have their advantages and disadvantages: Lord - vice, cluster, joint with partition table (table federation and partitioning), horizontal partition (sharding, can be thought of as a special case of the partition).
Another factor to the rise in data storage options are available.Before the age of the Internet system, the user is usually come from inside the organization, it is likely to set the planned downtime during non-work time, even an unplanned outage will only have limited impact.The commercialization of the Internet has also changed it: now everyone who has access to the Internet are all potential users, so it is possible that an unplanned down time caused a greater impact, and the global Internet lead to non-working hours, it is difficult to determine and arrange planned downtime.
inThe first part of this blog series, I discussed the redundant role in achieving high availability.However, when applied to the data storage layer, redundant brings a series of new interesting challenges.In the database application layer redundancy of the most common way is to master/deputy configuration.
This deceptively simple Settings, when compared with the traditional single relational database, there is a huge difference: we now have the network isolation of multiple machines.When database write operations occur, we have to decide when to now think it is finished: as long as to the master database, or just save the vice database (or even n deputy database, if we want to get higher availability - to realize adding another machine impact on the availability, please see the first part of this blog series).If we decide to save to the master database is enough, before copying data if the primary database fails, we have to bear the risk of loss of data.If we decide to wait until the data replication is complete, the price we will accept the delay.Deputy database downtime in rare cases, we need to decide whether to continue to accept the request of the write operation, or reject it.
Therefore, we from a default the consistency of the world, has entered a consistency is a choice of the world.In this world, we can choose to accept the so-called eventual consistency, that is, state replication between multiple nodes, but not every node has a complete view of the whole state.In the example above we configuration, if we choose to think to master database is write operation to complete (or to the primary database and any pair of database, but not necessarily the two deputy database), so we just choose the eventual consistency.In the end, because each write operation will be copied to each database.But if, at any point we
query a database, we can't guarantee it contains until that moment so far all the write operation.
Let's try the new theory of the CAP
To sum up, when the data is copied (also known as separation (partitioned)), the state of the system were scattered.This means that we left the comfort of the ACID field, enter the CAP of the Brave New World.CAP theory is by Dr Eric Brewer, of the university of California at Berkeley in 2000.It is the simplest form of this: a distributed system must be in consistency, availability, and trade-offs between separated Tolerance (how), and can only be two of the three.
CAP theory expanded discussion of data storage to beyond the scope of the ACID, inspired the creation of many non-relational database technology.In his CAP theory put forward ten years later, Dr Brewer, issued a statement, to clarify his original "three choose two" view is greatly simplified, is for the sake of discussion, and help to transcend the ACID.However, this greatly simplified, caused numerous misinterpretation and misunderstanding.In the interpretation of the CAP finer, all three dimensions should be understood as a range, rather than a Boolean value.In addition, should understand that most of the time working in the distributed system space pattern, in this case, the need to make a compromise between consistency and performance/delay.In space really rare cases, the system must make a choice between consistency and availability.
Contact until we master/example, if you choose to believe that only when the data in all places are copied (also known as synchronous replication) after the write operation is completed, we just chose consistency at the expense of the write operation delay., on the other hand, if you choose to believe that once the data to the master database, think write operation to complete, and let the copy in the background (also known as asynchronous replication), we are at the expense of the consistency of performance.
When the network space, a distributed system into special separation mode, trade-offs between consistency and availability.Returning to our example: multiple database after lose connection with the primary database, may continue to provide query service, is at the expense of the consistency of availability.Or, we can choose, master database if lost and vice database connection, you should stop accepting write operation request, therefore is at the expense of the usability consistency were chosen.Consistency in the commercialization of the Internet age, the choice usually means the loss of income, so a lot of system availability.In this case, when the system is returned to normal, it can enter the recovery mode, all the accumulated resolved inconsistency and replication.
While we still talking about recovery mode, to say a kind of said Lord - Lord (active) or active - distributed data storage configuration.In this setup, the write operation can be sent to multiple nodes, and then copied each other.In such a system, even the normal mode is also becoming complicated.Because, if two of the same data update in roughly the same time on two different master node, how to coordinate?Not only that, if such a system have to recover from a state of separation, things got worse.Although there may be feasible main - the main configuration, make it more easy, and there are also some product my suggestion is that unless absolutely necessary, or try to avoid.There are many ways to achieve good balance of performance and availability, and don't need to burden the main - the main configuration of the cost of high complexity.
A common pattern of many modern data storage
Provide size/performance and usability good collocation is a common way, is the combination of separation and copy form a configuration (or mode).This is sometimes referred to as a copy of the separated collection (partitioned up set).
(click to enlarge images)
Whether Hadoop, Cassandra or directing a cluster, all these are basically conforms to this pattern, so many AWS data services.Let's look at separate copies of the collection of some common characteristics:
; Data across multiple nodes (or multiple nodes cluster) delimited (i.e., separate).There is no single
partition with all of the data.A single write operation sent to only one partition.Multiple write
operations may be sent to multiple partitions, so should be independent of each other.Complex,
transactional, multiple records (and thus may involve multiple partition) write operations should be
avoided, because it may affect the whole system.
; A single partition that can handle the largest amount of data could be potential bottlenecks.If a
partition to its bandwidth limit, to increase the flow of more partitions, and split across it, helps to
solve the problem.Therefore, can be extended by adding more partitions to this type of system.
; A partitioned indexes (key) used to assign each partition of data.You need a careful selection of the
index partition, such as much as possible to read and write operations, on average, "distribution" in
all partitions.If the read/write operation gathered themselves together, and these operations may be
beyond the bandwidth of a partition, and then influence the performance of the whole system,
while the other partition does not make full use of.This is called the "heat partition" problem.
; Data replication between multiple hosts.This can be, each partition is a set of completely separate
copy, or on the same set of hosts multiple copies of the set.The number of a data to be copied is
often referred to as replicator.
; This configuration has a built-in high availability: data is copied to multiple hosts.In theory, a
number of smaller than the number of replicator host fails, will not affect the usability of the whole
All these benefits, and the built-in scalability and high availability, accompanied by a corresponding price: this is no longer your Swiss army knife, single relational database management system (RDBMS).This is a complex system, there are a lot of can change the part of the management and the need to fine-tune the parameters.Need professional knowledge to set up, configure and maintain these systems.In addition, the monitoring and alarm infrastructure is needed to ensure their normal operation.Of course you can do it yourself, but is not easy, you may be a short period of time can't fix.
In order to help our customers without administrative overhead, high scalability and high availability of data storage, provide various hosting AWS data/storage service.Because there are many different optimization objectives, so there is no single magic data storage, but a set of services, each service for a particular workload is optimized.In the next blog post, I will provide the data about AWS storage options, discuss what each service for the (no) and optimization.
Rich data storage, although some choice is difficult, but it is a good thing.We need to go beyond traditional idea of the whole system is only a single data storage, receiving system used in a variety of data storage, each provide service for it is most suited to the workload of this way of thinking.For example, we can use the combination of the following:
; High performance intake queue, click to obtain input flow
; Click on the flow processing system based on Hadoop
; Cloud based object storage, used to store compressed low cost, long daily click flow in this paper
; Save the metadata of the relational database, can be used to enrich the traffic data for us
; Used for the analysis of the data warehouse cluster
; Used in natural language query search cluster
Above all can be part of a single subsystem, such as the analysis platform called site. conclusion
1. Commercial Internet expand demand and availability, and RDBMS this Swiss army knife can no
longer meet the needs of this.
2. On extension and redundant data storage increase level increased the system complexity, making
ACID is more difficult to guarantee, force us to consider choice according to the theory of CAP,
created many opportunities for optimization and specialization of fun.
3. The use of multiple data storage in the system, each provide service with the most appropriate
4. Modern data storage is a complex system, which requires special knowledge and management
overhead.With AWS, you don't need such a cost, can enjoy the benefits of dedicated data storage.