Distributed computing platform based on Mesos and
In view of the "Internet +" in the era of business growth, the demand change speed and mass calculation, cheap, highly scalable distributed x86 cluster has become the standard solution, such as Google distributed system already deployed in the tens of millions of servers.Docker and its related technology and development, and give a large-scale cluster management brings new imagination space.How to effectively combine both?This article will introduce several people of science and technology based on the practice of Mesos and Docker distributed computing platform.
Distributed system design criteria
Distributed system must be first large-scale system, has a good Scalability.For reasons of cost, many large-scale distributed systems generally use cheap PC server, rather than a large high performance server.
No single point of failure
Cheap PC server use often encounter all sorts of problems in large-scale, PC server hardware is unlikely to be highly reliable, such as Google's data center every day there are a large number of hard disk failure, so must the hardware fault tolerance, distributed system to ensure there is no single point of failure.In this very unstable, very unreliable hardware computing environment, build a distributed system to provide highly reliable service, must through the software fault tolerance.Distributed system for there is no single point of failure requires two aspects of design consideration, is a kind of service class enterprise applications, each service instance background to have multiple copies of a two hardware failure not affect all service instance;Another application of data storage, and each data also must want to have multiple backup, ensure that even if a few hardware broken data is not lost.
In addition to a single point of failure, but also ensure the high reliability.In a distributed environment, in view of the enterprise service application, to do load balancing and service discovery to ensure high reliability;In view of the data services, in order to achieve high reliability, first of all, according to a certain algorithm to the overall data fragmentation (because a server to hold), then the same algorithm to shard lookup.
A distributed design concept is that the data of local sex again, because the network communication overhead is the bottleneck of distributed system, to reduce the network overhead, should let computing tasks to find data, and not let the data for the calculation.
Distributed system was compared with the Linux operating system
Due to the vertical development can optimize the space is too small (upper limit of the performance of the single server obviously), distributed system emphasizes the horizontal extension, lateral optimization, when distributed cluster computing resources, add to the cluster server, to constantly improve distributed cluster computing ability.Distributed system to achieve unified management of all server cluster, shielding the underlying management details, such as fault tolerance, scheduling, communication, let developers think logically distributed cluster is the server.
Although compared with stand-alone Linux operating system, a distributed system is not mature to become "distributed operating system," but it as well as stand-alone Linux to solve five operating system required functions, namely, resource allocation, process management, task scheduling, interprocess communication (IPC) and the file system, respectively by the Mesos, Docker, Marathon/Chronos, RabbitMQ and HDFS/Ceph, corresponding to the Linux under the Linux Kernel, the Linux Kernel, init. D/cron, Pipe Socket and corruption, as shown in figure 1.
Figure 1 a distributed system and the comparison of the Linux operating system
Based on the distributed computing platform of Mesos
Mesos principle of resource allocation
At present our Mesos cluster deployment on public cloud services, with more than 100 virtual unit into Mesos cluster.Mesos are not required to compute nodes is a physical server or virtual server, as long as is the Linux operating system.Mesos can understand into a distributed Kernel, only distributing cluster computing resources, is not responsible for task scheduling.Based on the above Mesos can run different distributed computing platform, such as the Spark, Storm, Hadoop, Marathon, Chronos and so on.Spark, such as Storm and Hadoop computing platform has task scheduling functions, can be used directly Mesos SDK with Mesos request resources, then computing tasks scheduling by oneself, and the hardware fault tolerance.Marathon provide task scheduling for a service-oriented distributed applications, such as corporate websites such as this kind of need the services of a long running.Web applications usually no task scheduling and fault tolerance, because a background process is unlikely to instances where shall hang up after machine back on such a complex problem.This kind of no task scheduling ability of service-oriented distributed applications, can be responsible for scheduling by Marathon.Marathon, for example, one hundred implementation of scheduling the web service background instance, if an instance to hang out, including Marathon will put this on other server instance.Chronos is distributed batch application of task scheduling, such as processing log regularly or regular offline tasks such as Hadoop.
Mesos's biggest advantage is able to do fine-grained distributed cluster resource allocation.As shown in figure 2, the left is coarse grain of allocation of resources, on the right is the fine grain of allocation of resources.
Figure 2 Mesos resource scheduling of two ways
Figure 2 on the left have three clusters, each cluster server, three are three kinds of distributed computing platform, such as the above three Hadoop, middle three is the Spark, the following three Storm, three different frameworks, respectively.On the right is the unified management 9 Mesos cluster server, all tasks from the Spark, Hadoop or Storm in 9 mixed running on the server.Mesos first raised the rate of resource redundancy.Coarse grain resources management must bring certain waste, fine grained resource to improve resource management ability.Hadoop machines very at leisure, the Spark is not installed, but Mesos can be as long as any scheduling response immediately.Last and stability data, because the unified management of all 9 units have been Mesos, if say the Hadoop, Mesos will cluster scheduling.Between the computational resources are not Shared, storage is not good to be Shared.If it ran the Spark for network data migration, is obviously affect the speed.And then the method of allocation of resources is the resource offers, it is in the window of schedulable resources to choose, Mesos Spark or Hadoop, and so on.This method, the distribution of the Mesos logic is very simple, as long as constantly reports which are available resources.Mesos resource allocation methods also have a potential faults, is no centralized distribution, so may not bring the global optimal way.But the disadvantages for the current data resources are not very serious.Now a computing center resource contribution rate is very difficult to achieve 50%, most of the computing center is idle state.
The sample of Mesos distributed resources
The following specific illustrate how to use Mesos resource allocation.As shown in figure 3, the middle one is Mesos Master, below is the Mesos Slave, the above is the Spark and Hadoop to run above the Mesos.Mesos Master report available resources to Spark or Hadoop.Assuming Hadoop has a task to run, Hadoop Mesos from the Master of the resources available to a Mesos Slave node, then this task will be in the Mesos Slave node, it is a task to complete a resource allocation, the next Mesos Master to continue their resource allocation.
Figure 3 Mesos resource distribution of the sample
Mesos only do one thing, that is, distributed cluster resource allocation, regardless of the task scheduling.Marathon and Chonos is to do the task scheduling based on the Mesos.As shown in figure 4, the mixed operation from the Marathon and Chronos Mesos cluster of different types of tasks.Marathon and Chonos doing the task scheduling based on Mesos, must be a dynamic scheduling, namely each task before execution is don't know where is it in the future a execution on the server which one port and binding.As shown in figure 5, 9 servers Mesos cluster mixed operation of Marathon scheduled tasks, the middle one server to fail, the two tasks on this server is affected, and Marathon -- the two task migrated to other servers, that is, the benefits of dynamic task scheduling is very easy to realize fault tolerance.
Figure 4 Mesos cluster running different types of tasks
Figure 5 Marathon dynamic task scheduling
In order to reduce the influence of a hardware failure on application service, try stateless in application.Stateless benefit is in do not need any recovery procedures are affected, so this program as long as scheduling again.Stateless requirements put state data storage server or a message queue, and so are the benefits of fault-tolerant recover will become very convenient.
The high reliability of the service class
For service type of task, its guaranteed service over a distributed environment of high reliability, it need to load balance and service discovery.Load balancing in a distributed environment do have a difficulty is the background of these examples have dynamic change can occur, such as a single node is broken, the instance on the node will be affected, and then migrated to other nodes.However the traditional load balancer background instance address port is static.So in a distributed environment, in order to do service load balance must be found.For example, there are four cases, before a service now the newly added two instances, you need to tell the load balancer new instance address and port.Service discovery process is composed of several modules to cooperate to complete, such as Marathon to increase new instance to a service, writes the new scheduling instance address port Zookeeper, then Bamboo Zookeeper in storing the addresses of a new instance of the service port information told the load balancer, so the load balancer will know new instance address port, completed the service discovery.
The high reliability of data classes
For service type of application, distributed system with the load balancer and service discovery to ensure high reliability of service.For the application of data types, a distributed system to ensure that the same high reliable data services.The first thing to do data fragmentation, a server to save all the data is divided into many copies to save, but to shard data must be shard, according to certain rules to find the back to follow the same rules to shard lookup, is consistency.Assume that the original plan we use Hash calculation method, on the linear space points after three, I want to be in the data is divided into three pieces of machinery to save, three machines are all full, and distributing the data no longer assigned to it when straight linear space, but it is allocated to the annular space, connect the start and end, together as a data link, as shown in figure 6, so that the corresponding data points on this piece.If you want to add a new data center in the new cut out of the ring, it is very convenient, this part cut out this part of representative data should be on the new chip, so the original child data fragmentation to embedded shards.
Figure 6 data fragmentation
May also delete data, we put yellow data on red data, this is the advantage of the ring.Practical in order to achieve high reliability, may assume that any data is mapped to the yellow part, the yellow part as long as the map to any area of a yellow exist the same machine, with a slice of the underlying machine can have multiple copies and make backups of data, this is an example of the actual data fragmentation.This is how to do the high reliability of the data.These data fragmentation, and load balancing, is for the corresponding distributed fragmentation is unreliable and hardware failure, this is we use the greatest characteristic of distributed systems.
A distributed computing platform based on Docker
We mainly use Docker for process management of distributed environment.Docker workflow as shown in figure 7, we apply Docker to production stage, not only is applied to the development phase, so every day we edit Dockerfile, enhance Docker Images and test online, hair Docker image, in our internal private Docker Regis, then transferred to the production environment we Docker cluster, there is no difference between this and other Docker workflow.
Figure 7 Docker workflow
Submit Docker task in Mesos
Because the Mesos and Docker was combined seamlessly.Through the Marathon and Chronos submitted a service-oriented application and batch type application.Marathon and Chronos submit tasks, by means of a RESTful application using JSON script sets the parameters of the number of instances of the background, application, and the path of the Docker Images and so on.
The process of communication in distributed environment
Communication between application services under distributed environment, distributed message queue is used to do, we are using the RabbitMQ.RabbitMQ is a distributed system, it is also to ensure the high reliability, the problem of fault tolerance.First the RabbitMQ cluster, as shown in figure 8, six nodes form a RabbitMQ cluster, is each backup relation between each node, any broken, five can also provide other services, through redundancy to ensure the high reliability of the RabbitMQ.
Figure 8 the RabbitMQ cluster
Second, the RabbitMQ has data fragmentation mechanism.Because the message queue is may be a long, long to all of the message can't be on a node, shard in at this moment, the long message queue is divided into sections, respectively on different nodes.As shown in figure 9 is the RabbitMQ alliance mechanism, put a message queue into two segments, a period on a segment on the downstream and upstream assumes that the downstream message queue consumed over automatically move the message in the message queue upstream to the downstream, such a message queue into a very long time will not be afraid, shard to multiple nodes.
Figure 9 message queue shard
The distributed file system
Tell me the HDFS and the Ceph distributed file system finally.HDFS Hadoop file system, as shown in figure 10, each block has three backup, must be on different servers, and three backup per rack inside put two, do also to fault tolerance.Ceph is another popular open source distributed file system.Ceph the abstract network storage device into a logical drive, and then the "Mount" to each of the distributed cluster server, in principle is very like the Linux operating system Mount a physical hard disk.As a result, a user program to access the Ceph file system just like access Linux local path, is very convenient.