The graphs of the Hadoop framework evolution explanation

By Jeremy Cox,2015-07-19 12:27
14 views 0
The graphs of the Hadoop framework evolution explanation

    The classic version of graphs

    The so-called classical version of the graphs framework, is also the first edition Hadoop mature commercial framework, is a simple and easy to use, to see a picture architecture diagram:

    Above the picture we can put appellation V1.0 version of Hadoop, train of thought is very clear, each Client submit the Job to a unified Job Tracker, then the Job Tracker will Job into N Task, and then distributed to various nodes (Node) parallel collaborative operation, then to their respective operation result feedback to the Job Tracker, then the output.

    However, this framework has its own restriction and limitation, we come to the simple analysis of the points:

    1, a single point of failure, first of all, a single point of failure is the most deadly thing, can be seen in the image above all to the completion of the Job thanks to the JobTracker scheduling and allocation, once this node downtime means paralysis of the entire platform, of course, in practice, most by a JobTracker slaver to solve.But, in a framework for distributed computing characteristics, the calculation of the core focus and a machine is not an optimal solution.

    2, extensibility, likewise, in the architecture diagram above you can see, the Job Tracker is not only bearing the Client provided by the Job and the distribution and scheduling, also need to manage all the failure of the Job, restart, monitor the resource utilization of each Node, the realization principle is the Heartbeat, Heartbeat detection), with the increase of the number of Node, the Job Tracker to the more and more tasks will become, in the struggle to cope with all the child nodes run test at the same time, also for the new

    Job distribution, so this kind of official framework gives a limited number of nodes (< 4000 nodes).

    3, waste materials, in the traditional architecture, the distribution of each Job, is through the way of the number of Node resources are allocated, obviously this allocation cannot be the realization of the dynamic load balancing, for instance, two large memory consumption of task scheduling in a Node, which means the state machine pressure is very big, and the corresponding some nodes are more relaxed, apparently in distributed computing, it is a huge waste of resources.

    Version 4, coupling, actually this is also a platform to do a big fatal flaw, the above architecture, framework has any graphs or not important changes (such as BUG fixes, performance improvements, or certain features), will force the system level to upgrade.And, no matter whether the user agree, have to forcing each client updates of distributed systems.

    More than four, is brought by the framework V1.0 above limitations, summarized, problem is mainly concentrated in the middle section of the main thread Job tracker, so solve the problem of this thread, basic waste has solved the above mentioned performance and scalability, and many other problems.

    Here we detailed analysis the Job Track detailed responsibilities in graphs, solve the problem of expansionary decoupling is responsibility, let's take a look at:

    1, the management of the cluster computing resources, involve the nodelist maintenance activities, and the amount of available Map and Reduce Slots list.And on the basis of the selected scheduling strategy of the available Slots assigned to the appropriate function and task

    2, coordinate the cluster running Task, which involves the guidance Task Trackerstart Map and Reduce tasks, the Task of monitoring running state, restart the failed mission, speculated that slow Task performance and calculating the sum of operation counter value, and so on.

    Look, are JobTrack very tired...This arrangement on a process will lead to significant scalability problems, especially in the larger cluster, JobTracker must constantly tracking thousands of TaskTracker, hundreds of homework, and tens of thousands of Map and Reduce tasks, here is a photo look at:

    In the figure above shows a relatively busy Job Tracker in busy with assignment...

    So the analysis to this, it seems that the way to solve the problem already be vividly portrayed: reduce the responsibilities of a single JobTracker!

    Now that reduce the responsibilities of the JobTracker, also means that you need to do not belong to his responsibility assigned to others to do, through the above description, we basically JobTracker duties can be divided into two parts: the cluster resource management and task coordination.

    Between these two tasks, apparently cluster management tasks to be more important, it means that the performance of the whole platform of strong extensibility and platform, and compared with something such as task coordination can be assigned to one of the Node to dry, and because each Client mentioned Job allocation process and execution, allocation process is short and flexible.

    Popular point: is the JobTracker responsibility in the architecture above, it is responsible for the entire platform resource management is ok, as for the allocation of tasks and coordination to subordinate (Node) to dry.Is just like

    a company for a living, big Boss only responsible for the entire company resources management, and that this live is thrown to the corresponding part is ok.

    Through the above analysis, like a version of the framework optimization under the basic way and basic clear, we then analysis the Hadoop a new version of the architecture.

    YARN is a new generation of architecture design

Look at the official definition of:

    Apache Hadoop YARN (Yet Another Resource Negotiator, Another Resource coordinator) is a kind of new Hadoop Resource manager, it is a common resources management system, can provide a unified for the upper application of Resource management and scheduling, the introduction of it to cluster in the utilization, Resource unified management and data sharing and so on have brought huge benefits.

    Through the analysis of the first part of our basic has confirmed to change the responsibility of previous JobTracker the main thread of the whole cluster resource management and allocation.Tell from this here if the thread name or JobTracker obviously is not appropriate.

    So in the new general architecture diagram his name into the ResourceManager (resource management), then this name is more appropriate for its responsibilities.

    Let's come to a picture

    Ha, basic and generation architecture diagram, only have the obvious separation responsibility, let us analysis the, first of all to determine a noun in the picture below:

    1, the ResourceManager global cluster resource manager (RM)

    2, ApplicationMaster (AM) dedicated JobTracker, on the way, you can see that now the separation of duties on the JobTracker to Node.

    3, NodeManage each child nodes (NM) management, instead of the previous TaskTracker, but similar function, just added a since the management function of each node, and responsibility share of the RM.

    4, Containers, with Application to mention the graphs of homework before, and the bearing Application Container is for the Container, the purpose is to more applications can run on Hadoop platform, in order to expand in the future.

    Let us have a brief, run the process of the framework.

    In the YARN structure, a global ResourceManager mainly in the form of a background process running.It usually allocated on a particular machine, in a variety of competition between the application of authoritarian cluster resources available.ResourceManageer will track what activities are available in cluster nodes and resources to coordinate when the user submits the application access to these resources.

    ResourceManager is the only have this information process, so it can pass some kind of Shared, safe, multi-tenant way distribution (or schedule) decisions (for example, according to the priority of applications, the queue capacity, ACLs, data location, etc.)

    In the user submits an application, a lightweight process instance called ApplicationMaster willstart to coordinate the implementation of all tasks within the application.Including surveillance, restart the failed tasks, speculated that the slow task, and calculate the sum of application program.These responsibilities is the JobTracker before. Now they are independent, and run in the NodeManager control the operation of resources in the container.

    NodeManager TaskTracker is a more common and efficient version of NodeManager has many resources dynamically create containers, depending on the size of the container contains resources, such as: memory, CPU, disk, and network IO, but currently only supports memory and CPU (YARN - 3). In fact, this platform provides an interface to facilitate subsequent extension, future cgroups is used to control the disk and network I/o.

    In fact, simply speaking, NodeManager is the inner node of a high degree of autonomy, including the JobTracker node.

    Let's look at another picture to detailed look at the new Job within a YARN internal process flow in each Node (the Node) :

    From the picture you can see, compared with before the first edition of the architecture diagram of the interaction between the nodes is much more behind, because, we will be in the new structure of the JobTracker duties devolved to the NodeManager ApplicationMaster, namely the traditional Map in ApplicationMaster - Redurce distribution, so will cause the interaction between different Noder.

    That, of course, all the process will be their boss ResourceManager for scheduling and management.

    The above architecture, called MRv2 in the Hadoop version.

    We have to sum up, this architecture to solve the problem:

    1, higher utilization rate of the cluster, a framework of unused resources are being used by another framework, enough to avoid resource waste

    2, high extensibility, adopted the responsibility under the architecture of the train of thought, has solved the limitation of the first edition of 4000 node, can fully extend resources so far.

    3, in the new Yarn, by joining ApplicationMaster is part of a can change, the user can according to different programming model to write your own AppMst, let more programming model run on Hadoop cluster.

    4, in the previous version of the framework, the JobTracker is a great burden to monitor running status of the tasks of the Job, now, this part down the ApplicationMaster.

    In addition to the above points, we have to analyze the following, in the new framework of ResouceManager the function of the window.

    The last figure to see:

    When a Client to submit the application is first entered the ResourceManager, it maintained the application list to run on a cluster, and the list of the resources available on the NodeManager each activity.ResourceManager want to make sure that is the first application can run this Job, will deposit to the corresponding Container, here will allocate part of cluster resources, of course, the choice of this part of resources by many restrictions, such as: queue capacity, ACL and fairness.The next step is another pluggable components Scheduler to distributed task (not distribute here!), the Scheduler to perform scheduling, not to have any monitoring application execution, the Scheduler is secretary, will be the big Boss (RM) assigned tasks to the corresponding department.

    (ApplicationMaster) then, is the leadership to assign tasks to employees (DataNode), and this is the Map - Redure the process of distribution, so in the

    process, ApplicationMaster is responsible for the application of the cycle, of course in the process of running, it can follow the boss (RM) make some corresponding resource requirements, such as:

    1, a certain amount of hardware resources, such as the amount of memory and CPU share.

    2, a preferred location, such as a Node, usually need to set the host name, rack, etc.

    After 3, Task allocation of priorities.

    After then, find the corresponding resources, began to arms extended to the completion of the task, while the batch run happen in (the Node), but also has its own little captain in the Node (NodeManager), which is responsible for monitoring their Node of the resource usage, for example, his task is less than the original distribution, was finished ahead of time, it will over the container and release resources.

    And in the process of the above, the ApplicationMaster will try our best to coordinate container, automatic required tasks to complete its application, he will monitor the progress of the application to restart the failed tasks, and submit the application to the Client's progress report.After completion of the application, ApplicationMaster will shut itself down and release their own containers.

    The process, of course, if the ApplicationMaster herself off and then ResouceManager will return again to look for a leadership (start it in the new container), until the entire program is complete.


    Hadoop is a very cow to break off the platform for the distributed architecture, it need me to share with you: I don't think so, the advantages of many successful cases have been implied that we, the future of the so-called big data, the so-called Internet +, the so-called cloud...Will find its foothold.

Report this document

For any questions or suggestions please email