Hadoop 2.0 on deep learning solutions
Boston,The data science teamIs using sophisticated tools and algorithms to optimize business activities, and the commercial activity is based on the profound dialysis of user data.Machine algorithm, is widely used in data science can help us to identify and use patterns in the data.Dialysis was obtained from the Internet in the large-scale data is a challenging task, therefore, can run on a large scale algorithm is a vital needs.With the explosive growth of the data and cluster tens of thousands of machines, we need to make the algorithm can adapt to running in such a distributed environment.In general machine learning algorithms of distributed computing environment with a series of challenges.
Here, we discuss how to implement and deploy a Hadoop cluster in deep learning (a state-of-the-art machine learning framework).For the algorithm is how to adapt to run in a distributed environment, we provide the specific details.We are running on the algorithm is given in the standard data sets.
The depth of the trust network
Deep trust network (Deep are Networks, DBN) is under the condition of greed and unsupervised limited by iteration and training of Boltzmann machine (the Boltzmann those, RMB) for graphic model.By means of the following can be observed dimensions x and hidden layer between hk connected distributed modeling, DBN are trained to extract the deep dialysis of training data.
The expression 1: DBN distributed
In the following figure, the relationship between input and hidden layer can be observed.From the point of high level, the first layer is training, as a matter for the original input x model.The input data is a sparse binary dimension, which indicates that the data will be classified, for example, a binary digital image.Subsequent layer in front of the passed data (samples or activations) used as the training sample.Layer can be decided by experience, in order to get a better model performance, DBN support any number of layers.
Figure 1: DBN level
The following code snippet shows into the training of the matter.In the provided to the matter of the input data, there are multiple predefined time points.Input data is divided into small batch data, calculate the weights for each layer, the activations and deltas.
After trained all the layers, the depth of the network parameter adjustment using supervised training standards.Supervised training standards, for instance, can be designed as a classification problem, depth allows the use of network to solve classification problems.More complex by the supervision standard can be used to provide such as scenarios, such as reading interesting results, for example to explain what show in the picture.
Deep learning has received the widespread attention, and not just because it can be concluded that more outstanding results than some other learning algorithms, but also because it can run on a distributed equipment, allowing processing large data sets.Deep web can be parallel in two levels, the layer level and data level .For other parallel hierarchy, many implementation using the GPU array to parallel computing layer level they activations and frequent synchronization.However, this method is not suitable for the data resides in the cluster, connected through a network of multiple machines with higher network overhead.For parallel data layer, training was conducted on data sets in parallel, is more suitable for distributed devices.Paypal, most of the data stored on a Hadoop cluster, so to be able to run the cluster algorithm is our top priority.Dedicated cluster maintenance and support is also an important factor we need to consider.However, due to the deep learning is essentially iterative, like graphs paradigm is not suitable for operation of these algorithms.But as Hadoop2.0 and resource management based on the YARN, we can write iterator, finely control program can be used by the resources at the same time.We use the IterativeReduce , Hadoop YARN inside of a user program written iterative algorithm, we can deploy it to in a running Hadoop against 2.4.1 Paypal in the cluster.
We implemented Hinton's core algorithm in reference .Because we demand is dispersed to run on multiple machines in the cluster algorithm, we use them in such an
environment.This algorithm for scattered across multiple machines, we refer to this proposed guide .Below is a detailed summary of our implementation to do:
1. The Master node initialize the weights of the matter.
2. The Master node to the Worker pushed weights and splits.
3. Worker node points in a data set time training a matter layer, in
other words, in a completely through all the Worker nodes after the
split, send updated weights to the Master.
4. At a given point in time, for all the Worker from the Master node
weights of averaging.
5. Set in the predefined time (50 in our example), repeat steps 3 to
6. Step 6 is complete, there is a layer were trained.Subsequent matter
layer also repeat these steps.
7. When all layers are training, deep web will reverse broadcasting
mechanism by using error accurately adjusted.
Below describes in run deep learning algorithms of a single data set point in time (step 3-5).We note that, this model can be used to implement a host machine learning algorithm of iteration.
Figure 2: a single data set point in time for training
The following code snippet shows the steps involved in the training of a single machine DBN.Data set is first split into multiple batches, then multiple matter layer are sequentially initialization and training.After the matter has been training, they will pass a accurate adjustment error reverse phase.
We use the IterativeReduce  the implementation of the largely to YARN tube.We made a major reform to achieve, it can be used to our deep learning algorithm.IterativeReduce implementation for Cloudera Hadoop distributed written, it is we reset the platform, in order to adapt to the standard Apache Hadoop distributed.We also override the implementation, in order to use the standard programming models described in .In particular, we need to YARN client API in the ResourceManager and communicate between the client program.We also use the AMRMClient and AMNMClient communicate between ApplicationMaster, ResourceManager and NodeManager.
We first use the YARN API to submit the application to the YARN resource manager:
After the application is submitted, YARN Master resource manager to start the application.If necessary, the application the Master is responsible for the allocation and release of Worker container.The Master program to use AMRMClient to communicate with the resource manager.
Application NMClient apis used by the Master in the container (Master node passed) run a command.
Once the application Master launched it need Worker container, it set a port for communication and application of the Master.For we deep learning, we added some methods, they need to provide original IterativeReduce interface parameters initialization signal, step by step training and accurate adjustment.IterativeReduce using Apache Avro IPC to realize the communication between the Master and the Worker.
The following code snippet shows that involves a series of Master - the Worker nodes distributed training, the Master Worker to send initial parameters, then the Worker training on the part of the data to its matter.After the Worker to complete the training, it sends the results to the Master, the Master will be integrated the results.Iteration is complete, the Master by starting reverse broadcasting precise adjustment phase to complete the process.
The results of
We assessed the use MNIST handwritten numeral recognition  to achieve the depth of the learning performance.The data set contains manually tag number from 0 to 9.Training set is composed of 60000 images, test set contains 10000 images.
In order to measure performance, DBN is first pilot training, and then in 60000 photo were accurate adjustment, through the above steps, DBN will evaluate on 10000 test images.During the period of training or evaluation, not for image preprocessing.Error rate is by the total number of images for the classification and the ratio of the total number of the picture of the test set.
When used in each matter the hidden unit 500-500-2000, at the same time using 10 nodes distributed equipment, we can achieve the best classification error rate of 1.66%.Error rate can be as reported by the author of 1.2% than the original algorithm (using the 500-500-2000 hidden units) , and similar Settings some results of .We note that the original is implemented on a single machine, and our implementation is in a distributed on the machine.Parameters of average performance slightly lower in this step, however the distribution algorithm on multiple machines is not do more harm than good.In the following table summarizes ten nodes of cluster running on each layer corresponds to the number of hidden cell changes in the error rate.
Table 1: MNIST performance evaluation
We successfully deployed a deep learning system, we believe that it is in some machine learning problems is very useful in the process.In addition, the iteration abstract distribution can be used to reduce any other suitable machine learning algorithms, able to take advantage of gm's Hadoop cluster will prove very beneficial to run on large datasets of large machine learning algorithms.We note that, we need some improvements, the current framework of the main around reducing network latency and more advanced resource management.In addition, we need to optimize the DBN framework that can reduce the communication between the internal nodes.With the accurate control of cluster resources, Hadoop YARN framework to provide us with more flexibility.