Deep learning - LeCun, Bengio and Hinton joint review
three Daniel Yann LeCun, Yoshua Bengio and Geoffrey Hinton status are known in the field of study in depth.To commemorate the proposed the 60th anniversary of the establishment of artificial intelligence, the latest special has opened up a "Nature" magazine"+ artificial intelligence robot" projectAnd published many related papers, including the Yann LeCun, Yoshua Bengio and Geoffrey Hinton cooperation for the first time of this review article,"Deep Learning". In this paper, the lower part of the Chinese translation for the review articles, CNN said, distributed characteristic, RNN was introduced in detail and its different applications, the future development of deep learning technology was forecasted.
Convolution neural network
Convolution neural network is designed to deal with to the multi-dimensional array data, such as a three contains the pixel values of 2-d image into a color image with three color channel.Many data form is the multi-dimensional array: 1 d used to represent the signal and sequence including language, 2 d used to represent images or sounds, 3 d image used to represent the video or a voice.Convolution neural network using the four key ideas to use natural signal properties: local connection, weights of sharing and pooling and how the use of the network layer.
Figure 2 internal convolutional neural network
A typical convolution (figure 2) neural network structure is composed of a series of process.The initial stages is composed, convolution and pooling convolution layer units are
organized in characteristic figure, in the feature maps, each unit through a set of weights called filter is connected to a layer of the figure of a local block, then the local weighted and are passed to a nonlinear function, such as ReLU.All the units of a feature in the figure to enjoy the same filter, the characteristics of different layer graph using different filters.The use of this structure in the two reasons.First of all, in an array of data, such as image data, a value of nearby values are often highly correlated, can form is easier to be detected have distinct local characteristics.Second, different location is not related to local statistical characteristics, that is, a character in one place, may also occur in other place, so the weights of different location of the unit can be Shared and can detect the same sample.In mathematics, the executed by a characteristic pattern of filtering is an offline convolution operation, convolution neural network is so named.
Convolution layer of a layer on the characteristics of the local connection, however, the role of pooling layer is similar on the semantic features, this is because the formation of the characteristics of the relative position of a subject is different.Generally, pooling units in a local characteristic figure of the maximum size of the adjacent pooling unit by moving a row or column to read data from small block, because doing so would reduce the expression of dimensions and the translational invariance of the data.Two or three of the convolution and nonlinear transformation and pooling is strung together, plus a more behind the convolution and full connection layer.On the convolutional neural network to carry on the back propagation algorithm and on the depth of the general network is the same, can let all rights in the filter is worth to the training.
Depth of neural network using the properties of many natural signal level, in this attribute senior feature is based on the combination of low-level features.In the image, the combination of local edge to form the basic pattern, the pattern formation of local object, and then to form objects.This hierarchy also exists in voice and data, and text data, such as voice in the phone, factors, syllables, words and sentences in the document.When the input data have change in the position of the previous level, pooling operation said let these characteristics have robustness to these changes.
Convolution convolution and pooling of neural network layer inspiration directly from the simple and complex cells in visual neuroscience.This kind of cell in LNG - V1 -v2 - V4 - IT this kind of structure form of visual circuit.When a convolutional neural network and the monkey to a pair of the same picture, convolution neural network shows the monkey in the temporal lobe cortex under the random change of 160 neurons.Convolution neural network with the root of neurocognitive, their architecture is a little similar, but is not in the neurocognitive end-to-end monitoring such as the back propagation algorithm of learning algorithm.A primitive 1 d convolution neural network called time delay neural network, which can be used to identify the voice and the simple word.
Since the 1990 s, a large number of applications based on convolutional neural network appeared.The first is to use time delay neural network to speech recognition and document reading.This document reading system using a better convolutional neural networks are trained and a probability model, the probability model for some constraints of language.In the
late 1990 s, the system is used to reading on America more than 10% of the check.Later, Microsoft has developed and handwritten character recognition system based on convolution neural network recognition system.In the early 1990 s, convolutional neural networks are also used to natural object recognition of a graphic, such as face, hands and face recognition, face recognition).
Using depth of convolution network for image understanding
The beginning of the 21st century, convolution neural network has been successfully used to detect and segmentation, object recognition and image fields.These applications are using a large amount of labeled data, such as traffic signal recognition, biological information segmentation, face detection, text, pedestrians and natural graphics of the human body detection.In recent years, the convolution of the neural network is a major success application face recognition.
It is worth mentioning that the image can be tagging at pixel level, so that it can be applied in such as automatic telephone answering, self-driving cars of the robot.Like the Mobileye and NVIDIA is based on the convolution of the neural network method is used in the visual system.Other applications involving in natural language understanding and speech recognition.
Figure 3 from image to text
Despite the convolutional neural network application is very successful, but it is computer vision and machine learning team start ImageNet competition in 2012.In the competition, the depth of the convolution neural network is used in the millions network image data set, the data set contains 1000 different classes.The results reached unprecedented good, almost half of the best ways to reduce the error rate.This success to effective use of the GPU, ReLU, a new regular technology known as dropout, and by breaking down the technology of existing sample produce more training samples.This success bring a revolution in computer vision.Today, the convolution neural network used in almost all of the recognition and detection task.Recently is a better results, use convolution neural network combined with feedback neural network is used to generate the image title.
Convolution neural network architecture is now 10-20 floors adopted ReLU activation function, millions of weights, and billions of a connection.Training such a network is two years ago, however, it only takes a few weeks, now the progress of hardware, software and algorithm in parallel, and the training time compression in a few hours.
The performance of the visual system based on convolutional neural networks has attracted the attention of the big technology companies, such as Google, Facebook, Microsoft, IBM, yahoo!, Twitter and Adobe, also, some fast-growing start-ups.
Convolution neural network on chip easily or efficiently in the field programmable gate array (FPGA) implementation, many companies such as NVIDIA, Mobileye, Intel, Qualcomm and Samsung, are developing convolution neural network chip, in order to make smartphones, cameras, robots and automatic driving a car in real-time visual system possible.
The distributed feature representation and language processing
And do not use the distributed feature representation (distributed representations) compared to the classical learning algorithms, deep learning theory suggests that the depth of the network has two different huge advantage.These advantages derived from the weight of each node in the network, and has the reasonable structure depends on the underlying distribution generated data.First, study the distributed characteristic said can adapt to new learning generalization to the combination of the characteristic values of (n yuan characteristics, for example, there are two possible combinations).Second, portfolio presentation layer in deep web brings the advantage of another exponential potential (exponential depth).
In the multilayer neural network hidden layer USES input data in the network learning, make it more easy to predict the target output.Here is a good example demonstration, such as the local text as input, the content of the training next words of multi-layer neural network to predict the sentences.Content of each word is expressed as one of the N points in the network
of the vector, that is, each component has a value of 1 in the rest of the whole of 0.In the first layer, create different activation condition, each word or words vector (as shown in figure 4).In language models, the rest of the layers in the network to learn the words and conversion of input vector for the next words of vector to predict the output word sentences, vocabulary words in the sentence as text can be predicted by the middle and lower probability of a word.Network learning contains many active node, and can be explained by the word of independent characteristics of word vector, as the first demonstration of the text study hierarchical representation examples of text symbols.These semantic features in the input is not clear.But in the use of "rules" (" micro - rules ", in this article that the micro rules) was discovered in the process of learning, and as a breakdown of the structure of the relation between the input and output symbols good way.When the sentence is from a large number of text and real individual micro rules unreliable cases, learning words vector can behave very well, too.Using the trained model predicts new examples, some concepts are similar words easily confused, such as (Tuesday) on Tuesday and Wednesday (Wednesday), Sweden (Sweden) and Norway (Norway).Such representation is called the distributed characteristics, said because their between elements are not mutually exclusive, and their structural changes in the information corresponding to the observed data.These words vector is obtained by studying the characteristics of the structure, these characteristics are not determined by the experts, but by the neural network automatically discover.To learn the words from the text vector said now is widely used in natural language.
Figure 4 word vector visualization study
The center of the characteristics of the said dispute between inspired and based on the understanding of the neural network based on logic.Inspired in logic paradigm, a symbolic entities represent something, because of its unique properties and other entities in the same or different symbols.The symbol instance has no internal structure and the structure and the use is relevant, as to understand the semantic symbol, it must be with the reasonable inference rules for change.On the contrary, the weights of neural network using a large amount of activity carrier, matrix and nonlinear scalar, easy to implement can support, with functions of common sense reasoning quickly "intuition".
Before introducing neural language model, describes the standard method, which is based on statistical language model, the model without using distributed characteristics.But brief symbol sequence based on the statistical frequency of growth to N (N - grams, N yuan grammar).Possible N - the number of grams close to V, V is the size of the vocabulary, considering the text contains thousands of words, so I need a very large corpus.N - grams to each word as an atomic unit, so you can't generalisations in semantically related words sequence, neural network language model can, however, because they are associated with each word and is a characteristic value vector, and semantically related words in vector space close to each other (figure 4).
Recursive neural network
Back propagation algorithm is introduced, using recursive neural network is the most exciting (recurrent neural networks, hereinafter referred to as RNNs) training.To involve the input sequence of tasks, such as speech and language, using the RNNs can obtain better results.RNNs processing an input sequence element at a time, at the same time maintaining the network implicit unit contains implicit in the past time sequence elements of the state vector of historical information.If is the output of different depth of multi-layer network neurons, we will consider this in different discrete time step implicit unit of output, it will make us more clear how to using back propagation training RNNs (as shown in figure 5, right).
Figure 5 recursive neural network
RNNs is very powerful dynamic system, but the training they proved to exist problems, since the gradient of reverse distribution in each time interval is growth or decline, so after a period of time will lead to the result of a surge in or zero.
Due to the advanced architecture and training methods, RNNs was found to be the next character can well predict the text lower a word or sentence, and can be applied to more complex tasks.For example, in some time to read English in the sentence the words, will be an English training "encoder" network, which makes the final state vector implicit unit well representation to express the meaning of a sentence or idea.This kind of "vector" (thought vector) can be used as a joint training a French "encoder" network initialization implicit state (or additional input), the output probability distributions for the French translation of the first word.If you select from a special distribution of the first word as input of network coding, will output the translation of the probability distribution of the second word in the sentence, and until the stop choosing.In general, this process is based on the probability distribution of the English sentence and the French word sequence.This simple method of machine translation and can even the most advanced (state - of - the - art) method, but also caused the people whether you need to understand the sentence like symbols used reasoning rules of
internal operation.This and at the same time involved in the daily reasoning according to the reasonable conclusion analogy is matched.
Analogy to the meaning of the French sentence translated into English sentences, can learn the same image content "translation" as the English sentences (as shown in figure 3).The encoder can be at the end of the hidden layer converts pixels to activity the depth of the vector convolution (ConvNet) network.Decoder and RNNs language used in machine translation and neural network model is similar.Recently, great interest has been an upsurge in depth learning boom (see example mentioned document ).
RNNs once (as shown in figure 5), can be regarded as a all share the same the depth of the weights of feedforward neural networks.While their purpose is to study the dependence of for a long time, but the theoretical and empirical evidence to suggest that it is hard to learn and save the information for a long time.
In order to solve this problem, an increase of network storage ideas arise.Adopted the special implicit LSTM (long short - termmemory networks) was first proposed, its natural behavior is the preservation of the input for a long time.A special unit called memory cells similar to the accumulator and gating neurons: it the next time step will have a weight in parallel to itself, copy their own state of real value and the accumulation of external signal, but that since the connection is made by another unit of learning and to decide when to remove the memory contents by practice.
LSTM network subsequently proved to be more efficient than the traditional RNNs, especially when a number of layers of each time step, the whole speech recognition system can exactly acoustics transcription for the character sequence.The LSTM network or related door control unit are also used in encoding and decoding network, and perform well in the machine translation.
In the past few years, several scholars put forward different proposals for strengthening RNNs memory module.Proposals include nerve Turing machine, which by joining RNNs readable writable "similar tape" to enhance the network of storage and memory network to enhance the conventional network through association memory.Memory network in a standard q&a benchmark performance is good, memory is used to remember later asked to answer the question.
Beyond simple memory, neural networks are being used to Turing machine and memory that usually need to reasoning and symbolic operation tasks, can also teach nerve Turing machine "algorithm".In addition, they can never sorted input symbol sequence (including each symbol has its corresponding in list, and the results show that the real value of the priority), a sort of symbol sequence learning output.Memory can be trained network is used to track a set with text adventure games and stories of the state of the world, answer some questions need complex reasoning.In a test case, the network can correct answer 15 other version of the Lord of the rings, such as "Frodo now?"The problem.
Depth study of the future
Unsupervised learning to rekindle function to promote deep learning upsurge, but purely for supervised learning success over learning without supervision.In this review, though this is not our focus, we still expect to unsupervised learning is becoming more and more important in the long term.Unsupervised learning dominant in human and animal learning, we can find the interior structure of the world through observation, rather than being told the name of each of the objective things.
Human vision is a smart way of, based on the specific use of small or big high-resolution retinal fovea and surrounding area light to imaging of the active process.We look forward to the future will be more progress in machine vision, the progress from the end-to-end training system, and combining the ConvNets lies and RNNs, to decide to enhance learning.Combined with deep learning and enhance learning system is in the early, but it was already in the classification task than passive video system, and learn the operation video game produced an impressive result.
In the next few years, natural language understanding is deep learning another area made great influence.We predict that use the RNNs system will better understand the sentence or the entire document, when they are selectively part studied for some time to join the strategy.
In the end, the major progress in the aspect of artificial intelligence will come from the combination of complex reasoning said learning (representation learning) of the system.Though deep learning and simple reasoning has been applied to voice and handwriting recognition for a long time, we still need to conduct a large number of new pattern instead of rule-based character expression vector operations。