Computer vision: let the cold machine read and
In 2010, Stanford university, Princeton university and Columbia University scientists started ImageNet large-scale Visual identification Challenge (ImageNet Large Scale Visual Recognition Challenge, ILSVRC), promote the sustainable development of the computer Visual identification challenges., according to the New York times in 2014 computer identify challenges, the computer system for target recognition accuracy is almost doubled, image classification error rate is reduced by half.
On this basis, developed by Microsoft research Asia, visual computing group of computer vision system, has recently gained breakthrough., according to a paper published the team contains approximately 1.2 million copies in training images, 50000 images and 100000 test images, is divided into 1000 different categories of "ImageNe1000 challenge" of Microsoft research team development system of successful identification error rate reduced to 4.94%, lower than the human eye about 5.1% of the identification error rates for the first time.Why computer vision system, and where to go?Microsoft Asia research institute researcher sun about incarnations of computer vision:
To be near ","
A popular global star through "inspired countless people's desire to explore the vast universe mystery, also let many people remember the Tars the smart lovely, humor humor and wit of the intelligent robot.Artificial intelligence theme of Hollywood films have been widely popular among fans, humans build with endless imagination and dazzling stunts a nothing and a wonderful future world, is infatuated.Back to reality, however, computer scientists' action to seemingly far behind the pace of film artists' imagination - it is a movie, to develop a like, can understand the world around us, to understand human language, and fluent dialogue intelligent robots and humans, we have a long way to go.
Star cross, can see, hear, can say intelligent robots, are popular with the audience.Image: "star through" stills
Long time, the computer can see, hear, can say has always been my colleagues pursuit of the goal and the computer industry.More than 10 years of cultivation in the field of computer vision, give the computer a pair of eye, let it can understand this colorful world, has been inspired me in this challenging path forward the important strength.Although computer cannot yet was as smart as demonstrated in the movie, but has made a lot of amazing achievements.
How the world in our eyes
For humans, "man" seems to be a natural instinct, newborn baby a few days can imitate their parents;It gives us only by very few details will tell each other's ability, we borrow a dim light can still recognize friends on the end of the corridor.This for humans the ability to easily, however, are now finding life difficult for computer.In the past for a long time, the computer vision technology is stuck, before further explore, you talk about how we see the world with the eye.
Believe that everyone in the middle school physics class tasted of small hole imaging principle.But the person's eye is much more complex than the camera, when we observe an object, a second glance about three times, and have one resides.When the retinal photoreceptor felt the outline of the candle, a known as central sunken area in fact is to record the shape of a candle in the form of distortion.
So why problem comes, we see the world neither distortion and no
deformation?Is very simple, because human beings have the universal "converter" in the cerebral cortex, it will be our visual nerve signal is converted to the real image of the capture.This "converter" can be simplified as four areas, biologists are respectively called V1, V2, V4 and IT areas.The neurons of V1 area, only for a small part of the whole visual areas to respond, for example, found a straight line, some neurons become active.Part of the line can be any thing, may be at the table, may be the floor, may be a stroke of the characters in this article.Eyes every glance, this part of the neuron activity can change quickly.
Mystery in the cerebral cortex at the top of the IT area, biologists found that objects (such as a face) anywhere in the field of vision, some neurons has been fixed in the active state.That is to say, the human visual recognition from the retina to the IT area, the nervous system can identify from the subtle features, to gradually become can identify the target.If computer vision can also have a "translator", the computer identification efficiency will be greatly improved, the operation of the human eye visual nerve provides enlightenment for the computer vision technology breakthrough.
Why do the computer always "unclear"
Though the mystery of the human eye recognition has been gradually revealed, but used directly in the computer but not easy.We will find that the computer identification is always in the "mixed", once the light, Angle changes, such as computer is hard to keep up with the rhythm of the environment, will course.For computer, identify a person in different environment, it is better to identify it's much easier to two people in the same environment.This is because the researchers initially tried to face to imagine as a template, using machine learning method to master the law of the template.However face although it is fixed, but different Angle, light, dress up, appearance also has difference, made it difficult for simple template matching all face.
Therefore, the core of the face recognition problem is that how to make computer to ignore the inner differences of the same person, and can be found between the two, respectively, namely, similar to the same person, different people different.
Computer human recognition system.Image: msra. Cn
Introduction to artificial neural network is computer vision beyond the template to identify the key.However, human did not fully grasp the operation mechanism, and how to guide the progress of computer?Artificial neural network in the 1960 s had bud and early theory only fixed on the model of simple, namely the biology class of input and hidden layer - "output" model.When introduce the working principle of neural, teachers usually simple told stimulation is access to the input neurons, input neurons and links to other parts of the form "hidden layer", finally, the output neurons.The link intensity of the neurons are not the same as different height on the strength of the sheet music, artificial neural network is to rely on the strength of the different links between these neurons to map the input mode to the output.
But "music" is stationary, and only from input to output, there is no reverse.That is to say, if people still, the computer may be able to read through this principle, but it is not possible in real life.In the late 1980 s, used in artificial neural network "back propagation algorithm" invention, it can output unit error back to the input unit, and remember it.This kind of method to artificial neural network can learn from a large number of training samples statistical rule, predict the unknown event.But compared with the complicated and hierarchical structure in the brain, this contains only a single hidden layer neural network structure is also appear insignificant.
: deep neural network for computer "secure"
In 2006, Jeffrey Hinton, a professor at the university of Geoffrey Hinton have made a breakthrough in the deep of the neural network training.He proved more
hidden layer artificial neural network has more excellent features the ability to learn on the one hand, on the other hand can through the initialization step by step to overcome the previous training problem has plagued researchers - basic principle is to guarantee the network initialization, through a lot of unsupervised data with monitoring data in initialization is good or pre training network optimization adjustment.
Inspired by these factors, the face or image recognition research, mostly based on CNN (ConvolutionNeuralNetworks) principle.CNN can be seen as a step by step a scan "machine".The first layer of edge detection, corner, even or uneven area, this layer almost does not contain semantic information;The second layer are combined based on the results of the first layer detection, and delivers the combination to the next layer, and so on.Under multiple scanning, accumulative accuracy, the computer is in forward article mentioned "similar to the same person, different people different from" the goal.
CNN's scientific name for "with the depth of the convolution structure of neural network, the network to identify the object also can be divided into two steps: image classification and object detection.In the first stage, the computer first identify the types of objects, such as people, animals or other items;The second phase, the computer to get precise location of objects in an image, the two stage respectively answered "what" and "where is the" two questions.Microsoft's intelligent chat robot "little ice" identification of dog breeds of function is the typical example of CNN.First, you need to build a network of several layers of depth of convolution.The first layer like the definition of human visual system, used for the edge of a small or small piece to do some testing;The second layer will take these small structure of large structures, such as the dog legs and eyes;Up in order to organize, finally will be able to identify the kinds of dogs.Second, the need to this with the depth of the convolution structure of neural network into many figure, dog training system identification accuracy.
In 2013, researchers at the university of California, Berkeley, puts forward a way of called called R - CNN (Region - basedCNN) object detection method, has the very high recognition accuracy, it will each image is divided into multiple Windows or sub area, the application of neural network to classify each child area.But its main drawback is that, for real-time detection algorithm is too slow.In order to detect several objects in an image, the neural network may require thousands of times.
At Microsoft research Asia, visual computing group of researchers implements a called "spatial pyramid polymerization" (SpatialPyramidPooling, SPP) of the new algorithm, through the internal characteristics identification, instead of each region from scratch test, a calculation of the whole picture.Using this new algorithm, on the premise of no loss of accuracy, object detection with one hundred times the speed improvement.In large-scale visual identification ImageNet challenge 2014, Microsoft Asia research institute adopts SPP algorithm system has obtained the classification of the third and second grade.At present, this technology has been successfully
transformed into Microsoft's cloud service OneDrive.After adopted the technology, OneDrive can automatically add tags to upload images.At the same time, the user to enter the keywords, you can search the corresponding pictures.
ImageNet large-scale visual identification challenge 2014 sample image.Photo: image-net.org
Future: computer vision and the human dance
If simple recognition face, regardless of hair and other parts of the body, the human accuracy is about 97.5%, while computer now can reach more than 99%.Does this mean that the computer is better than the human?Not, because we not only observe facial, body and body can help us recognize each other.In the real environment of complex light, one can more intelligently choose these branches to help their decision, while the computer in this area is much less.However, in case of large amount of data, or in the face of unfamiliar faces, and more powerful computer.
Through continuous human invention of new technology to replace the old technology, so as to complete the task more efficiently and economically.The same is true in the field of computer vision, we develop more convenient for face recognition entrance guard system, to replace manual input user name and password - Xbox One using the infrared camera design face recognition system is popular with customers.
In addition to the human beings also can do recognition, computer vision can also be used in the human ability, sense organs of field and tedious work - in smile instant automatic press the shutter, help the driver parking into bits, to capture the
body posture to interact with the computer games, accurately welding parts and check the defect of the factory, a busy shopping season to help the warehouse sorting goods, away from home, sweeping the floor when the robot clean room, automatic identification classified all digital photos...
Perhaps in the near future, electronic scale supermarket can identify the kinds of vegetables;Entrance guard system can distinguish with the gift of a friend, or holding a lever to steal the gangster;Wearable devices and mobile phones help us identify any object and search the relevant information of lens.More wonderful, and it can surpass the human senses, and my eyes to perceive the world with sound wave, infrared, watch the clouds waving to predict the weather, the operation of the monitoring vehicle scheduling transportation, even break through our imagination, help theoretical physicist analysis more than 3 d spatial objects in motion.
Once upon a time, with the human eye recorded the mammoth history.In the future, we hope to gradually open the eyes of the computer, let it understand in this colorful world, at the same time can also help to complete the work and life of human, more efficient and intelligent.Looking forward to dance, in computer vision and human world is colorful, not only have more wisdom.