Industry tracking

Deep Learning and Machine Vision

Artificial intelligence is a very beautiful dream of mankind, just like interstellar roaming and immortality. We want to build a machine that has the same perception of external things as people, such as seeing the world.

In the 1950s, mathematician Turing put forward the criterion of judging whether a machine has artificial intelligence: Turing test. That is to put the machine in one room, the human tester in another room, people chat with the machine, the tester did not know in advance whether the other room is a person or a machine. After chatting, if the tester is not sure whether he is talking to a human or a machine, then the Turing test passes, that is to say, the machine has the same perception as human beings.

However, from the Turing test to the beginning of this century, in more than 50 years, numerous scientists have proposed many machine learning algorithms, trying to make computers have the same intelligence level as human beings, but until the success of deep learning algorithm in 2006, it brought a glimmer of hope for solution.

In-depth Study of Stars Holding the Moon

In many academic fields, in-depth learning is often 20-30% better than non-in-depth learning algorithm. Many large companies have gradually begun to invest in this algorithm and set up their own in-depth learning team, of which the biggest investment is Google, which disclosed the Google Brain Project in June 2008. In January 2014, Google acquired DeepMind, and then in March 2016, it developed Alphago algorithm to beat Korean ninth-stage chess player Lee Shi-shi in the Go Challenge, proving that the algorithm designed by in-depth learning can beat the world's strongest player.

In terms of hardware, Nvidia started to make display chips, but since 2006 and 2007, it mainly uses GPU chips for general computing, which is especially suitable for the large amount of simple and repetitive computations in deep learning. At present, many people choose Nvidia's CUDA toolkit to develop in-depth learning software.

Since 2012, Microsoft has been using in-depth learning for machine translation and Chinese speech synthesis. Behind Xiaona's artificial intelligence is a set of data algorithms for natural language processing and speech recognition.

Baidu announced the establishment of Baidu Research Institute in 2013, the most important of which is Baidu Deep Learning Institute, when it recruited a well-known scientist Dr. Yu Kai. But then Yu Kai left Baidu and founded another company, Horizon, which is engaged in the development of deep learning algorithms.

Facebook and Twitter have also conducted in-depth learning research, in which the former, together with Yann Lecun, a professor at New York University, established their own in-depth learning algorithm lab; in October 2015, Facebook announced its open source in-depth learning algorithm framework, the Torch framework. Twitter acquired Madbits in July 2014 to provide users with high-precision image retrieval services.

Computer Vision in the Age of Pre-deep Learning

Internet giants value in-depth learning not for academic purposes, but for the huge market it can bring. Then why did the traditional algorithm fail to achieve the precision of deep learning before it was learned?

Before the deep learning algorithm comes out, for visual algorithm, it can be roughly divided into the following five steps: feature perception, image preprocessing, feature extraction, feature screening, reasoning, prediction and recognition. In the early stage of machine learning, the dominant statistical machine learning group did not pay much attention to features.

In my opinion, computer vision can be said to be the application of machine learning in the field of vision, so when computer vision adopts these machine learning methods, it has to design the first four parts by itself.

But it's a difficult task for anyone. The traditional computer recognition method separates feature extraction from classifier design, and then combines them in application. For example, if the input is a motorcycle image, first of all, there must be a process of feature expression or feature extraction, and then put the expressed features into the learning algorithm for classification learning.

In the past 20 years, there have been many excellent feature operators, such as the most famous SIFT operator, the so-called operator that keeps the scale rotation unchanged. It is widely used in image comparison, especially in the so-called structure from motion applications. There are some successful applications. The other is the HoG operator, which can extract the body, more robust object edges, and play an important role in object detection.

These operators also include Textons, Spin image, RIFT and GLOH, which occupy the mainstream of visual algorithms before the birth of deep learning or before the real popularity of deep learning.

Several (Half) Successful Cases

These features and some specific classifier combinations have achieved some successful or semi-successful examples, basically meeting the requirements of commercialization, but not yet fully commercialized.

First, the fingerprint recognition algorithm in the 1980s and 1990s, which has been very mature, is generally to find some key points on the fingerprint pattern, find points with special geometric characteristics, and then compare the two key points of fingerprints to determine whether they match.

Then is the Haar-based face detection algorithm in 2001, which can achieve real-time face detection under the hardware conditions at that time. Now all the face detection in mobile phone cameras are based on it or its variants.

The third is object detection based on HoG feature, which combines with the corresponding SVM classifier is the famous DPM algorithm. DPM algorithm surpasses all other algorithms in object detection and achieves good results.

But there are too few successful examples, because manual design features require a lot of experience, you need to have a special understanding of the field and data, and then design features also need a lot of debugging work. To put it plainly, it takes a little luck.

Another difficulty is that you need not only to design features manually, but also to have a more appropriate classifier algorithm on this basis. At the same time, it is almost impossible to design features and then select a classifier, which can achieve the best result by combining the two.

Deep Learning from the Perspective of Bionics

If we don't design features manually and select classifiers, is there any other way? Can we learn features and classifiers at the same time? That is, when a model is input, the input is only a picture, and the output is its own label. For example, if you enter a star's image, the label will be a 50-dimensional vector (if you want to identify it in 50 people), where the corresponding star's vector is 1, and the other position is 0.

This setting is in line with the research results of human brain science.

The 1981 Nobel Prize in Medical Physiology was awarded to David Hubel, a neurobiologist. His main achievement is to discover the information processing mechanism of the visual system, proving that the visual cortex of the brain is hierarchical. There are two main contributions. One is that he thinks human visual function is abstract and the other is iteration. Abstraction is the abstraction of very specific image elements, that is, the original light pixels and other information, to form meaningful concepts. These meaningful concepts will then iterate upwards and become more abstract and perceptible.

Pixels are not abstract, but the human brain can connect these pixels into edges, edges become more abstract concepts relative to the pixels; edges then form spheres, spheres and then to balloons, which is an abstract process, the brain will eventually know to see a balloon.

Simulating human brain to recognize face is also an abstract iteration process. It is an abstract iteration process from the first pixel to the edge of the second layer, then to the part of the face, and then to the whole face.

For example, when we see the motorcycle in the picture, we may have a few microseconds in our brain, but after a lot of neuron abstraction iterations. For computers, the first thing they saw was not motorcycles at all, but different numbers on the three channels of RGB images.

The so-called feature or visual feature is to synthesize these values in the form of statistics or non-statistics to show the parts or the whole motorcycle. Before the popularity of in-depth learning, most of the design image features are based on this, that is, to synthesize the information of the pixel level in an area, which is conducive to the classification learning later.

If we want to completely simulate the human brain, we also need to simulate the process of abstraction and recursive iteration, abstracting information from the most trivial pixel level to the concept of "species" to be acceptable.

The Concept of Convolution

Convolutional neural network (CNN), which is often used in computer vision, is a relatively accurate simulation of human brain.

What is convolution? Convolution is the relationship between two functions, and then a new value is obtained. It is the process of integral calculation in continuous space and summation in discrete space. In fact, in computer vision, convolution can be regarded as an abstract process, which is to abstract the information statistics in a small area.

For example, for a picture of Einstein, I can learn n different convolutions and functions, and then count the area. Statistics can be done in different ways, such as focusing on the central statistics, or around statistics, which leads to a variety of statistical sum functions, in order to achieve the cumulative sum of multiple statistics that can be learned at the same time.

The above figure shows how to generate a response map from the input image to the final convolution. First, we use the learned convolution and scan the image. Then each convolution will generate a scan response map. We call it response map, or feature map. If there are multiple convolutions and summations, there are multiple feature maps. That is to say, from the initial input image (RGB three channels), we can get a feature map of 256 channels, because there are 256 convolutions and each convolution represents a statistical abstraction.

In convolution neural network, besides convolution layer, there is also a kind of operation called pooling. The statistical concept of pooling operation is clearer, that is, a statistical operation to find the average or maximum value in a small area.

As a result, if I had previously entered a response feature map with two channels, or a convolution of 256 channels, each feature map would pass through a pooling layer to get a 256 feature map smaller than the original feature map.

In the example above, the pooling layer maximizes each 2X2 region and assigns the maximum to the corresponding location of the generated feature map. If the input image is 100 x 100, the output image will be 50 x 50 and the feature map will be half. The information retained at the same time is the largest information in the original 2X2 area.

Example of operation: LeNet network

Le, as its name implies, refers to Lecun, a bull in the field of artificial intelligence. This network is the initial prototype of deep learning network, because the previous network is shallow, it is deep. LeNet was invented in 1998, when Lecun was in AT&T's lab. He used this network to recognize letters. It was very good.

How does it constitute? The input image is a 32*32 gray scale image. The first layer generates six feature maps of 28X28 through a set of convolution summation. Then through a pooling layer, six feature maps of 14X14 are obtained. Then through a convolution layer, 16 convolution layers of 10X10 are generated, and 16 feature maps of 5*5 are generated through the pooling layer.

Starting with the last 16 feature maps of 5X5, three full connection layers are used to achieve the final output, which is the output of label space. Because the design is to identify only 0 to 9, so the output space is 10, if we want to identify 10 digits plus 26 letters of size, the output space is 62. In the 62-dimensional vector, if the value on a certain dimension is the largest, the corresponding letters and numbers are the predicted results.

The last straw on the camel

From 1998 to the beginning of this century, it took 15 years for in-depth learning to flourish, but at that time, the results were very good and marginalized. By 2012, deep learning algorithms had achieved good results in some areas, and AlexNet was the last straw on the camel.

AlexNet, developed by several scientists at the University of Toronto, has achieved excellent results in the ImageNet competition. At that time, AlexNet's recognition outperformed all shallow methods. Since then, people have realized that the era of in-depth learning has finally come, and some people have used it for other applications, while others have begun to develop new network structures.

In fact, AlexNet's structure is very simple, just a large version of LeNet. The input is a 224X224 image, which passes through several convolution layers, several pooling layers, and finally connects two full connection layers to reach the final label space.

Last year, some people figured out how to visualize features learned in depth. So what are the features AlexNet learns? In the first layer, there are some filling blocks and boundaries; in the middle layer, we begin to learn some texture features; in the higher level close to the classifier, we can clearly see the features of the shape of the object.

The last layer, the classification layer, is entirely different attitudes of objects, according to which different objects exhibit different attitudes.

It can be said that whether it is face recognition, vehicle recognition, elephant recognition or chair recognition, the first thing to learn is the edge, then the part of the object, and then at a higher level to abstract the whole of the object. The whole convolutional neural network simulates human abstraction and iteration.

Why did it come back 20 years later?

We can't help asking: It seems that the design of convolutional neural network is not very complicated, and there has been a more decent prototype since 1998. Free exchange algorithm and theoretical demonstration have not made much progress. So why is it 20 years later that convolutional neural network can come back and occupy the mainstream?

This problem has little to do with the technology of convolutional neural network itself. I personally think it has something to do with other objective factors.

Firstly, if the depth of convolution neural network is too shallow, the recognition ability of convolution neural network is often not as good as that of general shallow model, such as SVM or boosting. But if we do it deeply, we need a lot of data to train, otherwise the over-fitting in machine learning will be inevitable. Since 2006 and 2007, the Internet began to generate a large number of various image data.

Another condition is computing power. Convolutional neural network requires a lot of repeatable and parallelizable computing. It is impossible to train a deep convolutional neural network when CPU has only a single core and low computing power. With the increase of GPU computing power, it is possible to train convolutional neural networks with large data.

The last point is people. Convolutional neural networks (CNNs) have a number of scientists (such as Lecun) who have been insisting on it for a long time before they are silent and submerged by a large number of shallow methods. Finally, we can see the dawn of convolutional neural network occupying the mainstream.

The Application of Deep Learning in Vision

Successful applications of deep learning in computer vision include face recognition, image question and answer, object detection and object tracking.

Face recognition

Face matching in face recognition is to get a face and compare it with the face in the database, or to give two faces at the same time to judge whether they are the same person or not.

In this respect, Professor Tang Xiaoou is more advanced, and their DeepID algorithm is better in LWF. They also use convolutional neural networks, but when doing the comparison, the two faces extract different location features, and then compare each other to get the final comparison results. The latest Deep ID-3 algorithm achieves 99.53% accuracy in LWF, which is almost the same as naked eye recognition results.

Picture Questions and Answers

This is a rising topic around 2014, that is, to ask a question for a picture at the same time, and then let the computer answer it. For example, there is a picture of an office near the sea, and then ask "What's behind the desk". The output of the neural network should be "chairs and windows".

This application introduces LSTM network, a specially designed neuron unit with certain memory ability. The characteristic is that the output of one time will be regarded as the input of the next time. It can be considered to be more suitable for language and other scenarios with time series. Because when we read an article and a sentence, the understanding behind the sentence is based on the memory of the words in front of us.
Image question answering is based on the combination of convolution neural network and LSTM unit to achieve image question answering. LSTM output should be the desired answer, and the input is the input of the previous moment, as well as the characteristics of the image, and each word of the question.

Object Detection

Region CNN

In-depth learning has also achieved very good results in object detection. The basic idea of Roegion CNN algorithm in 2014 is firstly to extract possible object blocks from images by a non-depth method, and then to determine attributes and the position of a specific object based on these image blocks by depth learning algorithm.

Why do we use non-depth method to extract possible image blocks first? Because when doing object detection, if you use the scanning window method to monitor objects, you should take into account the different size of the scanning window, the different aspect ratio and location. If every image block passes through the depth network, this time is unacceptable.

So we used a compromise method called Selective Search. Firstly, the image blocks which are totally impossible to be objects are removed, leaving only about 2000 image blocks to be judged in the depth network. So the achievement of AP is 58.5, almost double that of the past. Unfortunately, regional CNN is very slow and takes 10 to 45 seconds to process a picture.

Faster R-CNN Method

And on last year's NIPS, we saw Faster R-CNN method, a super accelerated version of R-CNN method. Its speed is up to seven frames per second, that is, seven pictures can be processed in one second. The trick is not to use image blocks to judge whether an object is a background, but to throw the whole image into the depth network, so that the depth network can decide where there are objects, where are the objects'blocks and what kind of objects are.

The number of deep network operations has been reduced from 2000 to one, and the speed has been greatly improved.

Faster R-CNN proposes to let in-depth learning generate possible object blocks by itself, and then use the same depth network to judge whether the object blocks are background? At the same time, we should classify and estimate the boundary.

Faster R-CNN can be fast and good, the detection of AP on VOC2007 reaches 73.2, and the speed is also increased by two or three hundred times.

YOLO

The YOLO network proposed by FACEBOOK last year is also used for object detection, reaching 155 frames per second at the fastest and reaching full real-time. It allows an entire image to enter the neural network, allowing the neural network to determine for itself where the object may be and what it may be. But it reduces the number of possible image blocks, from more than 2000 original Faster R-CNN to 98.

At the same time, the RPN structure in Faster R-CNN is cancelled to replace the Selective Search structure. There is no RPN in YOLO. Instead, it directly predicts the type and location of objects.

The cost of YOLO is a drop in accuracy, which is only 52.7 at 155 frames and 63.4 at 45 frames per second.

SSD

The latest algorithm on arXiv is called Single Shot MultiBox Detector, or SSD.

It is a super-improved version of YOLO, which absorbs the lesson of YOLO's decline in accuracy, while retaining the characteristics of fast speed. It can achieve 58 frames per second with an accuracy of 72.1. The speed is 8 times faster than Faster R-CNN, but it achieves similar accuracy.

Object tracking

The so-called tracking is to lock the object of interest in the first frame of the video and let the computer follow it, no matter how it rotates and shakes, even hiding behind the bushes.

Deep learning has a significant effect on tracking problems. DeepTrack algorithm was proposed by my colleagues and colleagues at the Australian Institute of Information Technology. It was the first online article to track with in-depth learning, which surpassed all other shallow algorithms at that time.

More and more deep learning tracking algorithms have been proposed this year. Last December ICCV 2015, Ma Chao's Hierarchical Convolutional Feature algorithm reached the latest data record. Instead of updating a deep learning network online, it uses a large network for pre-training, and then lets the big network know what is an object and what is not an object.

Put the big network on the tracking video, and then analyze the different features of the network on the video, and use the more mature shallow tracking algorithm to track. This makes use of the advantages of deep learning feature learning, while taking advantage of the advantages of shallow method with faster speed. The effect is 10 frames per second, and the accuracy is record-breaking.

The latest tracking results are based on the Hierarchical Convolutional Feature, MDnet proposed by a Korean research team. It combines the two previous depth algorithms. First, it learns when it is offline. It learns not about general object detection or ImageNet, but about tracking video. Then, after learning video, it updates a part of the network when it is really using the network. In this way, we get a lot of training when we are offline, and we can flexibly change our network when we are online.

Deep Learning Based on Embedded System

Back to ADAS (the main field of Wisdom Eye Technology), it can fully use in-depth learning algorithm, but it has higher requirements for hardware platform. It is not possible to put a computer on a car because power is a problem and it is hard to be accepted by the market.

At present, in-depth learning computing is mainly carried out in the cloud, the front-end photographs, and the back-end cloud platform processing. But for ADAS, long-term data transmission is unacceptable. Perhaps after the accident, the data in the cloud has not been transmitted back.

Is it possible to consider NVIDIA push?
NVIDIA's embedded platform has much better computing power than all the mainstream embedded platforms, and its computing power is close to the mainstream top CPUs, such as desktop i7. So the work of Wisdom Eye Technology is to make deep learning algorithm achieve real-time effect under the limited resources of embedded platform, and the accuracy is almost not reduced.

Firstly, the network is reduced, which may be the structure of the network. Because of the different recognition scenarios, the corresponding functional reduction is also needed. In addition, the fastest depth detection algorithm, combined with the fastest depth tracking algorithm, is used to develop some scene analysis algorithms. The purpose of the combination of the three is to reduce the amount of computation and the size of detection space. In this case, the deep learning algorithm is implemented on limited resources, but the accuracy is reduced very little.