Le, as its name implies, refers to Lecun, a bull in the field of artificial intelligence. This network is the initial prototype of deep learning network, because the previous network is shallow, it is deep. LeNet was invented in 1998, when Lecun was in AT&T's lab. He used this network to recognize letters. It was very good.
How does it constitute? The input image is a 32*32 gray scale image. The first layer generates six feature maps of 28X28 through a set of convolution summation. Then through a pooling layer, six feature maps of 14X14 are obtained. Then through a convolution layer, 16 convolution layers of 10X10 are generated, and 16 feature maps of 5*5 are generated through the pooling layer.
Starting with the last 16 feature maps of 5X5, three full connection layers are used to achieve the final output, which is the output of label space. Because the design is to identify only 0 to 9, so the output space is 10, if we want to identify 10 digits plus 26 letters of size, the output space is 62. In the 62-dimensional vector, if the value on a certain dimension is the largest, the corresponding letters and numbers are the predicted results.
The last straw on the camel
From 1998 to the beginning of this century, it took 15 years for in-depth learning to flourish, but at that time, the results were very good and marginalized. By 2012, deep learning algorithms had achieved good results in some areas, and AlexNet was the last straw on the camel.
AlexNet, developed by several scientists at the University of Toronto, has achieved excellent results in the ImageNet competition. At that time, AlexNet's recognition outperformed all shallow methods. Since then, people have realized that the era of in-depth learning has finally come, and some people have used it for other applications, while others have begun to develop new network structures.
In fact, AlexNet's structure is very simple, just a large version of LeNet. The input is a 224X224 image, which passes through several convolution layers, several pooling layers, and finally connects two full connection layers to reach the final label space.
Last year, some people figured out how to visualize features learned in depth. So what are the features AlexNet learns? In the first layer, there are some filling blocks and boundaries; in the middle layer, we begin to learn some texture features; in the higher level close to the classifier, we can clearly see the features of the shape of the object.
The last layer, the classification layer, is entirely different attitudes of objects, according to which different objects exhibit different attitudes.
It can be said that whether it is face recognition, vehicle recognition, elephant recognition or chair recognition, the first thing to learn is the edge, then the part of the object, and then at a higher level to abstract the whole of the object. The whole convolutional neural network simulates human abstraction and iteration.
Why did it come back 20 years later?
We can't help asking: It seems that the design of convolutional neural network is not very complicated, and there has been a more decent prototype since 1998. Free exchange algorithm and theoretical demonstration have not made much progress. So why is it 20 years later that convolutional neural network can come back and occupy the mainstream?
This problem has little to do with the technology of convolutional neural network itself. I personally think it has something to do with other objective factors.
Firstly, if the depth of convolution neural network is too shallow, the recognition ability of convolution neural network is often not as good as that of general shallow model, such as SVM or boosting. But if we do it deeply, we need a lot of data to train, otherwise the over-fitting in machine learning will be inevitable. Since 2006 and 2007, the Internet began to generate a large number of various image data.
Another condition is computing power. Convolutional neural network requires a lot of repeatable and parallelizable computing. It is impossible to train a deep convolutional neural network when CPU has only a single core and low computing power. With the increase of GPU computing power, it is possible to train convolutional neural networks with large data.
The last point is people. Convolutional neural networks (CNNs) have a number of scientists (such as Lecun) who have been insisting on it for a long time before they are silent and submerged by a large number of shallow methods. Finally, we can see the dawn of convolutional neural network occupying the mainstream.
The Application of Deep Learning in Vision
Successful applications of deep learning in computer vision include face recognition, image question and answer, object detection and object tracking.
Face recognition
Face matching in face recognition is to get a face and compare it with the face in the database, or to give two faces at the same time to judge whether they are the same person or not.
In this respect, Professor Tang Xiaoou is more advanced, and their DeepID algorithm is better in LWF. They also use convolutional neural networks, but when doing the comparison, the two faces extract different location features, and then compare each other to get the final comparison results. The latest Deep ID-3 algorithm achieves 99.53% accuracy in LWF, which is almost the same as naked eye recognition results.
Picture Questions and Answers
This is a rising topic around 2014, that is, to ask a question for a picture at the same time, and then let the computer answer it. For example, there is a picture of an office near the sea, and then ask "What's behind the desk". The output of the neural network should be "chairs and windows".
This application introduces LSTM network, a specially designed neuron unit with certain memory ability. The characteristic is that the output of one time will be regarded as the input of the next time. It can be considered to be more suitable for language and other scenarios with time series. Because when we read an article and a sentence, the understanding behind the sentence is based on the memory of the words in front of us.
Image question answering is based on the combination of convolution neural network and LSTM unit to achieve image question answering. LSTM output should be the desired answer, and the input is the input of the previous moment, as well as the characteristics of the image, and each word of the question.
Object Detection
Region CNN
In-depth learning has also achieved very good results in object detection. The basic idea of Roegion CNN algorithm in 2014 is firstly to extract possible object blocks from images by a non-depth method, and then to determine attributes and the position of a specific object based on these image blocks by depth learning algorithm.
Why do we use non-depth method to extract possible image blocks first? Because when doing object detection, if you use the scanning window method to monitor objects, you should take into account the different size of the scanning window, the different aspect ratio and location. If every image block passes through the depth network, this time is unacceptable.
So we used a compromise method called Selective Search. Firstly, the image blocks which are totally impossible to be objects are removed, leaving only about 2000 image blocks to be judged in the depth network. So the achievement of AP is 58.5, almost double that of the past. Unfortunately, regional CNN is very slow and takes 10 to 45 seconds to process a picture.
Faster R-CNN Method
And on last year's NIPS, we saw Faster R-CNN method, a super accelerated version of R-CNN method. Its speed is up to seven frames per second, that is, seven pictures can be processed in one second. The trick is not to use image blocks to judge whether an object is a background, but to throw the whole image into the depth network, so that the depth network can decide where there are objects, where are the objects'blocks and what kind of objects are.
The number of deep network operations has been reduced from 2000 to one, and the speed has been greatly improved.
Faster R-CNN proposes to let in-depth learning generate possible object blocks by itself, and then use the same depth network to judge whether the object blocks are background? At the same time, we should classify and estimate the boundary.
Faster R-CNN can be fast and good, the detection of AP on VOC2007 reaches 73.2, and the speed is also increased by two or three hundred times.
YOLO
The YOLO network proposed by FACEBOOK last year is also used for object detection, reaching 155 frames per second at the fastest and reaching full real-time. It allows an entire image to enter the neural network, allowing the neural network to determine for itself where the object may be and what it may be. But it reduces the number of possible image blocks, from more than 2000 original Faster R-CNN to 98.
At the same time, the RPN structure in Faster R-CNN is cancelled to replace the Selective Search structure. There is no RPN in YOLO. Instead, it directly predicts the type and location of objects.
The cost of YOLO is a drop in accuracy, which is only 52.7 at 155 frames and 63.4 at 45 frames per second.
SSD
The latest algorithm on arXiv is called Single Shot MultiBox Detector, or SSD.
It is a super-improved version of YOLO, which absorbs the lesson of YOLO's decline in accuracy, while retaining the characteristics of fast speed. It can achieve 58 frames per second with an accuracy of 72.1. The speed is 8 times faster than Faster R-CNN, but it achieves similar accuracy.
Object tracking
The so-called tracking is to lock the object of interest in the first frame of the video and let the computer follow it, no matter how it rotates and shakes, even hiding behind the bushes.
Deep learning has a significant effect on tracking problems. DeepTrack algorithm was proposed by my colleagues and colleagues at the Australian Institute of Information Technology. It was the first online article to track with in-depth learning, which surpassed all other shallow algorithms at that time.
More and more deep learning tracking algorithms have been proposed this year. Last December ICCV 2015, Ma Chao's Hierarchical Convolutional Feature algorithm reached the latest data record. Instead of updating a deep learning network online, it uses a large network for pre-training, and then lets the big network know what is an object and what is not an object.
Put the big network on the tracking video, and then analyze the different features of the network on the video, and use the more mature shallow tracking algorithm to track. This makes use of the advantages of deep learning feature learning, while taking advantage of the advantages of shallow method with faster speed. The effect is 10 frames per second, and the accuracy is record-breaking.
The latest tracking results are based on the Hierarchical Convolutional Feature, MDnet proposed by a Korean research team. It combines the two previous depth algorithms. First, it learns when it is offline. It learns not about general object detection or ImageNet, but about tracking video. Then, after learning video, it updates a part of the network when it is really using the network. In this way, we get a lot of training when we are offline, and we can flexibly change our network when we are online.
Deep Learning Based on Embedded System
Back to ADAS (the main field of Wisdom Eye Technology), it can fully use in-depth learning algorithm, but it has higher requirements for hardware platform. It is not possible to put a computer on a car because power is a problem and it is hard to be accepted by the market.
At present, in-depth learning computing is mainly carried out in the cloud, the front-end photographs, and the back-end cloud platform processing. But for ADAS, long-term data transmission is unacceptable. Perhaps after the accident, the data in the cloud has not been transmitted back.
Is it possible to consider NVIDIA push?
NVIDIA's embedded platform has much better computing power than all the mainstream embedded platforms, and its computing power is close to the mainstream top CPUs, such as desktop i7. So the work of Wisdom Eye Technology is to make deep learning algorithm achieve real-time effect under the limited resources of embedded platform, and the accuracy is almost not reduced.
Firstly, the network is reduced, which may be the structure of the network. Because of the different recognition scenarios, the corresponding functional reduction is also needed. In addition, the fastest depth detection algorithm, combined with the fastest depth tracking algorithm, is used to develop some scene analysis algorithms. The purpose of the combination of the three is to reduce the amount of computation and the size of detection space. In this case, the deep learning algorithm is implemented on limited resources, but the accuracy is reduced very little.
Contact: Manager Xu
Phone: 13907330718
Tel: 0731-22222718
Email: hniatcom@163.com
Add: Room 603, 6th Floor, Shifting Room, No. 2, Orbit Zhigu, No. 79 Liancheng Road, Shifeng District, Zhuzhou City, Hunan Province