1. Overview
Artificial Intelligence, or AI (Interview: Machine Learning – Deep Learning), is one of the most beautiful dreams of mankind, just like immortality and interstellar travel. Although computer technology has made great progress, so far, no computer can generate “self” consciousness. Yes, with the help of humans and a large amount of ready-made data, computers can perform very powerfully, but without these two, it can’t even distinguish between a cat and a dog.
In his 1950 paper, Turing (Turing, everyone knows him. He is the originator of computers and artificial intelligence, corresponding to his famous “Turing machine” and “Turing test” respectively) proposed the idea of the Turing test, that is , when you talk across the wall, you will not know whether you are talking to a person or a computer. This undoubtedly sets a high expectation for computers, especially artificial intelligence. However, half a century has passed, and the progress of artificial intelligence is far from reaching the standard of the Turing test. This not only makes people who have been waiting for many years feel discouraged, but also think that artificial intelligence is a scam and related fields are “pseudoscience”.
However, since 2006, the field of machine learning has made breakthrough progress. The Turing test is no longer so out of reach. As for the technical means, it not only depends on the parallel processing ability of cloud computing for big data, but also on the algorithm. This algorithm is Deep Learning. With the help of Deep Learning algorithm, humans have finally found a way to deal with the eternal problem of “abstract concepts”.
In June 2012, the New York Times disclosed the Google Brain project, which attracted widespread public attention. The project was jointly led by Andrew Ng, a famous machine learning professor at Stanford University, and Jeff Dean, a world-leading expert in large- scale computer systems. It used a parallel computing platform with 16,000 CPU Cores to train a machine learning model called “Deep Neural Networks” (DNN, Deep Neural Networks) (with a total of 1 billion nodes. This network is naturally not comparable to human neural networks. You know, there are more than 15 billion neurons in the human brain, and the number of interconnected nodes, that is, synapses, is as numerous as the grains of sand in the galaxy. Someone once estimated that if the axons and dendrites of all the nerve cells in a person’s brain were connected in sequence and pulled into a straight line, it could be connected from the earth to the moon and then from the moon back to the earth), and achieved great success in fields such as speech recognition and image recognition.
Andrew, one of the project leaders, said: “We did not set the boundaries ourselves as we usually do, but directly put massive amounts of data into the algorithm and let the data speak for itself. The system will automatically learn from the data. ” Jeff, another person in charge, said: “When training, we never tell the machine: ‘This is a cat.’ The system actually invented or understood the concept of ‘cat’ by itself.”
In November 2012, Microsoft publicly demonstrated a fully automatic simultaneous interpretation system at an event in Tianjin, China. The speaker gave a speech in English, and the computer in the background automatically completed speech recognition, English-Chinese machine translation, and Chinese speech synthesis in one go, with very smooth results. It is reported that the key technology behind it is also DNN, or deep learning (DL).
In January 2013, at Baidu’s annual meeting, founder and CEO Robin Li announced the establishment of Baidu Research Institute, the first of which was the “Institute of Deep Learning” (IDL).
Why do Internet companies with big data compete to invest a lot of resources in the research and development of deep learning technology? It sounds like deep learning is awesome. So what is deep learning? Why is there deep learning? How did it come about? What can it do? What are the current difficulties? The simple answers to these questions need to be taken slowly. Let’s first understand the background of machine learning (the core of artificial intelligence).
II. Background
Machine Learning is a discipline that studies how computers can simulate or implement human learning behaviors to acquire new knowledge or skills and reorganize existing knowledge structures to continuously improve their performance. Can machines have the ability to learn like humans? In 1959, Samuel of the United States designed a chess program that has the ability to learn and can improve its chess skills in continuous games. Four years later, the program defeated the designer himself. Three years later, the program defeated an undefeated champion in the United States that had been undefeated for eight years. This program showed people the ability of machine learning and raised many thought-provoking social and philosophical issues (haha, the normal track of artificial intelligence has not developed much, but these philosophical ethics are developing very fast. What machines will become more and more like humans in the future, and humans will become more and more like machines. What machines will be anti-human, ATM will be the first to fire, etc. Human thinking is infinite).
Although machine learning has been developed for decades, there are still many problems that have not been well solved:
For example, image recognition, speech recognition, natural language understanding, weather forecasting, gene expression, content recommendation, etc. At present, the way we use machine learning to solve these problems is as follows (taking visual perception as an example):
Data is obtained from sensors (such as CMOS) at the beginning. Then it goes through preprocessing, feature extraction, feature selection, and then to reasoning, prediction or recognition. The last part, which is the machine learning part, is where most of the work is done, and there are also many papers and research.
The three parts in the middle can be summarized as feature expression. Good feature expression plays a very critical role in the accuracy of the final algorithm, and the main calculation and testing work of the system is consumed in this part. However, this part is usually done manually in practice. Relying on manual feature extraction.
Up to now, many NB features have emerged (good features should be invariant (size, scale, rotation, etc.) and distinguishable): for example, the emergence of Sift is a milestone in the research field of local image feature descriptors. Since SIFT is invariant to image changes such as scale, rotation, certain perspectives and illumination changes, and SIFT has strong distinguishability, it does make it possible to solve many problems. But it is not omnipotent.
However, manually selecting features is a very laborious and heuristic method (requiring professional knowledge). Whether the selection is good depends largely on experience and luck, and it takes a lot of time to adjust. Since manually selecting features is not very good, can we learn some features automatically? The answer is yes! Deep Learning is used to do this. Looking at its alias Unsupervised Feature Learning, you can see that Unsupervised means that no human is involved in the feature selection process.
So how does it learn? How does it know which features are good and which are bad? We say that machine learning is a discipline that studies how computers simulate or implement human learning behavior. Well, how does our human visual system work? Why can we find another her in the vast sea of people, in the multitude of living beings, in the world of mortals (because you exist in my deep mind, in my dreams, in my heart, in my songs…). The human brain is so awesome, can we refer to it and simulate it? (It seems that the features and algorithms that are somewhat related to the human brain are all good, but I don’t know if they are artificially imposed to make their own works sacred and elegant.)
In recent decades, the development of cognitive neuroscience, biology and other disciplines has made us no longer unfamiliar with our mysterious and magical brain, and has also fueled the development of artificial intelligence.
3. Human brain visual mechanism
The 1981 Nobel Prize in Medicine was awarded to David Hubel (a Canadian-born American neurobiologist) and Torsten Wiesel, as well as Roger Sperry. The main contribution of the first two was “the discovery of information processing in the visual system”: the visual cortex is hierarchical:
Let’s see what they did. In 1958, David Hubel and Torsten Wiesel at John Hopkins University studied the correspondence between the pupil area and the neurons in the cerebral cortex. They opened a 3 mm hole in the back of the cat’s skull, inserted electrodes into the hole, and measured the activity of the neurons.
Then, they showed the kittens objects of various shapes and brightness. And when showing each object, they also changed the position and angle of the object. They hoped that this method would allow the kittens’ pupils to feel different types and strengths of stimulation.
The purpose of this experiment is to prove a hypothesis. There is a certain correspondence between the different visual neurons in the posterior cerebral cortex and the stimulation of the pupil. Once the pupil is stimulated in a certain way, a part of the neurons in the Posterior cerebral cortex will become active. After many days of repeated boring experiments and the sacrifice of several poor kittens, David Hubel and Torsten Wiesel discovered a type of neuron called “Orientation Selective Cell”. When the pupil finds the edge of an object in front of it, and this edge points in a certain direction, this type of neuron will become active.
This discovery inspired people to further think about the nervous system. The working process of nerve-center-brain may be a process of continuous iteration and abstraction.
There are two key words here, one is abstraction and the other is iteration. From the original signal, do low-level abstraction and gradually iterate to high-level abstraction. Human logical thinking often uses highly abstract concepts.
For example, it starts with the raw signal intake (the pupil takes in pixels), followed by preliminary processing (certain cells in the cerebral cortex discover edges and directions), then abstraction (the brain determines that the shape of the object in front of it is round), and then further abstraction (the brain further determines that the object is a balloon).
This physiological discovery led to the breakthrough development of computer artificial intelligence forty years later.
In general, the information processing of the human visual system is hierarchical. From the low-level V1 area to the edge features, to the shape or part of the target in the V2 area, and then to the higher level, the entire target, the behavior of the target, etc. In other words, the high-level features are a combination of low-level features. The feature representations from low to high levels are more and more abstract, and can better express semantics or intentions. The higher the level of abstraction, the fewer possible guesses there are, and the easier it is to classify. For example, the correspondence between a word set and a sentence is many-to-one, the correspondence between a sentence and semantics is many-to-one, and the correspondence between semantics and intention is still many-to-one. This is a hierarchical system.
Sensitive people have noticed the key words: layering. Does the deep in Deep Learning mean how many layers there are, or how deep it is? That’s right. So how does Deep Learning learn from this process? After all, it is handled by computers , so one of the problems is how to model this process?
Because what we want to learn is the expression of features, we need to understand features more deeply, or about this level of features. So before talking about Deep Learning, we need to talk about features again (haha, in fact, it would be a pity not to put such a good explanation of features here, so I put it here).
IV. About Features
Features are the raw materials of machine learning systems, and their impact on the final model is undoubted. If the data is well expressed as features, linear models can usually achieve satisfactory accuracy. So what do we need to consider for features?
4.1. Granularity of Feature Representation
At what granularity of feature representation can learning algorithms work? For an image, pixel-level features are of no value at all. For example, the motorcycle below, from the pixel level, no information can be obtained, and it is impossible to distinguish between motorcycles and non-motorcycles. However, if the feature is structural (or meaningful), such as whether it has a handle or a wheel, it is easy to distinguish between motorcycles and non-motorcycles, and learning algorithms can work.
4.2. Primary (shallow) feature representation
Since pixel-level feature representation methods are ineffective, what kind of representation is useful?
Around 1995, two scholars, Bruno Olshausen and David Field, were working at Cornell University. They tried to study visual problems using both physiological and computer methods.
They collected a lot of black and white landscape photos, and extracted 400 small fragments from these photos. The size of each photo fragment is 16×16 pixels. Let’s label these 400 fragments as S[i], i = 0,.. 399. Next , randomly extract another fragment from these black and white landscape photos, the size is also 16×16 pixels, let’s label this fragment as T.
The question they raised is how to select a group of fragments, S[k], from these 400 fragments, and synthesize a new fragment by superposition. This new fragment should be as similar as possible to the randomly selected target fragment T, and at the same time, the number of S[k] should be as small as possible. To describe it in mathematical language, it is:
Sum_k (a[k] * S[k]) –> T, where a[k] is the weight coefficient when superimposing fragment S[k].
To solve this problem, Bruno Olshausen and David Field invented an algorithm, Sparse Coding.
Sparse coding is an iterative process, and each iteration is divided into two steps:
1) Select a set S[k] and then adjust a[k] so that Sum_k (a[k] * S[k]) is closest to T.
2) Fix a[k] and select another more suitable fragment S'[k] from the 400 fragments to replace the original S[k], so that Sum_k (a[k] * S'[k]) is closest to T .
After several iterations, the best combination of S[k] was selected. Surprisingly, the selected S[k] are basically the edge lines of different objects in the photo. These line segments are similar in shape, but different in direction.
The algorithm results of Bruno Olshausen and David Field coincide with the physiological discoveries of David Hubel and Torsten Wiesel!
That is to say, complex graphs are often composed of some basic structures. For example, in the following figure: a graph can be linearly represented by 64 orthogonal edges (which can be understood as orthogonal basic structures). For example, the sample x can be composed of three edges from 1 to 64 with weights of 0.8, 0.3, and 0.5. The other basic edges have no contribution, so they are all 0.
In addition, the experts also found that this rule exists not only in images, but also in sounds. They found 20 basic sound structures from unlabeled sounds, and the rest of the sounds can be synthesized from these 20 basic structures.
4.3 Structural Feature Representation
Small graphics can be composed of basic edges. How to represent more structured, complex, and conceptual graphics? This requires higher-level feature representation, such as V2 and V4. Therefore, V1 looks at the pixel level at the pixel level. V2 looks at V1 at the pixel level. This is hierarchical, and the high-level expression is composed of the combination of the low-level expressions. To be more professional, it is the basis. The basis proposed by V1 is the edge, and then the V2 layer is the combination of the bases of the V1 layer. At this time, the basis obtained in the V2 area is a higher layer. That is, the result of the combination of the basis of the previous layer, and the upper- upper layer is the combination basis of the upper-upper layer… (So some experts say that Deep learning is “doing basics”. Because it sounds bad, they call it Deep learning or Unsupervised Feature Learning)
Intuitively speaking, it is to find small patches that make sense and then combine them to obtain the features of the previous layer, and recursively learn the features upward.
When training on different objects, the edge basis obtained is very similar, but the object parts and models will be completely different (then it will be much easier for us to distinguish between cars and faces):
From the perspective of text, what does a doc mean? When we describe something, what is the most appropriate way to express it? Using words one by one? I don’t think so. Words are at the pixel level. At least they should be terms. In other words, each doc is composed of terms. But is this enough to express concepts? Maybe not. We need to go one step further and reach the topic level. Once we have a topic, it is reasonable to move on to a doc. But the number of each level is very different. For example, the concept represented by a doc -> topic (in the thousands to tens of thousands) -> term (in the hundreds of thousands) -> word ( in the millions).
When a person is reading a document, what he sees are words. These words are automatically segmented into terms in the brain. Then, the topics are obtained through prior learning according to the conceptual organization, and then high-level learning is carried out.
4.4. How many features are needed?
We know that we need to construct features in layers, from shallow to deep, but how many features should each layer have?
For any method, the more features there are, the more reference information is provided, and the accuracy will be improved. However, more features mean more complex calculations, a larger space for exploration, and sparse data available for training on each feature, which will bring various problems. It is not necessarily the case that the more features, the better.
OK, now we can finally talk about Deep Learning. We talked about why Deep Learning exists (letting machines automatically learn good features without the need for manual selection. Also, referring to the hierarchical visual processing system of humans), and we came to the conclusion that Deep Learning requires multiple layers to obtain more abstract feature expressions. So how many layers are appropriate? What architecture should be used for modeling? How to perform unsupervised training?
5. Basic Idea of Deep Learning
Suppose we have a system S, which has n layers (S1,…Sn), its input is I, and its output is O, which can be figuratively represented as: I =>S1=>S2=>…..=> Sn => O. If the output O is equal to the input I, that is, the input I does not have any information loss after passing through this system change (haha, the experts say that this is impossible. There is a saying in information theory that “information is lost layer by layer” (information processing inequality). Suppose that information a is processed to obtain b, and then b is processed to obtain c, then it can be proved that the mutual information between a and c will not exceed the mutual information between a and b. This shows that information processing will not increase information, and most processing will lose information. Of course, it would be great if useless information was lost), it remains unchanged, which means that the input I does not have any information loss after passing through each layer Si, that is, at any layer Si, it is another representation of the original information (that is, input I). Now back to our topic Deep Learning, we need to automatically learn features. Suppose we have a bunch of inputs I (such as a bunch of images or texts), and suppose we design a system S (with n layers). We adjust the parameters in the system so that its output is still the input I, then we can automatically obtain a series of hierarchical features of the input I, namely S1,…, Sn.
For deep learning, the idea is to stack multiple layers, that is, the output of one layer is used as the input of the next layer. In this way, the input information can be expressed in a hierarchical manner.
In addition, the previous assumption is that the output is strictly equal to the input. This restriction is too strict. We can slightly relax this restriction. For example, we only need to make the difference between the input and the output as small as possible . This relaxation will lead to another different type of Deep Learning method. The above is the basic idea of Deep Learning.
6. Shallow Learning and Deep Learning
Shallow learning was the first wave of machine learning.
In the late 1980s, the invention of the back propagation algorithm (also called the Back Propagation algorithm or BP algorithm) for artificial neural networks brought hope to machine learning and set off a machine learning craze based on statistical models. This craze continues to this day . People have found that using the BP algorithm can allow an artificial neural network model to learn statistical laws from a large number of training samples, thereby predicting unknown events. This statistical-based machine learning method is superior to the past artificial rule-based systems in many aspects. Although the artificial neural network at this time was also called a multi-layer perceptron, it was actually a shallow model with only one hidden layer of nodes.
In the 1990s, various shallow machine learning models were proposed, such as support vector machines (SVM), Boosting, maximum entropy methods (such as LR, Logistic Regression), etc. The structure of these models can basically be seen as having a layer of hidden nodes (such as SVM, Boosting), or no hidden nodes (such as LR). These models have achieved great success both in theoretical analysis and application. In contrast, due to the difficulty of theoretical analysis and the need for a lot of experience and skills in training methods, shallow artificial neural networks were relatively quiet during this period.
Deep learning is the second wave of machine learning.
In 2006, Geoffrey Hinton, a professor at the University of Toronto in Canada and a leader in the field of machine learning, and his student Ruslan Salakhutdinov published an article in Science, which opened up a wave of deep learning in academia and industry. This article has two main points: 1) Artificial neural networks with multiple hidden layers have excellent feature learning capabilities, and the learned features have a more essential characterization of the data, which is conductive to visualization or classification; 2) The difficulty of training deep neural networks can be effectively overcome through “layer-wise initialization” (layer-wise pre-training). In this article, layer-wise initialization is achieved through unsupervised learning.
Most current classification, regression and other learning methods are shallow structure algorithms, which are limited in their ability to represent complex functions with limited samples and computing units, and their generalization ability for complex classification problems is subject to certain constraints. Deep learning can achieve complex function approximation and characterize the distributed representation of input data by learning a deep nonlinear network structure, and demonstrates a strong ability to learn the essential characteristics of a data set from a small number of sample sets. (The advantage of multiple layers is that complex functions can be represented with fewer parameters)
The essence of deep learning is to learn more useful features by building a machine learning model with many hidden layers and a large amount of training data, thereby ultimately improving the accuracy of classification or prediction. Therefore, “deep model” is a means, and “feature learning” is the goal. Different from traditional shallow learning, deep learning is different in that: 1) it emphasizes the depth of the model structure, usually with 5, 6, or even more than 10 layers of hidden nodes; 2) it clearly highlights the importance of feature learning, that is, through layer-by-layer feature transformation, the feature representation of the sample in the original space is transformed into a new feature space, making classification or prediction easier. Compared with the method of constructing features with artificial rules, using big data to learn features can better characterize the rich intrinsic information of the data.
7. Deep learning and Neural Network
Deep learning is a new field in machine learning research. Its motivation is to build and simulate neural networks that analyze and learn like the human brain. It imitates the mechanism of the human brain to interpret data, such as images, sounds, and text. Deep learning is a type of unsupervised learning.
The concept of deep learning originates from the study of artificial neural networks. A multi-layer perceptron with multiple hidden layers is a deep learning structure. Deep learning discovers distributed feature representations of data by combining low-level features to form more abstract high-level representations of attribute categories or features.
Deep learning itself is a branch of machine learning, which can be simply understood as the development of neural networks. About 20 to 30 years ago, neural networks were a particularly hot direction in the ML field, but later gradually faded out for the following reasons :
1) It is easy to overfit, the parameters are difficult to tune, and it requires a lot of tricks;
2) The training speed is relatively slow, and the effect is not better than other methods when the number of layers is relatively small (less than or equal to 3);
So for about 20 years, neural networks received little attention, and during this period, SVM and boosting algorithms were basically the world. However, Hinton, an infatuated old man, persisted and eventually (with Bengio, Yann.lecun, etc.) proposed a practical deep learning framework.
There are similarities but also many differences between Deep learning and traditional neural networks.
The similarities between the two are that deep learning uses a hierarchical structure similar to that of a neural network. The system consists of a multi-layer network consisting of an input layer, a hidden layer (multiple layers), and an output layer. Only nodes in adjacent layers are connected, and there are no connections between nodes in the same layer or across layers. Each layer can be regarded as a logistic regression model. This hierarchical structure is closer to the structure of the human brain.
In order to overcome the problems in neural network training, DL adopts a training mechanism that is very different from neural networks. In traditional neural networks, back propagation is used. Simply put, an iterative algorithm is used to train the entire network, randomly set initial values, calculate the output of the current network, and then change the parameters of the previous layers according to the difference between the current output and the label until convergence (the overall method is a gradient descent). Deep learning is a layer-wise training mechanism as a whole. The reason for this is that if the back propagation mechanism is used, for a deep network (more than 7 layers), the residual propagated to the front layer has become too small, and the so-called gradient diffusion occurs. We will discuss this issue next.
8. Deep learning training process
8.1. Why can’t traditional neural network training methods be used in deep neural networks?
The BP algorithm is a typical algorithm for traditional training of multi-layer networks. In fact, this training method is not ideal for networks with only a few layers. The local minimum that is prevalent in the non-convex objective cost function of deep structures (involving multiple layers of nonlinear processing units) is the main source of training difficulties.
Problems with BP algorithm:
(1) The gradient becomes increasingly sparse: the error correction signal becomes smaller as you go down from the top layer;
(2) Convergence to a local minimum: especially when starting from a point far from the optimal region (random value initialization can cause this to happen);
(3) Generally, we can only use labeled data for training: but most data is unlabeled, and the brain can learn from unlabeled data;
8.2 Deep Learning Training Process
If all layers are trained at the same time, the time complexity will be too high; if one layer is trained at a time, the bias will be propagated layer by layer. This will face the opposite problem of supervised learning above, which will lead to serious underfitting (because the deep network has too many neurons and parameters).
In 2006, Hinton proposed an effective method to build a multi-layer neural network on unsupervised data. In short, it is divided into two steps: one is to train one layer of the network at a time, and the other is to tune it so that the high-level representation r generated by the original representation x and the x’ generated by the high-level representation r are as consistent as possible. The method is:
1) First, build a single layer of neurons layer by layer, so that a single layer network is trained each time.
2) After all layers are trained, Hinton uses the wake-sleep algorithm for tuning.
The weights between layers other than the top layer are changed to bidirectional, so that the top layer is still a single-layer neural network, while the other layers become graph models. The upward weights are used for “cognition” and the downward weights are used for “generation”. Then use the Wake-Sleep algorithm to adjust all the weights. Make cognition and generation consistent, that is, to ensure that the generated top-level representation can restore the bottom-level nodes as accurately as possible. For example, if a node in the top layer represents a face, then all images of a face should activate this node, and the resulting image should be able to appear as a rough face image. The Wake-Sleep algorithm is divided into two parts: wake and sleep.
1) Wake phase : cognitive process, generating abstract representations (node states) of each layer through external features and upward weights (cognitive weights), and using gradient descent to modify the downward weights (generated weights) between layers. In other words , “if reality is different from what I imagined, change my weights so that what I imagined is like this.”
2) Sleep stage : the generation process, through the top-level representation (concepts learned when awake) and downward weights, the bottom-level state is generated, and the upward weights between layers are modified. In other words, “if the scene in the dream is not the corresponding concept in my brain, change my cognitive weight so that this scene is this concept in my mind.”
The deep learning training process is as follows:
1) Use bottom-up unsupervised learning (starting from the bottom and training layer by layer to the top):
Using uncalibrated data (calibrated data is also acceptable) to train the parameters of each layer in layers, this step can be regarded as an unsupervised training process, which is the biggest difference from traditional neural networks (this process can be regarded as a feature learning process):
Specifically, the first layer is trained with uncalibrated data. The parameters of the first layer are learned first during training (this layer can be regarded as a hidden layer of a three-layer neural network that minimizes the difference between the output and the input) . Due to the capacity limitation and sparsity constraint of the model, the obtained model can learn the structure of the data itself, thereby obtaining features with more representation ability than the input. After learning the n-1th layer, the output of the n- 1th layer is used as the input of the nth layer to train the nth layer, thereby obtaining the parameters of each layer respectively.
2) Top-down supervised learning (that is, training through labeled data, transferring errors from top to bottom, and fine-tuning the network):
Based on the parameters of each layer obtained in the first step, the parameters of the entire multi-layer model are further fine-tuned. This step is a supervised training process. The first step is similar to the random initialization process of the neural network . Since the first step of DL is not random initialization, but obtained by learning the structure of the input data, this initial value is closer to the global optimum, thus achieving better results. Therefore, the good effect of deep learning is largely attributed to the feature learning process in the first step.
9. Common models or methods of Deep Learning
9.1. AutoEncoder
The simplest method of Deep Learning is to use the characteristics of artificial neural networks. Artificial neural networks (ANN) are hierarchical systems. If we give a neural network, we assume that its output is the same as its input, and then train and adjust its parameters to obtain the weights in each layer. Naturally, we get several different representations of the input I (each layer represents a representation), and these representations are features. An autoencoder is a neural network that reproduces the input signal as much as possible .In order To achieve this reproduction, the autoencoder must capture the most important factors that can represent the input data, just like PCA, to find the main components that can represent the original information.
The specific process is briefly described as follows:
1) Given unlabeled data, learn features using unsupervised learning:
In our previous neural network, as shown in the first figure, the samples we input are labeled, that is, (input, target), so we change the parameters of the previous layers according to the difference between the current output and the target ( label) until convergence. But now we only have unlabeled data, which is the figure on the right. So how do we get this error?
As shown in the figure above, when we input the input into an encoder, we get a code, which is a representation of the input. So how do we know that the code represents the input? We add a decoder, and the decoder will output a message. If the output message is very similar to the initial input signal (ideally, it is the same), then obviously, we have reason to believe that the code is reliable. Therefore, we adjust the parameters of the encoder and decoder to minimize the reconstruction error. At this time, we get the first representation of the input signal, which is the encoded code. Because it is unlabeled data, the source of the error is obtained by directly comparing the reconstruction with the original input.
2) Generate features through the encoder and then train the next layer. This is how we train layer by layer:
So we get the code of the first layer. Our reconstruction error is minimal, which makes us believe that this code is a good expression of the original input signal, or to put it more far-fetchedly, it is exactly the same as the original signal (different expressions, reflecting the same thing). The training method of the second layer is no different from that of the first layer. We take the code output by the first layer as the input signal of the second layer, and minimize the reconstruction error in the same way. We will get the parameters of the second layer and the code of the second layer input, which is the second expression of the original input information. The other layers can be prepared in the same way (for training this layer, the parameters of the previous layers are fixed, and their decoders are no longer useful and are not needed).
3) Supervised fine-tuning:
After the above method, we can get many layers. As for how many layers are needed (or how deep is needed, there is no scientific evaluation method for this at present), you need to experiment and adjust it yourself. Each layer will get a different expression of the original input. Of course, we think it is the more abstract, the better, just like the human visual system.
At this point, the AutoEncoder cannot be used to classify data because it has not yet learned how to connect an input and a class. It has only learned how to reconstruct or reproduce its input. In other words, it has only learned to obtain a feature that can represent the input well, and this feature can represent the original input signal to the greatest extent. Then, in order to achieve classification, we can add a classifier (such as Logistic Regression, SVM, etc.) to the top encoding layer of the AutoEncoder, and then train it through the standard multi-layer neural network supervised training method (gradient descent method).
That is to say, at this time, we need to input the feature code of the last layer into the final classifier, and fine-tune it through supervised learning with labeled samples. There are two types of this, one is to only adjust the classifier (the black part):
Another way: fine-tune the entire system through labeled samples: (If there is enough data, this is the best. End-to-end learning)
Once supervised training is complete, the network can be used for classification. The top layer of the neural network can be used as a linear classifier, which can then be replaced with a classifier with better performance.
In the research, we found that if these automatically learned features are added to the original features, the accuracy can be greatly improved, and even better than the current best classification algorithm in classification problems!
There are several variants of AutoEncoder, here are two of them:
Sparse AutoEncoder:
Of course, we can continue to add some constraints to obtain new Deep Learning methods. For example, if we add the L1 Regularity restriction on the basis of AutoEncoder (L1 mainly constrains that most of the nodes in each layer must be 0, and only a few are not 0, which is the origin of the name Sparse), we can get the Sparse AutoEncoder method.
As shown in the figure above, it actually limits the expression code obtained each time to be as sparse as possible, because sparse expressions are often more effective than other expressions (the human brain seems to be the same, a certain input only stimulates certain neurons , and most of the other neurons are inhibited).
Denoising AutoEncoders:
The denoising autoencoder DA is based on the autoencoder, and the training data is added with noise, so the autoencoder must learn to remove this noise and obtain the real input that is not polluted by noise. Therefore, this forces the encoder to learn a more robust expression of the input signal, which is why its generalization ability is stronger than that of general encoders. DA can be trained by the gradient descent algorithm.
9.2 Sparse Coding
If we relax the restriction that the output must be equal to the input, and use the concept of basis in linear algebra, that is, O = a 1 *Φ 1 + a 2 *Φ 2 +….+ a n *Φ n , Φ i is the basis and a i is the coefficient, we can get the following optimization problem:
Min |I – O|, where I represents input and O represents output.
By solving this optimization formula, we can obtain the coefficients a i and the basis Φ i , which are another approximate expression of the input.
Therefore, they can be used to express the input I, and this process is also learned automatically. If we add the L1 Regularity restriction to the above formula, we get:
Min |I – O| + u*(|a 1 | + |a 2 | + … + |a n |)
This method is called Sparse Coding. In layman’s terms, it is to represent a signal as a linear combination of a set of bases, and only a few bases are needed to represent the signal. “Sparsity” is defined as: there are only a few non-zero elements or only a few elements that are far greater than zero. Requiring the coefficients a i to be sparse means that for a set of input vectors, we only want as few coefficients as possible to be far greater than zero. is a reason for choosing to use sparse components to represent our input data, because most sensory data, such as natural images, can be represented as a superposition of a small number of basic elements, which can be surfaces or lines in the image. At the same time, for example, the analogy with the primary visual cortex is also improved (the human brain has a large number of neurons, but only a few neurons are excited for certain images or edges, and the others are in an inhibited state) .
The sparse coding algorithm is an unsupervised learning method that is used to find a set of “overcomplete” basis vectors to more efficiently represent sample data. Although techniques such as principal component analysis (PCA) allow us to easily find a set of “complete” basis vectors, what we want to do here is to find a set of “overcomplete” basis vectors to represent the input vector (that is, the number of basis vectors is larger than the dimension of the input vector). The advantage of overcomplete bases is that they can more effectively find the structure and patterns hidden in the input data. However, for overcomplete bases, the coefficients a i are no longer uniquely determined by the input vector. Therefore, in the sparse coding algorithm, we added another criterion ” sparseness” to solve the degeneracy problem caused by overcompleteness. ( For detailed process, please refer to: UFLDL Tutorial Sparse Coding )
For example, at the bottom layer of Feature Extraction of an image, we need to generate an Edge Detector. The work here is to randomly select some small patches from Natural Images, and use these patches to generate a “basis” that can describe them , that is, the basis composed of 8*8=64 bases on the right. Then, given a test patch, we can obtain it through the linear combination of the basis according to the above formula, and the sparse matrix is a . There are 64 dimensions in a in the figure below, and there are only 3 non-zero items, so it is called “sparse”.
You may have a question here, why is the bottom layer used as the Edge Detector? What is the upper layer? Here is a simple explanation and you will understand that the reason why it is an Edge Detector is that the edges in different directions can describe the entire image, so the edges in different directions are naturally the basis of the image… and the result of the combination of the basis of the upper layer, the upper layer is the combination basis of the upper layer… ( just as we said in the fourth part above)
Sparse coding is divided into two parts:
1) Training stage: Given a series of sample images [x1, x2, …], we need to learn a set of bases [Φ1, Φ2, …], that is, a dictionary.
Sparse coding is a variant of the k-means algorithm, and its training process is similar (the idea of the EM algorithm: if the objective function to be optimized contains two variables, such as L(W, B), then we can first fix W and adjust B to minimize L, and then fix B and adjust W to minimize L, and iterate in this way to continuously push L to the minimum value. The EM algorithm can be found in my blog: ” A Brief Explanation from Maximum Likelihood to EM Algorithm “).
The training process is a repeated iterative process. As mentioned above, we alternately change a and Φ to minimize the following objective function.
Each iteration is divided into two steps:
a) Fix the dictionary Φ[k], and then adjust a[k] so that the above formula, that is, the objective function, is minimized (ie, solve the LASSO problem).
b) Then fix a[k] and adjust Φ[k] so that the above equation, ie the objective function, is minimized (ie, solving the convex QP problem).
Continue iterating until convergence. In this way, you can get a set of bases that can well represent this series of x, that is, a dictionary.
2) Coding stage: Given a new image x, we use the dictionary obtained above to solve a LASSO problem to obtain a sparse vector a . This sparse vector is a sparse representation of the input vector x.
For example:
9.3. Restricted Boltzmann Machine (RBM)
Suppose there is a bipartite graph with no links between nodes in each layer. One layer is the visible layer, ie the input data layer (v), and one layer is the hidden layer (h). If we assume that all nodes are random binary variable nodes (can only take values of 0 or 1), and assuming that the total probability distribution p(v,h) satisfies the Boltzmann distribution, we call this model Restricted BoltzmannMachine (RBM).
Let’s take a look at why it is a Deep Learning method. First of all, because this model is a bipartite graph, when v is known, all hidden nodes are conditionally independent (because there is no connection between nodes), that is, p (h|v)=p(h 1 |v)…p(h n |v). Similarly, when the hidden layer h is known, all visible nodes are conditionally independent. At the same time, since all v and h satisfy the Boltzmann distribution, when v is input, the hidden layer h can be obtained through p(h|v), and after obtaining the hidden layer h, the visible layer can be obtained through p(v|h). By adjusting the parameters , we want to make the visible layer v1 obtained from the hidden layer the same as the original visible layer v, then the obtained hidden layer is another expression of the visible layer. Therefore, the hidden layer can be used as a feature of the input data of the visible layer, so it is a Deep Learning method.
How to train? That is, how to determine the weights between the visible layer nodes and the hidden nodes? We need to do some mathematical analysis, that is, the model.
The energy of a joint configuration can be expressed as:
The joint probability distribution of a configuration can be determined by the Boltzmann distribution (and the energy of this configuration):
Because the hidden nodes are conditionally independent (because there is no connection between nodes), that is:
Then we can easily (factorize the above formula) get the probability that the jth node in the hidden layer is 1 or 0 based on the given visual layer v:
Similarly, given the hidden layer h, the probability that the i-th node in the visible layer is 1 or 0 can also be easily obtained:
Given a set of independent and identically distributed samples: D={ v (1) , v (2) ,…, v (N) }, we need to learn parameters θ={W,a,b}.
We maximize the following log-likelihood function (maximum likelihood estimation: for a certain probability model, we need to choose a parameter that maximizes the probability of our current observation sample):
That is to say, by taking the derivative of the maximum log-likelihood function, we can get the parameter W corresponding to when L is maximum.
If we increase the number of hidden layers, we can get Deep Boltzmann Machine (DBM); if we use Bayesian belief network (that is, directed graph model, of course, there is still no link between nodes in the restricted layer) in the part close to the visual layer, and use Restricted Boltzmann Machine in the part farthest from the visual layer, we can get Deep Belief Net (DBN).
9.4. Deep Belief Networks
DBNs is a probabilistic generative model. Compared with the traditional discriminative neural network model, the generative model establishes a joint distribution between observed data and labels, and evaluates both P(Observation|Label) and P(Label|Observation), while the discriminative model only evaluates the latter, that is, P(Label|Observation). When applying the traditional BP algorithm to deep neural networks, DBNs encounter the following problems:
(1) A labeled sample set is required for training;
(2) The learning process is slow;
(3) Inappropriate parameter selection will cause learning to converge to a local optimal solution.
DBNs consist of multiple layers of Restricted Boltzmann Machines, and a typical type of neural network is shown in Figure 3. These networks are “restricted” to one visible layer and one hidden layer, with connections between layers but no connections between units within a layer. The hidden layer units are trained to capture the correlation of high-order data presented in the visible layer.
First, regardless of the top two layers that form an associative memory, the connections of a DBN are guided by the top-down generation of weights. RBMs are like a building block that makes it easier to learn connection weights compared to traditional and deeply layered sigmoid belief networks.
At the beginning, an unsupervised greedy layer-by-layer method was used to pre-train the weights of the generative model. The unsupervised greedy layer-by-layer method was proved to be effective by Hinton and was called contrastive divergence.
In this training phase, a vector v is generated in the visual layer, through which the values are passed to the hidden layer. In turn, the input of the visual layer will be randomly selected to try to reconstruct the original input signal. Finally, these new visual neuron activation units will be forwarded to reconstruct the hidden layer activation units to obtain h (during the training process, the visual vector values are first mapped to the hidden units; then the visual units are reconstructed by the hidden layer units; these new visual units are mapped to the hidden units again, so that new hidden units are obtained. This repetitive step is called Gibbs sampling). These backward and forward steps are the familiar Gibbs sampling, and the difference in correlation between the hidden layer activation units and the visual layer inputs is used as the main basis for updating the weights.
Training time is significantly reduced, as only a single step is required to approximate maximum likelihood learning. Each layer added to the network improves the log-probability of the training data, which we can interpret as getting closer and closer to the true representation of the energy. This meaningful extension, and the use of unlabeled data, is a decisive factor in any deep learning application.
In the top two layers, the weights are connected together so that the output of the lower layer will provide a reference clue or association to the top layer, so that the top layer will connect it to its memory content. What we care about most and what we want to get in the end is the discriminative performance, such as in classification tasks.
After pre-training, DBN can use the BP algorithm to adjust the discriminative performance by using labeled data. Here, a label set will be attached to the top layer (generalized associative memory), and a network classification surface will be obtained through a bottom-up, learned recognition weight. This performance will be better than the network trained by the simple BP algorithm. This can be explained intuitively. The BP algorithm of DBNs only needs to perform a local search in the weight parameter space, which is faster to train and takes less time to converge than the forward neural network.
The flexibility of DBNs makes it easier to expand. One extension is Convolutional Deep Belief Networks (CDBNs). DBNs do not take into account the 2D structural information of the image because the input is simply vectorized from an image matrix in one dimension. CDBNs take this problem into consideration. It uses the spatial relationship between neighboring pixels to achieve the transformation invariance of the generative model through a model area called convolutional RBMs, and can be easily transformed to high-dimensional images. DBNs do not explicitly deal with learning the temporal connection of observed variables, although there are studies in this area, such as stacked time RBMs, and as a generalization, there are dubbed temporal convolution machines for sequence learning. The application of this sequence learning brings an exciting future research direction to the problem of speech signal processing.
Currently, research related to DBNs includes stacked autoencoders, which replaces the RBMs in traditional DBNs with stacked autoencoders. This allows deep multi-layer neural network architectures to be trained using the same rules, but it lacks the strict requirements for the parameterization of the layers. Unlike DBNs, autoencoders use discriminative models, which makes it difficult for this structure to sample the input sampling space, making it more difficult for the network to capture its internal expression. However, denoising autoencoders can avoid this problem well and are better than traditional DBNs. It produces field generalization performance by adding random contamination during the training process and stacking it. The process of training a single denoising autoencoder is the same as the process of training a generative model with RBMs.
9.5 Convolutional Neural Networks
Convolutional neural network is a kind of artificial neural network, which has become a research hotspot in the current field of speech analysis and image recognition. Its weight-sharing network structure makes it more similar to biological neural networks, reducing the complexity of the network model and the number of weights. This advantage is more obvious when the network input is a multi-dimensional image, so that the image can be directly used as the input of the network, avoiding the complex feature extraction and data reconstruction process in traditional recognition algorithms. Convolutional network is a multi-layer perceptron specially designed for recognizing two-dimensional shapes. This network structure is highly invariant to translation, scaling, tilt or other forms of deformation.
CNNs are influenced by the early time-delay neural network (TDNN), which reduces learning complexity by sharing weights in the time dimension and is suitable for processing speech and time series signals.
CNNs are the first learning algorithms that have truly successfully trained multi-layer network structures. They use spatial relationships to reduce the number of parameters that need to be learned to improve the training performance of the general forward BP algorithm. CNNs were proposed as a deep learning architecture to minimize the preprocessing requirements of data. In CNN, a small part of the image (the local receptive region) is used as the input of the lowest layer of the hierarchy, and the information is then transmitted to different layers in turn, each layer passes through a digital filter to obtain the most significant features of the observed data. This method can obtain significant features of the observed data that are invariant to translation, scaling, and rotation, because the local receptive region of the image allows neurons or processing units to access the most basic features, such as oriented edges or corners.
1) History of Convolutional Neural Networks
In 1962, Hubel and Wiesel proposed the concept of receptive field through their research on cat visual cortical cells. In 1984, Japanese scholar Fukushima proposed the neocognitron based on the concept of receptive field, which can be regarded as the first implementation network of convolutional neural network and the first application of the concept of receptive field in the field of artificial neural network. The neocognitron decomposes a visual pattern into many sub-patterns (features) and then processes them in hierarchical and connected feature planes. It attempts to model the visual system so that it can complete recognition even when the object is displaced or slightly deformed.
Usually, a neurocognitron contains two types of neurons, namely, S-units that are responsible for feature extraction and C-units that are resistant to deformation. There are two important parameters involved in the S-unit, namely the receptive field and the threshold parameter. The former determines the number of input connections, and the latter controls the degree of response to the characteristic sub-pattern. Many scholars have been committed to improving the performance of the neurocognitron: In traditional neurocognitrons, the amount of visual blur caused by the C-unit in the photosensitive area of each S-unit is normally distributed. If the blur effect produced by the edge of the photosensitive area is greater than that in the center, the S-unit will accept the greater deformation tolerance caused by this non-normal blur. What we hope to get is that the difference between the effects produced by the training mode and the deformation stimulation mode at the edge and center of the receptive field becomes larger and larger. In order to effectively form this non-normal blur, Fukushima proposed an improved neurocognitron with a double C-unit layer.
Van Ooyen and Niehuis introduced a new parameter to improve the discrimination ability of the neurocognitive machine. In fact, this parameter acts as an inhibitory signal, inhibiting the excitation of neurons to repeated excitation features. Most neural networks memorize training information in weights. According to the Hebb learning rule, the more times a feature is trained, the easier it is to detect in the subsequent recognition process. Some scholars have also combined the theory of evolutionary computing with the neurocognitive machine, and by weakening the training and learning of repetitive excitation features, the network pays attention to those different features to help improve the discrimination ability. The above are all the development processes of the neurocognitive machine, and the convolutional neural network can be regarded as a generalized form of the neurocognitive machine. The neurocognitive machine is a special case of the convolutional neural network.
2) Network structure of convolutional neural network
A convolutional neural network is a multi-layer neural network, each layer consists of multiple two-dimensional planes, and each plane consists of multiple independent neurons.
Figure: Conceptual demonstration of convolutional neural network: The input image is convolved with three trainable filters and an additive bias. The filtering process is shown in Figure 1. After convolution, three feature maps are generated in the C1 layer. Then, the four pixels in each group of feature maps are summed, weighted, and biased. A Sigmoid function is used to obtain three feature maps of the S2 layer. These maps are then filtered to obtain the C3 layer. This hierarchical structure generates S4 in the same way as S2. Finally, these pixel values are rasterized and connected into a vector input to the traditional neural network to obtain the output.
Generally, the C layer is a feature extraction layer. The input of each neuron is connected to the local receptive field of the previous layer, and the local feature is extracted. Once the local feature is extracted, its positional relationship with other features is also determined. The S layer is a feature mapping layer. Each computing layer of the network consists of multiple feature maps. Each feature map is a plane, and the weights of all neurons on the plane are equal. The feature mapping structure uses the sigmoid function with a small influence function kernel as the activation function of the convolutional network, so that the feature map has displacement invariance.
In addition, since neurons on a mapping surface share weights, the number of free parameters of the network is reduced, reducing the complexity of network parameter selection. Each feature extraction layer (C-layer) in the convolutional neural network is followed by a calculation layer (S-layer) for local averaging and secondary extraction. This unique two-time feature extraction structure enables the network to have a higher distortion tolerance for input samples during recognition.
3) About parameter reduction and weight sharing
As mentioned above, one of the great things about CNN is that it reduces the number of parameters that need to be trained in the neural network through receptive field and weight sharing. So what exactly is that?
Left: If we have an image of 1000×1000 pixels and 1 million hidden neurons, then if they are fully connected (each hidden neuron is connected to every pixel of the image), there will be 1000x1000x1000000=10^12 connections, that is, 10^12 weight parameters. However, the spatial connection of the image is local, just like people perceive the external image through a local receptive field. Each neuron does not need to perceive the global image. Each neuron only perceives the local image area, and then at a higher level, these neurons that perceive different local areas are combined to obtain global information. In this way, we can reduce the number of connections, that is, reduce the number of weight parameters that the neural network needs to train. As shown in the right figure below: If the local receptive field is 10×10, each receptive field of the hidden layer only needs to be connected to this 10×10 local image, so 1 million hidden neurons have only 100 million connections, that is, 10^8 parameters. It is four orders of magnitude less than before, so training is less strenuous, but it still feels like a lot. Is there any other solution?
We know that each neuron in the hidden layer is connected to 10×10 image regions, which means that each neuron has 10×10=100 connection weight parameters. What if these 100 parameters are the same for each neuron? That is to say, each neuron uses the same convolution kernel to convolve the image. How many parameters do we have in this case? Only 100 parameters! ! ! Dear! No matter how many neurons you have in the hidden layer, I only have 100 parameters for the connection between the two layers! Dear! This is weight sharing! Dear! This is the main selling point of convolutional neural networks! Dear! (A little annoyed, haha) Maybe you will ask, is this reliable? Why is it feasible? This… learning together.
Well, you will think, this is not a good way to extract features, right? You have only extracted one feature. That’s right, you are really smart. We need to extract multiple features, right? Suppose a filter, that is, a convolution kernel, is to extract a feature of the image, such as an edge in a certain direction. Then we need to extract different features, what should we do? Isn’t it enough to add more filters? That’s right. So suppose we add 100 filters, and the parameters of each filter are different, which means that it extracts different features of the input image, such as different edges. In this way, each filter deconvolves the image to obtain a projection of different features of the image, which we call Feature Map. So there are 100 Feature Maps for 100 convolution kernels. These 100 Feature Maps form a layer of neurons. Now it’s clear. How many parameters do we have in this layer? 100 convolution kernels x each convolution kernel shares 100 parameters = 100×100 = 10K, which is 10,000 parameters. Only 10,000 parameters! Dear! (Here it comes again, I can’t stand it anymore!) See the picture below on the right: different colors represent different filters.
Hey, I missed a question. I just said that the number of parameters in the hidden layer has nothing to do with the number of neurons in the hidden layer, but only has to do with the size of the filter and the number of filter types. So how do we determine the number of neurons in the hidden layer? It is related to the original image, that is, the size of the input (the number of neurons), the size of the filter, and the sliding step size of the filter in the image! For example, my image is 1000×1000 pixels, and the filter size is 10×10. Assuming that the filter does not overlap, that is, the step size is 10, then the number of neurons in the hidden layer is (1000×1000)/(10×10)=100×100 neurons. Assuming the step size is 8, that is, the convolution kernel will overlap two pixels, then… I won’t count it, just get the idea. Note that this is just one type of filter, that is, the number of neurons in a Feature Map. If there are 100 Feature Maps, it will be 100 times. It can be seen that the larger the image, the greater the gap between the number of neurons and the number of weight parameters that need to be trained.
One thing to note is that the above discussion does not take into account the bias of each neuron. So the number of weights needs to be increased by 1. This is also shared by the same type of filter.
In short, the core idea of the convolutional network is to combine the three structural ideas of local receptive field, weight sharing (or weight replication), and time or space subsampling to achieve a certain degree of displacement, scale, and deformation invariance.
4) A typical example
A typical convolutional network used to recognize numbers is LeNet-5 ( see here for the effect and paper ). Most banks in the United States used it to recognize handwritten numbers on checks. Its accuracy can be imagined to be able to reach such a commercial level. After all, the combination of academia and industry is the most controversial at present.
Let’s use this example to illustrate.
LeNet-5 has 7 layers, excluding input, and each layer contains trainable parameters (connection weights). The input image is 32*32 in size. This is larger than the largest letter in the Mnist database (a recognized handwriting database). The reason for this is that potential obvious features such as stroke power-off or corner points can appear in the center of the highest-level feature monitoring sub-receptive field.
We must first make one thing clear: each layer has multiple Feature Maps, each Feature Map extracts a feature of the input through a convolution filter, and each Feature Map has multiple neurons.
The C1 layer is a convolution layer (why convolution? An important feature of convolution operation is that through convolution operation, the original signal features can be enhanced and noise can be reduced), consisting of 6 feature maps. Each neuron in the feature map is connected to a 5*5 neighborhood in the input. The size of the feature map is 28*28, which can prevent the input connection from falling outside the boundary (for calculation during BP feedback, so as not to cause gradient loss, personal opinion). C1 has 156 trainable parameters (each filter has 5*5=25 unit parameters and one bias parameter, a total of 6 filters, a total of (5*5+1)*6=156 parameters), a total of 156*(28*28)=122,304 connections.
The S2 layer is a downsampling layer (why downsampling? Using the principle of local correlation of images, subsampling the image can reduce the amount of data processing while retaining useful information), with 6 14*14 feature maps. Each unit in the feature map is connected to the 2*2 neighborhood of the corresponding feature map in C1. The 4 inputs of each unit in the S2 layer are added, multiplied by a trainable parameter, and then added with a trainable bias. The result is calculated by the sigmoid function. The trainable coefficient and bias control the degree of nonlinearity of the sigmoid function. If the coefficient is small, the operation is close to linear operation, and subsampling is equivalent to blurring the image. If the coefficient is large, subsampling can be regarded as a noisy “or” operation or a noisy “and” operation according to the size of the bias. The 2*2 receptive fields of each unit do not overlap, so the size of each feature map in S2 is 1/4 of the size of the feature map in C1 (1/2 for rows and columns). The S2 layer has 12 trainable parameters and 5880 connections.
Figure: Convolution and subsampling process: The convolution process includes: using a trainable filter f x to convolve an input image (the first stage is the input image, and the subsequent stage is the convolution feature map), and then adding a bias b x to obtain the convolution layer C x . The subsampling process includes: summing the four pixels in each neighborhood into one pixel, then weighting it by the scalar W x+1 , adding the bias b x+1 , and then passing a sigmoid activation function to produce a feature map S x+1 that is approximately four times smaller .
Therefore, the mapping from one plane to the next can be regarded as a convolution operation, and the S-layer can be regarded as a fuzzy filter, which plays a role in secondary feature extraction. The spatial resolution decreases between hidden layers, while the number of planes contained in each layer increases, which can be used to detect more feature information.
The C3 layer is also a convolutional layer. It also uses a 5×5 convolution kernel to convolve layer S2. The resulting feature map has only 10×10 neurons, but it has 16 different convolution kernels, so there are 16 feature maps. One thing to note here is that each feature map in C3 is connected to all 6 or several feature maps in S2, which means that the feature map of this layer is a different combination of the feature maps extracted in the previous layer (this approach is not unique). (See, it is a combination here, just like the human visual system we talked about before, the underlying structure constitutes a more abstract structure in the upper layer, such as edges constitute parts of shapes or objects).
I just said that each feature map in C3 is composed of all 6 or several feature maps in S2. Why not connect each feature map in S2 to each feature map in C3? There are two reasons. First, the incomplete connection mechanism keeps the number of connections within a reasonable range. Second, and most importantly, it destroys the symmetry of the network. Since different feature maps have different inputs, they are forced to extract different features (hopefully complementary).
For example, there is a way that the first 6 feature maps of C3 take 3 adjacent feature map subsets in S2 as input. The next 6 feature maps take 4 adjacent feature map subsets in S2 as input. The next 3 take 4 non-adjacent feature map subsets as input. The last one takes all feature maps in S2 as input. In this way, the C3 layer has 1516 trainable parameters and 151600 connections.
The S4 layer is a downsampling layer consisting of 16 5*5 feature maps. Each unit in the feature map is connected to the 2*2 neighborhood of the corresponding feature map in C3, just like the connection between C1 and S2. The S4 layer has 32 trainable parameters (1 factor and one bias per feature map) and 2000 connections.
The C5 layer is a convolutional layer with 120 feature maps. Each unit is connected to the 5*5 neighborhood of all 16 units in the S4 layer. Since the size of the S4 layer feature map is also 5*5 (same as the filter), the size of the C5 feature map is 1*1: this constitutes a full connection between S4 and C5. The reason why C5 is still marked as a convolutional layer rather than a fully connected layer is that if the input of LeNet-5 becomes larger while the others remain unchanged, then the dimension of the feature map will be larger than 1*1. The C5 layer has 48120 trainable connections.
The F6 layer has 84 units (the reason for choosing this number comes from the design of the output layer) and is fully connected to the C5 layer. There are 10164 trainable parameters. Like a classic neural network, the F6 layer calculates the dot product between the input vector and the weight vector, plus a bias. It is then passed to the sigmoid function to produce a state for unit i.
Finally, the output layer consists of Euclidean Radial Basis Function units, one for each class, each with 84 inputs. In other words, each output RBF unit computes the Euclidean distance between the input vector and the parameter vector. The farther the input is from the parameter vector, the larger the RBF output. An RBF output can be understood as a penalty term that measures how well the input pattern matches a model of the class associated with the RBF. In probabilistic terms, the RBF output can be understood as the negative log-likelihood of a Gaussian distribution over the configuration space of the F6 layer. Given an input pattern, the loss function should be such that the configuration of F6 is close enough to the RBF parameter vector (i.e., the expected classification of the pattern). The parameters of these units are manually chosen and kept fixed (at least initially). The components of these parameter vectors are set to -1 or 1. Although these parameters can be chosen with equal probability to -1 and 1, or to form an error correction code, they are designed to be a 7*12 (i.e., 84) formatted image of the corresponding character class. This representation is not very useful for identifying individual digits, but is useful for identifying strings of characters in the printable ASCII set.
Another reason to use this distribution encoding instead of the more common “1 of N” encoding for generating the output is that non-distributed encodings work poorly when the number of classes is large. The reason is that the output of non-distributed encodings must be 0 most of the time. This makes it difficult to achieve with sigmoid units. Another reason is that the classifier is not only used to recognize letters, but also to reject non-letters. RBFs using distribution encoding are better suited for this goal. Because unlike sigmoids, they excite in a well-constrained region of the input space, while atypical patterns are more likely to fall outside.
The RBF parameter vector plays the role of the target vector of the F6 layer. It is important to note that the components of these vectors are either +1 or -1, which is well within the range of the F6 sigmoid, thus preventing the sigmoid function from saturating. In fact, +1 and -1 are the points of maximum curvature of the sigmoid function. This makes the F6 unit operate in the maximum nonlinear range. Saturation of the sigmoid function must be avoided because it will lead to slower convergence and ill-conditioning of the loss function.
5) Training process
The mainstream of neural network for pattern recognition is supervised learning network, while unsupervised learning network is more used for cluster analysis. For supervised pattern recognition, since the category of any sample is known, the distribution of samples in space is no longer divided according to their natural distribution tendency, but to find an appropriate space division method based on the distribution of samples of the same type in space and the degree of separation between samples of different types, or find a classification boundary so that samples of different types are located in different areas. This requires a long and complex learning process to continuously adjust the position of the classification boundary used to divide the sample space so that as few samples as possible are divided into non-same areas.
Convolutional networks are essentially a mapping from input to output. They can learn a large number of mapping relationships between inputs and outputs without the need for any precise mathematical expressions between inputs and outputs. As long as the convolutional network is trained with a known pattern, the network has the mapping ability between input and output pairs. Convolutional networks perform mentor training, so their sample sets are composed of vector pairs in the form of (input vector, ideal output vector). All these vector pairs should be derived from the actual “running” results of the system that the network is about to simulate. They can be collected from the actual running system. Before starting training, all weights should be initialized with some different small random numbers. “Small random numbers” are used to ensure that the network will not enter a saturated state due to excessive weights, resulting in training failure; “different” is used to ensure that the network can learn normally. In fact, if the same number is used to initialize the weight matrix, the network will not be able to learn.
The training algorithm is similar to the traditional BP algorithm. It mainly includes 4 steps, which are divided into two stages:
The first stage, the forward propagation stage:
a) Take a sample (X, Y p ) from the sample set and input X into the network;
b) Calculate the corresponding actual output Op .
In this phase, information is transformed from the input layer to the output layer. This process is also the process performed when the network is running normally after training. In this process, the network performs calculations (actually, the input is multiplied by the weight matrix of each layer to obtain the final output result):
O p =F n (…(F 2 (F 1 (X p W (1) ) W (2) )…) W (n) )
The second stage, the back propagation stage
a) Calculate the difference between the actual output Op and the corresponding ideal output Yp ;
b) Back propagate and adjust the weight matrix according to the method of minimizing the error.
6) Advantages of Convolutional Neural Networks
Convolutional neural networks (CNNs) are mainly used to identify two-dimensional graphics that are invariant to displacement, scaling, and other forms of distortion. Since the feature detection layer of CNNs is learned through training data, explicit feature extraction is avoided when using CNNs, and learning is implicitly performed from training data; furthermore, since the weights of neurons on the same feature map are the same, the network can learn in parallel, which is also a major advantage of convolutional networks over networks in which neurons are connected to each other. Convolutional neural networks have unique advantages in speech recognition and image processing due to their special structure of local weight sharing. Their layout is closer to actual biological neural networks, and weight sharing reduces the complexity of the network. In particular, the image of a multi-dimensional input vector can be directly input into the network, which avoids the complexity of data reconstruction during feature extraction and classification.
The classification of streams is almost always based on statistical features, which means that certain features must be extracted before distinguishing. However, explicit feature extraction is not easy and is not always reliable in some application problems. Convolutional neural networks avoid explicit feature sampling and learn implicitly from training data. This makes convolutional neural networks significantly different from other neural network-based classifiers, integrating feature extraction functions into multi-layer perceptrons through structural reorganization and weight reduction. It can directly process grayscale images and can be directly used to process image-based classification.
Convolutional networks have the following advantages over general neural networks in image processing: a) The input image and the network’s topological structure can be well matched; b) Feature extraction and pattern classification are performed simultaneously and are generated simultaneously during training; c) Weight sharing can reduce the network’s training parameters, making the neural network structure simpler and more adaptable.
7) Summary
The close relationship between the inter-layer connection and the spatial information in CNNs makes it suitable for image processing and understanding. Moreover, it has shown relatively good performance in automatically extracting the salient features of the image. In some examples, Gabor filters have been used in an initialization preprocessing step to simulate the response of the human visual system to visual stimuli. In most of the current work, researchers have applied CNNs to a variety of machine learning problems, including face recognition, document analysis, and language detection. In order to achieve the purpose of finding the coherence between frames in the video, CNNs are currently trained through a temporal coherence, but this is not unique to CNNs.
Haha, this part is too long-winded and not to the point. I have no choice but to leave it like this for now. I have not gone through this process yet, so my level is limited. I hope you can understand. I need to revise it later, haha.
10. Summary and Outlook
1) Summary of Deep Learning
Deep learning is about algorithms that automatically learn multi-layer (complex) expressions of the potential (implicit) distribution of the data to be modeled. In other words, deep learning algorithms automatically extract low-level or high-level features required for classification. High-level features refer to features that can be hierarchically (hierarchically) dependent on other features. For example, for machine vision, deep learning algorithms learn from the original image to obtain a low-level expression of it, such as edge detectors, wavelet filters, etc., and then build expressions based on these low-level expressions, such as linear or nonlinear combinations of these low-level expressions, and then repeat this process to finally obtain a high-level expression.
Deep learning can obtain features that better represent data. At the same time, because the model has many levels and parameters and sufficient capacity, the model is capable of representing large-scale data. Therefore, for problems such as images and speech whose features are not obvious (manual design is required and many have no intuitive physical meaning), it can achieve better results on large-scale training data. In addition, from the perspective of pattern recognition features and classifiers, the deep learning framework combines features and classifiers into one framework, uses data to learn features, and reduces the huge workload of manually designing features during use (this is currently the area where industrial engineers put the most effort). Therefore, not only can the effect be better, but it is also very convenient to use. Therefore, it is a set of frameworks that are worth paying attention to, and everyone who does ML should pay attention to it.
Of course, deep learning itself is not perfect, nor is it a powerful tool for solving any ML problem in the world, and it should not be magnified to the point of being omnipotent.
2) The future of deep learning
There is still a lot of work to be done in deep learning. The current focus is still on borrowing some methods from the field of machine learning that can be used in deep learning, especially in the field of dimensionality reduction. For example: one of the current works is sparse coding, which reduces the dimensionality of high-dimensional data through compressed sensing theory, so that vectors with very few elements can accurately represent the original high-dimensional signal. Another example is semi-supervised popular learning, which projects the similarity of high-dimensional data into a low-dimensional space by measuring the similarity of training samples. Another more inspiring direction is evolutionary programming approaches, which can perform conceptual adaptive learning and change the core architecture by minimizing engineering energy.
Deep learning still has many core problems to be solved:
(1) For a particular framework, for how many dimensions of input can it perform well (if it is an image, it may be millions of dimensions)?
(2) Which architecture is effective for capturing short-term or long-term temporal dependencies?
(3) How to integrate information from multiple perceptions for a given deep learning architecture?
(4) What are the correct mechanisms to enhance a given deep learning architecture to improve its robustness and invariance to distortions and data loss?
(5) In terms of models, are there other more effective and theoretically based deep model learning algorithms?
Exploring new feature extraction models is a topic worthy of in-depth research. In addition, effective parallel training algorithms are also a direction worth studying. The current stochastic gradient optimization algorithm based on minimum batch processing is difficult to train in parallel on multiple computers. The usual approach is to use graphics processing units to accelerate the learning process. However, a single machine GPU is not suitable for large-scale data recognition or similar task data sets. In terms of the expansion of deep learning applications, how to reasonably and fully utilize deep learning to enhance the performance of traditional learning algorithms is still the focus of research in various fields.
11. References and Deep Learning Learning Resources ( Continuously Updated… )
First, the Weibo of the big guns in the field of machine learning: @余凯_西二旗民工; @老师木; @梁斌penny; @张栋_机器; @邓坎; @大数据皮东; @djvu9…
(1) Deep Learning
http://deeplearning.net/
(2) Deep Learning Methods for Vision
http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/
(3)Neural Network for Recognition of Handwritten Digits[Project]
http://www.codeproject.com/Articles/16650/Neural-Network-for-Recognition-of-Handwritten-Digi
(4)Training a deep autoencoder or a classifier on MNIST digits
http://www.cs.toronto.edu/~hinton/MatlabForSciencePaper.html
(5) Ersatz: deep neural networks in the cloud
http://www.ersatz1.com/
(6) Deep Learning
http://www.cs.nyu.edu/~yann/research/deep/
(7) Invited talk “A Tutorial on Deep Learning” by Dr. Kai Yu
http://vipl.ict.ac.cn/News/academic-report-tutorial-deep-learning-dr-kai-yu
(8)CNN – Convolutional neural network class
http://www.mathworks.cn/matlabcentral/fileexchange/24291
(9) Yann LeCun’s Publications
http://yann.lecun.com/exdb/publis/index.html#lecun-98
(10) LeNet-5, convolutional neural networks
http://yann.lecun.com/exdb/lenet/index.html
(11) Deep Learning expert Geoffrey E. Hinton’s HomePage
http://www.cs.toronto.edu/~hinton/
(12)Sparse coding simulation software[Project]
http://redwood.berkeley.edu/bruno/sparsenet/
(13) Andrew Ng’s homepage
http://robotics.stanford.edu/~ang/
(14)stanford deep learning tutorial
http://deeplearning.stanford.edu/wiki/index.php/UFLDL_Tutorial
(15) How does a deep neural network work?
http://www.zhihu.com/question/19833708?group_id=15019075#1657279
(16) A shallow understanding on deep learning
http://blog.sina.com.cn/s/blog_6ae183910101dw2z.html
(17)Bengio’s Learning Deep Architectures for AI
http://www.iro.umontreal.ca/~bengioy/papers/ftml_book.pdf
(18) andrew ng’s talk video:
http://techtalks.tv/talks/machine-learning-and-ai-via-brain-simulations/57862/
(19) CVPR 2012 tutorial:
http://cs.nyu.edu/~fergus/tutorials/deep_learning_cvpr12/tutorial_p2_nnets_ranzato_short.pdf
(20) My thoughts after listening to Andrew ng’s report at Tsinghua University
http://blog.sina.com.cn/s/blog_593af2a70101bqyo.html
(21) Kai Yu: CVPR12 Tutorial on Deep Learning Sparse Coding
(22) Honglak Lee: Deep Learning Methods for Vision
(23) Andrew Ng: Machine Learning and AI via Brain simulations
(24) Deep Learning [2,3]
http://blog.sina.com.cn/s/blog_46d0a3930101gs5h.html
(25) The little thing called deep learning…
http://blog.sina.com.cn/s/blog_67fcf49e0101etab.html
(26) Yoshua Bengio, U. Montreal: Learning Deep Architectures
(27) Kai Yu: A Tutorial on Deep Learning
(28) Marc’Aurelio Ranzato: NEURAL NETS FOR VISION
(29) Unsupervised feature learning and deep learning
http://blog.csdn.net/abcjennifer/article/details/7804962
(30) Hot Topics in Machine Learning – Deep Learning
http://elevencitys.com/?p=1854
(31) Machine Learning — Deep Learning
http://blog.csdn.net/abcjennifer/article/details/7826917
(32) Convolutional Neural Networks
http://wenku.baidu.com/view/cd16fb8302d276a200292e22 .html
(33) A brief discussion on the basic ideas and methods of Deep Learning
http://blog.csdn.net/xianlingmao/article/details/8478562
(34) Deep Neural Networks
http://blog.csdn.net/txdb/article/details/6766373
(35) Google’s cat face recognition: a new breakthrough in artificial intelligence
http://www.36kr.com/p/122132.html
(36) Yu Kai, Deep Learning: The New Wave of Machine Learning, Technical News
http://blog.csdn.net/datoubo/article/details/8577366
(37) Geoffrey Hinton: UCL Tutorial on: Deep Belief Nets
(38)Learning Deep Boltzmann Machines
http://web.mit.edu/~rsalakhu/www/DBM.html
(39)Efficient Sparse Coding Algorithm
http://blog.sina.com.cn/s/blog_62af19190100gux1.html
(40) Itamar Arel, Derek C. Rose, and Thomas P. Karnowski: Deep Machine Learning—A New Frontier in Artificial Intelligence Research
(41) Francis Quintal Lauzon: An introduction to deep learning
(42)Tutorial on Deep Learning and Applications
(43) Boltzmann neural network model and learning algorithm
http://wenku.baidu.com/view/490dcf748e9951e79b892785 .html
(44) Deep Learning and Knowledge Graph ignite the big data revolution
http://blog.sina.com.cn/s/blog_46d0a3930101fswl.html