这些模型都是在ImageNet挑战赛上具有里程碑意义的卷积神经网络模型

Alex Net

ReLU Nonlinearity
train much faster and solve vanishing gradient problem
Overlapping Pooling
Reducing Overfitting
- Data Augmentation
- Dropout

we first rescaled the image such that the shorter side was of length 256, and then cropped out the central 256×256 patch from the resulting image. We did not pre-process the images in any other way, except for subtracting the mean activity over the training set from each pixel.

作者压缩图片并减去均值，这样的预处理步骤已经成为了习惯。

It contains eight learned layers — five convolutional and three fully-connected.

在那个时代，这样的深度已经是前所未有了

In terms of training time with gradient descent, these saturating nonlinearities are much slower than the non-saturating nonlinearity f(x) = max(0, x).

使用 ReLU 来代替传统的激活函数，学习速度更快

The parallelization scheme that we employ essentially puts half of the kernels (or neurons) on each GPU, with one additional trick: the GPUs communicate only in certain layers

使用两个 GPU 来加速训练，而且仅在 3/6/7/8 层会使用对方的数据

However, we still find that the following local normalization scheme aids generalization

使用 Local Response Normalization 来提高模型的泛化能力。文章中提到这个方法降低了 top1 和 top5中 1.4% 和 1.2%的错误率。这个方法使 ReLU 拥有了在边缘区域衰减的特性

We generally observe during training that models with overlapping pooling find it slightly more difficult to overfit

同样，使用了 overlapping pooling 来减少错误率。具体原理文章中没有提到，这个是实验得出的结论。可能是有重叠增加了保留的信息量吧

The first form of data augmentation consists of generating image translations and horizontal reflections. The second form of data augmentation consists of altering the intensities of the RGB channels in training images.

使用了两种数据集增大的方法。从 256x256 的图片中随机取出224x224 像素的图片，并加上水平翻转。对整个训练集做PCA来取得RGB值的主要特征，然后对输入进行主要特征乘σ=0.1的高斯分布的随机数。

The recently-introduced technique, called “dropout”, consists of setting to zero the output of each hidden neuron with probability 0.5.

引入了Dropout

VGG

Paper

Small Convolution Filters
16–19 layers
stack of 3x3 conv
1x1 conv

The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity).

使用了多个 3x3 的卷积核组合的滤波器来代替一个大的滤波器，除了能大幅减少参数数量，还能更好的提取特征。 1x1 的小卷积核用来做线性变换，可以结合非线性函数（ReLU等）来更好的表达特征，也可以用来控制输出维度。对于一个 RGB 图片来说，1x1 的卷积核实际上是 1x1x3 的，而卷积核的数量则对应输出的维度，即10个卷积核输出的维度为10。对于灰度图来说，1x1的卷积核就是1x1x1。

GoogLeNet

Paper

Auxiliary Classifiers
Sparse Connection
1x1 Convolutions
Convolutions of Different Sizes
Global Average Pooling

The most straightforward way of improving the performance of deep neural networks is by increasing their size

从趋势来看确实是这样，模型深度和宽度的增加都可以带来准确率的提高，但也带来了计算量的增大，硬件需求的提升（需要更大的内存和更快的计算速度），同时也会增加计算的难度，比如反向传播到靠近开始的位置时梯度信息就会非常弱。以及过多的参数也会造成过拟合的问题。

A fundamental way of solving both of these issues would be to introduce sparsity and replace the fully connected layers by the sparse ones, even inside the convolutions.

但是稀疏不利于计算机的计算，也就是不能很好的利用计算机的密集计算性能。

In order to avoid patch-alignment issues, current incarnations of the Inception architecture are restricted to filter sizes 1×1, 3×3 and 5×5

于是就有了VGG非常有特色的结构，也就是上图所示，多个不同尺度的卷积核拼接成，融合了不同尺度的特征。在 3x3 和 5x5 的卷积核后使用了 1x1 的卷积核进行降维，减少了计算量。这也是增加稀疏性的一个方法。

We found that a move from fully connected layers to average pooling improved the top-1 accuracy by about 0.6%, however the use of dropout remained essential even after removing the fully connected layers

思路来源于 Network in Network ，将最后一层的特征图进行整张图的一个均值池化，形成一个特征点，将这些特征点组成最后的特征向量进行softmax中进行计算，避免了全连接层的使用，减少了参数的数量也缓解了过拟合。

By adding auxiliary classifiers connected to these intermediate layers, discrimination in the lower stages in the classifier was expected.

为了避免梯度消失，额外增加了2个辅助的分类器用于向前传导梯度。他们的损失在乘衰减系数后会加入从后面的层传回来的损失用于辅助训练。

ResNet

Paper

Depth 152
Shortcut Connection

with the network depth increasing, accuracy gets saturated (which might be unsurprising) and then degrades rapidly

之前提到网络深度的增加有利于提高准确度，但实际上深度超过20层左右的普通网络的表现并不会比十几层的好

Instead of hoping each few stacked layers directly fit a desired underlying mapping, we explicitly let these layers fit a residual mapping. Formally, denoting the desired underlying mapping as H(x), we let the stacked nonlinear layers fit another mapping of F(x) := H(x)−x. The original mapping is recast into F(x)+x. We hypothesize that it is easier to optimize the residual mapping than to optimize the original, unreferenced mapping

这应该就是这篇文章的核心思想了，增加残差结构。使用上图中的虚线结构，将不直接相连的层之间增加了 short cut 连接，该连接的作用是为下面的层提供了额外的 identity mapping，也就是 h(x) = x。作者提出了两个假设，我们要计算的映射是H(x), 假设 H(x) 符合残差函数的形式：F(x) = H(x)-x，又假设求 F(x) 的过程比 H(x) 要简单，这样，通过 F(x)+x 我们就可以更方便地计算 H(x)。为什么有这样的假设？文中没有解释原理，而是由实验结果证实。对于残差的计算，实际上 F(x) 与 x 的维度不同，所以作者提出了两种改变维度的方法，增加 0 padding 和使用之前提到过的 1x1 卷积核。

Dense Net

Paper - Dense connectivity - Combine features by concatenating

… we combine features by concatenating them… To further improve the information flow between layers we propose a different connectivity pattern: we introduce direct connections from any layer to all subsequent layers.

借鉴了 ResNet 的思想，在Dense Net中，不同层的输出特征直接做了合并，而在 ResNet 中是相加。每一层都能接收到前面所有层直接传递的参数，有利于减少在传递过程中的信息损失。

在不同的 block 之间，使用 1x1 的卷积核与平均池化来减少数据维度。

Menu

Classic CNN Models

Alex Net

VGG

GoogLeNet

ResNet

Dense Net