VERY DEEP CONVOLUTIONAL NETWORKSFOR LARGE-SCALE IMAGE RECOGNITION 논문 번역

Autonomous Driving 2023. 7. 5. 23:38

아래 페이지가 정리가 잘 되어 있다.

https://medium.com/@msmapark2/vgg16-%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0-very-deep-convolutional-networks-for-large-scale-image-recognition-6f748235242a

VGG16 논문 리뷰 — Very Deep Convolutional Networks for Large-Scale Image Recognition

VGG-16 모델은 ImageNet Challenge에서 Top-5 테스트 정확도를 92.7% 달성하면서 2014년 컴퓨터 비전을 위한 딥러닝 관련 대표적 연구 중 하나로 자리매김하였다.

medium.com

그래도 논문의 내용을 보고 싶은 사람들만 아래 번역된 글을 읽으면 좋을듯싶다.

해당 논문을 읽는 이유는 resnet의 비교 대상 논문으로 딥러닝 모른는 사람들에게 많은 도움이 되는 논문이다.

필터 크기는 몇으로 하는것이 좋을까? 이런 질문 궁금하지 않나?

해당 논문이 3x3 필터만 써도 왜 잘 되는지 설명하고 있다.

즉 작은 필터를 여러개 쓰는것이 큰 필터를 하나 쓰는것보다 비선형성 측면과 개수 측면에서 좋다라고 말하고 있음.

ABSTRACT
In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture withvery small (3 × 3) convolution filters, which shows that a significant improvementon the prior-art configurations can be achieved by pushing the depth to 16–19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.

요약
본 연구에서는 대규모 이미지 인식 환경에서 합성곱 신경망의 깊이가 정확도에 미치는 영향을 조사했습니다. 우리의 주요 기여는 매우 작은 (3 × 3) 컨볼루션 필터를 사용하는 구조를 통해 깊이가 증가하는 네트워크의 철저한 평가입니다. 이를 통해 이전 기술 구성에 비해 깊이를 16~19개의 가중치 층으로 늘리면 상당한 개선이 가능함을 보였습니다. 이러한 결과는 ImageNet Challenge 2014에서 저희 팀이 로컬라이제이션 및 분류 트랙에서 각각 1위와 2위를 차지한 근거가 되었습니다. 또한, 저희의 표현이 다른 데이터셋에서도 잘 일반화되어 최첨단 결과를 달성한다는 것을 보였습니다. 깊은 시각적 표현의 사용에 대한 추가 연구를 촉진하기 위해 저희는 두 가장 성능이 우수한 ConvNet 모델을 공개적으로 제공하였습니다.

1 INTRODUCTION

Convolutional networks (ConvNets) have recently enjoyed a great success in large-scale im- age and video recognition (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014; Simonyan & Zisserman, 2014) which has become possible due to the large public image reposito- ries, such as ImageNet (Deng et al., 2009), and high-performance computing systems, such as GPUs or large-scale distributed clusters (Dean et al., 2012). In particular, an important role in the advance of deep visual recognition architectures has been played by the ImageNet Large-Scale Visual Recog- nition Challenge (ILSVRC) (Russakovsky et al., 2014), which has served as a testbed for a few generations of large-scale image classification systems, from high-dimensional shallow feature en- codings (Perronnin et al., 2010) (the winner of ILSVRC-2011) to deep ConvNets (Krizhevsky et al., 2012) (the winner of ILSVRC-2012).

With ConvNets becoming more of a commodity in the computer vision field, a number of at- tempts have been made to improve the original architecture of Krizhevsky et al. (2012) in a bid to achieve better accuracy. For instance, the best-performing submissions to the ILSVRC- 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014) utilised smaller receptive window size and smaller stride of the first convolutional layer. Another line of improvements dealt with training and testing the networks densely over the whole image and over multiple scales (Sermanet et al., 2014; Howard, 2014). In this paper, we address another important aspect of ConvNet architecture design – its depth. To this end, we fix other parameters of the architecture, and steadily increase the depth of the network by adding more convolutional layers, which is feasible due to the use of very small (3 × 3) convolution filters in all layers.

As a result, we come up with significantly more accurate ConvNet architectures, which not only achieve the state-of-the-art accuracy on ILSVRC classification and localisation tasks, but are also applicable to other image recognition datasets, where they achieve excellent performance even when used as a part of a relatively simple pipelines (e.g. deep features classified by a linear SVM without fine-tuning). We have released our two best-performing models1 to facilitate further research. The rest of the paper is organised as follows. In Sect. 2, we describe our ConvNet configurations. The details of the image classification training and evaluation are then presented in Sect. 3, and the configurations are compared on the ILSVRC classification task in Sect. 4. Sect. 5 concludes the paper. For completeness, we also describe and assess our ILSVRC-2014 object localisation system in Appendix A, and discuss the generalisation of very deep features to other datasets in Appendix B. Finally, Appendix C contains the list of major paper revisions

1 서론
합성곱 신경망(ConvNets)은 최근 대규모 이미지 및 비디오 인식 분야에서 큰 성공을 거두었습니다(Krizhevsky 등, 2012; Zeiler & Fergus, 2013; Sermanet 등, 2014; Simonyan & Zisserman, 2014). 이는 ImageNet(Deng 등, 2009)과 같은 대규모 공개 이미지 저장소 및 GPU나 대규모 분산 클러스터와 같은 고성능 컴퓨팅 시스템 덕분에 가능해졌습니다(Dean 등, 2012). 특히, ImageNet 대규모 시각 인식 챌린지(ILSVRC)(Russakovsky 등, 2014)는 깊은 시각 인식 아키텍처의 발전에 중요한 역할을 담당하였습니다. 이 챌린지는 고차원 얕은 특징 인코딩(Perronnin 등, 2010)(ILSVRC-2011 우승자)에서 깊은 ConvNets(Krizhevsky 등, 2012)(ILSVRC-2012 우승자)로 이어지는 몇 세대의 대규모 이미지 분류 시스템의 실험대상으로 사용되었습니다.

컴퓨터 비전 분야에서 ConvNets가 보다 일반화되면서, Krizhevsky 등의 초기 아키텍처를 개선하여 더 높은 정확도를 달성하려는 시도들이 이루어졌습니다. 예를 들어, ILSVRC-2013에서 최상의 성과를 거둔 논문(Zeiler & Fergus, 2013; Sermanet 등, 2014)들은 첫 번째 합성곱 계층의 작은 수용 영역 크기와 작은 스트라이드를 사용했습니다. 다른 개선 방향으로는 네트워크를 전체 이미지와 다양한 스케일에서 조밀하게 훈련하고 테스트하는 것(Sermanet 등, 2014; Howard, 2014)이 있었습니다. 본 논문에서는 ConvNet 아키텍처의 또 다른 중요한 측면인 깊이에 대해 다루고 있습니다. 이를 위해, 아키텍처의 다른 매개변수를 고정시키고, 모든 계층에서 매우 작은 (3 × 3) 컨볼루션 필터를 사용하여 점진적으로 네트워크의 깊이를 증가시킵니다.

결과적으로, 우리는 상당히 정확한 ConvNet 아키텍처를 개발했습니다. 이 아키텍처는 ILSVRC 분류 및 로컬라이제이션 작업에서 최첨단 정확도를 달성할 뿐만 아니라, 다른 이미지 인식 데이터셋에도 적용할 수 있습니다. 이러한 아키텍처는 상대적으로 간단한 파이프라인(예: 세밀 조정 없이 선형 SVM에 의해 분류된 깊은 특징)의 일부로 사용되었을 때에도 탁월한 성능을 보입니다. 저희는 성능이 가장 우수한 두 개의 모델을 공개하여 추가적인 연구를 촉진하고자 합니다.

이후 논문은 다음과 같은 구성으로 진행됩니다. 2절에서는 우리의 ConvNet 구성에 대해 설명합니다. 그 다음으로 3절에서는 이미지 분류 훈련과 평가에 대한 세부 정보를 제시하고, 4절에서는 ILSVRC 분류 작업에서의 구성을 비교합니다. 5절에서 논문을 마무리합니다. 덧붙여, 우리의 ILSVRC-2014 객체 로컬라이제이션 시스템에 대한 설명과 평가는 부록 A에서 다루고, 매우 깊은 특징의 다른 데이터셋으로의 일반화에 대해 부록 B에서 논의합니다. 마지막으로, 부록 C에는 주요 논문 개정 내역 목록이 포함되어 있습니다.

2 CONVNET CONFIGURATIONS

To measure the improvement brought by the increased ConvNet depth in a fair setting, all our ConvNet layer configurations are designed using the same principles, inspired by Ciresan et al. (2011); Krizhevsky et al. (2012). In this section, we first describe a generic layout of our ConvNet configurations (Sect. 2.1) and then detail the specific configurations used in the evaluation (Sect. 2.2). Our design choices are then discussed and compared to the prior art in Sect. 2.3.

2.1 ARCHITECTURE
During training, the input to our ConvNets is a fixed-size 224 × 224 RGB image. The only pre- processing we do is subtracting the mean RGB value, computed on the training set, from each pixel. The image is passed through a stack of convolutional (conv.) layers, where we use filters with a very small receptive field: 3 × 3 (which is the smallest size to capture the notion of left/right, up/down, center). In one of the configurations we also utilise 1 × 1 convolution filters, which can be seen as a linear transformation of the input channels (followed by non-linearity). The convolution stride is fixed to 1 pixel; the spatial padding of conv. layer input is such that the spatial resolution is preserved after convolution, i.e. the padding is 1 pixel for 3 × 3 conv. layers. Spatial pooling is carried out by five max-pooling layers, which follow some of the conv. layers (not all the conv. layers are followed by max-pooling). Max-pooling is performed over a 2 × 2 pixel window, with stride 2.

A stack of convolutional layers (which has a different depth in different architectures) is followed by three Fully-Connected (FC) layers: the first two have 4096 channels each, the third performs 1000- way ILSVRC classification and thus contains 1000 channels (one for each class). The final layer is the soft-max layer. The configuration of the fully connected layers is the same in all networks. All hidden layers are equipped with the rectification (ReLU (Krizhevsky et al., 2012)) non-linearity. We note that none of our networks (except for one) contain Local Response Normalisation (LRN) normalisation (Krizhevsky et al., 2012): as will be shown in Sect. 4, such normalisation does not improve the performance on the ILSVRC dataset, but leads to increased memory con- sumption and computation time. Where applicable, the parameters for the LRN layer are those of (Krizhevsky et al., 2012)

2.2 CONFIGURATIONS
The ConvNet configurations, evaluated in this paper, are outlined in Table 1, one per column. In the following we will refer to the nets by their names (A–E). All configurations follow the generic design presented in Sect. 2.1, and differ only in the depth: from 11 weight layers in the network A (8 conv. and 3 FC layers) to 19 weight layers in the network E (16 conv. and 3 FC layers). The width of conv. layers (the number of channels) is rather small, starting from 64 in the first layer and then increasing by a factor of 2 after each max-pooling layer, until it reaches 512. In Table 2 we report the number of parameters for each configuration. In spite of a large depth, the number of weights in our nets is not greater than the number of weights in a more shallow net with larger conv. layer widths and receptive fields (144M weights in (Sermanet et al., 2014)).

2 합성곱 신경망 구성
공정한 환경에서 ConvNet의 깊이 증가에 따른 개선을 측정하기 위해, 우리의 모든 ConvNet 계층 구성은 Ciresan 등(2011)과 Krizhevsky 등(2012)에 영감을 받은 동일한 원칙을 사용하여 설계되었습니다. 이 섹션에서는 먼저 우리의 ConvNet 구성의 일반적인 레이아웃을 설명한 후 (2.1절), 평가에 사용된 구체적인 구성에 대해 자세히 설명합니다 (2.2절). 그런 다음 설계 선택 사항을 이전 기술과 비교하여 논의합니다 (2.3절).

2.1 아키텍처
훈련 중에 우리의 ConvNets의 입력은 고정 크기인 224 × 224 RGB 이미지입니다. 우리가 수행하는 유일한 전처리는 각 픽셀에서 훈련 세트를 기반으로 계산된 평균 RGB 값을 빼는 것입니다. 이미지는 합성곱(conv.) 계층 스택을 통과합니다. 이 때 매우 작은 수용 영역을 가진 필터인 3 × 3을 사용합니다. (이는 좌/우, 상/하, 중앙 개념을 포착하는 데 필요한 가장 작은 크기입니다). 구성 중 하나에서는 1 × 1 합성곱 필터도 사용하는데, 이는 입력 채널의 선형 변환으로 볼 수 있습니다(비선형성 뒤에 따릅니다). 합성곱 스트라이드는 1 픽셀로 고정되며, 합성곱 계층 입력의 공간 패딩은 합성곱 이후에도 공간 해상도가 보존되도록 설정됩니다. 즉, 3 × 3 합성곱 계층의 패딩은 1 픽셀입니다. 공간 풀링은 다섯 개의 맥스 풀링 계층을 사용하여 수행되며, 일부 conv. 계층 뒤에 따릅니다 (모든 conv. 계층이 맥스 풀링을 따르는 것은 아닙니다). 맥스 풀링은 2 × 2 픽셀 창에서 스트라이드 2로 수행됩니다.

합성곱 계층 스택은 (다른 아키텍처마다 다른 깊이를 가지는) 3개의 완전 연결 (FC) 계층으로 이어집니다. 첫 번째와 두 번째 FC 계층은 각각 4096개의 채널을 가지며, 세 번째 FC 계층은 1000가지 ILSVRC 분류를 수행하므로 1000개의 채널을 가지고 있습니다 (각 클래스에 하나씩). 마지막 계층은 소프트맥스 계층입니다. 완전 연결 계층의 구성은 모든 네트워크에서 동일합니다.
모든 은닉 계층은 활성화 함수로 렐루 (ReLU) (Krizhevsky 등, 2012)를 사용합니다.
우리의 네트워크 중 한 가지를 제외하고는 로컬 레스폰스 정규화 (LRN) (Krizhevsky 등, 2012)를 포함하지 않습니다. 4절에서 보여줄 것처럼, 이러한 정규화는 ILSVRC 데이터셋의 성능을 향상시키지 않을 뿐만 아니라 메모리 사용량과 계산 시간이 증가하는 원인이 됩니다. LRN 계층에 대한 매개변수는 (Krizhevsky 등, 2012)의 것을 사용합니다.

2.2 구성
이 논문에서 평가된 ConvNet 구성은 각각 표 1에 개요가 제시되었습니다. 이후로는 이름 (A-E)으로 네트워크를 참조합니다. 모든 구성은 2.1절에서 제시된 일반적인 디자인을 따르며, 깊이만 다릅니다. A 네트워크는 11개의 가중치 계층(8개의 conv. 계층과 3개의 FC 계층)으로 구성되어 있으며, E 네트워크는 19개의 가중치 계층(16개의 conv. 계층과 3개의 FC 계층)으로 구성됩니다. conv. 계층의 너비(채널 수)는 상당히 작으며, 첫 번째 계층에서는 64로 시작하여 맥스 풀링 계층마다 2배씩 증가하여 512까지 도달합니다.
표 2에는 각 구성의 매개변수 수를 보고하고 있습니다. 큰 깊이에도 불구하고, 우리의 네트워크의 가중치 수는 더 큰 conv. 계층 너비와 수용 영역을 가진 더 얕은 네트워크의 가중치 수보다 크지 않습니다 ((Sermanet 등, 2014)에서는 144M의 가중치).

3 CLASSIFICATION FRAMEWORK
In the previous section we presented the details of our network configurations. In this section, we describe the details of classification ConvNet training and evaluation.

3.1 TRAINING
The ConvNet training procedure generally follows Krizhevsky et al. (2012) (except for sampling the input crops from multi-scale training images, as explained later).
Namely, the training is carried out by optimising the multinomial logistic regression objective using mini-batch gradient descent (based on back-propagation (LeCun et al., 1989)) with momentum.
The batch size was set to 256, momentum to 0.9. The training was regularised by weight decay (the L2 penalty multiplier set to 5 · 10−4) and dropout regularisation for the first two fully-connected layers (dropout ratio set to 0.5). The learning rate was initially set to 10−2, and then decreased by a factor of 10 when the validation set accuracy stopped improving.
In total, the learning rate was decreased 3 times, and the learning was stopped after 370K iterations (74 epochs). We conjecture that in spite of the larger number of parameters and the greater depth of our nets compared to (Krizhevsky et al., 2012), the nets required less epochs to converge due to (a) implicit regularisation imposed by greater depth and smaller conv. filter sizes; (b) pre-initialisation of certain layers.

The initialisation of the network weights is important, since bad initialisation can stall learning due to the instability of gradient in deep nets. To circumvent this problem, we began with training the configuration A (Table 1), shallow enough to be trained with random initialisation. Then, when training deeper architectures, we initialised the first four convolutional layers and the last three fully- connected layers with the layers of net A (the intermediate layers were initialised randomly). We did not decrease the learning rate for the pre-initialised layers, allowing them to change during learning. For random initialisation (where applicable), we sampled the weights from a normal distribution with the zero mean and 10−2 variance. The biases were initialised with zero. It is worth noting that after the paper submission we found that it is possible to initialise the weights without pre-training by using the random initialisation procedure of Glorot & Bengio (2010). To obtain the fixed-size 224×224 ConvNet input images, they were randomly cropped from rescaled training images (one crop per image per SGD iteration). To further augment the training set, the crops underwent random horizontal flipping and random RGB colour shift (Krizhevsky et al., 2012). Training image rescaling is explained below. Training image size. Let S be the smallest side of an isotropically-rescaled training image, from which the ConvNet input is cropped (we also refer to S as the training scale). While the crop size is fixed to 224 × 224, in principle S can take on any value not less than 224: for S = 224 the crop will capture whole-image statistics, completely spanning the smallest side of a training image; for S ≫ 224 the crop will correspond to a small part of the image, containing a small object or an object part.

We consider two approaches for setting the training scale S. The first is to fix S, which corresponds to single-scale training (note that image content within the sampled crops can still represent multi- scale image statistics). In our experiments, we evaluated models trained at two fixed scales: S = 256 (which has been widely used in the prior art (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) and S = 384. Given a ConvNet configuration, we first trained the network using S = 256. To speed-up training of the S = 384 network, it was initialised with the weights pre-trained with S = 256, and we used a smaller initial learning rate of 10−3. The second approach to setting S is multi-scale training, where each training image is individually rescaled by randomly sampling S from a certain range [Smin, Smax] (we used Smin = 256 and Smax = 512). Since objects in images can be of different size, it is beneficial to take this into account during training. This can also be seen as training set augmentation by scale jittering, where a single model is trained to recognise objects over a wide range of scales. For speed reasons, we trained multi-scale models by fine-tuning all layers of a single-scale model with the same configuration, pre-trained with fixed S = 384

3 분류 프레임워크
이전 섹션에서는 네트워크 구성에 대한 세부 정보를 제공했습니다. 이번 섹션에서는 분류 ConvNet의 훈련 및 평가에 대한 세부 사항을 설명합니다.
3.1 훈련
ConvNet의 훈련 절차는 일반적으로 Krizhevsky et al. (2012)를 따릅니다. 다만, 입력 이미지에서 다중 스케일 훈련 이미지를 샘플링하는 부분은 나중에 설명할 것입니다. 간단히 말해, 훈련은 모멘텀을 사용한 미니배치 경사 하강법(백프로파게이션(LeCun et al., 1989) 기반)을 사용하여 다항 로지스틱 회귀 목적 함수를 최적화함으로써 수행됩니다. 배치 크기는 256으로 설정되고, 모멘텀은 0.9로 설정됩니다. 훈련은 가중치 감쇠(L2 패널티의 곱셈자를 5 × 10^-4로 설정) 및 처음 두 개의 완전 연결 계층에 대한 드롭아웃 정규화(드롭아웃 비율을 0.5로 설정)를 통해 정규화됩니다. 학습률은 초기에 10^-2로 설정되고, 검증 세트 정확도가 개선되지 않을 때마다 10의 배수로 감소되었습니다. 총합으로 학습률은 3번 감소되었으며, 학습은 370K 반복 (74 에포크) 후에 중지되었습니다. (Krizhevsky et al., 2012)와 비교하여 매개변수의 수가 더 많고 깊이가 더 깊은에도 불구하고, 네트워크가 수렴하기 위해 더 적은 에포크가 필요한 이유로 (a) 더 큰 깊이와 작은 컨볼루션 필터 크기에 의해 암묵적으로 적용되는 정규화; (b) 특정 계층의 사전 초기화가 있다고 추측합니다.

네트워크 가중치의 초기화는 중요합니다. 왜냐하면 깊은 신경망에서 기울기의 불안정성으로 인해 잘못된 초기화는 학습을 멈출 수 있기 때문입니다. 이 문제를 해결하기 위해 우리는 초기에 무작위 초기화로 훈련 가능한 충분히 얕은 A 구성(Table 1)으로 훈련을 시작했습니다. 그런 다음 더 깊은 아키텍처를 훈련할 때 첫 네 개의 컨볼루션 계층과 마지막 세 개의 완전 연결 계층을 net A의 계층으로 초기화했습니다(중간 계층은 무작위로 초기화되었습니다). 사전 초기화된 계층에 대해 학습률을 감소시키지 않고, 학습 중에 변경되도록 허용했습니다. 무작위 초기화(해당하는 경우)의 경우 가중치는 평균이 0이고 분산이 10^-2인 정규 분포에서 샘플링되었습니다. 편향은 0으로 초기화되었습니다. 이 문서를 제출한 후에 우리는 Glorot & Bengio (2010)의 무작위 초기화 절차를 사용하여 사전 훈련 없이 가중치를 초기화할 수 있다는 것을 발견했습니다.

고정 크기인 224×224 ConvNet 입력 이미지를 얻기 위해, 훈련 이미지에서 무작위로 크롭되었습니다 (SGD 반복마다 이미지당 하나의 크롭). 훈련 세트를 더 확장하기 위해, 크롭은 무작위로 수평으로 뒤집히고 무작위로 RGB 색상 변화를 겪었습니다 (Krizhevsky et al., 2012). 훈련 이미지 크기 조정은 아래에서 설명합니다.

훈련 이미지 크기. S를 등방성으로 조정된 훈련 이미지의 가장 작은 측면으로 정의하고, ConvNet 입력이 크롭되는 크기라고 가정합시다 (S를 훈련 스케일이라고도 합니다). 크롭 크기는 224 × 224로 고정되지만, 원칙적으로 S는 224보다 작지 않은 어떤 값을 가질 수 있습니다: S = 224인 경우 크롭은 전체 이미지 통계를 포착하여 훈련 이미지의 가장 작은 측면을 완전히 포함할 것입니다. S ≫ 224인 경우 크롭은 작은 객체나 객체

부분을 포함하는 이미지의 작은 부분에 해당합니다.

훈련 스케일 S를 설정하기 위해 두 가지 접근 방식을 고려합니다. 첫 번째는 S를 고정하는 것으로, 이는 단일 스케일 훈련에 해당합니다 (샘플링된 크롭 내의 이미지 내용은 여전히 다중 스케일 이미지 통계를 나타낼 수 있음에 유의하십시오). 실험에서 우리는 두 가지 고정 스케일에서 훈련된 모델을 평가했습니다: S = 256 (이전 연구에서 널리 사용되었음 (Krizhevsky et al., 2012; Zeiler & Fergus, 2013; Sermanet et al., 2014)) 및 S = 384. ConvNet 구성이 주어지면, 우리는 먼저 S = 256을 사용하여 네트워크를 훈련시켰습니다. S = 384 네트워크의 훈련 속도를 높이기 위해, S = 256으로 사전 훈련된 가중치로 초기화하고 초기 학습률을 10^-3으로 낮추었습니다.

두 번째 접근 방식은 다중 스케일 훈련으로, 각 훈련 이미지를 개별적으로 S를 [Smin, Smax] 범위에서 무작위로 샘플링하여 크기를 조정합니다 (Smin = 256 및 Smax = 512로 설정했습니다). 이미지 내의 객체는 서로 다른 크기일 수 있으므로, 훈련 중에 이를 고려하는 것이 유리합니다. 이는 또한 스케일 젯터링을 통한 훈련 세트 증강으로 볼 수 있으며, 단일 모델이 다양한 스케일의 객체를 인식하기 위해 광범위한 스케일 범위에서 훈련됩니다. 속도 문제로 인해, 우리는 단일 스케일 모델의 모든 계층을 동일한 구성으로 세밀 조정하여 다중 스케일 모델을 훈련했습니다. 이 세밀 조정은 S = 384로 고정된 스케일로 사전 훈련된 모델로부터 이루어집니다.

3.2 TESTING

At test time, given a trained ConvNet and an input image, it is classified in the following way. First, it is isotropically rescaled to a pre-defined smallest image side, denoted as Q (we also refer to it as the test scale). We note that Q is not necessarily equal to the training scale S (as we will show in Sect. 4, using several values of Q for each S leads to improved performance). Then, the network is applied densely over the rescaled test image in a way similar to (Sermanet et al., 2014). Namely, the fully-connected layers are first converted to convolutional layers (the first FC layer to a 7 × 7 conv. layer, the last two FC layers to 1 × 1 conv. layers). The resulting fully-convolutional net is then applied to the whole (uncropped) image. The result is a class score map with the number of channels equal to the number of classes, and a variable spatial resolution, dependent on the input image size. Finally, to obtain a fixed-size vector of class scores for the image, the class score map is spatially averaged (sum-pooled). We also augment the test set by horizontal flipping of the images; the soft-max class posteriors of the original and flipped images are averaged to obtain the final scores for the image.

Since the fully-convolutional network is applied over the whole image, there is no need to sample multiple crops at test time (Krizhevsky et al., 2012), which is less efficient as it requires network re-computation for each crop. At the same time, using a large set of crops, as done by Szegedy et al. (2014), can lead to improved accuracy, as it results in a finer sampling of the input image compared to the fully-convolutional net. Also, multi-crop evaluation is complementary to dense evaluation due to different convolution boundary conditions: when applying a ConvNet to a crop, the convolved feature maps are padded with zeros, while in the case of dense evaluation the padding for the same crop naturally comes from the neighbouring parts of an image (due to both the convolutions and spatial pooling), which substantially increases the overall network receptive field, so more context is captured. While we believe that in practice the increased computation time of multiple crops does not justify the potential gains in accuracy, for reference we also evaluate our networks using 50 crops per scale (5 × 5 regular grid with 2 flips), for a total of 150 crops over 3 scales, which is comparable to 144 crops over 4 scales used by Szegedy et al. (2014)

3.3 IMPLEMENTATION DETAILS

Our implementation is derived from the publicly available C++ Caffe toolbox (Jia, 2013) (branched out in December 2013), but contains a number of significant modifications, allowing us to perform training and evaluation on multiple GPUs installed in a single system, as well as train and evaluate on full-size (uncropped) images at multiple scales (as described above). Multi-GPU training exploits data parallelism, and is carried out by splitting each batch of training images into several GPU batches, processed in parallel on each GPU. After the GPU batch gradients are computed, they are averaged to obtain the gradient of the full batch. Gradient computation is synchronous across the GPUs, so the result is exactly the same as when training on a single GPU. While more sophisticated methods of speeding up ConvNet training have been recently pro- posed (Krizhevsky, 2014), which employ model and data parallelism for different layers of the net, we have found that our conceptually much simpler scheme already provides a speedup of 3.75 times on an off-the-shelf 4-GPU system, as compared to using a single GPU. On a system equipped with four NVIDIA Titan Black GPUs, training a single net took 2–3 weeks depending on the architecture.

3.2 TESTING

테스트 시에는 훈련된 ConvNet과 입력 이미지가 주어졌을 때, 다음과 같은 방식으로 분류됩니다. 먼저, 입력 이미지는 사전 정의된 최소 이미지 면의 크기로 등방성으로 재조정됩니다. 이를 Q로 표시하며, 이를 테스트 스케일이라고도 합니다. 우리는 Q가 훈련 스케일 S와 반드시 동일하지 않을 수 있음을 주목합니다 (4절에서 보여줄 것처럼, 각 S에 대해 여러 가지 Q 값을 사용하는 것이 성능 향상에 도움이 됩니다). 그런 다음, 네트워크가 재조정된 테스트 이미지 전체에 밀집하게 적용됩니다. 이는 (Sermanet et al., 2014)와 유사한 방식으로 수행됩니다. 즉, 완전 연결 계층은 먼저 합성곱 계층으로 변환됩니다 (첫 번째 FC 계층은 7 × 7 합성곱 계층으로, 마지막 두 FC 계층은 1 × 1 합성곱 계층으로 변환됩니다). 이렇게 얻어진 완전 합성곱 네트워크는 전체 (자르지 않은) 이미지에 적용됩니다. 그 결과, 클래스 스코어 맵이 생성되며, 이는 클래스의 개수와 동일한 채널 수를 가지며 입력 이미지의 크기에 따라 가변적인 공간 해상도를 가집니다. 마지막으로, 이미지의 고정 크기 클래스 스코어 벡터를 얻기 위해 클래스 스코어 맵이 공간적으로 평균화 (합산 풀링)됩니다. 또한, 이미지의 수평 반전을 통해 테스트 세트를 보강합니다. 원본 이미지와 반전된 이미지의 소프트맥스 클래스 포스터리어를 평균화하여 이미지의 최종 스코어를 얻습니다.

풀리 컨볼루션 네트워크가 전체 이미지에 적용되므로 (Krizhevsky et al., 2012) 각 잘라낸 이미지에 대해 네트워크를 다시 계산하는 것이 필요하지 않습니다. 동시에 Szegedy et al. (2014)이 수행한 대로 다양한 크롭을 사용하는 것은 완전 합성곱 네트워크와 비교하여 입력 이미지의 더 세밀한 샘플링을 제공하여 정확성을 향상시킬 수 있습니다. 또한, 다중 크롭 평가는 밀집 평가와 함께 사용될 때 서로 보완적입니다. 왜냐하면 크롭에 ConvNet을 적용할 때는 합성곱된 특성 맵이 0으로 패딩되지만, 밀집 평가의 경우 동일한 크롭에 대한 패딩이 이미지의 인접 부분에서 자연스럽게 발생하기 때문에 (합성곱과 공간 풀링 둘 다 해당), 전체적인 네트워크 수용 영역이 크게 증가하므로 더 많은 문맥이 포착됩니다. 여러 크롭의 계산 시간이 잠재적인 정확도 향상에 비해 상당하지 않다고 믿기 때문에 실제로는 사용하지 않을 것입니다. 그러나 참고로, 우리는 또한 50개의 크롭을 사용하여 네트워크를 평가합니다 (2개의 반전 포함, 5 × 5 정규 그리드로 총 3개의 스케일에서 150개의 크롭), 이는 Szegedy et al. (2014)이 4개의 스케일에서 144개의 크롭을 사용한 것과 비교할 수 있습니다.

3.3 구현 세부사항
우리의 구현은 공개적으로 이용 가능한 C++ Caffe 툴박스(Jia, 2013)를 기반으로 하였으며(2013년 12월에 분기됨), 여러 개의 GPU에서 훈련과 평가를 수행하고 여러 스케일의 전체 크기(잘라내지 않은) 이미지에서 훈련과 평가를 수행할 수 있도록 상당한 수정이 가해졌습니다. 다중 GPU 훈련은 데이터 병렬성을 활용하며, 각 GPU에 병렬로 처리되는 여러 GPU 배치로 훈련 이미지의 각 배치를 분할하여 수행됩니다. GPU 배치의 그래디언트가 계산된 후, 전체 배치의 그래디언트를 얻기 위해 평균화됩니다. 그래디언트 계산은 GPU 간에 동기적으로 수행되므로, 단일 GPU에서 훈련하는 것과 정확히 동일한 결과를 얻을 수 있습니다.
최근에는 ConvNet 훈련을 가속화하기 위해 모델 및 데이터 병렬성을 다른 레이어에 적용하는 더 정교한 방법들이 제안되었지만(Krizhevsky, 2014), 우리는 개념적으로 훨씬 더 간단한 체계가 이미 단일 GPU를 사용하는 경우에 비해 4개의 GPU 시스템에서 3.75배의 가속화를 제공함을 발견했습니다. NVIDIA Titan Black GPU가 장착된 시스템에서 단일 네트워크를 훈련하는 데는 아키텍처에 따라 2~3주가 소요되었습니다.

4 CLASSIFICATION EXPERIMENTS
Dataset.
In this section, we present the image classification results achieved by the described ConvNet architectures on the ILSVRC-2012 dataset (which was used for ILSVRC 2012–2014 chal- lenges). The dataset includes images of 1000 classes, and is split into three sets: training (1.3M images), validation (50K images), and testing (100K images with held-out class labels). The clas- sification performance is evaluated using two measures: the top-1 and top-5 error. The former is a multi-class classification error, i.e. the proportion of incorrectly classified images; the latter is the main evaluation criterion used in ILSVRC, and is computed as the proportion of images such that the ground-truth category is outside the top-5 predicted categories. For the majority of experiments, we used the validation set as the test set. Certain experiments were also carried out on the test set and submitted to the official ILSVRC server as a “VGG” team entry to the ILSVRC-2014 competition (Russakovsky et al., 2014).

4.1 SINGLE SCALE EVALUATION
We begin with evaluating the performance of individual ConvNet models at a single scale with the layer configurations described in Sect. 2.2. The test image size was set as follows: Q = S for fixed S, and Q = 0.5(Smin + Smax) for jittered S ∈ [Smin, Smax]. The results of are shown in Table 3. First, we note that using local response normalisation (A-LRN network) does not improve on the model A without any normalisation layers. We thus do not employ normalisation in the deeper architectures (B–E). Second, we observe that the classification error decreases with the increased ConvNet depth: from 11 layers in A to 19 layers in E. Notably, in spite of the same depth, the configuration C (which contains three 1 × 1 conv. layers), performs worse than the configuration D, which uses 3 × 3 conv. layers throughout the network.

This indicates that while the additional non-linearity does help (C is better than B), it is also important to capture spatial context by using conv. filters with non-trivial receptive fields (D is better than C). The error rate of our architecture saturates when the depth reaches 19 layers, but even deeper models might be beneficial for larger datasets. We also compared the net B with a shallow net with five 5 × 5 conv. layers, which was derived from B by replacing each pair of 3 × 3 conv. layers with a single 5 × 5 conv. layer (which has the same receptive field as explained in Sect. 2.3). The top-1 error of the shallow net was measured to be 7% higher than that of B (on a center crop), which confirms that a deep net with small filters outperforms a shallow net with larger filters. Finally, scale jittering at training time (S ∈ [256; 512]) leads to significantly better results than training on images with fixed smallest side (S = 256 or S = 384), even though a single scale is used at test time. This confirms that training set augmentation by scale jittering is indeed helpful for capturing multi-scale image statistics.

4.2 MULTI-SCALE EVALUATION
Having evaluated the ConvNet models at a single scale, we now assess the effect of scale jittering at test time. It consists of running a model over several rescaled versions of a test image (corresponding to different values of Q), followed by averaging the resulting class posteriors. Considering that a large discrepancy between training and testing scales leads to a drop in performance, the models trained with fixed S were evaluated over three test image sizes, close to the training one: Q = {S − 32, S, S + 32}. At the same time, scale jittering at training time allows the network to be applied to a wider range of scales at test time, so the model trained with variable S ∈ [Smin; Smax] was evaluated over a larger range of sizes Q = {Smin, 0.5(Smin + Smax), Smax}. The results, presented in Table 4, indicate that scale jittering at test time leads to better performance (as compared to evaluating the same model at a single scale, shown in Table 3). As before, the deepest configurations (D and E) perform the best, and scale jittering is better than training with a fixed smallest side S. Our best single-network performance on the validation set is 24.8%/7.5% top-1/top-5 error (highlighted in bold in Table 4). On the test set, the configuration E achieves 7.3% top-5 error.

4.3 MULTI-CROP EVALUATION
In Table 5 we compare dense ConvNet evaluation with mult-crop evaluation (see Sect. 3.2 for de- tails). We also assess the complementarity of the two evaluation techniques by averaging their soft- max outputs. As can be seen, using multiple crops performs slightly better than dense evaluation, and the two approaches are indeed complementary, as their combination outperforms each of them. As noted above, we hypothesize that this is due to a different treatment of convolution boundary conditions.

4.4 CONVNET FUSION
Up until now, we evaluated the performance of individual ConvNet models. In this part of the exper- iments, we combine the outputs of several models by averaging their soft-max class posteriors. This improves the performance due to complementarity of the models, and was used in the top ILSVRC submissions in 2012 (Krizhevsky et al., 2012) and 2013 (Zeiler & Fergus, 2013; Sermanet et al., 2014). The results are shown in Table 6. By the time of ILSVRC submission we had only trained the single-scale networks, as well as a multi-scale model D (by fine-tuning only the fully-connected layers rather than all layers). The resulting ensemble of 7 networks has 7.3% ILSVRC test error. After the submission, we considered an ensemble of only two best-performing multi-scale models (configurations D and E), which reduced the test error to 7.0% using dense evaluation and 6.8% using combined dense and multi-crop evaluation. For reference, our best-performing single model achieves 7.1% error (model E, Table 5).

4.5 COMPARISON WITH THE STATE OF THE ART
Finally, we compare our results with the state of the art in Table 7. In the classification task of ILSVRC-2014 challenge (Russakovsky et al., 2014), our “VGG” team secured the 2nd place with 7.3% test error using an ensemble of 7 models. After the submission, we decreased the error rate to 6.8% using an ensemble of 2 models. As can be seen from Table 7, our very deep ConvNets significantly outperform the previous gener- ation of models, which achieved the best results in the ILSVRC-2012 and ILSVRC-2013 competi- tions. Our result is also competitive with respect to the classification task winner (GoogLeNet with 6.7% error) and substantially outperforms the ILSVRC-2013 winning submission Clarifai, which achieved 11.2% with outside training data and 11.7% without it. This is remarkable, considering that our best result is achieved by combining just two models – significantly less than used in most ILSVRC submissions. In terms of the single-net performance, our architecture achieves the best result (7.0% test error), outperforming a single GoogLeNet by 0.9%. Notably, we did not depart from the classical ConvNet architecture of LeCun et al. (1989), but improved it by substantially increasing the depth.

4 분류 실험
데이터셋.
이 섹션에서는 ILSVRC-2012 데이터셋(2012~2014년 ILSVRC 챌린지에 사용된 데이터셋)에서 기술된 ConvNet 아키텍처로 달성한 이미지 분류 결과를 제시합니다. 이 데이터셋은 1000개 클래스의 이미지를 포함하며, 훈련(1.3백만 개의 이미지), 검증(5만 개의 이미지) 및 테스트(10만 개의 이미지로 클래스 레이블을 숨김) 세트로 분할됩니다. 분류 성능은 두 가지 측정 항목으로 평가됩니다: 상위 1위 오류(top-1 error)와 상위 5위 오류(top-5 error). 전자는 다중 클래스 분류 오류로, 잘못 분류된 이미지의 비율을 나타냅니다. 후자는 ILSVRC에서 주로 사용되는 평가 기준이며, 그라운드 트루스 카테고리가 상위 5개 예측 카테고리 밖에 있는 이미지의 비율로 계산됩니다.
대부분의 실험에서는 검증 세트를 테스트 세트로 사용했습니다. 일부 실험은 또한 테스트 세트에서 수행되었으며 "VGG" 팀으로 ILSVRC-2014 대회 (Russakovsky et al., 2014)에 공식 제출되었습니다.

4.1 단일 스케일 평가
먼저, 2.2 절에서 설명한 레이어 구성을 가진 개별 ConvNet 모델의 성능을 단일 스케일에서 평가합니다. 테스트 이미지 크기는 다음과 같이 설정되었습니다: 고정된 S의 경우 Q = S이고, 범위 [Smin, Smax]에서의 jittered S의 경우 Q = 0.5(Smin + Smax)입니다. 결과는 표 3에 나와 있습니다.
첫째, 로컬 응답 정규화(A-LRN 네트워크)를 사용하는 것은 정규화가 없는 모델 A보다 성능을 향상시키지 않는 것으로 나타났습니다. 따라서 깊은 아키텍처 (B-E)에서는 정규화를 사용하지 않습니다.
둘째, ConvNet의 깊이가 증가함에 따라 분류 오류가 감소하는 것을 관찰할 수 있습니다: A에서 11개 레이어부터 E에서 19개 레이어까지. 특히, 동일한 깊이임에도 불구하고, 1 × 1 conv. 레이어를 사용하는 구성 C (세 개의 1 × 1 conv. 레이어를 포함)이 네트워크 전체에 3 × 3 conv. 레이어를 사용하는 구성 D보다 성능이 떨어집니다.

이는 추가적인 비선형성이 도움이 되지만(구성 C가 구성 B보다 우수함), 비트락적인 수용 영역을 가진 conv. 필터를 사용하여 공간적 맥락을 포착하는 것도 중요하다는 것을 나타냅니다(구성 D가 구성 C보다 우수함). 우리의 아키텍처의 오류율은 깊이가 19개 레이어에 도달하면 포화되지만, 더 깊은 모델은 더 큰 데이터셋에 유리할 수 있습니다. 또한, 넷 B와 B에서 각각의 3 × 3 conv. 레이어를 하나의 5 × 5 conv. 레이어로 대체하여 유도한 다섯 개의 5 × 5 conv. 레이어로 이루어진 얕은 넷을 비교하였습니다(2.3 절에서 설명한 것과 동일한 수용 영역을 가지는 5 × 5 conv. 레이어로 대체). 얕은 넷의 top-1 오류는 B보다 7% 높았으며(중앙 crop에서 측정), 작은 필터를 사용한 깊은 넷이 큰 필터를 사용한 얕은 넷보다 우수하다는 것을 확인했습니다.
마지막으로, 훈련 시 스케일 저터링(S ∈ [256; 512])은 고정된 최소 변의 이미지(S = 256 또는 S = 384)에서 훈련하는 것보다 훨씬 좋은 결과를 낳습니다. 심지어 테스트 시에는 단일 스케일을 사용하더라도 그렇습니다. 이는 스케일 저터링에 의한 훈련 세트 확장이 다중 스케일 이미지 통계를 포착하는 데 실제로 도움이 된다는 것을 확인합니다.

4.2 다중 스케일 평가
단일 스케일에서 ConvNet 모델을 평가한 후, 이제 테스트 시 다중 스케일 적용의 효과를 평가합니다. 이는 테스트 이미지의 여러 크기에 대해 모델을 실행한 후 결과적인 클래스 포스터리어를 평균화하는 것으로 이루어집니다. 훈련과 테스트 사이의 큰 스케일 차이는 성능 저하로 이어질 수 있으므로, 고정된 S로 훈련된 모델은 Q = {S - 32, S, S + 32}와 같은 테스트 이미지 크기에 대해 평가됩니다. 동시에, 훈련 시에 스케일 변동을 허용함으로써 네트워크가 다양한 스케일에 대해 테스트될 수 있으므로, 변수 S ∈ [Smin; Smax]로 훈련된 모델은 Q = {Smin, 0.5(Smin + Smax), Smax}와 같은 큰 범위의 크기에 대해 평가됩니다.
Table 4에 제시된 결과는 테스트 시 다중 스케일 적용이 더 나은 성능을 보여줍니다 (Table 3에서 단일 스케일로 모델을 평가한 것과 비교하여). 이전과 마찬가지로, 가장 깊은 구성 (D와 E)이 가장 우수한 성능을 발휘하며, 스케일 변동은 고정된 최소한의 크기 S로 훈련하는 것보다 나은 성과를 보입니다. 검증 세트에서 단일 네트워크의 최상의 성능은 24.8%/7.5%의 top-1/top-5 에러입니다 (Table 4에서 굵게 표시됨). 테스트 세트에서는 구성 E가 7.3%의 top-5 에러를 달성합니다.

4.3 다중 크롭 평가
Table 5에서 우리는 밀집(ConvNet) 평가와 다중 크롭(Multi-crop) 평가를 비교합니다 (세부 사항은 섹션 3.2를 참조하십시오). 우리는 두 가지 평가 기법의 보완성도 평가하기 위해 소프트맥스 출력을 평균화합니다. 보시다시피, 다중 크롭을 사용한 것이 밀집 평가보다 약간 더 나은 결과를 나타내며, 두 접근 방식은 실제로 보완적입니다. 위에서 언급한 대로, 이는 합성곱 경계 조건의 다른 처리 때문이라고 추측합니다.

4.4 ConvNet 퓨전
지금까지 우리는 개별 ConvNet 모델의 성능을 평가했습니다. 실험의 이 부분에서는 여러 모델의 출력을 소프트맥스 클래스 포스터리어의 평균으로 결합합니다. 이는 모델들의 보완성에 의해 성능이 향상되며, 2012년 (Krizhevsky et al., 2012) 및 2013년 (Zeiler & Fergus, 2013; Sermanet et al., 2014)의 최고 ILSVRC 제출에서 사용되었습니다.
결과는 표 6에 나와 있습니다. ILSVRC 제출 당시에는 단일 스케일 네트워크와 멀티-스케일 모델 D (모든 레이어가 아닌 완전히 연결된 레이어만을 재조정하여)만 훈련했습니다. 이 7개의 네트워크 앙상블은 7.3%의 ILSVRC 테스트 오류를 갖습니다.
제출 이후에는 최고 성능을 발휘하는 멀티-스케일 모델 (구성 D와 E) 두 가지만을 앙상블로 고려하여 밀집 평가에서 테스트 오류를 7.0%로 줄였으며, 밀집 및 다중 크롭 평가를 결합하여 6.8%의 테스트 오류를 달성했습니다. 참고로, 우리의 최고 성능을 내는 단일 모델은 7.1%의 오류를 달성합니다 (모델 E, 표 5).

4.5 최첨단 기술과의 비교
마지막으로, 표 7에서 우리의 결과를 최첨단 기술과 비교합니다. ILSVRC-2014 챌린지의 분류 과제에서 "VGG" 팀은 7개 모델의 앙상블을 사용하여 7.3%의 테스트 오류로 2위를 차지했습니다. 제출 이후, 2개 모델의 앙상블을 사용하여 오류율을 6.8%로 낮췄습니다.
표 7에서 볼 수 있듯이, 우리의 매우 깊은 ConvNet은 이전 세대의 모델을 크게 능가합니다. 이전 모델은 ILSVRC-2012 및 ILSVRC-2013 대회에서 최고 결과를 달성했습니다. 또한, GoogLeNet(6.7% 오류)과 비교하여 분류 과제의 우승자와 경쟁력을 갖추었으며, ILSVRC-2013에서 우승한 Clarifai(외부 훈련 데이터로 11.2% 오류, 훈련 데이터 없이 11.7% 오류)를 크게 능가합니다. 이는 매우 두 모델을 결합하여 달성한 결과로서, 대부분의 ILSVRC 제출보다 훨씬 적은 모델을 사용한 것입니다. 단일 네트워크 성능 측면에서도, 우리의 구조는 최고의 결과 (7.0% 테스트 오류)를 달성하여 단일 GoogLeNet보다 0.9% 우수한 성능을 보입니다. 특히, 우리는 LeCun et al. (1989)의 고전적인 ConvNet 구조를 고수하면서 깊이를 크게 증가시켜 개선한 점이 주목할 만합니다.

5 CONCLUSION
In this work we evaluated very deep convolutional networks (up to 19 weight layers) for large- scale image classification. It was demonstrated that the representation depth is beneficial for the classification accuracy, and that state-of-the-art performance on the ImageNet challenge dataset can be achieved using a conventional ConvNet architecture (LeCun et al., 1989; Krizhevsky et al., 2012) with substantially increased depth. In the appendix, we also show that our models generalise well to a wide range of tasks and datasets, matching or outperforming more complex recognition pipelines built around less deep image representations. Our results yet again confirm the importance of depth in visual representations. ACKNOWLEDGEMENTS This work was supported by ERC grant VisRec no. 228180. We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GPUs used for this research

5 결론
본 연구에서는 매우 깊은 컨볼루션 신경망 (19개의 가중치 계층까지)을 대규모 이미지 분류에 적용하여 평가했습니다. 이 연구에서는 표현의 깊이가 분류 정확도에 유익하며, 전통적인 ConvNet 구조 (LeCun et al., 1989; Krizhevsky et al., 2012)를 사용하여 ImageNet 챌린지 데이터셋에서 최첨단 성능을 달성할 수 있음을 보였습니다. 부록에서는 우리의 모델이 다양한 작업과 데이터셋에 대해 일반화가 잘되며, 덜 깊은 이미지 표현을 기반으로 구축된 더 복잡한 인식 파이프라인과 일치하거나 능가한다는 것을 보여주었습니다. 우리의 결과는 시각적 표현에서 깊이의 중요성을 다시 한번 확인합니다.
감사의 말씀
본 연구는 ERC grant VisRec no. 228180의 지원을 받았습니다. 본 연구에 사용된 GPU 기기의 기증에 대해 NVIDIA Corporation에 감사드립니다.

제목: 로컬라이제이션
본 논문의 본문에서는 ILSVRC 챌린지의 분류 작업을 고려하고 다양한 깊이의 ConvNet 아키텍처를 철저히 평가했습니다. 이 섹션에서는 2014년에 25.3%의 오차로 이 챌린지에서 승리한 로컬라이제이션 작업에 초점을 맞춥니다. 이 작업은 실제로 클래스의 객체 수와 관계없이 상위 5개 클래스마다 단일 객체 경계 상자를 예측해야 하는 객체 탐지의 특수한 경우로 볼 수 있습니다. 이를 위해 Sermanet 등의 접근 방식(2014)을 채택하되 일부 수정을 가했습니다. 우리의 방법은 A.1절에서 설명되며, A.2절에서 평가됩니다.

A.1 로컬라이제이션 ConvNet
객체 로컬라이제이션을 수행하기 위해, 우리는 매우 깊은 ConvNet을 사용합니다. 여기서 마지막 완전 연결층은 클래스 점수 대신 경계 상자 위치를 예측합니다. 경계 상자는 중심 좌표, 너비 및 높이를 저장하는 4차원 벡터로 표현됩니다. 경계 상자 예측이 모든 클래스에 대해 공유되는 경우(단일 클래스 회귀, SCR (Sermanet et al., 2014)), 마지막 층은 4차원이며, 클래스별로 다르다면(각 클래스 회귀, PCR) 4000차원입니다(데이터셋에 1000개의 클래스가 있기 때문입니다). 마지막 경계 상자 예측 층을 제외하고, 우리는 분류 작업에서 최상의 성능을 보여준 16개의 가중치 층을 포함하는 ConvNet 아키텍처 D (표 1)를 사용합니다(Sect. 4).

학습.
로컬라이제이션 ConvNet의 학습은 분류 ConvNet의 학습과 유사합니다(Sect. 3.1). 주요 차이점은 로지스틱 회귀 목적을 실제 경계 상자 매개변수와의 차이에 대한 유클리드 손실로 대체한다는 것입니다. 우리는 두 개의 로컬라이제이션 모델을 학습했습니다. 각각 S = 256 및 S = 384의 단일 스케일에서(시간 제약으로 인해 ILSVRC-2014 제출에서는 더 큰 S 값을 사용하지 못했습니다). 이후 섹션에서는 학습 프로토콜에 대해 설명합니다.

테스트.
이 섹션에서는 네트워크 수정을 평가하기 위해 두 가지 프로토콜을 설명합니다. 첫 번째 프로토콜은 실제 경계 상자 클래스에 대한 경계 상자 예측을 사용하여 네트워크 수정을 평가합니다. 두 번째 프로토콜은 전체 이미지에 로컬라이제이션 ConvNet을 적용하고 탐욕적인 병합 절차를 사용하여 경계 상자 예측을 병합합니다.

결과.
우리는 ILSVRC 기준을 사용하여 로컬라이제이션 방법을 평가했습니다. ILSVRC 기준에서 경계 상자 예측과 실제 경계 상자 사이의 겹침 비율이 0.5보다 큰 경우 경계 상자 예측을 올바른 것으로 간주합니다. 우리의 방법은 다른 최첨단 방법과 비교하여 향상된 성능을 보여줍니다.

마지막으로, 본 논문의 후반부에서는 작은 데이터셋으로 딥 ConvNet 특징을 일반화하는 방법에 대해 논의합니다. 우리는 다른 데이터셋에서 사전 훈련된 ConvNet 모델을 평가하고 마지막 층의 활성화를 이미지 특징으로 사용합니다. 이를 통해 우리는 최상위 분류기를 학습하는 데 사용할 수 있는 일련의 특징을 얻을 수 있습니다.

A LOCALISATION

In the main body of the paper we have considered the classification task of the ILSVRC challenge, and performed a thorough evaluation of ConvNet architectures of different depth. In this section, we turn to the localisation task of the challenge, which we have won in 2014 with 25.3% error. It can be seen as a special case of object detection, where a single object bounding box should be predicted for each of the top-5 classes, irrespective of the actual number of objects of the class. For this we adopt the approach of Sermanet et al. (2014), the winners of the ILSVRC-2013 localisation challenge, with a few modifications. Our method is described in Sect. A.1 and evaluated in Sect. A.2.

A.1 LOCALISATION CONVNET

To perform object localisation, we use a very deep ConvNet, where the last fully connected layer predicts the bounding box location instead of the class scores. A bounding box is represented by a 4-D vector storing its center coordinates, width, and height. There is a choice of whether the bounding box prediction is shared across all classes (single-class regression, SCR (Sermanet et al., 2014)) or is class-specific (per-class regression, PCR). In the former case, the last layer is 4-D, while in the latter it is 4000-D (since there are 1000 classes in the dataset). Apart from the last bounding box prediction layer, we use the ConvNet architecture D (Table 1), which contains 16 weight layers and was found to be the best-performing in the classification task (Sect. 4). Training. Training of localisation ConvNets is similar to that of the classification ConvNets (Sect. 3.1). The main difference is that we replace the logistic regression objective with a Euclidean loss, which penalises the deviation of the predicted bounding box parameters from the ground-truth. We trained two localisation models, each on a single scale: S = 256 and S = 384 (due to the time constraints, we did not use training scale jittering for our ILSVRC-2014 submission). Training was initialised with the corresponding classification models (trained on the same scales), and the initial learning rate was set to 10−3. We explored both fine-tuning all layers and fine-tuning only the first two fully-connected layers, as done in (Sermanet et al., 2014). The last fully-connected layer was initialised randomly and trained from scratch. Testing. We consider two testing protocols. The first is used for comparing different network modifications on the validation set, and considers only the bounding box prediction for the ground truth class (to factor out the classification errors). The bounding box is obtained by applying the network only to the central crop of the image. The second, fully-fledged, testing procedure is based on the dense application of the localisation ConvNet to the whole image, similarly to the classification task (Sect. 3.2). The difference is that instead of the class score map, the output of the last fully-connected layer is a set of bounding box predictions. To come up with the final prediction, we utilise the greedy merging procedure of Sermanet et al. (2014), which first merges spatially close predictions (by averaging their coor- dinates), and then rates them based on the class scores, obtained from the classification ConvNet. When several localisation ConvNets are used, we first take the union of their sets of bounding box predictions, and then run the merging procedure on the union. We did not use the multiple pooling offsets technique of Sermanet et al. (2014), which increases the spatial resolution of the bounding box predictions and can further improve the results.

A.2 LOCALISATION EXPERIMENTS

In this section we first determine the best-performing localisation setting (using the first test proto- col), and then evaluate it in a fully-fledged scenario (the second protocol). The localisation error is measured according to the ILSVRC criterion (Russakovsky et al., 2014), i.e. the bounding box prediction is deemed correct if its intersection over union ratio with the ground-truth bounding box is above 0.5. Settings comparison. As can be seen from Table 8, per-class regression (PCR) outperforms the class-agnostic single-class regression (SCR), which differs from the findings of Sermanet et al. (2014), where PCR was outperformed by SCR. We also note that fine-tuning all layers for the lo- calisation task leads to noticeably better results than fine-tuning only the fully-connected layers (as done in (Sermanet et al., 2014)). In these experiments, the smallest images side was set to S = 384; the results with S = 256 exhibit the same behaviour and are not shown for brevity. Fully-fledged evaluation. Having determined the best localisation setting (PCR, fine-tuning of all layers), we now apply it in the fully-fledged scenario, where the top-5 class labels are predicted us- ing our best-performing classification system (Sect. 4.5), and multiple densely-computed bounding box predictions are merged using the method of Sermanet et al. (2014). As can be seen from Ta- ble 9, application of the localisation ConvNet to the whole image substantially improves the results compared to using a center crop (Table 8), despite using the top-5 predicted class labels instead of the ground truth. Similarly to the classification task (Sect. 4), testing at several scales and combining the predictions of multiple networks further improves the performance. Comparison with the state of the art. We compare our best localisation result with the state of the art in Table 10. With 25.3% test error, our “VGG” team won the localisation challenge of ILSVRC-2014 (Russakovsky et al., 2014). Notably, our results are considerably better than those of the ILSVRC-2013 winner Overfeat (Sermanet et al., 2014), even though we used less scales and did not employ their resolution enhancement technique. We envisage that better localisation per- formance can be achieved if this technique is incorporated into our method. This indicates the performance advancement brought by our very deep ConvNets – we got better results with a simpler localisation method, but a more powerful representation.

B GENERALISATION OF VERY DEEP FEATURES

In the previous sections we have discussed training and evaluation of very deep ConvNets on the ILSVRC dataset. In this section, we evaluate our ConvNets, pre-trained on ILSVRC, as feature extractors on other, smaller, datasets, where training large models from scratch is not feasible due to over-fitting. Recently, there has been a lot of interest in such a use case (Zeiler & Fergus, 2013; Donahue et al., 2013; Razavian et al., 2014; Chatfield et al., 2014), as it turns out that deep image representations, learnt on ILSVRC, generalise well to other datasets, where they have outperformed hand-crafted representations by a large margin. Following that line of work, we investigate if our models lead to better performance than more shallow models utilised in the state-of-the-art methods. In this evaluation, we consider two models with the best classification performance on ILSVRC (Sect. 4) – configurations “Net-D” and “Net-E” (which we made publicly available). To utilise the ConvNets, pre-trained on ILSVRC, for image classification on other datasets, we remove the last fully-connected layer (which performs 1000-way ILSVRC classification), and use 4096-D activations of the penultimate layer as image features, which are aggregated across multiple locations and scales. The resulting image descriptor is L2-normalised and combined with a linear SVM classifier, trained on the target dataset. For simplicity, pre-trained ConvNet weights are kept fixed (no fine-tuning is performed). Aggregation of features is carried out in a similar manner to our ILSVRC evaluation procedure (Sect. 3.2). Namely, an image is first rescaled so that its smallest side equals Q, and then the net- work is densely applied over the image plane (which is possible when all weight layers are treated as convolutional). We then perform global average pooling on the resulting feature map, which produces a 4096-D image descriptor. The descriptor is then averaged with the descriptor of a hori- zontally flipped image. As was shown in Sect. 4.2, evaluation over multiple scales is beneficial, so we extract features over several scales Q. The resulting multi-scale features can be either stacked or pooled across scales. Stacking allows a subsequent classifier to learn how to optimally combine image statistics over a range of scales; this, however, comes at the cost of the increased descriptor dimensionality. We return to the discussion of this design choice in the experiments below. We also assess late fusion of features, computed using two networks, which is performed by stacking their respective image descriptors. Image Classification on VOC-2007 and VOC-2012. We begin with the evaluation on the image classification task of PASCAL VOC-2007 and VOC-2012 benchmarks (Everingham et al., 2015). These datasets contain 10K and 22.5K images respectively, and each image is annotated with one or several labels, corresponding to 20 object categories. The VOC organisers provide a pre-defined split into training, validation, and test data (the test data for VOC-2012 is not publicly available; instead, an official evaluation server is provided). Recognition performance is measured using mean average precision (mAP) across classes. Notably, by examining the performance on the validation sets of VOC-2007 and VOC-2012, we found that aggregating image descriptors, computed at multiple scales, by averaging performs similarly to the aggregation by stacking. We hypothesize that this is due to the fact that in the VOC dataset the objects appear over a variety of scales, so there is no particular scale-specific seman- tics which a classifier could exploit. Since averaging has a benefit of not inflating the descrip- tor dimensionality, we were able to aggregated image descriptors over a wide range of scales: Q ∈ {256, 384, 512, 640, 768}. It is worth noting though that the improvement over a smaller range of {256, 384, 512} was rather marginal (0.3%). The test set performance is reported and compared with other approaches in Table 11. Our networks “Net-D” and “Net-E” exhibit identical performance on VOC datasets, and their combination slightly improves the results. Our methods set the new state of the art across image representations, pre- trained on the ILSVRC dataset, outperforming the previous best result of Chatfield et al. (2014) by more than 6%. It should be noted that the method of Wei et al. (2014), which achieves 1% better mAP on VOC-2012, is pre-trained on an extended 2000-class ILSVRC dataset, which includes additional 1000 categories, semantically close to those in VOC datasets. It also benefits from the fusion with an object detection-assisted classification pipeline. Image Classification on Caltech-101 and Caltech-256. In this section we evaluate very deep fea- tures on Caltech-101 (Fei-Fei et al., 2004) and Caltech-256 (Griffin et al., 2007) image classification benchmarks. Caltech-101 contains 9K images labelled into 102 classes (101 object categories and a background class), while Caltech-256 is larger with 31K images and 257 classes. A standard eval- uation protocol on these datasets is to generate several random splits into training and test data and report the average recognition performance across the splits, which is measured by the mean class recall (which compensates for a different number of test images per class). Following Chatfield et al. (2014); Zeiler & Fergus (2013); He et al. (2014), on Caltech-101 we generated 3 random splits into training and test data, so that each split contains 30 training images per class, and up to 50 test images per class. On Caltech-256 we also generated 3 splits, each of which contains 60 training images per class (and the rest is used for testing). In each split, 20% of training images were used as a validation set for hyper-parameter selection. We found that unlike VOC, on Caltech datasets the stacking of descriptors, computed over multi- ple scales, performs better than averaging or max-pooling. This can be explained by the fact that in Caltech images objects typically occupy the whole image, so multi-scale image features are se- mantically different (capturing the whole object vs. object parts), and stacking allows a classifier to exploit such scale-specific representations. We used three scales Q ∈ {256, 384, 512}. Our models are compared to each other and the state of the art in Table 11. As can be seen, the deeper 19-layer Net-E performs better than the 16-layer Net-D, and their combination further improves the performance. On Caltech-101, our representations are competitive with the approach of He et al. (2014), which, however, performs significantly worse than our nets on VOC-2007. On Caltech-256, our features outperform the state of the art (Chatfield et al., 2014) by a large margin (8.6%). Action Classification on VOC-2012. We also evaluated our best-performing image representa- tion (the stacking of Net-D and Net-E features) on the PASCAL VOC-2012 action classification task (Everingham et al., 2015), which consists in predicting an action class from a single image, given a bounding box of the person performing the action. The dataset contains 4.6K training im- ages, labelled into 11 classes. Similarly to the VOC-2012 object classification task, the performance is measured using the mAP. We considered two training settings: (i) computing the ConvNet fea- tures on the whole image and ignoring the provided bounding box; (ii) computing the features on the whole image and on the provided bounding box, and stacking them to obtain the final representation. The results are compared to other approaches in Table 12. Our representation achieves the state of art on the VOC action classification task even without using the provided bounding boxes, and the results are further improved when using both images and bounding boxes. Unlike other approaches, we did not incorporate any task-specific heuristics, but relied on the representation power of very deep convolutional features. Other Recognition Tasks. Since the public release of our models, they have been actively used by the research community for a wide range of image recognition tasks, consistently outperform- ing more shallow representations. For instance, Girshick et al. (2014) achieve the state of the object detection results by replacing the ConvNet of Krizhevsky et al. (2012) with our 16-layer model. Similar gains over a more shallow architecture of Krizhevsky et al. (2012) have been ob- served in semantic segmentation (Long et al., 2014), image caption generation (Kiros et al., 2014; Karpathy & Fei-Fei, 2014), texture and material recognition (Cimpoi et al., 2014; Bell et al., 2014).

C PAPER REVISIONS

Here we present the list of major paper revisions, outlining the substantial changes for the conve- nience of the reader. v1 Initial version. Presents the experiments carried out before the ILSVRC submission. v2 Adds post-submission ILSVRC experiments with training set augmentation using scale jittering, which improves the performance. v3 Adds generalisation experiments (Appendix B) on PASCAL VOC and Caltech image classifica- tion datasets. The models used for these experiments are publicly available. v4 The paper is converted to ICLR-2015 submission format. Also adds experiments with multiple crops for classification. v6 Camera-ready ICLR-2015 conference paper. Adds a comparison of the net B with a shallow net and the results on PASCAL VOC action classification benchmark.

'Autonomous Driving' 카테고리의 다른 글

칼만 필터(Kalman filter)의 이해-1편 (0)	2024.01.05
Multi-task Learning with Localization Ambiguity Suppression for Occupancy Prediction by 42 dot team (1)	2023.07.09
CVPR23 - Tesla (1)	2023.06.29
CVPR23 E2EAD \| 3D Occupancy Prediction Challenge (0)	2023.06.29
샤오팽(Xpeng) 자율주행 수준에대한 설명 (0)	2023.06.23

ABOUT ME

Connecting the dots Connecting the dots

'Autonomous Driving' 카테고리의 다른 글

티스토리툴바

ABOUT ME

'Autonomous Driving' 카테고리의 다른 글

관련글 관련글 더보기

티스토리툴바