[X:AI] SimCLR 논문 리뷰

A Simple Framework for Contrastive Learning of Visual Representations

논문 원본 : https://arxiv.org/abs/2002.05709

A Simple Framework for Contrastive Learning of Visual Representations

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to under

arxiv.org

1. Abstract & Introduction

human supervision(인위적으로 부여된 label)없이 visual representations을 학습하는 것은 오래동안 풀리지 않은 문제
대부분의 주 접근 방식은 Generative 또는 Discriminative 방법

Genrative

pixel-level generation는 계산 비용이 너무 큼

Discriminative

pretext tasks를 정의하는 과정에서 heuristic에 의존 (일반성 제한)

본 논문은 contrastive learning을 기반으로 하는 Discriminative 접근 방식 소개
이를 위해 간단한 framework인 SimCLR 제안

2. Method

2.1 The Contrastive Learning Framework

SimCLR은 이미지 데이터를 다양한 방법으로 augumentation한 후, 해당 이미지들이 원래 같은 이미지라는 것 잘 인식할 수 있도록 훈련시키는 framework

4가지 주요 구성 요소

[1]

이미지에 2개의 augumentation을 각각 적용
random cropping & resize back to original size , random color distortions, random Gaussian blur (3종류)
변형된 두 이미지를 positive pair로 간주

[2]

증강된 이미지 데이터로부터 ResNet을 사용하여 representation vector 추출

[3]

representation vector를 contrastive loss을 계산하기에 적합한 새로운 벡터로 변환
하나의 hidden layer가 있는 MLP 사용 (ReLU 함수 사용)

[4]

positive pair을 잘 맞추기 위한 학습 과정. 미니배치 내의 다른 변형된 이미지들은 negative pair로 간주
augmentation된 두 이미지가 얼마나 비슷한지(positive pair)측정하기 위해 코사인 유사도 사용
positive pair의 유사도를 높이고, negative pair와의 유사도는 낮추도록 학습. 이를 위해 NT-Xent 손실함수 사용

2.2 Training with Large Batch Size

memory bank를 사용하지 않고 대신 batch size를 256에서 8192까지 다양하게 설정
batch size가 크면 positive pair 당 더 많은 negative pair를 얻을 수 있음
memory bank : 많은 양의 데이터에서 다양한 negative pair들을 저장해두는 일종의 저장소

batch size가 크면 학습이 불안정
이를 해결하기 위해 LARS optimizer 사용 (큰 batch size에서도 안정적인 학습을 도움)
Cloud TPU을 사용해 모델 학습했으며 batch size에 따라 32개에서 128개의 코어 사용
LARS optimizer : 각 layer의 parameter와 gradient 크기를 기반으로 learning rate 조정

표준 ResNet은 배치 정규화(BN) 사용하지만, 분산 학습에서는 여러 장치(GPU or TPU)에서 데이터 처리
이렇게 하면 각 장치가 다른 평균과 분산을 갖게 되어, 모델이 장치별로 다르게 학습
즉 모델이 각 장치의 데이터를 잘 맞추지만 전체 데이터를 잘 표현하지 못함
이를 해결하기 위해, 모든 장치에서의 BN 평균과 분산을 집계(모든 장치가 동일한 값으로 BN 적용)
다른 방법으로는 각 장치 간의 Data Shuffling 또는 Layer Normalization

3. Data Augmentation for Contrastive Representation Learning

Data Augmentation은 predictive task로 정의할 수 있음
이는 Contrastive Learning에서 중요한 역할을 함
기존의 많은 방법(아래 사진)들은 Network Architecture를 변경하여 Contrastive prediction task를 정의
본 논문은 단순한 random cropping (with resizing)을 수행하여 이러한 복잡성을 피할 수 있음을 보여줌

3.1 Composition of data augmentation operations is crucial for learning good representations

Data Augmentation의 영향을 체계적으로 연구하기 위해, 본 논문은 몇 가지 일반적인 augmentation 고려
spatial/geometric transformation : cropping & resizing (with horizontal flipping), rotation, cutout
appearance transformation : color distortion, Gaussian blur, Sobel filtering

각 augmentation을 개별적으로 또는 pair로 적용하여 성능을 테스트
이미지를 항상 crop& resize (ImageNet은 제각기 다른 image size를 갖기 때문)
성능 하락이 예상되지만 공정한 평가가 될 수 있도록 하기 위함
이후 다른 augmentation 적용

연구 결과 단일 augmentation으로는 좋은 학습 결과를 얻기 어려움
여러 augmentation을 조합하면 학습이 더 어려워지지만, 결과가 훨씬 좋아짐
random cropping & random color distortion이 특히 효과적인 조합

3.2 Contrastive learning needs stronger data augmentation than supervised learning

더 강력한 color augmentation이 unsupervised model의 성능을 크게 향상시킴
복잡한 증강 방법인 AutoAugment보다 단순한 crop&(더 강력한) color distortion이 더 나은 성과
supervised learning에서든 더 강력한 color distortion이 성능을 개선하지 않거나 오히려 성능을 저하시킬 수 있음
따라서 unsupervised contrastive learning이 supervised learning보다 강력한 (color) data augmentation에서 더 큰 이익을 얻는다는 것을 보여줌

4. Architectures for Encoder and Head

4.1 Unsupervised contrastive learning benefits (more) from bigger models

모델이 더 크면 unsupervised learning 학습의 성능이 더 많이 향상
이는 supervised learning에서도 비슷하게 나타나지만, unsupervised learning이 더 큰 모델에서 더 많은 이점을 얻는다는 것을 보여줌

4.2 A nonlinear projection head improves the representation quality of the layer before it

projection head를 사용하면 모델 성능이 더 좋아짐
특히, nonlinear projection head가 linear projection head보다 성능이 더 좋음

contrastive loss는 데이터를 변형할 때 정보 손실을 유발할 수 있음
nonlinear projection을 사용하면 이러한 정보 손실을 줄이고 중요한 정보를 유지할 수 있음

5. Loss Functions and Batch Size

5.1 Normalized cross entropy loss with adjustable temperature works better than alternatives

Table 2에서는 contrastive loss로 사용되는 logistic loss, margin loss를 NT-Xent loss와 비교
NT-Xent loss는 L2 Norm(cosine similarity)와 temperature 조정 사용
적절한 temperature는 모델이 hard negative에서 더 잘 학습할 수 있도록 도와줌

다른 loss function은 negative sample의 상대적인 hardness를 반영하지 않기 때문에, semi-hard negative mining 필요
semi-hard negative mining : 모델이 학습할 때 , 모든 negative sample를 고려하는 대신 특정 hard negative sample에 집중하는 방법

동일한 조건(모두 L2 Norm 사용), NT-Xent loss function이 다른 loss function들 보다 더 좋은 성능을 보임
적절한 temperature 조정 없이 NT-Xent loss function을 사용하면 성능이 크게 떨어짐
L2 Norm을 사용하지 않으면, contrastive task에서의 정확도는 높지만 linear evalutation에서의 결과는 나빠짐

5.2 Contrastive learning benefits (more) from larger batch sizes and longer training

훈련 초기 (100 epoch) : batch size가 클수록 성능이 더 좋음
큰 batch size는 더 많은 데이터를 한 번에 처리하여 더 많은 부정 예제를 제공하므로, 모델이 더 빨리 수렴하고 더 나은 성능을 보임

훈련이 길어질수록 : batch size 간의 성능 차이가 줄어들거나 사라짐
이는 훈련이 길어질수록 모델이 충분히 학습하여 batch size의 영향을 덜 받기 때문

더 큰 batch size의 이점 : supervised learning과 달리, contrastive learning은 큰 batch size가 더 많은 부정 예제를 포함하여 학습을 더 효과적으로 만듦. 이는 주어진 정확도를 달성하는 데 필요한 epoch과 학습 단계를 줄여줌

더 긴 훈련의 이점 : 더 오래 훈련하면 더 많은 negative sample를 제공하여 성능을 더욱 향상

6. Comparison with State-of-the-art

Linear evaluation

방법 : ResNet-50을 너비 배율(1배, 2배, 4배)로 확장하여 실험
훈련 : 1000 epoch

결과

이전 연구들보다 더 좋은 성능, 특히 ResNet-50 (4배) 모델은 supervised learning된 ResNet-50과 비슷한 성능을 보임

Semi-supervised learning

방법: 데이터셋의 1% 또는 10%의 레이블만 사용하여 학습
훈련 : 레이블이 있는 데이터를 사용하여 모델을 미세 조정

결과

해당 방법이 1%와 10% 레이블 데이터 모두에서 다른 최신 방법들보다 더 좋은 성능을 보임
사전 학습된 ResNet-50 (2배, 4배) 모델을 전체 데이터셋에서 미세 조정하면 처음부터 학습하는 것보다 성능이 더 좋았음

Transfer learning

방법: 12개의 자연 이미지 데이터셋에서 모델의 성능을 평가

선형 평가: 사전 학습된 모델의 특징 그대로 사용하고, 그 위에 단순한 선형 분류기만 추가
미세 조정: 사전 학습된 모델을 다시 학습하여 데이터셋에 맞게 최적화

결과

self-supervised 모델은 5개의 데이터셋에서 supervised learning baseline을 능가
supervised learning baseline이 더 나은 데이터셋은 2개(Pets와 Flowers)뿐
나머지 5개의 데이터셋에서는 성능이 비슷

7. Related Work

Handcrafted pretext task

최근 self-supervised learning은 인공적으로 설계된 pretext task에서 부활
예를 들어, patch prediction, solving jigsaw puzzles, colorizationm rotation prediction
해당 task들은 더 큰 network와 긴 학습 기간을 통해 좋은 성과를 낼 수 있지만, 임의적인 heuristics에 의존하여 표현의 일반성이 제한될 수 있음

Contrastive visual representation learning

positive sample과 negative sample을 대조하여 representation을 학습하는 접근법
초기 연구인 Hadsell et al. (2006)에서 시작하여, Dosovitskiy et al. (2014)는 각 인스턴스를 특징 벡터로 나타내는 방법을 제안
Wu et al. (2018)는 memory bank를 사용하는 방법을 제안
또한, 일부 연구에서는 memory bank 대신 batch 내 샘플을 사용하는 방법도 탐구됨

Our contributions

최근 연구들은 모델 성능이 잘 나오는 이유를 찾으려고 했음
그 중 하나는 latent representation들 간의 상호 정보 최대화임
그러나 contrastive learning이 잘 되는 이유가 상호 정보 때문인지, 아니면 특정한 loss function 때문인지 확실하지 않음

본 논문에서 제안하는 framework의 거의 모든 개별 구성 요소는 이전 연구에서 나타났지만, 구체적인 구현 방식을 다를 수 있음
해당 framework가 이전 연구보다 우수한 이유는 어떤 단일 설계 선택이 아닌 여러 요소의 조합에 있음

8. Conclusion

본 논문은 contrastive visual representation learning을 위한 간단한 framework인 SimCLR 제안
해당 framework를 통해 self-supervised learning, semi-supervised learning과 transfer learning에서 이전 방법들보다 상당히 향상된 결과를 얻음
본 연구의 접근 방식은 ImageNet에서의 standard supervised learning과는 data augmentation, network 끝에 nonlinear head 사용, loss function에서만 차이가 있음
이 간단한 framework의 강점은 최근 관심이 급증했음에도 불구하고 self-supervised learning이 여전히 저평가되고 있음을 시사

Reference

https://dongwoo-im.github.io/papers/review/2022-11-12-SimCLR/
https://github.com/google-research/simclr

'논문 리뷰 > CV' 카테고리의 다른 글

[X:AI] MOFA-Video 논문 리뷰 (1)	2024.07.20
[X:AI] GAN 논문 리뷰 (1)	2024.07.17
[X:AI] Grad-CAM 논문 리뷰 (1)	2024.07.06
[X:AI] Taskonomy 논문 리뷰 (1)	2024.05.21
[X:AI] Mask R-CNN 논문 리뷰 (1)	2024.05.06

hyeon827

[X:AI] SimCLR 논문 리뷰

A Simple Framework for Contrastive Learning of Visual Representations

'논문 리뷰 > CV' 카테고리의 다른 글

티스토리툴바

[X:AI] SimCLR 논문 리뷰

A Simple Framework for Contrastive Learning of Visual Representations

'논문 리뷰 > CV' 카테고리의 다른 글

관련글

티스토리툴바