[X:AI] SegNet 논문 리뷰

SegNet: A Deep Convolutional Encoder-Decdoer Architecture for Image Segmentation

논문 원본 https://arxiv.org/abs/1511.00561

SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation

We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a

arxiv.org

Abstract

SegNet은 encoder network, a corresponding decoder network, pixel-wise classification layer으로 구성
encoder network의 구조는 FC layer를 제외한 VGG16을 사용

decoder network의 역할은 낮은 해상도 encoder의 feature map을 원본 이미지의 크기로 확대(upsampling)하여 각 픽셀이 어떤 범주에 속하는 지 분류
특히, decoder는 encoder의 max pooling 단계에서 계산된 pooling indices를 활용하여 사용하여 비선형 upsampling 수행(이는 upsampling을 학습할 필요가 없다는 것을 의미)

FCN, DeepLab-LargeFov, DeconvNet과의 비교를 통해 SegNet은 Image Segmentation 성능에서 중요한 메모리 사용량과 정확도 사이의 균형을 찾는 방법 제시
SegNet은 road scenes와 SUN RGB-D indoor scene segemtation tasks에서 더 좋은 성능

1. Introduction

max pooling과 sub-sampling 과정에서 feature map 해상도 감소
max pooling : feature map의 주어진 영역에서 가장 큰 값 선택
sub-sampling : feature map에서 일정 간격으로 sample을 추출하여 데이터 차원 축소
이러한 과정은 계산 비용을 줄이고, 네트워크의 overfitting을 방지하지만, 동시에 중요한 세부 정보와 공간적 해상도를 잃어버리게 만들 수 있음(coarse feature map) -> pixel-wise prediction에서 좋은 output을 내지 못함

SegNet은 낮은 해상도의 특징을 원본 이미지 해상도로 정확하게 매핑하여 각 픽셀을 정밀하게 분류할 수 있는 모델 구조를 설계하는 것을 목표
특히 road scenes 이해에 유용하게 설계 -> apperance(road,building), shape(car,pedestrains), spatial-relationship(context, road& sie-walk) 같은 정보를 이해하는 것이 중요

SegNet의 encoder network는 FC layer를 제거한 VGG16 모델 구조 사용
FC layer를 제거함으로써 SegNet이 computation cost 측면에서 강점
decoder network의 경우 ecoder의 mirrored 구조
또한, max pooling indices를 사용하여 비선형 upsampling 수행

3. Architecture

SegNet = encoder + decoder + final pixelwise classification layer

Encoder Network

encoder는 VGG16의 13개의 convolutional layer로 구성
따라서 encoder의 경우 pre-trained weight 활용 가능
또한, encoder 출력에서 더 높은 해상도의 feature map를 유지하기 위해 FC layer를 제거하여 다른 architecture에 비해 parameter 수를 크게 줄임(134M -> 14.7M)
Conv -> Batch Normalisation -> ReLU, 2x2 max pooling (stride 2)

Decoder Network

각 encoder network에 대응하여 존재(mirrored)
Upsampling 후 Conv -> Batch Normalisation -> ReLU

pooling 및 Conv 연산으로 인한 feature map의 정보 손실 방지를 위한 방안 필요
Encoder feature map를 저장해두었다가 Up-sampling할 때 Decoder로 전달하는 방안 생각할 수 있음
그러나 해당 방식은 많은 메모리 필요

그래서 SegNet decoder는 max-pooling indices만으로 저장한 후 upsampling시 사용
이러한 방식으로 학습 parameter 감소시킴( Transposed convolution은 추가적인 학습 parameter 요구)
학습 parameter 수가 줄어들어 전체 모델을 end-to-end로 학습시킬 수 있게 됨
Max Unpooling 방식은 다른 encoder-decoder 형식의 network에도 적용 가능 -> SegNet의 구조적 특징을 다른 문제에 적용하거나 변형하여 사용할 수 있음을 의

Softmax classifier

K(클래스 개수)개의 채널을 가진 decoder의 최종 output은 K-class softmax classifier로 들어가 각 픽셀마다 독립적인 확률값으로 계산
각 픽셀별로 가장 확률이 높은 class ==> 최종 segmentation

3.1 Decoder Variants

본 논문에서는 좋은 SegNet 구조를 찾기 위해 다양하게 구성하여 실험

SegNet-Basic

4개의 encoder와 4개의 decoder를 가진 SegNet의 작은 버전
7x7 Conv 연산 수행(wide context 정보 추출)
decoder에서 bias, ReLU 사용 X

FCN-Basic

SegNet-Basic과 동일한 encoder를 공유하지만, 모든 decoder에서 FCN decoding 기술 사용

SegNet-Basic-SingleChannelDecoder

SegNet-Basic과 유사하지만 모든 decoder filter가 단일 channel을 가짐
학습해야 할 파라미터의 수를 줄이고, 추론 시간을 단축시키기 위함

FCN-Basic-NoAddition

FCN 모델에서 skip-connection을 제외한 구조

Bilinear-Interpolation

FCN-Basic-NoAddition 모델에서 FCN의 upsampling 기술을 쓰지 않고 고정된 bilinear interpolation weight 사용

SegNet-Basic-EncoderAddition

SegNet-Basic에서 모든 layer의 encoder feature map을 decoder feature map과 요소별로 더함

FCN-Basic-NoDimReduction

FCN모델에서 encoder feature map의 차원 축소를 수행하지 않는 구조

3.3 Analysis

encoder feature map을 모두 사용하는 것이 가장 좋은 성능을 달성
메모리가 제한되어있다면, dimensionality reduction이나 un-maxpooling 사용
더 큰 decoder가 더 좋은 성능

4. Benchmarking

SegNet의 성능을 두 가지 scene segmentation datasets에서 평가
road scene segmentation
indoor scene segmentaion

FCN, DeepLab-LargeFOV, DeconvNet 등과 같은 여러 딥러닝 아키텍처와 비교

4.1 Road Scene Segmentation

SegNet은 작은 class도 잘 segmentation하는 능력과 전체 이미지를 부드럽고 자연스러운 segmentation 제공
DeepLab-LargeFOV는 가장 효율적인 모델로 평가되었으나, 작은 class는 제대로 segmentation하지 X
학습된 DeConv layer를 사용한 FCN은 고정된 bilinear upsampling을 사용한 FCN보다 더 좋은 결과
DeconvNet은 가장 크고 훈련하기 비효율적인 모델로, 작은 class는 segmentation하지 못함

4.2 SUN RGB-D Indoor Scences

5,285개의 training image와 5,050개의 testing image를 포함하는 indoor scenes데이터
벽,바닥,천장,테이블,의자,소파 등 37개의 indoor scene classes
객체의 shape,size,pose가 굉장히 다양하고 test image마다 여러 다른 class가 자주 부분적으로 가려져 그만큼 어려운 task임

해당 연구에서는 이미지의 depth 정보는 제외하고 RGB 정보만 사용
depth 정보를 활용하면 segmentation task의 정학도를 높일 수 있지만, 이를 위해 기존의 architecture를 수정하거나 새로운 architecture를 개발해야 함

입력이 오직 RGB임에도 불구하고, SegNet은 의자와 테이블의 다리와 같은 큰 객체들을 잘 segmentation함
그러나 outdoor scene에 비해 segmentation 품질이 떨어지고, scene의 복잡성이 증가함에 따라 품질히 현저히 감소

6. Conclusion

SegNet은 메모리와 계산 시간 측면에서 효율적으로 road 및 indoor sence를 segmentation하기 위한 deep convolution network임
feature map의 max pooling indices만 저장하여 사용함으로써 SegNet은 추론 시간 동안 더 적은 메모리를 사용하면서 좋은 성능을 제공
SegNet은 크고 잘 알려진 데이터셋에서 경쟁력 있는 성능을 보여주며, 특히 road scene segmentation에 높은 성능을 보여줌

7. References

https://eremo2002.tistory.com/120

https://wikidocs.net/148875

https://velog.io/@ckd248/%EB%85%BC%EB%AC%B8-%EB%A6%AC%EB%B7%B0-SegNet-A-Deep-Convolutional-Encoder-Decoder-Architecture-for-Image-Segmentation

'논문 리뷰 > CV' 카테고리의 다른 글

[X:AI] Taskonomy 논문 리뷰 (0)	2024.05.21
[X:AI] Mask R-CNN 논문 리뷰 (0)	2024.05.06
[X:AI] EfficientNet 논문 리뷰 (0)	2024.04.03
[X:AI] U-Net 논문 리뷰 (0)	2024.03.27
[X:AI] InceptionV2/3 논문 리뷰 (0)	2024.03.14

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

hyeon827

[X:AI] SegNet 논문 리뷰

'논문 리뷰 > CV' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역

[X:AI] SegNet 논문 리뷰

'논문 리뷰 > CV' 카테고리의 다른 글

관련글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역