Notice
Recent Posts
Recent Comments
ยซ   2024/12   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
Tags
more
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Hello Potato World

[ํฌํ…Œ์ดํ†  ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Learning Deep Features for Discriminative Localization ๋ณธ๋ฌธ

Paper Review๐Ÿฅ”/XAI

[ํฌํ…Œ์ดํ†  ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Learning Deep Features for Discriminative Localization

Heosuab 2021. 8. 8. 04:32

 

โ‹† ๏ฝก หš โ˜๏ธŽ หš ๏ฝก โ‹† ๏ฝก หš โ˜ฝ หš ๏ฝก โ‹† 

[XAI paper review]

 

Interpretable Machine Learning ์ฑ…์—์„œ ์†Œ๊ฐœ๋œ Grad-CAM์˜ ๋‚ด์šฉ์ด ๊ถ๊ธˆํ•ด์ ธ์„œ ์ฐพ์•„๋ณด๋‹ค๊ฐ€, ์—ฐ๊ด€๋œ Grad-CAM, Grad-CAM++, Guided Grad-CAM ๋“ฑ๋“ฑ์˜ ๊ธฐ๋ฐ˜์ด ๋˜๋Š” CAM(Class Activation Maps)์„ ๋‹ค๋ฃจ๋Š” ๋…ผ๋ฌธ์„ ๋จผ์ € ๋ฆฌ๋ทฐํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค.

 

 


 Global Average Pooling(GAP) vs Global Max Pooling(GMP)


 

 

 ์ด ๋…ผ๋ฌธ์—์„œ ๊ฐ€์žฅ ์ค‘์š”ํ•œ ๊ฐœ๋…์ธ Global Average Pooling์„ ๋จผ์ € ๋ณด๋ฉด, ์šฐ์„  Pooling layer๋ž€ CNN ๋‚ด์˜ ๋งŽ์€ Convolution layer๋‚ด์— ์กด์žฌํ•˜๋Š” filter(parameter)์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋„ˆ๋ฌด ๋งŽ์•„์ ธ์„œ Overfitting์ด ๋ฐœ์ƒํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด, parameter ์ˆ˜๋ฅผ ์ค„์ผ ์ˆ˜ ์žˆ๋„๋ก ์‚ฌ์šฉ๋˜๋Š” layer์ด๋‹ค.

 ๊ทธ ์ค‘ Max Pooling์ด๋ž€ ๊ฐ ์˜์—ญ(์ง€์—ญ) ๋‚ด์—์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ์„ ํƒํ•˜์—ฌ ์ค„์ด๋Š” ๋ฐฉ์‹์ด๊ณ , Global Max Pooling์ด๋ž€ ์ „์ฒด ์˜์—ญ(์ „์—ญ)์„ ํ•œ๋ฒˆ์— ๊ณ ๋ คํ•ด์„œ (heigt, width, channel)ํ˜•ํƒœ์˜ 3์ฐจ์›์„ (channel, )ํ˜•ํƒœ์˜ 1์ฐจ์› ๋ฒกํ„ฐ๋กœ ๊ทน๋‹จ์ ์ธ feature์˜ ๊ฐ์†Œ๋ฅผ ๋งŒ๋“œ๋Š” ๋ฐฉ์‹์ด๋‹ค. 

 ๋˜, ์ „์ฒด ์˜์—ญ ๋‚ด์—์„œ ๊ฐ€์žฅ ํฐ ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ Global Max Pooling(GMP)๋ผ๊ณ  ํ•˜์ง€๋งŒ, ๋ชจ๋“  ๊ฐ’์„ ๊ณ ๋ คํ•˜์—ฌ ํ‰๊ท ๊ฐ’์„ ์‚ฌ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์„ Global Average Pooling(GAP)์ด๋ผ๊ณ  ํ•œ๋‹ค.

 ๋ณดํ†ต CNN์˜ ๊ตฌ์กฐ์—์„œ๋Š” ๋งˆ์ง€๋ง‰ layer๋กœ FC layer๋ฅผ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, ์ด FC layer๋Š” parameter์˜ ๊ฐœ์ˆ˜๋ฅผ ๋งค์šฐ ์ปค์ง€๋„๋ก ๋งŒ๋“ค๊ธฐ ๋•Œ๋ฌธ์— overfitting ์œ„ํ—˜์ด ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ๊ณ , Feature map(pooling์ด์ „)์— ์กด์žฌํ•˜๋Š” object๋“ค์˜ ์œ„์น˜์ •๋ณด๊ฐ€ ์†์‹ค๋œ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค. ์ด ๋…ผ๋ฌธ์—์„œ๋Š” CNN์˜ ๋งˆ์ง€๋ง‰ FC layer๋ฅผ Global Average Pooling์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ overfitting์„ ๋ฐฉ์ง€ํ•  ์ˆ˜ ์žˆ๋Š” regularization์˜ ์—ญํ• ์„ ํ•˜๋ฉฐ, ์œ„์น˜์ •๋ณด๋ฅผ ์†์‹คํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€๋‹ค.

 

 


 Learning Deep Features for Discriminative Localization


 ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•œ CAM(Class Activation Maps)์˜ key point 2๊ฐ€์ง€๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

  •  Weakly-supervised object localization
  •  Visualizing CNNs

 ์œ„์—์„œ ์–ธ๊ธ‰ํ–ˆ๋˜ ๊ฒƒ์ฒ˜๋Ÿผ FC layer๋Œ€์‹ ์— Global Average Pooling์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ์œ„์น˜์ •๋ณด๋ฅผ ์†์‹คํ•˜์ง€ ์•Š์„ ์ˆ˜ ์žˆ๋„๋ก ๋งŒ๋“ค์—ˆ๋Š”๋ฐ, ๋•๋ถ„์— ๋‹จ ํ•œ๋ฒˆ์˜ forward-pass๋งŒ์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ๊ฐ€์ง€ Task๋ฅผ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด, Object Classification๋งŒ์„ ์œ„ํ•ด ํ•™์Šต๋œ CNN ๋ชจ๋ธ์ด ์ด๋ฏธ์ง€๋ฅผ classifyํ•  ์ˆ˜ ์žˆ์„๋ฟ๋งŒ ์•„๋‹ˆ๋ผ localization๋„ ์ˆ˜ํ–‰ํ•  ์ˆ˜

์žˆ๊ฒŒ ๋˜์—ˆ๋‹ค. ์ฆ‰, ๊ฐ ์ด๋ฏธ์ง€์˜ label๋งŒ ์ฃผ์–ด์ง„ ์ƒํ™ฉ์—์„œ ์ฃผ์–ด์ ธ์žˆ์ง€ ์•Š์€ localization์ •๋ณด๋ฅผ ์˜ˆ์ธกํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. (Weakly supervised learning : ํ•™์Šต์— ์ฃผ์–ด์ง„ ์ •๋ณด๋ณด๋‹ค ์˜ˆ์ธกํ•˜๋ ค๋Š” ์ •๋ณด๊ฐ€ ๋” ๋””ํ…Œ์ผํ•œ ๊ฒฝ์šฐ)

 ์•„๋ž˜ ๊ทธ๋ฆผ์€ Global Average Pooling์„ ์‚ฌ์šฉํ•˜์—ฌ CAM์„ ์‹œ๊ฐํ™”ํ•œ ๊ฒƒ์ธ๋ฐ, ๊ฐ ์ด๋ฏธ์ง€๋“ค์— ๋Œ€ํ•ด classifyํ•˜๋ฉด์„œ๋„ object๋“ค์ด ์œ„์น˜ํ•˜๋Š” ์˜์—ญ๋„ ์ฐพ์•„๋‚ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 


 Class Activation Mapping


 CAM์€ ์œ„์—์„œ ๋ณธ ๊ทธ๋ฆผ์ฒ˜๋Ÿผ, CNN์ด input image์— ๋Œ€ํ•œ prediction์„ ๋งŒ๋“ค์–ด๋ƒˆ์„ ๋•Œ, ํ•ด๋‹น class๋กœ ํŒ๋ณ„ํ•˜๋Š”๋ฐ ์ค‘์š”ํ•˜๊ฒŒ ์ƒ๊ฐํ•˜๋Š” ์˜์—ญ์„ ํ‘œ์‹œํ•˜์—ฌ ์‹œ๊ฐํ™”ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๊ตฌ์กฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™๋‹ค.

1. ๋งˆ์ง€๋ง‰ Convolution layer์˜ feature map์„ $f_k(x,y)$๋ผ๊ณ  ํ•˜๋ฉด, ๊ฐ๊ฐ์˜ unit $k$์— ๋Œ€ํ•ด GAP์„ ์ˆ˜ํ–‰ํ•ด์„œ $k$๊ฐœ์˜ ๊ฐ’์„ ์ถœ๋ ฅํ•œ๋‹ค. (GAP์˜ ๊ฒฐ๊ณผ $F_k$)

2. ๊ฐ๊ฐ์˜ $F_k$์— ๋Œ€ํ•ด์„œ, class c์— ๋Œ€ํ•œ ๊ฐ€์ค‘์น˜ $w_k^c$์˜ weighted sum์„ ๊ณ„์‚ฐํ•˜์—ฌ $S_c$๋ฅผ ์ถœ๋ ฅํ•œ๋‹ค. ์ด ๋•Œ์˜ ์ถœ๋ ฅ $S_c$๋Š” softmax์˜ input์œผ๋กœ ์‚ฌ์šฉ๋œ๋‹ค.

3. Softmax์—ฐ์‚ฐ์„ ๊ฑฐ์น˜๋ฉด ๊ฐ class c์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ $P_c$๊ฐ€ ์ถœ๋ ฅ๋œ๋‹ค. bias๋Š” classification์˜ ์„ฑ๋Šฅ์— ์˜ํ–ฅ์„ ๊ฑฐ์˜ ๋ฏธ์น˜์ง€ ์•Š๋Š”๋‹ค๊ณ  ๊ฐ€์ •ํ•˜๊ณ , bias๋Š” 0์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ๊ณ„์‚ฐํ•œ๋‹ค.

4. Class c์— ๋Œ€ํ•œ CAM์„ $M_c$๋ผ๊ณ  ์ •์˜ํ•˜๊ณ , $S_c$์˜ ์ˆ˜์‹์„ ๋ณ€ํ˜•ํ•˜์—ฌ ๊ตฌํ•  ์ˆ˜ ์žˆ๋Š” ํ˜•ํƒœ๋กœ ์‚ฌ์šฉํ•œ๋‹ค.

๋”ฐ๋ผ์„œ ๊ณต๊ฐ„์  ์ขŒํ‘œ(x,y)์˜ activation๊ฐ’์˜ ์ค‘์š”๋„๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ CAM $M_c(s,y)$๋Š” ์ด๋ฏธ์ง€๊ฐ€ class C๋กœ classify๋˜๋Š”๋ฐ ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. 

๋งˆ์ง€๋ง‰ convolution layer์—์„œ์˜ CAM์„ ์‹œ๊ฐํ™”ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ตœ์ข… CAM์„ ์ฒ˜์Œ input image์™€ ๊ฐ™์€ ํฌ๊ธฐ๋กœ unsamplingํ•˜๋ฉด, input image๋‚ด์—์„œ class c์™€ ๊ด€๋ จ๋˜์–ด์žˆ๋Š” ์˜์—ญ์ด ์–ด๋””์ธ์ง€ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค. ์œ„์˜ ๊ทธ๋ฆผ์—์„œ๋Š” input image๊ฐ€ "Australian terrier"์˜ class๋กœ ๊ตฌ๋ถ„๋˜๋Š”๋ฐ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ์˜์—ญ์„ ํ•˜์ด๋ผ์ดํŠธํ•œ CAM ๊ฒฐ๊ณผ๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๊ณ , ๊ฐ•์•„์ง€๊ฐ€ ์œ„์น˜ํ•œ ์˜์—ญ์˜ localization๋„ ํ•จ๊ป˜ ์ˆ˜ํ–‰ํ•œ ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 


 Results


 

ILSVRC์˜ ์ด 4๊ฐ€์ง€ class์˜ CAM์„ ์‹œ๊ฐํ™”ํ•œ ๊ทธ๋ฆผ.

์ฒซ๋ฒˆ์งธ์™€ ๋‘๋ฒˆ์งธ ๊ทธ๋ฆผ์—์„œ๋Š” "briard"์™€ "hen"์˜ ๋จธ๋ฆฌ ๋ถ€๋ถ„์ด prediction์— ํฐ ์˜ํ–ฅ์„ ๋ฏธ์ณค๊ณ , ์„ธ ๋ฒˆ์งธ ๊ทธ๋ฆผ์—์„œ์˜ "barbell"์˜ ์›ํŒ ๋ถ€๋ถ„, ๋„ค ๋ฒˆ์งธ ๊ทธ๋ฆผ์—์„œ์˜ "bell cote"์˜ bell ๋ถ€๋ถ„์ด ์˜ํ–ฅ์„ ๋งŽ์ด ๋ฏธ์นœ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

์ด๋ฒˆ์—” ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๋ฅผ ๊ฐ€์ง€๊ณ , ์—ฌ๋Ÿฌ class์— ๋Œ€ํ•œ CAM์„ ์˜ˆ์ธกํ•ด์„œ ์‹œ๊ฐํ™”ํ•œ ์ž๋ฃŒ์ด๋‹ค.

์‹ค์ œ Ground Truth label์€ Dome์ด๊ณ , ๊ฐ€์žฅ ํ™•๋ฅ ์ด ๋†’์€ Top5์˜ class์— ๋Œ€ํ•ด ์‹œ๊ฐํ™”ํ•˜์˜€๋‹ค. ํ•˜๋‚˜์˜ class์— ๋Œ€ํ•ด ์—ฌ๋Ÿฌ ์ด๋ฏธ์ง€์˜ CAM์„ ๋น„๊ตํ–ˆ์„ ๋•Œ๋Š”, ํ•ด๋‹น class๋ฅผ ๋Œ€ํ‘œํ•  ์ˆ˜ ์žˆ๋Š” ํŠน์ง•๋“ค์„ ์ผ๊ด€๋˜๊ฒŒ ํ•˜์ด๋ผ์ดํŠธํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ, ์ด ๊ทธ๋ฆผ์ฒ˜๋Ÿผ ์—ฌ๋Ÿฌ class์— ๋Œ€ํ•œ CAM์„ ๋น„๊ตํ–ˆ์„ ๋•Œ๋Š” ๊ฐ๊ฐ ๋‹ค๋ฅธ ๋ถ€๋ถ„๋“ค์„ ํ•˜์ด๋ผ์ดํŠธํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

CAM์œผ๋กœ ๊ตฌํ•ด์ง„ segmentation map์„ ์ „๋ถ€ ์ปค๋ฒ„ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐ€์žฅ ํฐ bounding box๋ฅผ ์ƒ์„ฑํ•ด์„œ Localization์„ ์ˆ˜ํ–‰ํ–ˆ๋‹ค. ๊ฐ a)์™€ b)์˜ ์ƒ๋‹จ ๊ทธ๋ฆผ์€ GoogleNet-GAP๋กœ ๊ตฌํ•ด์ง„ ๊ฒฐ๊ณผ์ด๊ณ , ํ•˜๋‹จ ๊ทธ๋ฆผ๋“ค์€ AlexNet์„ ์‚ฌ์šฉํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. 

๊ฐ ์ด๋ฏธ์ง€์—์„œ Gound Truth๋Š” ๋…น์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋œ Bounding box์ด๋ฉฐ, CAM์„ ์‚ฌ์šฉํ•˜์—ฌ ์˜ˆ์ธกํ•œ box๋Š” ๋นจ๊ฐ„์ƒ‰์œผ๋กœ ํ‘œ์‹œ๋˜์—ˆ๋‹ค.

 

 


 References


[1] Zhou et al, Learning Deep Features for Discriminative Localization, 2016

 

 

 

 

Comments