Notice
Recent Posts
Recent Comments
ยซ   2024/12   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
Tags
more
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Hello Potato World

[ํฌํ…Œ์ดํ†  ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization ๋ณธ๋ฌธ

Paper Review๐Ÿฅ”/XAI

[ํฌํ…Œ์ดํ†  ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization

Heosuab 2021. 8. 8. 20:47

 

โ‹† ๏ฝก หš โ˜๏ธŽ หš ๏ฝก โ‹† ๏ฝก หš โ˜ฝ หš ๏ฝก โ‹† 

[XAI paper review]

 

 


 Generalization to CAM


๋ชจ๋ธ์„ ํ•ด์„ํ•  ๋•Œ์—๋Š” Simplicity์™€ Interpretability์‚ฌ์ด์˜ tradeoff ๊ด€๊ณ„๊ฐ€ ์žˆ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด ๊ฐ„๋‹จํ• ์ˆ˜๋ก ํ•ด์„์€ ์šฉ์ดํ•ด์ง€๊ณ  ๋ชจ๋ธ์ด ๋ณต์žกํ• ์ˆ˜๋ก ํ•ด์„์€ ์–ด๋ ค์›Œ์ง€๊ธฐ ๋•Œ๋ฌธ์— ๋ชจ๋ธ์˜ accuracy๋ฅผ ์žƒ์ง€ ์•Š์œผ๋ฉด์„œ ํ•ด์„ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ด ๋‘˜ ์‚ฌ์ด์˜ ์ ์ •์ ์„ ์ฐพ๋Š”๊ฒƒ์ด ์ค‘์š”ํ•˜๋‹ค.

 ์ง€๋‚œ๋ฒˆ ๋ฆฌ๋ทฐํ–ˆ๋˜ CAM์—์„œ๋Š”, CNN๋ชจ๋ธ์˜ ๊ฐ€์žฅ ๋งˆ์ง€๋ง‰ layer์ธ FC layer๋ฅผ Global average pooling์œผ๋กœ ๋Œ€์ฒดํ•˜์—ฌ overfitting์„ ์ค„์ด๊ณ  ํ•™์Šต๋˜์ง€ ์•Š์€ task(weakly-supervised object localization)์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์žฅ์ ์ด ์žˆ์—ˆ๋‹ค. ํ•˜์ง€๋งŒ GAP๋กœ ๋Œ€์ฒดํ•˜๋Š” ๊ณผ์ •์ด ๊ฒฐ๊ตญ ๋ชจ๋ธ์˜ ํ•ด์„์„ ์œ„ํ•ด complexity๋ฅผ ์ค„์ธ ๊ฒƒ์ด๋ฉฐ, CAM์€ ๋งˆ์ง€๋ง‰ FC layer๋ฅผ ํฌํ•จํ•˜๋Š” ๋ชจ๋ธ์—๋งŒ ํ•œ์ •์ ์œผ๋กœ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋‹จ์ ์ด ์žˆ๋‹ค.

 ์ด ๋…ผ๋ฌธ์—์„œ ์ œ์‹œํ•˜๋Š” Grad-CAM์€, ๋ชจ๋ธ์˜ ๊ตฌ์กฐ๋‚˜ complexity์— ์•„๋ฌด ๋ณ€ํ˜•์„ ์ฃผ์ง€ ์•Š๊ณ  ๋ชจ๋“  ๋ชจ๋ธ์— ์ ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค. ๋”ฐ๋ผ์„œ CAM์˜ ์ผ๋ฐ˜ํ™”๋œ ๋ฐฉ๋ฒ•์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๊ณ , ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋” ๋„“์€ ๋ฒ”์ฃผ์˜ CNN๋ชจ๋ธ์— ์ „๋ถ€ ์ ์šฉํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— Image Classification, Localization, Captioning, VQA(Visual Question Answering)๋“ฑ์˜ ๋” ๋„“์€ Task์— ์‘์šฉ๋œ๋‹ค.

  • FC layer๊ฐ€ ์—†๋Š” CNNs
  • Image Captioning์ฒ˜๋Ÿผ ๊ตฌ์กฐํ™”๋œ output์„ ๋งŒ๋“œ๋Š” CNNs
  • VQA์ฒ˜๋Ÿผ multi-modal์˜ input์„ ์‚ฌ์šฉํ•˜๋Š” CNNs
  • ๊ฐ•ํ™”ํ•™์Šต

 

 


 Grad-CAM (Gradient-weighted Class Activation Mapping)


 

CAM์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ

  CAM์—์„œ ์‚ฌ์šฉํ–ˆ๋˜ Global average pooling์—์„œ๋Š”, ๋งˆ์ง€๋ง‰ Convolution layer์˜ Feature map ๊ฐ๊ฐ์— ๋Œ€ํ•ด ์ „์—ญ์ ์ธ ํ‰๊ท ๊ฐ’์„ ๊ตฌํ•œ ํ›„ ๊ฐ unit $k$์— ๋Œ€ํ•œ ์ค‘์š”๋„๋ฅผ ์˜๋ฏธํ•˜๋Š” weight๋“ค์„ ์‚ฌ์šฉํ•˜์—ฌ weighted sum์„ ํ•˜์˜€๋‹ค. ํ•˜์ง€๋งŒ ์œ„์—์„œ ์–ธ๊ธ‰ํ–ˆ๋˜ ๋ฌธ์ œ์ ์ฒ˜๋Ÿผ, ์ด weight ๊ฐ’๋“ค์ด ์ฃผ์–ด์ ธ์žˆ์ง€ ์•Š์œผ๋ฉด CAM์„ ์‚ฌ์šฉํ•  ์ˆ˜ ์—†๊ธฐ ๋•Œ๋ฌธ์— Global average pooling์„ ์‚ฌ์šฉํ•˜๋Š” CNN ๋ชจ๋ธ์—๋งŒ CAM์„ ์ ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ๋ฌธ์ œ์ ์ด ์ƒ๊ธด๋‹ค. ๋”ฐ๋ผ์„œ ์ด ๋…ผ๋ฌธ์—์„œ๋Š” ์ด weight๋“ค์„ ๊ฐ๊ฐ์˜ class c์— ๋Œ€ํ•œ gradient backpropagation์œผ๋กœ ๊ณ„์‚ฐํ•˜์—ฌ ๋Œ€์ฒดํ•˜๊ณ ์ž ํ•œ๋‹ค($\alpha_k^c=w_k^c$). ์ด $\alpha$๊ฐ’์„ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ์•„๋ž˜์—์„œ ์†Œ๊ฐœํ•œ๋‹ค.

 

 ์œ„์˜ ๊ทธ๋ฆผ์€ Global Average Pooling์ด ์•„๋‹Œ FC layer์ด ์‚ฌ์šฉ๋œ ๋ชจ๋ธ์˜ ์˜ˆ์‹œ์ด๋‹ค. Softmax layer๋ฅผ ํ†ต๊ณผํ•˜๊ธฐ ์ด์ „์˜ output์„ $y^c$(๊ฐ Class c์— ๋Œ€ํ•œ ๊ฒฐ๊ณผ๊ฐ’)์ด๋ผ๊ณ  ํ•˜๋ฉด, ๋งˆ์ง€๋ง‰ convolution layer์˜ feature map์ธ $A^k$์˜ ์˜ํ–ฅ์„ ๊ณ„์‚ฐํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ class์˜ gradient๋ฅผ ๊ณ„์‚ฐํ•œ๋‹ค.

Gradient๋ฅผ $A^k$๋‚ด์˜ ๋ชจ๋“  (i,j)์˜ ๋‰ด๋Ÿฐ์— ๋Œ€ํ•ด ๊ณ„์‚ฐํ•˜๊ณ  Global average pooling์„ ํ•˜์—ฌ, ๊ฐ $A^k$์— ๋Œ€ํ•œ ํ•˜๋‚˜์˜ importance weight $\alpha_k^c$๋ฅผ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค(Z๋Š” average๋ฅผ ์œ„ํ•œ i*j). ์ด ๋•Œ class c์— ๋Œ€ํ•œ output์ธ $y^c$์™€ feature map $A^k$๋ฅผ ์‚ฌ์šฉํ•œ backpropagation์„ ์ˆ˜ํ–‰ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ๊ฐ์˜ $\alpha_k^c$๋Š” target class c์— ๋Œ€ํ•œ feature map k์˜ ์˜ํ–ฅ๋ ฅ(์ค‘์š”๋„) ์ •๋ณด๋ฅผ ์˜๋ฏธํ•œ๋‹ค.

CAM์˜ ์ˆ˜์‹ ๊ฒฐ๊ณผ์—์„œ ๋ณผ ์ˆ˜ ์žˆ์—ˆ๋˜ Map์ƒ์„ฑ ๊ณผ์ •

 ์•ž์„œ CAM์—์„œ ํ™•์ธํ•  ์ˆ˜ ์žˆ์—ˆ๋˜๊ฒƒ์ฒ˜๋Ÿผ, ๊ตฌํ•ด์ง„ ๊ฐ€์ค‘์น˜(์ค‘์š”๋„) $\alpha_k^c$์™€ feature map $A^k$์˜ linear combination์—ฐ์‚ฐ์„ ํ•˜๋ฉด Grad-CAM์„ ๊ตฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์•„๋ž˜ ์ˆ˜์‹์ฒ˜๋Ÿผ ReLU์—ฐ์‚ฐ์„ ํ•œ๋ฒˆ ๋” ์ˆ˜ํ–‰ํ•˜๋Š” ์ด์œ ๋Š”, ์˜ํ–ฅ๋ ฅ์„ ๋ถ„์„ํ•˜๊ณ ์ž ํ•˜๋Š” ๊ฐ class c์— ๋Œ€ํ•ด positiveํ•œ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” feature๋“ค๋งŒ ๋‚˜ํƒ€๋‚ด๊ธฐ ์œ„ํ•ด์„œ๋‹ค. (positiveํ•œ ์˜ํ–ฅ์ด๋ž€, ํ•ด๋‹น feature๋“ค์˜ intensity๊ฐ€ ์ฆ๊ฐ€ํ•˜๋ฉด $y^c$๊ฐ’์ด ์ฆ๊ฐ€ํ•˜๋Š” ๊ฒƒ์„ ๋งํ•œ๋‹ค.) ReLU์˜ ํŠน์„ฑ์ƒ, negative ๊ฐ’์„ ๊ฐ€์ง€๋Š” feature๋“ค์€ 0์œผ๋กœ ๊ทผ์‚ฌํ•˜๊ฒŒ ๋œ๋‹ค.

 

 

 


 Guided Grad-CAM (Guided Backpropagation + Grad-CAM)


 ์œ„์˜ ๊ทธ๋ฆผ์€ ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€์— ๋Œ€ํ•ด "Tiger cat"(์ƒ๋‹จ ์ค„)๊ณผ "Dog"(ํ•˜๋‹จ ์ค„) ๋‘ ๊ฐ€์ง€ class์˜ ๋ถ„์„์„ ์‹œ๊ฐํ™”ํ•œ ๊ฒƒ์ด๋‹ค.

 Guided Backpropagation์ด๋‚˜ Deconvolution์™€ ๊ฐ™์€ Pixel-space Gradient Visualizations์˜ ๊ฒฝ์šฐ ์ด๋ฏธ์ง€์˜ ์„ธ๋ถ€์ ์ธ ๋””ํ…Œ์ผ์„ ํ•˜์ด๋ผ์ดํŠธํ•˜๊ธฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๊ฐ class์— ๋Œ€ํ•œ ์ฐจ๋ณ„์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด์ฃผ์ง€๋Š” ๋ชปํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์–ด (b)๋Š” "Cat"์— ๋Œ€ํ•œ ์‹œ๊ฐํ™” ์ด๋ฏธ์ง€์ด๊ณ , (h)๋Š” "Dog"์— ๋Œ€ํ•œ ์‹œ๊ฐํ™” ์ด๋ฏธ์ง€์ธ๋ฐ, ๋‘ ์ด๋ฏธ์ง€ ๋ชจ๋‘ pixel ๋‹จ์œ„๊นŒ์ง€ ์„ธ๋ถ€์ ์œผ๋กœ ๋ถ„์„ํ–ˆ์ง€๋งŒ, ๊ณ ์–‘์ด์™€ ๊ฐ•์•„์ง€ ์˜์—ญ์ด ์ „๋ถ€ ํ•˜์ด๋ผ์ดํŠธ๋˜์–ด ๋‘ ์ด๋ฏธ์ง€์— ํฐ ์ฐจ์ด๊ฐ€ ์—†๋‹ค๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 ๋ฐ˜๋ฉด CAM์ด๋‚˜ Grad-CAM๊ณผ ๊ฐ™์€ localization approaches์˜ ๊ฒฝ์šฐ ๊ฐ class์— ๋Œ€ํ•œ ์ฐจ๋ณ„์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ๋งŒ๋“ค์–ด์ฃผ์ง€๋งŒ ์„ธ๋ถ€์ ์ธ ๋””ํ…Œ์ผ์€ ์žก์•„๋‚ด์ง€ ๋ชปํ•œ๋‹ค. ๊ทธ๋ฆผ์—์„œ (c)์™€ (i)๋ฅผ ๋ณด๋ฉด, ๊ฐ class์— ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š” ๊ณ ์–‘์ด์™€ ๊ฐ•์•„์ง€์˜ ์˜์—ญ์„ ํžˆํŠธ๋งต์ฒ˜๋Ÿผ ๋‘๋ฃจ๋ญ‰์ˆ ํ•˜๊ฒŒ ํ•˜์ด๋ผ์ดํŠธํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

 Guided Grad-CAM์€ ์ด ๋‘ ๋ฐฉ๋ฒ•์˜ ์žฅ์ ๋“ค์„ ์œตํ•ฉํ•˜์—ฌ class-discriminativeํ•˜๋ฉด์„œ๋„ high-resolutionํ•œ ์‹œ๊ฐํ™”๋ฅผ ์ œ๊ณตํ•˜๊ณ ์ž ๋งŒ๋“ค์–ด์กŒ๋‹ค. ๊ทธ๋ฆผ์—์„œ (d)์™€ (j)๋ฅผ ๋ณด๋ฉด, pixel ๋‹จ์œ„๊นŒ์ง€ ์„ธ๋ถ€์ ์œผ๋กœ ๋ถ„์„ํ•˜๋ฉด์„œ๋„ ๊ฐ๊ฐ์˜ class์— ๋งž๋Š” ์˜์—ญ๋“ค๋งŒ ํ•˜์ด๋ผ์ดํŠธํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ class-discriminitiveํ•œ ์•Œ๊ณ ๋ฆฌ์ฆ˜์€, ๋ชจ๋ธ์˜ prediction์ด ํ‹€๋ ธ์„ ๊ฒฝ์šฐ์—๋„ ์™œ ํ‹€๋ฆฐ ์˜ˆ์ธก์„ ๋งŒ๋“ค์—ˆ๋Š”์ง€์— ๋Œ€ํ•œ ๋…ผ๋ฆฌ์  ๊ทผ๊ฑฐ๋ฅผ ์ฐพ์„ ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค.

 

 


 Application Tasks


 Grad-CAM์˜ ์ „์ฒด์ ์ธ ๊ตฌ์กฐ๋ฅผ ๋‚˜ํƒ€๋‚ธ ๊ทธ๋ฆผ์ด๋‹ค. ์•ž์„œ backpropagation์— ์‚ฌ์šฉํ–ˆ๋˜ $y^c$๋Š”, Image Classification์˜ class score๊ฐ’์ผ์ˆ˜๋„ ์žˆ์ง€๋งŒ, ๋‹ค๋ฅธ ์—ฌ๋Ÿฌ๊ฐ€์ง€ ํ˜•ํƒœ๋ฅผ ๊ฐ€์งˆ ์ˆ˜ ์žˆ๋‹ค.

 ๊ตฌ์กฐํ™”๋œ output caption์˜ activation๊ฐ’์ด๋ผ๋ฉด Image Captioning์— ์ ์šฉ๋  ์ˆ˜๋„ ์žˆ๊ณ , ์งˆ๋ฌธ์— ๋Œ€ํ•œ answer์˜ activation๊ฐ’์ด๋ผ๋ฉด VQA์— ์ ์šฉ๋  ์ˆ˜๋„ ์žˆ๋‹ค.

 

 ๋˜, CAM์€ ๋งˆ์ง€๋ง‰ Convolution layer์˜ feature map์—๋งŒ ํ•œ์ •ํ•ด์„œ ์‹œ๊ฐํ™”ํ•  ์ˆ˜ ์žˆ์—ˆ์ง€๋งŒ, Grad-CAM์€ backpropagation์„ ํ†ตํ•ด ๊ฐ€์ค‘์น˜๋ฅผ ๊ณ„์‚ฐํ•˜๊ธฐ ๋•Œ๋ฌธ์— ์ค‘๊ฐ„์˜ ๋‹ค๋ฅธ convolution layer์— ๋Œ€ํ•ด์„œ ์‹œ๊ฐํ™”ํ•  ์ˆ˜๋„ ์žˆ๋‹ค.

 

 

 


 Results


 ์œ„ ๊ทธ๋ฆผ์€ "What Color is the firehydrant?"๋ผ๋Š” ์งˆ๋ฌธ์— ๋Œ€ํ•œ VAQ์— ์•Œ๊ณ ๋ฆฌ์ฆ˜๋“ค์„ ์ ์šฉํ•œ ๊ฒƒ์ด๋‹ค. "red", "yellow", "yellow and red"๋ผ๋Š” ์„ธ ๊ฐœ์˜ class์— ๋Œ€ํ•ด ์‹œ๊ฐํ™”ํ–ˆ์„ ๋•Œ, Guided Backpropagation์€ ๋ชจ๋‘ ๋น„์Šทํ•œ ์ด๋ฏธ์ง€๋ฅผ ๋ณด์˜€๊ณ , Grad-CAM๊ณผ Guided Grad-CAM์€ ๊ฐ class์— ์˜ํ–ฅ์„ ํฌ๊ฒŒ ๋ฏธ์น˜๋Š” ์˜์—ญ๋“ค์„ ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ํ•˜์ด๋ผ์ดํŠธํ•˜์˜€๋‹ค. 

๋” ๋ณต์žกํ•œ ๋ชจ๋ธ์— ๋Œ€ํ•ด์„œ๋„ ํ•ด์„ ๊ฐ€๋Šฅํ•œ ์„ค๋ช…์„ ์ œ๊ณตํ•  ์ˆ˜ ์žˆ๋‹ค๊ณ  ํ•œ๋‹ค.

 ์œ„ ๊ทธ๋ฆผ์€ Image Captioning model์— ์ ์šฉํ•œ ๊ฒฐ๊ณผ์ด๋‹ค. caption output์„ ์˜ˆ์ธกํ•˜๋Š”๋ฐ ํฌ๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์นœ ์˜์—ญ๋“ค์„ ํ•˜์ด๋ผ์ดํŠธํ•œ ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 VGG-16์ด ๋ถ„๋ฅ˜๋ฅผ ์‹คํŒจํ•œ ์ด๋ฏธ์ง€๋“ค์— ๋Œ€ํ•ด Guided Grad-CAM์„ ์‹œ๊ฐํ™”ํ•œ ๊ทธ๋ฆผ์ด๋‹ค. ์‚ฌ๋žŒ์ด ๋ˆˆ์œผ๋กœ๋งŒ ๋ด์„œ๋Š” ๋ชจ๋ธ์ด ์™œ ์ž˜๋ชป๋œ ์˜ˆ์ธก์„ ๋งŒ๋“ค์–ด๋ƒˆ๋Š”์ง€ ํ™•์ธํ•˜๊ธฐ ์–ด๋ ต์ง€๋งŒ, ์ด ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์‚ฌ์šฉํ•˜๋ฉด ์ด์œ ๋ฅผ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

 ์˜ˆ์‹œ๋กœ, (b)์˜ ์ด๋ฏธ์ง€์˜ ๊ฒฝ์šฐ ์‹ค์ œ ๋ผ๋ฒจ์€ "volcano"์ด์ง€๋งŒ, ์ด๋ฏธ์ง€์˜ volcano๋ณด๋‹ค๋Š” ์ฃผ๋ณ€์˜ ์ฐฝ๋ฌธํ‹€์ด ๋” ๊ฐ•์กฐ๋˜์–ด์„œ ๋ชจ๋ธ์ด "car mirror"๋กœ ์ž˜๋ชป๋œ prediction์„ ๋งŒ๋“ค์–ด๋ƒˆ๋‹ค๊ณ  ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

 


 References


[1] Selvaraju et al, Grad-CAM: Visual Explanations from Deep Networks via Gradient-based Localization, 2017

 

 

 

Comments