Notice
Recent Posts
Recent Comments
ยซ   2024/12   ยป
์ผ ์›” ํ™” ์ˆ˜ ๋ชฉ ๊ธˆ ํ† 
1 2 3 4 5 6 7
8 9 10 11 12 13 14
15 16 17 18 19 20 21
22 23 24 25 26 27 28
29 30 31
Tags
more
Archives
Today
Total
๊ด€๋ฆฌ ๋ฉ”๋‰ด

Hello Potato World

[ํฌํ…Œ์ดํ†  ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning ๋ณธ๋ฌธ

Paper Review๐Ÿฅ”/Few-shot, Zero-shot

[ํฌํ…Œ์ดํ†  ๋…ผ๋ฌธ ๋ฆฌ๋ทฐ] Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning

Heosuab 2023. 1. 8. 18:02

โ‹† ๏ฝก หš โ˜๏ธŽ หš ๏ฝก โ‹† ๏ฝก หš โ˜ฝ หš ๏ฝก โ‹† 

[Zero-shot learning paper review]

 

 

 


 Compositional Zero-shot Learning (CZSL)


  ์ธ๊ฐ„์€ ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” ๊ฐœ์ฒด์— ๋Œ€ํ•œ ์ •๋ณด๋“ค์„ ์กฐํ•ฉํ•˜๊ณ  ๊ตฌ์„ฑํ•˜์—ฌ ์ƒˆ๋กœ์šด ๊ฐœ์ฒด์— ์ผ๋ฐ˜ํ™”ํ•˜๋Š” ๋Šฅ๋ ฅ์„ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค. ๋‹ค์‹œ ๋งํ•ด์„œ, ์ด๋ฏธ ์•Œ๊ณ  ์žˆ๋Š” "whole apple"๊ณผ "sliced banana"์˜ ์ •๋ณด๋ฅผ ์กฐํ•ฉํ•ด์„œ ์ƒˆ๋กœ์šด ๊ฐœ์ฒด์ธ "sliced apple" ๋˜๋Š” "whole banana"๋ฅผ ์ƒ๊ฐํ•ด๋‚ผ ์ˆ˜ ์žˆ๋‹ค. ์ด ๋Šฅ๋ ฅ์„ AI ์‹œ์Šคํ…œ์—์„œ ๋ชจ๋ฐฉํ•˜๊ณ ์ž ํ•˜๋Š” task๋ฅผ Compositional Zero-shot  Learning (CZSL)์ด๋ผ๊ณ  ํ•œ๋‹ค.

[figure 01] Compositional Zero-shot

  CZSL์—์„œ๋Š” ๊ฐ๊ฐ์˜ ๊ฐœ์ฒด(composition)์„ ๋‘ ๊ฐ€์ง€์˜ ๊ตฌ์„ฑ ์š”์†Œ, state์™€ object๋กœ ๋ถ„ํ•ดํ•œ๋‹ค. "whole", "sliced"์ฒ˜๋Ÿผ ๊ฐœ์ฒด์˜ ์ƒํƒœ๋ฅผ ํ‘œํ˜„ํ•˜๋Š” ์š”์†Œ๋Š” state, "apple", "banana"์ฒ˜๋Ÿผ ๊ฐœ์ฒด์˜ ํ˜•ํƒœ์™€ ์ข…๋ฅ˜๋ฅผ ๊ตฌ๋ถ„์ง“๋Š” ์š”์†Œ๋Š” object๋กœ ์ •์˜๋œ๋‹ค. CZSL์˜ ๋ชฉํ‘œ๋Š” Training data์™€ test data๊ฐ€ ๊ณตํ†ต ์›์†Œ๋ฅผ ๊ฐ€์ง€์ง€ ์•Š๊ณ  ๋ถ„๋ฆฌ๋˜์–ด ์žˆ๋‹ค๋Š” ๊ฐ€์ • ํ•˜์—, ์ƒˆ๋กœ์šด(unseen) test composition์„ ์‹๋ณ„ํ•˜๋Š” ๊ฒƒ์ด๋‹ค. 

  State์™€ object๋ฅผ ๋ชจ๋‘ ์‹๋ณ„ํ•˜๊ธฐ ์œ„ํ•ด CZSL์—์„œ ์‚ฌ์šฉํ•˜๋˜ ๋Œ€ํ‘œ์ ์ธ ๋ฐฉ๋ฒ•์œผ๋กœ๋Š”,

  • ๋‘ ๊ฐ€์ง€์˜ classifier๋ฅผ ๋”ฐ๋กœ ๋‘์–ด state์™€ object๋ฅผ ๋”ฐ๋กœ ํ•™์Šตํ•˜๋Š” ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค. ํ•˜์ง€๋งŒ ์ด ๋ฐฉ๋ฒ•์€ state-object ์‚ฌ์ด์˜ ์ƒํ˜ธ ์ž‘์šฉ ๋˜๋Š” entanglement๋ฅผ ๋ฌด์‹œํ•˜๊ฒŒ ๋œ๋‹ค.
  • ๋˜ ๋ชจ๋“  composition๊ณผ visual feature๋“ค์ด ํ•œ๋ฒˆ์— ํˆฌ์˜๋  ์ˆ˜ ์žˆ๋Š” ๊ณตํ†ต embedding space๋ฅผ ํ•™์Šตํ•˜์—ฌ embedding ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋ฅผ ๊ณ„์‚ฐํ•˜์—ฌ ํ™œ์šฉํ•˜๋Š” ๋ฐฉ๋ฒ•์ด ์žˆ๋Š”๋ฐ, training๊ณผ test composition ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ๋ฌด์‹œํ•˜์—ฌ ๋น„์Šทํ•œ ๊ฐœ์ฒด๋“ค์„ ํ˜ผ๋™ํ•  ์ˆ˜ ์žˆ๋‹ค. (e.g., young cat and young tiger

  ๋”ฐ๋ผ์„œ ํ•ด๋‹น ๋…ผ๋ฌธ์—์„œ๋Š” state์™€ object ๊ฐ๊ฐ์˜ ๊ตฌ๋ถ„์ ์ธ prototype๋ฅผ ํ™œ์šฉํ•˜์ง€๋งŒ ๋‘˜ ์‚ฌ์ด์˜ joint representation๋„ ํ•จ๊ป˜ ํ•™์Šตํ•˜๋Š” Siamese Contrastive Embedding Network (SCEN)์„ ์ œ์•ˆํ•œ๋‹ค.

 

 

 


 Siamese Contrastive Embedding Network (SCEN)


  States์˜ ์ง‘ํ•ฉ์„ $A$, objects์˜ ์ง‘ํ•ฉ์„ $O$๋ผ๊ณ  ํ•˜๋ฉด, state-object์˜ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋Š” components์˜ ์ง‘ํ•ฉ $C$๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋  ์ˆ˜ ์žˆ์œผ๋ฉฐ,

\[ C = A \times O = \{(a,o) \mid a \in A, o \in O\} \]

  Training dataset์—์„œ์˜ images ์ง‘ํ•ฉ์„ $I^s$, ๊ทธ์— ๋Œ€์‘ํ•˜๋Š” component๋ฅผ $C^s$ ($C^s \subset C$) ๋ผ๊ณ  ํ•˜๋ฉด, image-component์˜ ์Œ์œผ๋กœ ๊ตฌ์„ฑ๋˜๋Š” training dataset $D_{tr}$์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ํ‘œํ˜„๋œ๋‹ค.

\[ D_{tr} = \{((i,c) \mid i \in I^s, c \in C^s\} \]

  CZSL task์˜ ์ •์˜์— ๋”ฐ๋ผ training data์™€ test data๋Š” ๊ณตํ†ต ์›์†Œ๋ฅผ ๊ฐ€์ง€์ง€ ์•Š์•„์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, training data์˜ composition์„ $C^s$, test data์˜ composition์„ $C^u$๋ผ๊ณ  ํ•œ๋‹ค๋ฉด $C^s \cap C^u = \emptyset$์„ ๋งŒ์กฑํ•ด์•ผ ํ•œ๋‹ค. ๋˜ํ•œ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ seen, unseen composition ์ค‘์—์„œ ์˜ˆ์ธกํ•ด์•ผ ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, $\{I^s, C^s\}$๋กœ ํ•™์Šต๋œ mapping function $I \to C^s \cup C^u$๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. 

 

 

[figure 02] SCEN framework

  SCEN ๊ตฌ์กฐ๋ฅผ ์„ธ ๊ฐ€์ง€ ๋ชจ๋“ˆ๋กœ ์š”์•ฝํ•˜๋ฉด,

  1. Encoding : states์™€ objects ๊ฐ๊ฐ์˜ encoding
  2. Contrastive learning : state/object ๊ฐ๊ฐ์˜ contrastive space์—์„œ์˜ prototypes ์ถ”์ถœ
  3. Augmentation : State Transition Module(STM)์„ ํ†ตํ•ด ๊ฐ€์ƒ์˜ composition ์ƒ์„ฑ

 

- Module 1. Encoding

  ํ•˜๋‚˜์˜ ์ด๋ฏธ์ง€๊ฐ€ feature extractor FC๋ฅผ ํ†ต๊ณผํ•ด์„œ ์–ป์€ visual feature $x$๋Š”, state/object ๊ตฌ์„ฑ ์š”์†Œ๋กœ ๋ถ„ํ•ดํ•˜๊ธฐ ์œ„ํ•ด ๋‘ ๊ฐœ์˜ embedding์œผ๋กœ ์ธ์ฝ”๋”ฉ๋œ๋‹ค. State-specific Encoder $E_s$๋Š” state๋ฅผ ์ž˜ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด, Object-specific Encoder $E_o$๋Š” object๋ฅผ ์ž˜ ํ‘œํ˜„ํ•˜๊ธฐ ์œ„ํ•ด ํ•™์Šต๋œ๋‹ค. 

\[ h_s = E_s(x) \]

\[ h_o = E_o(x) \]

  State-object์˜ ๋‹ค์–‘ํ•œ ์กฐํ•ฉ์„ ํ†ตํ•ด ์—ฌ๋Ÿฌ composition์„ ๊ตฌ์„ฑํ•  ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ์ด๋Ÿฌํ•œ joint representation์„ ํ•™์Šตํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์ •์˜ํ•œ๋‹ค.

  • ๊ณ ์ •๋œ state์— ๋‹ค์–‘ํ•œ objects๋ฅผ ์กฐํ•ฉํ•˜๋Š” State-constant database $D_s$
  • ๊ณ ์ •๋œ object์— ๋‹ค์–‘ํ•œ states๋ฅผ ์กฐํ•ฉํ•˜๋Š” Object-constant database $D_o$
  • ๋‹ค์–‘ํ•œ objects-states ์กฐํ•ฉ ์ค‘์—์„œ input instance์™€ ๊ด€๋ จ์ด ์—†๋Š” Irrelevant database $D_{ir}$

state $\hat{a}$์™€ object $\hat{o}$๋กœ ์ด๋ฃจ์–ด์ง„ input instance $x=(\hat{a},\hat{o}) \in I^s$๊ฐ€ ์ž…๋ ฅ๋˜๋ฉด, ์„ธ ๊ฐ€์ง€์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋œ๋‹ค.

\[ D_s = \{(a,o) \mid a=\hat{a}, (a,o) \in C^s\} \]

\[ D_o = \{(a,o) \mid o=\hat{o}, (a,o) \in C^s\} \]

\[ D_{ir} = \{(a,o) \mid a \ne \hat{a}, o \ne \hat{o}, (a,o) \in C^s \} \]

 

 

- Module 2. Contrastive learning

  $E_s, E_o$๋ฅผ ํ†ตํ•ด ๋‘ ๊ฐœ์˜ ๋…๋ฆฝ๋œ embedding space (Siamese contrastive space)๋กœ ํˆฌ์˜๋œ $h_s, h_o$๋Š” contrastive learning์„ ํ†ตํ•ด state์™€ object ๊ฐ๊ฐ์„ ๊ฐ€์žฅ ์ž˜ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋Š” prototype์œผ๋กœ ํ•™์Šต๋œ๋‹ค. ํ•˜์ง€๋งŒ ๊ธฐ์กด์˜ contrastive loss๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ state์™€ object๋ฅผ ๋”ฐ๋กœ ํ•™์Šตํ•˜๋ฉด state-object interaction์ด ๋ฌด์‹œ๋˜๊ธฐ ๋•Œ๋ฌธ์—, ์ •์˜๋œ ์„ธ ๊ฐ€์ง€์˜ ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ loss๋ฅผ ์ƒˆ๋กœ ์ •์˜ํ•œ๋‹ค.


  • State-based contrastive loss $\mathcal{L}_{scl}$

      input $x$์˜ state encoding $h_s$๊ฐ€ state-based contrastive space์˜ anchor๋กœ ์„ค์ •๋œ๋‹ค. input $x$์™€ ๋™์ผํ•œ state๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค $D_s$๋กœ๋ถ€ํ„ฐ positive sample $h_s^{ss}$์„ ์ถ”์ถœํ•˜๋ฉฐ, ๋™์ผํ•˜์ง€ ์•Š์€ state๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค $D_{ir}$๋กœ๋ถ€ํ„ฐ $k$๊ฐœ์˜ negative samples $\{ h_{s_1}^{ir}, ..., h_{s_k}^{ir} \}$๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
      anchor์™€ positive ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋Š” ๊ฐ€๊นŒ์›Œ์ง€๋„๋ก, anchor์™€ negative ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ๋Š” ๋ฉ€์–ด์ง€๋„๋ก ํ•™์Šตํ•˜๋Š” contrastive loss๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋œ๋‹ค. ($\tau_s > 0$ : temperature parameter)
    \[ \mathcal{L}_{scl} = -log \frac{exp((h_s)^{\top} h_s^{ss} / \tau_s)}{exp((h_s)^{\top} h_s^{ss}/\tau_s) + \sum\nolimits_{i=1}^K exp((h_s)^{\top} h_{s_i}^{ir}/ \tau_s)} \]

  • Object-based contrastive loss $\mathcal{L}_{ocl}$

      object-based contrastive space์˜ anchor๋Š” $h_o$๋กœ ์„ค์ •๋œ๋‹ค. ๋™์ผํ•œ object๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค $D_o$๋กœ๋ถ€ํ„ฐ positive sample $h_o^{os}$๋ฅผ ์ถ”์ถœํ•˜๋ฉฐ, ๋™์ผํ•˜์ง€ ์•Š์€ object๋ฅผ ๊ฐ€์ง€๋Š” ๋ฐ์ดํ„ฐ๋ฒ ์ด์Šค $D_{ir}$๋กœ๋ถ€ํ„ฐ $k$๊ฐœ์˜ negative samples $\{ h_{o_1}, ..., h_{o_k}^{ir} \}$๋ฅผ ์ถ”์ถœํ•œ๋‹ค.
      state-object interaction์„ ๊ณ ๋ คํ•˜๊ธฐ ์œ„ํ•ด $D_{ir}$๋กœ๋ถ€ํ„ฐ ์ถ”์ถœํ•˜๋Š” negative samples๋Š” ๋‘ ๊ฐ€์ง€์˜ loss์—์„œ ๋™์ผํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค. ($\tau_o > 0$ : temperature parameter)
    \[ \mathcal{L}_{ocl} = -log \frac{exp((h_o)^{\top} h_o^{os} / \tau_o)}{exp((h_o)^{\top} h_o^{os}/\tau_o) + \sum\nolimits_{j=1}^K exp((h_o)^{\top} h_{o_j}^{ir}/ \tau_o)} \]

  • Classification Loss $\mathcal{L}_{cls}$

      Classifier๊ฐ€ state์™€ object ๊ฐ๊ฐ์˜ prototype์„ ํ†ตํ•ด ๊ตฌ๋ณ„ํ•  ์ˆ˜ ์žˆ๋„๋ก, ๋‘ ๊ณต๊ฐ„์—์„œ์˜ classification loss๋ฅผ ๋…๋ฆฝ์ ์œผ๋กœ ๊ณ„์‚ฐํ•œ๋‹ค. $C_a$ ์™€ $C_o$๋ฅผ state์™€ object ๊ฐ๊ฐ์— ๋Œ€ํ•œ classification์„ ํ•˜๋Š” fully connected layers๋ผํ•  ๋•Œ, ์ „์ฒด classification loss๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.
    \[ \mathcal{L} = C_a(h_s, a) + C_o(h_o, o) \]

 ์œ„์—์„œ ์ •์˜ํ•œ ์„ธ ๊ฐ€์ง€์˜ loss๋ฅผ ํ†ตํ•ด  Siamese Contrastive Space์˜ ์ „์ฒด loss $L_{cts}$๊ฐ€ ์ •์˜๋œ๋‹ค.

\[ \mathcal{L}_{cts} = \mathcal{L}_{scl} + \mathcal{L}_{ocl} + \mathcal{L}_{cls} \]

 

 

- Module 3. Augmentation

  Training data์— ๋“ฑ์žฅํ•˜์ง€ ์•Š๋Š” unseen composition์— ๋Œ€ํ•œ ์ผ๋ฐ˜ํ™” ์„ฑ๋Šฅ์„ ๋†’์ž„์œผ๋กœ์จ training๊ณผ test ์‚ฌ์ด์˜ ์ฐจ์ด๋ฅผ ์ค„์ด๊ธฐ ์œ„ํ•ด, ๊ฐ€์ƒ์˜ composition์„ ์ƒ์„ฑํ•˜๋Š” State Transition Module (STM) ๊ตฌ์กฐ๋ฅผ ์ œ์•ˆํ•œ๋‹ค. STM์€ Training data์— ๋‘ ๊ฐ€์ง€์˜ composition, "sliced apple", "red fox"๊ฐ€ ์žˆ๋‹ค๊ณ  ํ–ˆ์„ ๋•Œ, ๋ฐ์ดํ„ฐ์—๋Š” ์—†์ง€๋งŒ ์‹ค์ œ๋กœ ์กด์žฌํ•  ๋ฒ•ํ•œ "red apple"์„ ์ƒ์„ฑํ•˜๋˜, ๋ฐ์ดํ„ฐ์—๋„ ์—†๊ณ  ์‹ค์ œ๋กœ๋„ ์กด์žฌํ•˜์ง€ ์•Š๋Š” "sliced fox"๋Š” ๊ตฌ๋ณ„ํ•ด๋‚ด๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•œ๋‹ค. 

[figure 03] STM framework

  1.   ์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ object์™€ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€๋“ค์˜ ๋‹ค์–‘ํ•œ state๋ฅผ ์กฐํ•ฉํ•˜๊ธฐ ์œ„ํ•ด ๊ฐ๊ฐ์˜ prototype์„ ์–ป๋Š”๋‹ค. Object-specific encoder๋ฅผ ํ†ตํ•ด input $x$์˜ object prototype $h_o$๋ฅผ ์ถ”์ถœํ•˜๊ณ , state-specific encoder๋ฅผ ํ†ตํ•ด ๋‹ค๋ฅธ ์ƒ˜ํ”Œ $\{ s_1, s_2, ..., s_n \}$์˜ state prototype $ h_{\tilde{s}} = \{ h_{s_1}, h_{s_2}, ..., h_{s_n} \}$๋ฅผ ์ถ”์ถœํ•œ๋‹ค. 

  2. Generator $G$ ๋Š” ์ถ”์ถœํ•œ prototype์„ ์กฐํ•ฉํ•˜์—ฌ ๊ฐ€์ƒ์˜ composition์„ ๋งŒ๋“ ๋‹ค. 
    \[ G(h_{\tilde{s}}, h_o) = \hat{x}_{\tilde{s},o} \]

  3. Discriminator $D$ ๋Š” ์ƒ์„ฑ๋œ ๊ฐ€์ƒ์˜ composition ๋‚ด์—์„œ ์‹ค์ œ๋กœ ์กด์žฌํ•˜์ง€ ์•Š์„๋งŒํ•œ ๋ฐ์ดํ„ฐ (irrational composition)์„ ํŒ๋ณ„ํ•œ๋‹ค. 
    \[ \underset{D}{max} \underset{G,E_s,E_o}{min} V(G, D) = \mathbb{E}_{s,o} (logD(x_{a,o})) + \mathbb{E}_{h_{\tilde{s}},h_o} (log(1-D(G(h_{\tilde{s}},h_o)))) \]

  4. ์ƒˆ๋กœ ์ƒ์„ฑ๋œ ๋ฐ์ดํ„ฐ๋กœ $E_s$ ์™€ $E_o$ ์˜ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์ด ๋ชฉ์ ์ด์ง€๋งŒ, ์ƒ์„ฑ๋œ ์ด๋ฏธ์ง€๋Š” label์ด ์—†๊ธฐ ๋•Œ๋ฌธ์— re-encode ๊ณผ์ •์„ ๊ฑฐ์นœ๋‹ค. ๋‹ค์‹œ ์ธ์ฝ”๋”ฉ ๊ณผ์ •์„ ๊ฑฐ์ณ์„œ ์ถ”์ถœ๋œ state/object prototype์„ ํ†ตํ•ด ํ•™์Šตํ•˜๋Š” re-classification loss๋ฅผ ์ •์˜ํ•  ์ˆ˜ ์žˆ๋‹ค.
    \[ \mathcal{L}_{cls_{re}} = C_a(E_s(G(h_{\tilde{s}}, h_o)), \tilde{a}) + C_o(E_o(G(h_{\tilde{s}}, h_o)), o) \]

์œ„์—์„œ ์ •์˜ํ•œ ๋‘ ๊ฐ€์ง€์˜ loss๋ฅผ ํ†ตํ•ด State Transition Module (STM)์˜ ์ „์ฒด loss $\mathcal{L}_{stm}$๊ฐ€ ์ •์˜๋œ๋‹ค.

\[ \mathcal{L}_{stm} = \underset{D}{max} \underset{G,E_s,E_o}{min} V(G, D) + \mathcal{L}_{cls_{re}} \]

 

๋ณธ ๋…ผ๋ฌธ์—์„œ ์ œ์•ˆํ•œ SCEN framework์˜ final loss๋Š” $\mathcal{L}_{cts}, \mathcal{L}_{stm}$์˜ weighted sum์œผ๋กœ ์ •์˜๋œ๋‹ค.

\[ \mathcal{L}_{total} = \alpha \mathcal{L}_{cts} + \beta \mathcal{L}_{stm} \]

 

 

 

 


 Results


  CZSL์˜ ๋Œ€ํ‘œ์ ์ธ benchmark dataset ์„ธ ๊ฐ€์ง€์—์„œ ์‹คํ—˜ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค : MIT-States, UT-Zappos, C-GQA

[table 01] SCEN in MIT-States, UT-Zappos results

  MIT-States ๋ฐ์ดํ„ฐ์…‹์—์„œ test AUC score๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ–ˆ์„ ๋•Œ ๊ธฐ์กด์˜ SOTA์ธ 5.1%๋ฅผ ๋›ฐ์–ด๋„˜๋Š” 5.3% (+0.2%)๋ฅผ ๊ธฐ๋กํ–ˆ์œผ๋ฉฐ, Harmonic Mean(HM) score๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ–ˆ์„ ๋•Œ 18.4% (+1.2%)๋ฅผ ๊ธฐ๋กํ–ˆ๋‹ค. State์™€ object ๊ฐ๊ฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก์˜ accuracy๋กœ ๋ณด์•„๋„, 28.2% (+0.3%)์™€ 32.2% (+0.4%)์˜ ์ตœ๊ณ  ์„ฑ๋Šฅ์„ ๋ณด์ธ๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ UT-Zappos์—์„œ๋„ SOTA performance๋ฅผ ๊ธฐ๋กํ•˜์˜€๋‹ค. 

[table 02] SCEN in C-GQA results

  ๊ฐ€์žฅ ์ตœ๊ทผ ๋ฐœํ‘œ๋œ C-GQA ๋ฐ์ดํ„ฐ์…‹์—์„œ๋„ AUC, HM, state/object accuracy ๋ชจ๋‘ ํ–ฅ์ƒ๋œ ๊ฒฐ๊ณผ๋ฅผ ๋ณด์ธ๋‹ค.

 

 

 


References


[1] Li, Xiangyu, et al. "Siamese Contrastive Embedding Network for Compositional Zero-Shot Learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022.

Comments