Condition Image Genration๊ณผ ๊ด€๋ จ๋œ ์—ฌ๋Ÿฌ ํ† ํ”ฝ๊ณผ ๋…ผ๋ฌธ๋“ค์„ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

Preliminary

CFG scale

CFG(Classifier-Free Guidance) Scale์€ diffusion ๋ชจ๋ธ์—์„œ condition ๊ธฐ๋ฐ˜ ์ƒ˜ํ”Œ๋ง์„ ์ œ์–ดํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉํ•˜๋Š” ์Šค์ผ€์ผ๋ง ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. Unconditional ์ƒ˜ํ”Œ๊ณผ conditional ์ƒ˜ํ”Œ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์กฐ์ ˆํ•˜์—ฌ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ํ’ˆ์งˆ์„ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค.

ฯต~ฮธ(zฮป,c)=(1+w)ฯตฮธ(zฮป,c)โˆ’wฯตฮธ(zฮป)\tilde{\boldsymbol{\epsilon}}_\theta\left(\mathbf{z}_\lambda, \mathbf{c}\right)=(1+w) \boldsymbol{\epsilon}_\theta\left(\mathbf{z}_\lambda, \mathbf{c}\right)-w \boldsymbol{\epsilon}_\theta\left(\mathbf{z}_\lambda\right)
  • ๊ฐ’์ด ์ž‘์„ ๋•Œ (e.g., 1.0 ์ดํ•˜): ๋ชจ๋ธ์ด ๋” ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์ผ๋ฐ˜์ ์ธ ์ƒ˜ํ”Œ์„ ์ƒ์„ฑ. -1์ธ ๊ฒฝ์šฐ์—๋Š” ์•„์˜ˆ ํ”„๋กฌํ”„ํŠธ์˜ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š์Œ

  • ๊ฐ’์ด ํด ๋•Œ (e.g., 7.0 ์ด์ƒ): condition์— ๋” ๊ฐ•ํ•˜๊ฒŒ ๋งž์ถ”๋ ค๋Š” ๊ฒฝํ–ฅ์ด ์žˆ์Œ

  • ์ผ๋ฐ˜์ ์œผ๋กœ CFG scale 7์—์„œ 11 ์‚ฌ์ด๊ฐ€ ์ตœ์ ์˜ ๊ฒฐ๊ณผ๋ฅผ ์ œ๊ณต

Negative Prompt

์œ„์˜ CFG scale์˜ denoising ๊ณผ์ •์„ ๋ณด๋ฉด 'conditional sampling - unconditional sampling'์˜ ํ˜•ํƒœ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

Negative prompt๋Š” ์œ„ ๋ฐฉ์‹์—์„œ์˜ unconditional sampling์„ ํ•˜์ด์žฌํ‚นํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๊ตฌํ˜„๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, **'conditional sampling - unconditional sampling with negative prompt'**์˜ ํ˜•ํƒœ๋กœ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค. ์„ค๋ช…์— ๋Œ€ํ•œ ์‹œ๊ฐ ์ž๋ฃŒ๋Š” ์ด๊ณณ์„ ์ฐธ๊ณ ํ•˜๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค.

CLIP Skip

CLIP skip์€ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๊ณผ์ •์—์„œ ์ผ๋ถ€ (๋งˆ์ง€๋ง‰ ๋ช‡ ๊ฐœ์˜) clip layer๋ฅผ ๊ฑด๋„ˆ๋›ฐ๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค. ์†๋„๋„ ํ–ฅ์ƒ๋  ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ, ์ด๋ฏธ์ง€ ์ƒ์„ฑ ํ’ˆ์งˆ๋„ ์ข‹์•„์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

CLIP์˜ ๊นŠ์€ ๋ ˆ์ด์–ด์—์„œ๋Š” ํ…์ŠคํŠธ์˜ ์„ธ๋ฐ€ํ•œ ์„ค๋ช…๊ณผ ๊ด€๋ จ์ด ์žˆ๋Š”๋ฐ, ํŠน์ • ๋ชจ๋ธ์—์„œ๋Š” ์ด๋Ÿฌํ•œ ์„ธ๋ฐ€ํ•œ ์ดํ•ด๊ฐ€ ์ด๋ฏธ์ง€ ์ƒ์„ฑ์— ๋ถˆํ•„์š”ํ•œ ๋…ธ์ด์ฆˆ๋ฅผ ์ถ”๊ฐ€ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ CLIP skip์€ ๊ณผ๋„ํ•œ ๋…ธ์ด์ฆˆ๋ฅผ ์ค„์ด๊ณ  ํšจ์œจ์ ์œผ๋กœ GPU ์‹œ๊ฐ„์„ ํ™œ์šฉํ•˜๋Š” ๋ฐ ์œ ์šฉํ•ฉ๋‹ˆ๋‹ค.

  • ๋‚ฎ์€ CLIP skip: ํ…์ŠคํŠธ ์„ค๋ช…์„ ์„ธ๋ฐ€ํ•˜๊ฒŒ ๋ฐ˜์˜
  • ๋†’์€ CLIP skip: ํ…์ŠคํŠธ์˜ ๋””ํ…Œ์ผ์„ ์ผ๋ถ€ ํฌ์ƒ
  • ์ผ๋ฐ˜์ ์œผ๋กœ CLIP skip 1~2๊ฐ€ ์ผ๋ฐ˜์ ์œผ๋กœ ์ ํ•ฉ.
  • ๋‹ค๋งŒ LoRA๋‚˜ Negative Prompt์™€ ๊ฐ™์€ ๊ธฐ์ˆ  ์‚ฌ์šฉ ์‹œ, clip์— ์˜ํ•ด LoRA ๋ณ€ํ˜•๋˜์ง€ ์•Š๋Š”์ง€ ๋“ฑ์— ๋Œ€ํ•ด ์„ธ๋ฐ€ํ•˜๊ฒŒ ํ™•์ธ ํ•„์š”
BF16 / FP16 / FP32

BF16 (Brain Float 16)์€ **1-bit sign, 8-bit exponent, 7-bit mantissa**๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • Mantissa(์œ ํšจ ์ˆซ์ž)๊ฐ€ ์ž‘๊ธฐ ๋•Œ๋ฌธ์— ์ •๋ฐ€๋„๊ฐ€ ๋‚ฎ์ง€๋งŒ, FP32์™€ ๋™์ผํ•œ exponent ๋น„ํŠธ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ dynamic range๊ฐ€ ๋„“์Œ
  • FP32์— ๊ฐ€๊นŒ์šด ์ˆ˜ ๋ฒ”์œ„(dynamic range)๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, ๊ณ„์‚ฐ ์†๋„๊ฐ€ ๋น ๋ฅด๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ์Œ
  • Neural network training๊ณผ inference์—์„œ ์ž์ฃผ ์‚ฌ์šฉ๋˜๋ฉฐ, ํŠนํžˆ ์ •๋ฐ€๋„๊ฐ€ ํฌ๊ฒŒ ์ค‘์š”ํ•˜์ง€ ์•Š์€ ์ƒํ™ฉ(์˜ˆ: Gradient ๊ณ„์‚ฐ)์—์„œ ์œ ์šฉํ•จ

FP 16์€ **1-bit sign, 5-bit exponent, 10-bit mantissa**๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • BF16์ฒ˜๋Ÿผ ๊ณ„์‚ฐ ์†๋„๊ฐ€ ๋น ๋ฅด๊ณ , ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰์ด ์ ์Œ
  • Dynamic range๊ฐ€ ์ข์•„, ํฐ ๊ฐ’์ด๋‚˜ ์ž‘์€ ๊ฐ’์˜ ๊ณ„์‚ฐ์—์„œ๋Š” ์ •๋ฐ€๋„๊ฐ€ ๋ถ€์กฑํ•  ์ˆ˜ ์žˆ์Œ (overflow/underflow ์œ„ํ—˜)

FP32๋Š” **1-bit sign, 8-bit exponent, 23-bit mantissa**๋กœ ๊ตฌ์„ฑ๋ฉ๋‹ˆ๋‹ค.

  • ๋†’์€ ์ •๋ฐ€๋„๋ฅผ ์ œ๊ณตํ•˜๋ฉฐ, over/underflow ๋ฌธ์ œ๋ฅผ ์ค„์ž„
  • ๋Œ€์‹  ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰๊ณผ ๊ณ„์‚ฐ ๋น„์šฉ์ด ๋†’์Œ
LoRA Alpha

LoRA์—์„œ alpha๋Š” LoRA์˜ ์˜ํ–ฅ๋ ฅ์„ ์กฐ์ ˆํ•˜๋Š” scaling factor์ž…๋‹ˆ๋‹ค. LoRA weight scale ๋˜ํ•œ LoRA์˜ ์˜ํ–ฅ๋ ฅ์„ ์กฐ์ ˆํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค. ์ผ๋ฐ˜์ ์œผ๋กœ LoRA๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ณ„์‚ฐ๋ฉ๋‹ˆ๋‹ค (์•„๋ž˜์—์„œ dim์€ LoRA์˜ ์ฐจ์›์„ ์˜๋ฏธ)

output = original_weight + (lora_A @ lora_B) * (alpha / dim)

์œ„ ์‹์—์„œ alpha / dim ๋ถ€๋ถ„์ด LoRA์˜ ์˜ํ–ฅ๋ ฅ์„ ์กฐ์ ˆํ•˜๋Š” ๋ถ€๋ถ„์ž…๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ dim์ด 64์ด๊ณ  alpha๊ฐ€ 32์ผ ๋•Œ alpha / dim์€ 0.5๊ฐ€ ๋ฉ๋‹ˆ๋‹ค. ์ฆ‰, alpha=32์ด๊ณ , dim=64์ธ ๊ฒฝ์šฐ, inference์—์„œ LoRA weight scale์„ 0.5๋กœ ์„ค์ •ํ•˜๋Š” ๊ฒƒ์€ LoRA๋ฅผ ํ•™์Šตํ•  ๋•Œ ์„ค์ •ํ•œ alpha์™€ dim์˜ ๋น„์œจ๊ณผ ๋™์ผํ•œ ํšจ๊ณผ๋ฅผ ๋‚ด๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.

ControlNet

UNet์˜ layer๋ฅผ ๊ทธ๋Œ€๋กœ copyํ•ด์˜ค๊ณ , ์ž…๋ ฅ๋‹จ๊ณผ ์ถœ๋ ฅ๋‹จ์—๋Š” weight, bias๊ฐ€ 0์œผ๋กœ ์ดˆ๊ธฐํ™”๋œ zero convolution layer๋ฅผ ๋‘ก๋‹ˆ๋‹ค. ์ด ํ›„ trainable copy์™€ convolution layer๋ฅผ ํŠœ๋‹ํ•ฉ๋‹ˆ๋‹ค.

img

์•„๋ž˜๋Š” zero initialized convolution์ด ์ค‘์š”ํ–ˆ๋‹ค๋Š” ๊ฒƒ์„ ๋ณด์—ฌ์ฃผ๋Š” ๊ฒฐ๊ณผ์ž…๋‹ˆ๋‹ค.

img

Training ๋ฐฉ์‹์€ ๊ธฐ์กด์˜ stable diffusion loss์™€ ๋™์ผํ•ฉ๋‹ˆ๋‹ค. ๋‹ค๋งŒ ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋Š” 50%๋ฅผ ๋žœ๋คํ•˜๊ฒŒ ๋นˆ ๋ฌธ์ž์—ด๋กœ ๋„ฃ์–ด์ฃผ์—ˆ๋‹ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค. ํ•™์Šต ๊ณผ์ •์—์„œ๋Š” ๊ฒฐ๊ณผ๊ฐ€ ์ œ๋Œ€๋กœ ๋‚˜์˜ค์ง€ ์•Š๋‹ค๊ฐ€ ํŠน์ • step์—์„œ ๊ฐ‘์ž๊ธฐ ๊ฒฐ๊ณผ๊ฐ€ ์ž˜ ๋‚˜์˜ค๋Š” 'sudden convergence phenomenon'์ด ๋‚˜ํƒ€๋‚ฌ๋‹ค๊ณ ๋„ ๋งํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

Inference ์‹œ์—์„œ๋Š” ์—ฌ๋Ÿฌ condition์„ ์ ์šฉํ•˜๋Š” ๊ฒƒ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ๋‹จ์ง€ ControlNet์„ ํ•˜๋‚˜ ๋” ๋ถ€์ฐฉํ•ด์ฃผ๊ธฐ๋งŒ ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

img

IP-Adapter

์—ฌ๋Ÿฌ image prompt๋ฅผ ๋„ฃ์–ด์ค„ ์ˆ˜ ์žˆ๋Š” ๋ฐฉ๋ฒ•๋ก ๋“ค์ด ์žˆ์ง€๋งŒ, ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ์ ๋“ค์ด ์กด์žฌํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ๋ชจ๋ธ ์ „์ฒด๋ฅผ fine-tuning ํ•œ image prompt ๋ชจ๋ธ๋ณด๋‹ค๋Š” ์„ฑ๋Šฅ์ด ๋‚ฎ์Œ
  • ๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋ก ๋“ค์€ diffusion model์˜ cross attention module์„ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ์ ์ ˆํ•˜์ง€ ์•Š์Œ. ๊ทธ๋ž˜์„œ image feature์™€ text feature ๋ณ‘ํ•ฉํ•˜๋Š” ๊ธฐ์ž‘์ด, ์ง€๊ธˆ์€ ๋‹จ์ง€ image feature๋ฅผ text์— ๋งž์ถฐ์ฃผ๋Š” ๊ฒƒ์— ๋ถˆ๊ณผํ•จ. ์ด๋กœ์ธํ•ด ์ด๋ฏธ์ง€ ์ค‘์‹ฌ์ ์ธ feature๊ฐ€ ์‚ฌ๋ผ์ง€๊ณ , ๋‹จ์ˆœํžˆ ๋Œ€๋žต์ ์ธ ์ด๋ฏธ์ง€ ์š”์†Œ(e.g., ์Šคํƒ€์ผ) ์ •๋„๋งŒ ์ปจํŠธ๋กค ๊ฐ€๋Šฅํ•œ ์ˆ˜์ค€์ž„

IP-Adapter๋Š” decoupled cross-attention ๋งค์ปค๋‹ˆ์ฆ˜์„ ํ†ตํ•ด ์ด๋Ÿฌํ•œ ๋ฌธ์ œ์ ์„ ๊ฐœ์„ ํ•˜๊ณ ์ž ํ•˜์˜€์Šต๋‹ˆ๋‹ค. ์ „์ฒด ๊ตฌ์กฐ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

img

  1. Image prompt์—์„œ image feature๋ฅผ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.
  2. Decoupled cross attention: Image feature๋ฅผ ์‚ฌ์šฉํ•œ cross attention์™€ text feature๋ฅผ ์‚ฌ์šฉํ•œ cross attention์„ ๊ฐ๊ฐ ์ˆ˜ํ–‰ํ•œ ๋’ค ํ•ฉ์ณ์ค๋‹ˆ๋‹ค.
  3. ํ•ฉ์นœ feature๋ฅผ UNet์—์„œ ํ™œ์šฉํ•ฉ๋‹ˆ๋‹ค.

์•„๋ž˜์˜ ์‹์„ ํ†ตํ•ด์„œ text prompt์™€ image prompt๊ฐ€ ๊ฐ๊ฐ cross attention ์ˆ˜ํ–‰ํ•œ ๋’ค ํ•ฉ์ณ์ง€๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.

Znewย =Softmaxโก(QKโŠคd)V+Softmaxโก(Q(Kโ€ฒ)โŠคd)Vโ€ฒย whereย Q=ZWq,K=ctWk,V=ctWv,Kโ€ฒ=ciWkโ€ฒ,Vโ€ฒ=ciWvโ€ฒ\begin{array}{r} \mathbf{Z}^{\text {new }}=\operatorname{Softmax}\left(\frac{\mathbf{Q} \mathbf{K}^{\top}}{\sqrt{d}}\right) \mathbf{V}+\operatorname{Softmax}\left(\frac{\mathbf{Q}\left(\mathbf{K}^{\prime}\right)^{\top}}{\sqrt{d}}\right) \mathbf{V}^{\prime} \\ \text { where } \mathbf{Q}=\mathbf{Z W}_q, \mathbf{K}=\boldsymbol{c}_t \mathbf{W}_k, \mathbf{V}=\boldsymbol{c}_t \mathbf{W}_v, \mathbf{K}^{\prime}=\boldsymbol{c}_i \mathbf{W}_k^{\prime}, \mathbf{V}^{\prime}=\boldsymbol{c}_i \mathbf{W}_v^{\prime} \end{array}

Inference stage์—์„œ๋Š” image prompt์˜ ๊ฐ•๋„๋ฅผ ์กฐ์ ˆ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

Znewย =Attentionโก(Q,K,V)+ฮปโ‹…Attentionโก(Q,Kโ€ฒ,Vโ€ฒ)\mathbf{Z}^{\text {new }}=\operatorname{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V})+\lambda \cdot \operatorname{Attention}\left(\mathbf{Q}, \mathbf{K}^{\prime}, \mathbf{V}^{\prime}\right)

IP-Adapter์˜ ํ•™์Šต ๋กœ์ง์€ ์ถ”๋ก  ๋กœ์ง๊ณผ๋Š” ์กฐ๊ธˆ ๋‹ค๋ฆ…๋‹ˆ๋‹ค. ๋น„๋ก ์ถ”๋ก ์‹œ์—๋Š” ๋ ˆํผ๋Ÿฐ์Šค ์ด๋ฏธ์ง€๋ฅผ ๋„ฃ๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•˜์ง€๋งŒ, ํ•™์Šต์‹œ์—๋Š” ๋”ฐ๋กœ ๋ ˆํผ๋Ÿฐ์Šค ์ด๋ฏธ์ง€๋ฅผ ์ƒˆ๋กœ ๋งŒ๋“ค์ง€๋Š” ์•Š๊ณ , ๊ธฐ์กด์˜ (ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€) ํŽ˜์–ด๋ฅผ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ•˜์—ฌ, IP-Adapter์˜ ์ž…๋ ฅ ๊ฐ’์œผ๋กœ ์ด๋ฏธ์ง€๋ฅผ ์ฃผ๊ณ , ํ•ด๋‹น ์ด๋ฏธ์ง€๋ฅผ ๊ทธ๋Œ€๋กœ ์ž˜ ๋ณต์›ํ•˜๋Š” ๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค.

IP-Adapter๋Š” ํŠน์ • ํฌ์ฆˆ๋ฅผ ์ž์œ ๋กญ๊ฒŒ ์ •์˜ํ•˜๋ฉด์„œ ์ „๋ฐ˜์ ์ธ ์ƒ‰์ƒ ๊ตฌ์„ฑ๊ณผ ์Šคํƒ€์ผ์„ ์žก์•„๋‚ด๋Š” ๋ฐ ์œ ์šฉํ•œ๋ฐ ๋ฐ˜ํ•ด, ControlNet์€ ํŠน์ • ํฌ์ฆˆ๋ฅผ ๊ฐ•์ œํ•˜๋Š” ๋ฐ ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ IP-Adapter์™€ ControlNet์„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜๋ฉด ์•„๋ž˜ ์˜ˆ์‹œ์ฒ˜๋Ÿผ ์„ธ๋ถ€์ ์ธ ์ด๋ฏธ์ง€ ๋””ํ…Œ์ผ์ด๋‚˜ ์Šคํƒ€์ผ์„ ๋ณด์กดํ•˜๋ฉด์„œ๋„ ๊ตฌ์กฐ์  ํŠน์ง•๋„ ์ œ์–ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค (ex. ControlNet์œผ๋กœ ๊ตฌ๋„๋ฅผ ์„ค์ •ํ•œ ๋’ค, IP-Adapter๋กœ ์Šคํƒ€์ผ์„ ์„ธ๋ถ€์ ์œผ๋กœ ์กฐ์ •).

img

IP-Adapter์™€ ๊ด€๋ จ๋œ ์ข‹์€ ์‹ค์Šต ์˜ˆ์‹œ ์ž๋ฃŒ๊ฐ€ ์žˆ์–ด ์•„๋ž˜์— ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

๋งŒ์•ฝ์— (1) ํŠน์ • ๋ ˆํผ๋Ÿฐ์Šค ์ด๋ฏธ์ง€์˜ ํŠน์ง•์„ ๋ฐ˜์˜ํ•˜๋ฉด์„œ, (2) ํŠน์ • ๊ตฌ์กฐ/ํ˜•ํƒœ๋กœ ์ œ์–ดํ•œ, ํŠน์ • ์Šคํƒ€์ผ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์‹ถ๋‹ค๋ฉด, IP-Adapter๋ฅผ ํ†ตํ•ด ๋ ˆํผ๋Ÿฐ์Šค ์ด๋ฏธ์ง€์˜ ํŠน์ง•์„ ๋ฐ˜์˜ํ•˜๊ณ , ControlNet์œผ๋กœ ๊ตฌ์กฐ๋ฅผ ์ œ์–ดํ•˜๋ฉด์„œ, ์›ํ•˜๋Š” ์Šคํƒ€์ผ๋กœ ํ•™์Šต๋œ LoRA๋ฅผ ์ ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•˜์ง€๋งŒ ๋‹จ์ˆœ IP-Adapter๋Š” ๋•Œ๋กœ ์„ธ๋ถ€ ์Šคํƒ€์ผ์ด๋‚˜ ๋ ˆํผ๋Ÿฐ์Šค ์ด๋ฏธ์ง€์˜ ๋””ํ…Œ์ผ์ด ๊ฒฐ๊ณผ ์ด๋ฏธ์ง€์— ์ถฉ๋ถ„ํžˆ ๋ฐ˜์˜๋˜์ง€ ์•Š๋Š” ๊ฒฝ์šฐ๊ฐ€ ์ž์ฃผ ์žˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ ์ด๋Ÿฐ ๊ฒฝ์šฐ์—๋Š” IP-Adapter Plus๋ฅผ ์‚ฌ์šฉํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

IP-Adapter Plus

IP-Adapter Plus๋Š” IP-Adapter์˜ ํ™•์žฅ ๋ฒ„์ „์œผ๋กœ, ๋ ˆํผ๋Ÿฐ์Šค ์ด๋ฏธ์ง€์˜ ์„ธ๋ถ€์ ์ธ ์Šคํƒ€์ผ๊ณผ ๋ ˆ์ด์•„์›ƒ์„ ๋” ์ •๋ฐ€ํ•˜๊ฒŒ ๋ฐ˜์˜ํ•˜๋Š” ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•ด๋‹น ๋ชจ๋ธ์€ ๋ณ„๋„ ๋…ผ๋ฌธ์ด ์กด์žฌํ•˜๋Š” ๊ฒƒ์€ ์•„๋‹ˆ๊ณ  ๋™์ผํ•œ ์ €์ž๋“ค์ด ๋…ผ๋ฌธ์„ ์—…๋ฐ์ดํŠธํ•˜๋ฉด์„œ ์ƒˆ๋กœ ์ถ”๊ฐ€ํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.

๋‹จ์ˆœํ•œ ์ด๋ฏธ์ง€ ๊ฐœ๋… ๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์„ธ๋ฐ€ํ•œ ๋””ํ…Œ์ผ๊นŒ์ง€ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ์–ด์„œ, IP-Adapter๋ฅผ ํ™œ์šฉํ•ด๋ณด์‹ค ์ƒ๊ฐ์ด ์žˆ์œผ์‹œ๋‹ค๋ฉด IP-Adapter Plus ๋˜ํ•œ ํ•จ๊ป˜ ํ…Œ์ŠคํŠธ ํ•ด๋ณด์‹œ๋Š” ๊ฒƒ์„ ์ถ”์ฒœ๋“œ๋ฆฝ๋‹ˆ๋‹ค.

IP-Adapter Plus๊ฐ€ IP-Adapter์— ๋น„ํ•ด ๋””ํ…Œ์ผ์„ ์ž˜ ๋ฐ˜์˜ํ•  ์ˆ˜ ์žˆ๋Š” ์›๋ฆฌ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  1. Global embedding (cls token)๊ณผ local patch embedding์„ ๋ชจ๋‘ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค. (๊ธฐ์กด IP-Adapter๋Š” global embedding๋งŒ ์‚ฌ์šฉ)
  2. ์ด๋ฅผ IP-Adapter์˜ MLP์— ํ†ต๊ณผ์‹œ์ผœ ์ด๋ฅผ ๊ฐ€๊ณตํ•˜๊ณ , ๊ฐ€๊ณต๋œ ๋ฒกํ„ฐ๋ฅผ cross-attention์œผ๋กœ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.

์ฆ‰, IP-Adapter๋Š” ์ด๋ฏธ์ง€ ์ „์ฒด๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ์ถ•์•ฝ๋ฒกํ„ฐ ํ•˜๋‚˜๋งŒ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์— ๋น„ํ•ด, IP-Adapter Plus๋Š” ์ด๋ฏธ์ง€์˜ ๊ฐ patch๋ณ„ local vector๊นŒ์ง€ ๋ชจ๋‘ ์‚ฌ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋” ๋””ํ…Œ์ผํ•˜๊ณ  ์„ธ๋ฐ€ํ•œ ์ด๋ฏธ์ง€ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

InstantID

IP-Adapter ๊ตฌ์กฐ๋กœ ์–ผ๊ตด ์˜์—ญ์˜ ๋””ํ…Œ์ผ์„ ์žก์•„๋‘๊ณ , ControlNet ๊ตฌ์กฐ๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€ ์ƒ์—์„œ ์–ผ๊ตด์˜ ๊ตฌ์กฐ์  ์˜์—ญ์ด ์–ด๋””์ธ์ง€ ์žก์•„๋‘ฌ์„œ ์ด ๋‘ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ID๋ฅผ ๋ณด์กดํ•œ ์ฑ„๋กœ ์ƒˆ๋กœ์šด ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

img

๋‹ค๋ฅธ ๋ฐฉ๋ฒ•๋ก ๋“ค๊ณผ์˜ ๋น„๊ต ๊ฒฐ๊ณผ๋Š” ์•„๋ž˜์™€ ๊ฐ™์Šต๋‹ˆ๋‹ค.

img

Face swapping๊ณผ ๊ด€๋ จํ•˜์—ฌ ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” InsightFace์™€ ๋น„๊ตํ•ด๋ดค์„ ๋•Œ ๋‘˜ ๋‹ค ์ž์—ฐ์Šค๋Ÿฝ๊ณ  ์ข‹์€ ์„ฑ๋Šฅ์„ ๋‚ด๋Š” ๊ฒƒ์„ ํ™•์ธ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

img

์ข‹์€ ์‹ค์Šต ์˜ˆ์‹œ ์ž๋ฃŒ๊ฐ€ ์žˆ์–ด ์•„๋ž˜์— ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.

ID๋ฅผ ๋ณด์กดํ•˜๋Š” ๋ฐฉ๋ฒ•๊ณผ ๊ด€๋ จํ•˜์—ฌ ์ตœ๊ทผ ๋‚˜์˜จ ๋…ผ๋ฌธ์œผ๋กœ๋Š” PuLID๋ผ๋Š” ๋†’์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋Š” ๋…ผ๋ฌธ๋„ ์žˆ์œผ๋‚˜, ๋‚˜์ค‘์— ์ •๋ฆฌํ•˜๋Š” ๊ฒƒ์œผ๋กœ ํ•˜๊ฒ ์Šต๋‹ˆ๋‹ค.

Reference