Black Forest Labs์˜ FLUX.2 [dev] 32B ๋ชจ๋ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ FLUX.1๊ณผ ๋น„๊ตํ•˜๋ฉฐ, ํ…์ŠคํŠธ ์ธ์ฝ”๋”, Transformer ๊ตฌ์กฐ, VAE, Sampling ๊ณผ์ •์„ ๋‹จ๊ณ„๋ณ„๋กœ ์‚ดํŽด๋ด…๋‹ˆ๋‹ค. ๋ณธ ํฌ์ŠคํŒ…์€ claude code์—๊ฒŒ flux.2[dev] ๊ตฌ์กฐ๋ฅผ ๋ถ„์„์‹œํ‚จ ๋‚ด์šฉ์„ ๋ฐ”ํƒ•์œผ๋กœ ์ž‘์„ฑ๋˜์—ˆ์œผ๋ฉฐ, ๊ธ€ ๋˜ํ•œ claude code๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž‘์„ฑ ๋ฐ ์ •๋ฆฌ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.

Introduction

FLUX.1์€ Stability AI์—์„œ ๋…๋ฆฝํ•œ Black Forest Labs๊ฐ€ 2024๋…„์— ๊ณต๊ฐœํ•œ flow matching ๊ธฐ๋ฐ˜์˜ text-to-image ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.1 FLUX.1์˜ ์ „์ฒด์ ์ธ ๋ฐฐ๊ฒฝ๊ณผ flow matching์— ๋Œ€ํ•ด์„œ๋Š” ์ด์ „ ํฌ์ŠคํŒ…์—์„œ ๋‹ค๋ฃจ๊ณ  ์žˆ์œผ๋‹ˆ ์ฐธ๊ณ ํ•˜์‹œ๋ฉด ์ข‹์Šต๋‹ˆ๋‹ค.

FLUX.1์€ 12B ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋‹น์‹œ ๋›ฐ์–ด๋‚œ ์ด๋ฏธ์ง€ ํ’ˆ์งˆ์„ ๋ณด์—ฌ์ฃผ์—ˆ์ง€๋งŒ, ๋ช‡ ๊ฐ€์ง€ ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค.12

  • ํ…์ŠคํŠธ ์ธ์ฝ”๋”: CLIP-L๊ณผ T5-XXL ๋‘ ๊ฐœ์˜ ๋ณ„๋„ ๋ชจ๋ธ์„ ๋™์‹œ์— ๋กœ๋”ฉํ•ด์•ผ ํ•˜์—ฌ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด์ด ํผ3
  • ์ด๋ฏธ์ง€ ํŽธ์ง‘: ๋„ค์ดํ‹ฐ๋ธŒ ํŽธ์ง‘ ๊ธฐ๋Šฅ์ด ์—†์–ด, InstructPix2Pix ๋“ฑ ๋ณ„๋„ ๋ชจ๋ธ์ด ํ•„์š”3
  • ์ž…๋ ฅ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ: ํ…์ŠคํŠธ๋งŒ ์ž…๋ ฅ ๊ฐ€๋Šฅํ•˜์—ฌ ์ด๋ฏธ์ง€๋ฅผ ์ฐธ์กฐํ•˜๋Š” ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ƒ์„ฑ์ด ๋ถˆ๊ฐ€๋Šฅ1718
  • ๋ชจ๋ธ ๋‹ค์–‘์„ฑ: ๋‹จ์ผ ํฌ๊ธฐ์˜ ๋ชจ๋ธ๋งŒ ์ œ๊ณตํ•˜์—ฌ ๋‹ค์–‘ํ•œ ์ปดํ“จํŒ… ํ™˜๊ฒฝ์— ๋Œ€์‘ํ•˜๊ธฐ ์–ด๋ ค์›€1

FLUX.2 [dev]๋Š” ์ด๋Ÿฌํ•œ ํ•œ๊ณ„๋“ค์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด ์„ค๊ณ„๋˜์—ˆ์Šต๋‹ˆ๋‹ค.22 ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM ํ•˜๋‚˜๋กœ ํ†ตํ•ฉํ•˜๊ณ 4, 32B ํŒŒ๋ผ๋ฏธํ„ฐ๋กœ ๋ชจ๋ธ ์šฉ๋Ÿ‰์„ ํ‚ค์šฐ๋ฉด์„œ๋„ ๋” ํšจ์œจ์ ์ธ ๋ธ”๋ก ๊ตฌ์„ฑ์„ ์ ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.6 ๋„ค์ดํ‹ฐ๋ธŒ ์ด๋ฏธ์ง€ ํŽธ์ง‘๊ณผ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์ž…๋ ฅ๋„ ์ง€์›ํ•˜๋ฉฐ4, Klein ๋ณ€ํ˜•(9B, 4B)์„ ํ†ตํ•ด ๊ฒฝ๋Ÿ‰ํ™” ์˜ต์…˜๋„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.21 ์ด ๊ธ€์—์„œ๋Š” FLUX.2์˜ ์•„ํ‚คํ…์ฒ˜๋ฅผ FLUX.1๊ณผ ๋น„๊ตํ•˜๋ฉฐ ํ•˜๋‚˜์”ฉ ์‚ดํŽด๋ณด๊ฒ ์Šต๋‹ˆ๋‹ค.

Overall Pipeline

FLUX.2์˜ ์ „์ฒด ์ถ”๋ก  ๊ณผ์ •์€ ํฌ๊ฒŒ ๋‹ค์„ฏ ๋‹จ๊ณ„๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.4

flowchart LR
    A["Text Encoding<br>(Mistral-Small-3.2)"] --> C["Denoising Loop<br>(Flux2 Transformer)"]
    B["VAE Encoding<br>(Image to Latent)"] --> C
    C --> D["VAE Decoding<br>(Latent to Image)"]
    D --> E["Post-processing<br>(C2PA, Watermark)"]

ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ๋Š” Mistral ๊ธฐ๋ฐ˜์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ํ†ตํ•ด ์ž„๋ฒ ๋”ฉ์œผ๋กœ ๋ณ€ํ™˜๋˜๊ณ 4, ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ชจ๋“œ์—์„œ๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€๊ฐ€ VAE๋ฅผ ํ†ตํ•ด ์ž ์žฌ ๊ณต๊ฐ„(latent space)์œผ๋กœ ์ธ์ฝ”๋”ฉ๋ฉ๋‹ˆ๋‹ค.4 ์ด ๋‘ ์ •๋ณด๊ฐ€ Flux2 Transformer์˜ denoising loop์— ๋“ค์–ด๊ฐ€ ๋…ธ์ด์ฆˆ๋กœ๋ถ€ํ„ฐ ์ด๋ฏธ์ง€๋ฅผ ์ ์ง„์ ์œผ๋กœ ์ƒ์„ฑํ•˜๊ณ , ์ตœ์ข…์ ์œผ๋กœ VAE ๋””์ฝ”๋”๊ฐ€ ์ž ์žฌ ๋ฒกํ„ฐ๋ฅผ ๋‹ค์‹œ ์ด๋ฏธ์ง€๋กœ ๋ณต์›ํ•ฉ๋‹ˆ๋‹ค.4

FLUX.1๊ณผ์˜ ํŒŒ์ดํ”„๋ผ์ธ ์ฐจ์ด๋ฅผ ์ •๋ฆฌํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.2346

ํ•ญ๋ชฉ FLUX.1 [dev] 12B FLUX.2 [dev] 32B
Text Encoder CLIP-L + T5-XXL (2๊ฐœ) Mistral-Small-3.2-24B (1๊ฐœ)
์ž…๋ ฅ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ํ…์ŠคํŠธ๋งŒ ํ…์ŠคํŠธ + ์ด๋ฏธ์ง€ (๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ)
์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ฏธ์ง€์› ๋„ค์ดํ‹ฐ๋ธŒ ์ง€์› (single/multi-ref)
VAE Scale Factor 8 16
Transformer 12B (Double 19 + Single 38) 32B (Double 8 + Single 48)

FLUX.2์—์„œ๋Š” ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ชจ๋“œ๋„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.4 ํŽธ์ง‘ ๋Œ€์ƒ ์ด๋ฏธ์ง€๋ฅผ VAE ์ธ์ฝ”๋”์— ๋„ฃ์–ด ์ž ์žฌ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ ๋’ค, ํ…์ŠคํŠธ ํ”„๋กฌํ”„ํŠธ์™€ ํ•จ๊ป˜ Transformer์— ์ž…๋ ฅํ•˜๋ฉด ์›๋ณธ ์ด๋ฏธ์ง€์˜ ๊ตฌ์กฐ๋ฅผ ์œ ์ง€ํ•˜๋ฉด์„œ๋„ ํ”„๋กฌํ”„ํŠธ์— ๋”ฐ๋ผ ์ˆ˜์ •๋œ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.4 ๋˜ํ•œ ์ฐธ์กฐ ์ด๋ฏธ์ง€(reference image)๋ฅผ ์—ฌ๋Ÿฌ ์žฅ ์ž…๋ ฅํ•˜์—ฌ ์Šคํƒ€์ผ์ด๋‚˜ ๊ตฌ๋„๋ฅผ ์ฐธ๊ณ ํ•œ ์ƒ์„ฑ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.4 ์ด๋•Œ ์ฐธ์กฐ ์ด๋ฏธ์ง€๋Š” ์ตœ๋Œ€ 768ร—768 ํ•ด์ƒ๋„๋กœ center crop๋œ ๋’ค, ํ…์ŠคํŠธ ์ธ์ฝ”๋”(Mistral)๋ฅผ ํ†ตํ•ด ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.4

Text Encoder

FLUX.1 ๋Œ€๋น„ FLUX.2์—์„œ ๊ฐ€์žฅ ๋ˆˆ์— ๋„๋Š” ๋ณ€ํ™” ์ค‘ ํ•˜๋‚˜๋Š” CLIP + T5์—์„œ Mistral๋กœ์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋” ์ „๋ฉด ๊ต์ฒด์ž…๋‹ˆ๋‹ค.34

FLUX.1์˜ ๋ฐฉ์‹: ๋‘ ๋ชจ๋ธ์˜ ์กฐํ•ฉ

FLUX.1์€ CLIP-L(768์ฐจ์›)๊ณผ T5-XXL(4096์ฐจ์›) ๋‘ ๊ฐœ์˜ ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋ฅผ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค.2 CLIP-L์€ ์ด๋ฏธ์ง€-ํ…์ŠคํŠธ ์ •๋ ฌ์— ํŠนํ™”๋œ ์งง์€ ์ž„๋ฒ ๋”ฉ์„ ์ œ๊ณตํ•˜๊ณ , T5-XXL์€ ๊ธด ํ…์ŠคํŠธ์— ๋Œ€ํ•œ ํ’๋ถ€ํ•œ ์˜๋ฏธ ์ž„๋ฒ ๋”ฉ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.1718 CLIP-L์˜ ์ถœ๋ ฅ์€ ์ฃผ๋กœ ์ „์—ญ์ ์ธ conditioning vector๋กœ ํ™œ์šฉ๋˜์—ˆ๊ณ , T5-XXL์˜ ์ถœ๋ ฅ์€ cross-attention์„ ์œ„ํ•œ sequence ํ˜•ํƒœ์˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์œผ๋กœ ์‚ฌ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.3

ํ•˜์ง€๋งŒ ์ด ๋ฐฉ์‹์—๋Š” ํ•œ๊ณ„๊ฐ€ ์žˆ์—ˆ์Šต๋‹ˆ๋‹ค. ๋‘ ๋ชจ๋ธ์„ ๋™์‹œ์— ๋กœ๋”ฉํ•ด์•ผ ํ•˜๋ฏ€๋กœ ๋ฉ”๋ชจ๋ฆฌ ๋ถ€๋‹ด์ด ์ปธ๊ณ 3, ๊ฐ ๋ชจ๋ธ์˜ ์ถœ๋ ฅ ์ฐจ์›์ด ๋‹ฌ๋ผ ๋ณ„๋„์˜ projection์ด ํ•„์š”ํ–ˆ์Šต๋‹ˆ๋‹ค.2 ๋˜ํ•œ CLIP๊ณผ T5๋Š” ํ…์ŠคํŠธ๋งŒ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด, ์ด๋ฏธ์ง€๋ฅผ ์ฐธ์กฐ ์กฐ๊ฑด์œผ๋กœ ํ™œ์šฉํ•˜๋Š” ๊ฒƒ์ด ๋ถˆ๊ฐ€๋Šฅํ–ˆ์Šต๋‹ˆ๋‹ค.1718

FLUX.2์˜ ๋ฐฉ์‹: ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM ํ•˜๋‚˜๋กœ ํ†ตํ•ฉ

FLUX.2๋Š” Mistral-Small-3.2-24B ํ•˜๋‚˜๋กœ ํ…์ŠคํŠธ ์ธ์ฝ”๋”ฉ์„ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.4 ๋‹จ์ˆœํžˆ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ๋งŒ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, Layer 10, 20, 30์—์„œ hidden state๋ฅผ ๊ฐ๊ฐ ์ถ”์ถœํ•ฉ๋‹ˆ๋‹ค.4 Mistral์˜ hidden size๊ฐ€ 5120์ฐจ์›์ด๋ฏ€๋กœ, ์„ธ ๋ ˆ์ด์–ด์˜ ์ถœ๋ ฅ์„ concatํ•˜๋ฉด 5,120ร—3=15,3605{,}120 \times 3 = 15{,}360์ฐจ์›์˜ ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ์ด ๋ฉ๋‹ˆ๋‹ค.620

์—ฌ๋Ÿฌ ๋†’์ด์—์„œ ์‚ฌ์ง„์„ ์ฐ๋“ฏ์ด, ์–•์€ ์ธต(Layer 10)์—์„œ๋Š” ํ‘œ๋ฉด์ ์ธ ๊ตฌ๋ฌธ ์ •๋ณด๋ฅผ, ์ค‘๊ฐ„ ์ธต(Layer 20)์—์„œ๋Š” ์˜๋ฏธ๋ก ์  ์ •๋ณด๋ฅผ, ๊นŠ์€ ์ธต(Layer 30)์—์„œ๋Š” ๊ณ ์ฐจ์›์ ์ธ ์ถ”๋ก  ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ๋‹ค๊ณ  ์ดํ•ดํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.22 ์ด ๋ฐฉ์‹์€ Qwen3-VL์˜ DeepStack ๋ฉ”์ปค๋‹ˆ์ฆ˜๊ณผ๋„ ์œ ์‚ฌํ•œ ๋ฐœ์ƒ์ž…๋‹ˆ๋‹ค.19 DeepStack์—์„œ๋„ ViT์˜ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด ๋ฟ ์•„๋‹ˆ๋ผ ์ค‘๊ฐ„ ๋ ˆ์ด์–ด๋“ค์—์„œ visual feature๋ฅผ ์ถ”์ถœํ•˜์—ฌ LLM์˜ ์„œ๋กœ ๋‹ค๋ฅธ ๋ ˆ์ด์–ด์— ์ฃผ์ž…ํ•จ์œผ๋กœ์จ, ์ €์ˆ˜์ค€ ํ…์Šค์ฒ˜ ์ •๋ณด๋ถ€ํ„ฐ ๊ณ ์ˆ˜์ค€ ์˜๋ฏธ ์ •๋ณด๊นŒ์ง€ ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ–ˆ์Šต๋‹ˆ๋‹ค.19

์ž…๋ ฅ ์ฒ˜๋ฆฌ ์‹œ์—๋Š” chat template๊ณผ system message๋ฅผ ์ ์šฉํ•˜์—ฌ Mistral์— ํ”„๋กฌํ”„ํŠธ๋ฅผ ์ „๋‹ฌํ•ฉ๋‹ˆ๋‹ค.4 ์ตœ๋Œ€ 512 ํ† ํฐ๊นŒ์ง€ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ4, ์ด๋ฏธ์ง€ ์ž…๋ ฅ์ด ์žˆ๋Š” ๊ฒฝ์šฐ ์ตœ๋Œ€ 768ร—768 ํ•ด์ƒ๋„๊นŒ์ง€ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.4 ๋˜ํ•œ NSFW ํ•„ํ„ฐ๋ง(threshold 0.85)์„ ํ†ตํ•ด ๋ถ€์ ์ ˆํ•œ ์ฝ˜ํ…์ธ ๋ฅผ ๊ฑธ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค.4

ํ•ญ๋ชฉ FLUX.1 FLUX.2
๋ชจ๋ธ CLIP-L + T5-XXL Mistral-Small-3.2-24B
๋ชจ๋ธ ์ˆ˜ 2๊ฐœ 1๊ฐœ
์ถœ๋ ฅ ์ฐจ์› 768 + 4,096 15,360 (5,120 ร— 3)
์ถ”์ถœ ๋ฐฉ์‹ ๋งˆ์ง€๋ง‰ ๋ ˆ์ด์–ด Multi-layer (L10, L20, L30)
์ด๋ฏธ์ง€ ์ž…๋ ฅ ๋ถˆ๊ฐ€ ๊ฐ€๋Šฅ (768ร—768)

๊ทธ๋ ‡๋‹ค๋ฉด ์™œ ์ด๋Ÿฐ ๋ณ€ํ™”๋ฅผ ์ฃผ์—ˆ์„๊นŒ์š”? ์ฒซ์งธ, ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ํ…์ŠคํŠธ๋ฟ ์•„๋‹ˆ๋ผ ์ด๋ฏธ์ง€ ์ž…๋ ฅ๋„ ๊ฐ™์€ ์ธ์ฝ”๋”๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.20 ์ด๋ฅผ ํ†ตํ•ด ์ฐธ์กฐ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜์˜ ์ƒ์„ฑ์ด ๊ฐ€๋Šฅํ•ด์กŒ์Šต๋‹ˆ๋‹ค.4 ๋‘˜์งธ, ๋‘ ๊ฐœ์˜ ๋ณ„๋„ ๋ชจ๋ธ ๋Œ€์‹  ํ•˜๋‚˜์˜ ๋ชจ๋ธ๋กœ ํ†ตํ•ฉํ•˜์—ฌ ํŒŒ์ดํ”„๋ผ์ธ์ด ๋‹จ์ˆœํ•ด์กŒ์Šต๋‹ˆ๋‹ค.22 ์…‹์งธ, 15,360์ฐจ์›์ด๋ผ๋Š” ํ’๋ถ€ํ•œ ์ž„๋ฒ ๋”ฉ ๊ณต๊ฐ„์„ ํ†ตํ•ด ๋” ์„ธ๋ฐ€ํ•œ ํ…์ŠคํŠธ ์กฐ๊ฑด ์ œ์–ด๊ฐ€ ๊ฐ€๋Šฅํ•ด์กŒ์Šต๋‹ˆ๋‹ค.22

Transformer

FLUX.2์˜ Transformer๋Š” FLUX.1๊ณผ ๋™์ผํ•˜๊ฒŒ Double-Stream Block๊ณผ Single-Stream Block์˜ ์กฐํ•ฉ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ์ง€๋งŒ, ๊ทธ ๋น„์œจ๊ณผ ๊ทœ๋ชจ๊ฐ€ ํฌ๊ฒŒ ๋‹ฌ๋ผ์กŒ์Šต๋‹ˆ๋‹ค.26

ํ•ญ๋ชฉ FLUX.1 [dev] 12B FLUX.2 [dev] 32B
Hidden Size 3,072 (24 heads ร— 128d) 6,144 (48 heads ร— 128d)
Double-Stream Blocks 19 8
Single-Stream Blocks 38 48
in_channels 64 128
FFN Activation GELU SwiGLU
joint_attention_dim 4,096 15,360

FLUX.1์—์„œ๋Š” Double Block 19๊ฐœ, Single Block 38๊ฐœ์˜€์ง€๋งŒ, FLUX.2์—์„œ๋Š” Double Block์„ 8๊ฐœ๋กœ ๋Œ€ํญ ์ค„์ด๊ณ  Single Block์„ 48๊ฐœ๋กœ ๋Š˜๋ ธ์Šต๋‹ˆ๋‹ค.26 ๊ทธ๋ ‡๋‹ค๋ฉด ์™œ Double Block์€ ์ค„์ด๊ณ  Single Block์€ ๋Š˜๋ ธ์„๊นŒ์š”?

Double Block์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๋ณ„๋„ ์ŠคํŠธ๋ฆผ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฉด์„œ joint attention์œผ๋กœ ๊ต๋ฅ˜ํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.5 ์ดˆ๊ธฐ์— ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„์˜ ์ •๋ ฌ(alignment)์„ ์žก์•„์ฃผ๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.22 ํ•˜์ง€๋งŒ hidden size๊ฐ€ 6,144๋กœ ๋‘ ๋ฐฐ๊ฐ€ ๋˜๋ฉด์„œ ๊ฐ ๋ธ”๋ก์˜ ํ‘œํ˜„๋ ฅ์ด ํ›จ์”ฌ ์ปค์กŒ๊ธฐ ๋•Œ๋ฌธ์—, 8๊ฐœ๋งŒ์œผ๋กœ๋„ ์ถฉ๋ถ„ํ•œ ์ •๋ ฌ์ด ๊ฐ€๋Šฅํ•ด์ง„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.22

Double Block์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๊ฐ๊ฐ์— ๋ณ„๋„์˜ QKV, FFN ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ์œ ์ง€ํ•˜๋ฏ€๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์ด ๋‚ฎ์Šต๋‹ˆ๋‹ค.5 ๋ฐ˜๋ฉด Single Block์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ํ•˜๋‚˜์˜ ์‹œํ€€์Šค๋กœ ํ†ตํ•ฉํ•˜์—ฌ ํ•˜๋‚˜์˜ ํŒŒ๋ผ๋ฏธํ„ฐ ์„ธํŠธ๋กœ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ ํŒŒ๋ผ๋ฏธํ„ฐ ๋Œ€๋น„ ์ฒ˜๋ฆฌ ํšจ์œจ์ด ๋†’์Šต๋‹ˆ๋‹ค.5 ์ด ๋ถ€๋ถ„์„ ๋Š˜๋ ค์„œ ์„ธ๋ฐ€ํ•œ ์ƒ์„ฑ ํ’ˆ์งˆ์„ ๋†’์ด๋Š” ์ „๋žต์„ ์ทจํ•œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.22

๋˜ํ•œ FLUX.1์—์„œ๋Š” FFN activation์œผ๋กœ GELU๋ฅผ ์‚ฌ์šฉํ–ˆ์ง€๋งŒ, FLUX.2์—์„œ๋Š” SwiGLU๋กœ ์ „ํ™˜ํ–ˆ์Šต๋‹ˆ๋‹ค.56 SwiGLU๋Š” SwiGLU(x)=SiLU(xW1)โŠ—(xW2)\text{SwiGLU}(x) = \text{SiLU}(xW_1) \otimes (xW_2) ํ˜•ํƒœ๋กœ, ์ž…๋ ฅ์„ ๋‘ ๊ฐˆ๋ž˜๋กœ ๋‚˜๋ˆ  ํ•˜๋‚˜์—๋Š” SiLU(Swish) ํ™œ์„ฑํ™”๋ฅผ ์ ์šฉํ•˜๊ณ  ๋‹ค๋ฅธ ํ•˜๋‚˜์™€ element-wise ๊ณฑ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.12 ๊ฒŒ์ดํŒ… ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์ถ”๊ฐ€๋˜์–ด ์ •๋ณด ํ๋ฆ„์„ ๋” ์ •๊ตํ•˜๊ฒŒ ์ œ์–ดํ•  ์ˆ˜ ์žˆ์œผ๋ฉฐ, LLM ๋ถ„์•ผ์—์„œ GELU ๋Œ€๋น„ ์ผ๊ด€๋˜๊ฒŒ ๋” ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์—ฌ ์ตœ๊ทผ ํ‘œ์ค€์œผ๋กœ ์ž๋ฆฌ์žก๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.12

์ž…๋ ฅ ์ž„๋ฒ ๋”ฉ

Transformer์— ์ž…๋ ฅ์ด ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ๊ฐ๊ฐ์˜ ์ž„๋ฒ ๋”ฉ ๋ ˆ์ด์–ด๋ฅผ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.6

  • x_embedder: ์ด๋ฏธ์ง€ ์ž ์žฌ ๋ฒกํ„ฐ(128์ฐจ์›)๋ฅผ Linear(128 โ†’ 6144)๋กœ ํˆฌ์˜6
  • context_embedder: ํ…์ŠคํŠธ ์ž„๋ฒ ๋”ฉ(15,360์ฐจ์›)์„ Linear(15360 โ†’ 6144)๋กœ ํˆฌ์˜6
  • timestep_embedding: ํ˜„์žฌ timestep tt๋ฅผ sinusoidal 256์ฐจ์› โ†’ MLP โ†’ 6,144์ฐจ์›์œผ๋กœ ๋ณ€ํ™˜6
  • guidance_embedding: guidance scale ฮณ\gamma๋ฅผ timestep๊ณผ ๋™์ผํ•œ ๋ฐฉ์‹(sinusoidal 256์ฐจ์› โ†’ MLP โ†’ 6,144์ฐจ์›)์œผ๋กœ ๋ณ€ํ™˜ํ•˜๋Š” ๋ณ„๋„์˜ MLP ๋ ˆ์ด์–ด์ž…๋‹ˆ๋‹ค.6 ์ถ”๋ก  ์‹œ ์‚ฌ์šฉ์ž๊ฐ€ guidance_scale=3.5์ฒ˜๋Ÿผ float ๊ฐ’์„ ์ง€์ •ํ•˜๋ฉด, ์ด ์Šค์นผ๋ผ ๊ฐ’์ด ร— 1000 ์Šค์ผ€์ผ๋ง โ†’ sinusoidal embedding(256d) โ†’ MLPEmbedder(256 โ†’ 6144) ์ˆœ์„œ๋กœ ์ฒ˜๋ฆฌ๋ฉ๋‹ˆ๋‹ค.6 ์ƒ์„ฑ๋œ guidance ์ž„๋ฒ ๋”ฉ์€ timestep ์ž„๋ฒ ๋”ฉ๊ณผ element-wise ๋ง์…ˆ์œผ๋กœ ํ•ฉ์ณ์ ธ ํ•˜๋‚˜์˜ conditioning vector๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.6
    • ์ „ํ†ต์ ์ธ CFG์—์„œ๋Š” ๋งค step๋งˆ๋‹ค ๋ชจ๋ธ์„ 2๋ฒˆ(conditional + unconditional) ๋Œ๋ฆฐ ๋’ค ์™ธ๋ถ€์—์„œ v=vuncond+ฮณ(vcondโˆ’vuncond)v = v_{\text{uncond}} + \gamma(v_{\text{cond}} - v_{\text{uncond}})๋กœ ๋ณด๊ฐ„ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.15 Guidance embedding์€ ฮณ\gamma ๊ฐ’ ์ž์ฒด๋ฅผ ๋ชจ๋ธ ๋‚ด๋ถ€์— conditioning์œผ๋กœ ์ฃผ์ž…ํ•˜์—ฌ, ๋ชจ๋ธ์ด ํ•œ ๋ฒˆ์˜ forward pass๋กœ guided output์„ ์ง์ ‘ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•ฉ๋‹ˆ๋‹ค.6 ์ด๋ฅผ ํ†ตํ•ด ์ถ”๋ก  ๋น„์šฉ์ด ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.22
    • Klein ๋ชจ๋ธ๋“ค์€ guidance embedding์ด ์—†์œผ๋ฏ€๋กœ CFG ์—†์ด ๋™์ž‘ํ•ฉ๋‹ˆ๋‹ค.21

์ด conditioning vector๊ฐ€ AdaLN(Adaptive Layer Normalization)์„ ํ†ตํ•ด ๊ฐ ๋ธ”๋ก์— ์ „๋‹ฌ๋˜์–ด, ํ˜„์žฌ timestep๊ณผ guidance ๊ฐ•๋„์— ๋งž๊ฒŒ ๋ธ”๋ก์˜ ๋™์ž‘์„ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค.6

์ „์ฒด Transformer ํ๋ฆ„์„ ๋‹ค์ด์–ด๊ทธ๋žจ์œผ๋กœ ๋‚˜ํƒ€๋‚ด๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

flowchart TB
    subgraph Embed["Embedding"]
        TS["Timestep + Guidance<br>โ†’ vec (6144d)"]
        TXT_P["Text (15360d)<br>โ†’ context_embedder<br>โ†’ 6144d"]
        IMG_P["Latent (128d)<br>โ†’ x_embedder<br>โ†’ 6144d"]
        ROPE["RoPE Position<br>(32,32,32,32)"]
    end

    DB["Double-Stream Blocks ร—8<br>(img โ†” txt ๋ณ„๋„ ์ฒ˜๋ฆฌ, Joint Attention)"]
    CONCAT["img + txt concat"]
    SB["Single-Stream Blocks ร—48<br>(ํ†ตํ•ฉ ์ฒ˜๋ฆฌ, Fused QKV+MLP)"]
    SPLIT["img portion ์ถ”์ถœ"]
    FL["Final Layer<br>(AdaLN โ†’ Linear โ†’ 128d)"]

    TS --> DB
    TS --> SB
    TS --> FL
    TXT_P --> DB
    IMG_P --> DB
    ROPE --> DB
    ROPE --> SB
    DB --> CONCAT --> SB --> SPLIT --> FL

Double-Stream Block

Double-Stream Block์€ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋ฅผ ๊ฐ๊ฐ ๋…๋ฆฝ์ ์ธ ์ŠคํŠธ๋ฆผ์œผ๋กœ ์ฒ˜๋ฆฌํ•˜๋˜, attention ๋‹จ๊ณ„์—์„œ๋งŒ ํ•ฉ์ณ์„œ ์„œ๋กœ์˜ ์ •๋ณด๋ฅผ ๊ตํ™˜ํ•˜๋Š” ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค.5 ๋งˆ์น˜ ๋‘ ์‚ฌ๋žŒ์ด ๊ฐ์ž ๋ฉ”๋ชจ๋ฅผ ์ •๋ฆฌํ•˜๋ฉด์„œ, ์ค‘๊ฐ„์ค‘๊ฐ„ ์„œ๋กœ์˜ ๋ฉ”๋ชจ๋ฅผ ๋ณด๊ณ  ์˜๋…ผํ•˜๋Š” ๊ฒƒ๊ณผ ๋น„์Šทํ•ฉ๋‹ˆ๋‹ค.22

๊ฐ ์ŠคํŠธ๋ฆผ(์ด๋ฏธ์ง€/ํ…์ŠคํŠธ)์€ ๋‹ค์Œ์˜ ๊ณผ์ •์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.

  1. AdaLN-Zero Modulation: conditioning vector vec\text{vec}๋ฅผ Linear(6144 โ†’ 6144ร—6)์— ํ†ต๊ณผ์‹œ์ผœ shift, scale, gate ๊ฐ’ 6๊ฐœ๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.56 ์ด ์ค‘ 3๊ฐœ(shift, scale, gate)๋Š” attention์šฉ, ๋‚˜๋จธ์ง€ 3๊ฐœ๋Š” FFN์šฉ์ž…๋‹ˆ๋‹ค.5 ์ด ๊ฐ’๋“ค์ด Layer Normalization์˜ ์ถœ๋ ฅ์„ ์กฐ์ ˆํ•˜์—ฌ, ํ˜„์žฌ timestep๊ณผ guidance์— ๋งž๊ฒŒ ๋ธ”๋ก์˜ ๋™์ž‘์„ ์ ์‘์‹œํ‚ต๋‹ˆ๋‹ค.5
  2. QKV Projection: Query, Key, Value๋ฅผ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค. Linear(6144 โ†’ 6144ร—3)์œผ๋กœ Q, K, V๋ฅผ ํ•œ ๋ฒˆ์— ์ƒ์„ฑํ•œ ๋’ค, 48๊ฐœ head๋กœ ๋ถ„ํ• ํ•ฉ๋‹ˆ๋‹ค (head๋‹น 128์ฐจ์›).6
  3. QK-Norm: RMSNorm(128)์œผ๋กœ Q, K๋ฅผ ์ •๊ทœํ™”ํ•ฉ๋‹ˆ๋‹ค.56 ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ์‹œ attention logit์ด ๋„ˆ๋ฌด ์ปค์ ธ ํ•™์Šต์ด ๋ฐœ์‚ฐํ•˜๋Š” ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.9 Stable Diffusion 3 ๋…ผ๋ฌธ์—์„œ๋„ ์ด ๊ธฐ๋ฒ•์˜ ์ค‘์š”์„ฑ์„ ๊ฐ•์กฐํ•œ ๋ฐ” ์žˆ์Šต๋‹ˆ๋‹ค.9
  4. RoPE ์ ์šฉ: Q, K์— ์œ„์น˜ ์ธ์ฝ”๋”ฉ์„ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.56
  5. Joint Attention: ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ์˜ QKV๋ฅผ concatํ•˜์—ฌ ํ•˜๋‚˜์˜ SDPA(Scaled Dot-Product Attention)๋ฅผ ์ˆ˜ํ–‰ํ•œ ํ›„ ๋‹ค์‹œ ๋ถ„๋ฆฌํ•ฉ๋‹ˆ๋‹ค.5
  6. FFN (SwiGLU): ๊ฐ ์ŠคํŠธ๋ฆผ๋ณ„๋กœ Linear(6144 โ†’ 36864) โ†’ SwiGLU โ†’ Linear(18432 โ†’ 6144) ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.6 mlp_ratio๊ฐ€ 3.0์ด๋ฏ€๋กœ ์ค‘๊ฐ„ ์ฐจ์›์ด 6144ร—3=184326144 \times 3 = 18432๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.6
  7. Gated Residual: gate ๊ฐ’์œผ๋กœ ๊ฐ€์ค‘๋œ ์ถœ๋ ฅ์„ ์›๋ž˜ ์ž…๋ ฅ์— ๋”ํ•ฉ๋‹ˆ๋‹ค. Attention ์ถœ๋ ฅ๊ณผ FFN ์ถœ๋ ฅ ๊ฐ๊ฐ์— ๋ณ„๋„์˜ gate๊ฐ€ ์ ์šฉ๋ฉ๋‹ˆ๋‹ค.5

Joint Attention์ด ํ•ต์‹ฌ์ž…๋‹ˆ๋‹ค. ์ด๋ฏธ์ง€์˜ Q, K, V์™€ ํ…์ŠคํŠธ์˜ Q, K, V๋ฅผ sequence ์ฐจ์›์œผ๋กœ concatํ•œ ๋’ค ํ•˜๋‚˜์˜ attention์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.5 ์ด๋ฅผ ํ†ตํ•ด ์ด๋ฏธ์ง€ ํ† ํฐ์ด ํ…์ŠคํŠธ ํ† ํฐ์— attendํ•˜๊ณ , ํ…์ŠคํŠธ ํ† ํฐ๋„ ์ด๋ฏธ์ง€ ํ† ํฐ์— attendํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.5 Attention ํ›„์—๋Š” ๋‹ค์‹œ ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ ๋ถ€๋ถ„์œผ๋กœ ๋ถ„๋ฆฌํ•˜์—ฌ ๊ฐ ์ŠคํŠธ๋ฆผ์œผ๋กœ ๋Œ๋ ค์ค๋‹ˆ๋‹ค.5

flowchart TB
    subgraph DoubleBlock["Double-Stream Block"]
        IMG["Image Stream x"] --> IMG_ADALN["AdaLN โ†’ QKV โ†’ QK-Norm"]
        TXT["Text Stream y"] --> TXT_ADALN["AdaLN โ†’ QKV โ†’ QK-Norm"]
        IMG_ADALN --> ROPE_A["RoPE ์ ์šฉ"]
        TXT_ADALN --> ROPE_A
        ROPE_A --> JA["Joint Attention<br>(concat โ†’ SDPA โ†’ split)"]
        JA --> IMG_FFN["Image FFN (SwiGLU)<br>+ gated residual"]
        JA --> TXT_FFN["Text FFN (SwiGLU)<br>+ gated residual"]
        IMG_FFN --> IMG_OUT["x' (updated image)"]
        TXT_FFN --> TXT_OUT["y' (updated text)"]
    end

์ด ๊ตฌ์กฐ์˜ ์žฅ์ ์€ ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ์„œ๋กœ์˜ ์ •๋ณด๋ฅผ ์ฐธ์กฐํ•˜๋ฉด์„œ๋„, ๊ฐ์ž์˜ ํ‘œํ˜„ ๊ณต๊ฐ„์„ ๋…๋ฆฝ์ ์œผ๋กœ ์œ ์ง€ํ•  ์ˆ˜ ์žˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.22 ์ผ๋ฐ˜์ ์ธ cross-attention์—์„œ๋Š” ํ•œ์ชฝ(๋ณดํ†ต ์ด๋ฏธ์ง€)๋งŒ ๋‹ค๋ฅธ ์ชฝ(ํ…์ŠคํŠธ)์„ ์ฐธ์กฐํ•˜์ง€๋งŒ, joint attention์—์„œ๋Š” ์–‘๋ฐฉํ–ฅ์œผ๋กœ ์ •๋ณด๊ฐ€ ํ๋ฆ…๋‹ˆ๋‹ค.9 ์ด ์„ค๊ณ„๋Š” FLUX.1์˜ Stable Diffusion 3 ๊ธฐ๋ฐ˜ MM-DiT ๊ตฌ์กฐ์—์„œ ์˜๊ฐ์„ ๋ฐ›์•˜์œผ๋ฉฐ, FLUX.2์—์„œ๋„ ๋™์ผํ•œ ์›๋ฆฌ๋ฅผ ์œ ์ง€ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.956

Single-Stream Block

Double-Stream Block์„ ๊ฑฐ์นœ ๋’ค, ์ด๋ฏธ์ง€์™€ ํ…์ŠคํŠธ๋Š” sequence ์ฐจ์›์œผ๋กœ concat๋˜์–ด ํ•˜๋‚˜์˜ ํ†ตํ•ฉ ์‹œํ€€์Šค๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.6 ์˜ˆ๋ฅผ ๋“ค์–ด 1024ร—1024 ์ด๋ฏธ์ง€ ์ƒ์„ฑ ์‹œ, ์ด๋ฏธ์ง€ 4,096 ํ† ํฐ๊ณผ ํ…์ŠคํŠธ 512 ํ† ํฐ์ด ํ•ฉ์ณ์ ธ 4,608 ํ† ํฐ์˜ ์‹œํ€€์Šค๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.22 ์ดํ›„ 48๊ฐœ์˜ Single-Stream Block์ด ์ด ํ†ตํ•ฉ ์‹œํ€€์Šค๋ฅผ ์ฒ˜๋ฆฌํ•ฉ๋‹ˆ๋‹ค.6

Fused QKV + MLP

Single-Stream Block์˜ ๊ฐ€์žฅ ํฐ ํŠน์ง•์€ QKV projection๊ณผ MLP input projection์„ ํ•˜๋‚˜์˜ linear layer๋กœ ํ†ตํ•ฉ(fuse)ํ–ˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.5

linear1:Linear(6144โ†’3ร—6144+2ร—18432=55296)\text{linear1}: \text{Linear}(6144 \rightarrow 3 \times 6144 + 2 \times 18432 = 55296)

ํ•˜๋‚˜์˜ projection์—์„œ Q, K, V(๊ฐ 6,144์ฐจ์›)์™€ MLP ์ž…๋ ฅ(SwiGLU๋ฅผ ์œ„ํ•œ ๋‘ ๊ฐˆ๋ž˜, ๊ฐ 18,432์ฐจ์›)์„ ๋™์‹œ์— ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.5 ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ๋ณ„๋„์˜ forward pass ๋‘ ๋ฒˆ ๋Œ€์‹  ํ•œ ๋ฒˆ์˜ ํฐ ํ–‰๋ ฌ ๊ณฑ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์–ด GPU ํ™œ์šฉ ํšจ์œจ์ด ๋†’์•„์ง‘๋‹ˆ๋‹ค.[^23] GPU๋Š” ์ž‘์€ ํ–‰๋ ฌ ๊ณฑ์„ ์—ฌ๋Ÿฌ ๋ฒˆ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ๋ณด๋‹ค ํฐ ํ–‰๋ ฌ ๊ณฑ์„ ํ•œ ๋ฒˆ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ์ด ๋” ํšจ์œจ์ ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.[^23]

์ƒ์„ฑ๋œ Q, K์—๋Š” RMSNorm(QK-Norm)๊ณผ RoPE๊ฐ€ ์ ์šฉ๋œ ํ›„ 48๊ฐœ head๋กœ ๋ถ„ํ• ๋˜์–ด SDPA๊ฐ€ ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.56 MLP ์ž…๋ ฅ์€ SwiGLU ํ™œ์„ฑํ™”๋ฅผ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.6 ์ดํ›„ attention ์ถœ๋ ฅ(6,144์ฐจ์›)๊ณผ MLP ์ถœ๋ ฅ(18,432์ฐจ์›)์ด ๋‹ค์‹œ concat๋˜์–ด ํ•˜๋‚˜์˜ ์ถœ๋ ฅ projection์„ ๊ฑฐ์นฉ๋‹ˆ๋‹ค.5

linear2:Linear(6144+18432=24576โ†’6144)\text{linear2}: \text{Linear}(6144 + 18432 = 24576 \rightarrow 6144)

Gated Residual

์ตœ์ข… ์ถœ๋ ฅ์€ AdaLN์—์„œ ์ƒ์„ฑ๋œ gate ๊ฐ’๊ณผ ๊ณฑํ•ด์ง„ ๋’ค residual connection์œผ๋กœ ์ž…๋ ฅ์— ๋”ํ•ด์ง‘๋‹ˆ๋‹ค.5 Double Block์—์„œ๋Š” attention๊ณผ FFN ๊ฐ๊ฐ์— ๋ณ„๋„์˜ gate๊ฐ€ ์žˆ์—ˆ์ง€๋งŒ, Single Block์—์„œ๋Š” ํ•˜๋‚˜์˜ gate๋กœ ํ†ตํ•ฉ๋œ ์ถœ๋ ฅ ์ „์ฒด๋ฅผ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค.5 Gate ๊ฐ’์€ ํ•™์Šต์„ ํ†ตํ•ด ๊ฐ ๋ธ”๋ก์ด ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ์ž…๋ ฅ์„ ๋ณ€ํ˜•ํ• ์ง€๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.10 ํ•™์Šต ์ดˆ๊ธฐ์—๋Š” gate ๊ฐ’์ด 0์— ๊ฐ€๊นŒ์›Œ residual connection์ด ์ง€๋ฐฐ์ ์ด๊ณ , ํ•™์Šต์ด ์ง„ํ–‰๋ ์ˆ˜๋ก ๋ธ”๋ก์˜ ๋ณ€ํ˜•์ด ์ ์ง„์ ์œผ๋กœ ์ปค์ง€๋Š” ํšจ๊ณผ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.1022

flowchart TB
    subgraph SingleBlock["Single-Stream Block"]
        X["Combined x (img+txt)"]
        X --> ADALN_S["AdaLN โ†’ LayerNorm"]
        ADALN_S --> FUSED["Fused Projection<br>(โ†’ QKV + MLP input)"]
        FUSED --> |"Q,K,V"| ATTN["RoPE โ†’ SDPA (48 heads)"]
        FUSED --> |"MLP input"| MLP_S["SwiGLU Activation"]
        ATTN --> CAT["Concat [attn, mlp]"]
        MLP_S --> CAT
        CAT --> PROJ["Output Projection"]
        PROJ --> GATE_S["Gate ร— output + residual"]
    end

48๊ฐœ์˜ Single-Stream Block์„ ๋ชจ๋‘ ํ†ต๊ณผํ•œ ๋’ค์—๋Š”, **์ด๋ฏธ์ง€ ๋ถ€๋ถ„๋งŒ ์ถ”์ถœ(ํ…์ŠคํŠธ ๋ถ€๋ถ„ ์ œ๊ฑฐ)**ํ•˜๊ณ  Final Layer(AdaLN modulation โ†’ Linear(6144 โ†’ 128))๋ฅผ ๊ฑฐ์ณ denoised latent๋ฅผ ์ถœ๋ ฅํ•ฉ๋‹ˆ๋‹ค.6

Positional Encoding

FLUX.1๊ณผ FLUX.2 ๋ชจ๋‘ Rotary Position Embedding(RoPE)์„ ์‚ฌ์šฉํ•˜์ง€๋งŒ, ์„ค์ •์ด ์ƒ๋‹นํžˆ ๋‹ฌ๋ผ์กŒ์Šต๋‹ˆ๋‹ค.56

ํ•ญ๋ชฉ FLUX.1 FLUX.2
axes_dims (16, 56, 56) (32, 32, 32, 32)
theta 10,000 2,000
์ด ์ฐจ์› 128 (16+56+56) 128 (32+32+32+32)

FLUX.1์€ 3์ถ•(์‹œ๊ฐ„ 16 + ๋†’์ด 56 + ๋„ˆ๋น„ 56)์œผ๋กœ ์œ„์น˜๋ฅผ ์ธ์ฝ”๋”ฉํ–ˆ์Šต๋‹ˆ๋‹ค.5 ๋†’์ด์™€ ๋„ˆ๋น„์— ๋Œ€๋ถ€๋ถ„์˜ ์ฐจ์›์„ ํ• ๋‹นํ•˜๊ณ  ์‹œ๊ฐ„ ์ถ•์—๋Š” ์ ์€ ์ฐจ์›์„ ๋ถ€์—ฌํ•œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.5 ์ด ๊ตฌ์„ฑ์—์„œ๋Š” ๋†’์ดยท๋„ˆ๋น„ ์ •๋ณด๊ฐ€ ํ’๋ถ€ํ•œ ๋Œ€์‹ , ์‹œ๊ฐ„ ์ถ•์˜ ํ‘œํ˜„๋ ฅ์ด ์ƒ๋Œ€์ ์œผ๋กœ ์•ฝํ•ฉ๋‹ˆ๋‹ค.22

FLUX.2๋Š” 4์ถ•(32 + 32 + 32 + 32)์œผ๋กœ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.6 ๋ชจ๋“  ์ถ•์— ๋™์ผํ•œ ์ฐจ์›์„ ๊ท ๋“ฑํ•˜๊ฒŒ ๋ฐฐ๋ถ„ํ•ฉ๋‹ˆ๋‹ค.6 ์ด ๋ณ€ํ™”๋Š” Qwen3-VL์˜ Interleaved MRoPE์—์„œ๋„ ๋ฐœ๊ฒฌ๋œ ๊ฒƒ์ฒ˜๋Ÿผ, ํŠน์ • ์ถ•์— ์ฃผํŒŒ์ˆ˜๊ฐ€ ํŽธ์ค‘๋˜๋Š” ๋ถˆ๊ท ํ˜• ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๊ฒƒ์œผ๋กœ ๋ณด์ž…๋‹ˆ๋‹ค.22 Qwen3-VL์—์„œ๋Š” ์‹œ๊ฐ„(t), ๋†’์ด(h), ๋„ˆ๋น„(w) ์„ฑ๋ถ„์„ ์ž„๋ฒ ๋”ฉ ์ฐจ์› ์ „๋ฐ˜์— ๊ฑธ์ณ ๊ต์ฐจ ๋ฐฐ์น˜ํ•˜์—ฌ, ๊ฐ ์ฐจ์›์ด ์ €์ฃผํŒŒ์ˆ˜์™€ ๊ณ ์ฃผํŒŒ์ˆ˜ ๋Œ€์—ญ์— ๊ท ์ผํ•˜๊ฒŒ ๋ถ„ํฌ๋˜๋„๋ก ํ–ˆ์Šต๋‹ˆ๋‹ค.19

4๋ฒˆ์งธ ์ถ•์ด ์ถ”๊ฐ€๋œ ์ด์œ ๋Š” ์ด๋ฏธ์ง€ ํŽธ์ง‘ ๋ชจ๋“œ์—์„œ ์ฐธ์กฐ ์ด๋ฏธ์ง€์™€ ์ƒ์„ฑ ์ด๋ฏธ์ง€๋ฅผ ๊ตฌ๋ถ„ํ•˜๊ธฐ ์œ„ํ•œ ์ถ”๊ฐ€์ ์ธ ์œ„์น˜ ์ •๋ณด๊ฐ€ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.22 ์ƒ์„ฑ ์ „์šฉ ๋ชจ๋“œ์—์„œ๋„ ์ด 4๋ฒˆ์งธ ์ถ•์€ ์‹œํ€€์Šค ๋‚ด ์ถ”๊ฐ€์ ์ธ ๊ตฌ์กฐ ์ •๋ณด๋ฅผ ์ธ์ฝ”๋”ฉํ•˜๋Š” ๋ฐ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.22

๋˜ํ•œ theta ๊ฐ’์ด 10,000์—์„œ 2,000์œผ๋กœ ๋‚ฎ์•„์กŒ์Šต๋‹ˆ๋‹ค.56 RoPE์—์„œ theta๋Š” ํšŒ์ „ ์ฃผํŒŒ์ˆ˜์˜ ๊ธฐ์ €(base)๋ฅผ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค.11 theta๊ฐ€ ๋‚ฎ์„์ˆ˜๋ก ์œ„์น˜์— ๋”ฐ๋ฅธ ํšŒ์ „ ์ฃผ๊ธฐ๊ฐ€ ์งง์•„์ ธ, ๊ทผ์ ‘ํ•œ ์œ„์น˜ ๊ฐ„์˜ ์ฐจ์ด๋ฅผ ๋” ๋ฏผ๊ฐํ•˜๊ฒŒ ํฌ์ฐฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.11 ๋ฐ˜๋ฉด theta๊ฐ€ ๋†’์œผ๋ฉด ๋จผ ๊ฑฐ๋ฆฌ์˜ ์œ„์น˜ ๊ด€๊ณ„๊นŒ์ง€ ์ธ์ฝ”๋”ฉํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ๊ฐ€๊นŒ์šด ์œ„์น˜ ๊ฐ„์˜ ๊ตฌ๋ถ„๋ ฅ์ด ๋–จ์–ด์ง‘๋‹ˆ๋‹ค.11 FLUX.2์—์„œ theta๋ฅผ ๋‚ฎ์ถ˜ ๊ฒƒ์€ scale factor 16์œผ๋กœ ์ธํ•ด ๋” ์••์ถ•๋œ ์ž ์žฌ ๊ณต๊ฐ„์—์„œ ์„ธ๋ฐ€ํ•œ ๊ณต๊ฐ„ ๊ด€๊ณ„๋ฅผ ์žก์•„๋‚ด๊ธฐ ์œ„ํ•œ ์„ ํƒ์œผ๋กœ ํ•ด์„๋ฉ๋‹ˆ๋‹ค.22

VAE

VAE(Variational Autoencoder)๋Š” ์ด๋ฏธ์ง€๋ฅผ ์ž ์žฌ ๊ณต๊ฐ„์œผ๋กœ ์••์ถ•ํ•˜๊ณ  ๋‹ค์‹œ ๋ณต์›ํ•˜๋Š” ์—ญํ• ์„ ํ•ฉ๋‹ˆ๋‹ค.[^23] FLUX.2์˜ VAE์—์„œ ๊ฐ€์žฅ ํฐ ๋ณ€ํ™”๋Š” scale factor๊ฐ€ 8์—์„œ 16์œผ๋กœ ์ปค์ง„ ๊ฒƒ์ž…๋‹ˆ๋‹ค.87

Scale Factor 8 vs 16

Scale factor๋Š” ์ž…๋ ฅ ์ด๋ฏธ์ง€ ๋Œ€๋น„ Transformer์— ์ž…๋ ฅ๋˜๋Š” ์ž ์žฌ ๋ฒกํ„ฐ์˜ ๊ณต๊ฐ„ ํ•ด์ƒ๋„ ๋น„์œจ์„ ๋‚˜ํƒ€๋ƒ…๋‹ˆ๋‹ค.[^23]

ํ•ญ๋ชฉ FLUX.1 FLUX.2
VAE downsample 3ํšŒ (8ร—) 3ํšŒ (8ร—)
Patchification ํŒŒ์ดํ”„๋ผ์ธ์—์„œ ์ˆ˜ํ–‰ VAE ๋‚ด๋ถ€ ํ†ตํ•ฉ
์‹ค์งˆ Scale Factor 8 (VAE) + 2 (patch) = 16 16 (VAE ๋‚ด์žฅ)
latent_channels 16 32
Transformer in_channels 64 (16ร—4) 128 (32ร—4)

์‹ค์ œ๋กœ ๋‘ ๋ชจ๋ธ ๋ชจ๋‘ ๋™์ผํ•œ block_out_channels=[128, 256, 512, 512]๊ณผ 3ํšŒ downsample์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.78 ํ•ต์‹ฌ ์ฐจ์ด๋Š” FLUX.2๊ฐ€ 2ร—2 patchification์„ VAE ๋‚ด๋ถ€์— ํ†ตํ•ฉ(patch_size=(2,2))ํ–ˆ๋‹ค๋Š” ์ ๊ณผ, latent_channels๊ฐ€ 16์—์„œ 32๋กœ ๋‘ ๋ฐฐ๊ฐ€ ๋˜์—ˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.7 ์ฑ„๋„ ์ˆ˜๋ฅผ ๋Š˜๋ ค์„œ ๋” ์••์ถ•๋œ ๊ณต๊ฐ„์—์„œ๋„ ์ •๋ณด๋ฅผ ๋ณด์กดํ•ฉ๋‹ˆ๋‹ค.22 ๋งˆ์น˜ ๋„“์€ ์ฑ…์ƒ ๋Œ€์‹  ์„œ๋ž์žฅ์„ ์‚ฌ์šฉํ•˜๋Š” ๊ฒƒ์ฒ˜๋Ÿผ, ๋ฉด์ ์€ ์ค„์ด๋˜ ๊นŠ์ด(์ฑ„๋„)๋กœ ์ •๋ณด๋ฅผ ๋‹ด๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.22

์ด๋กœ ์ธํ•ด Transformer๊ฐ€ ์ฒ˜๋ฆฌํ•ด์•ผ ํ•  ์‹œํ€€์Šค ๊ธธ์ด๋Š” ๋™์ผํ•˜์ง€๋งŒ(1024ร—1024 ๊ธฐ์ค€ 4,096 ํ† ํฐ), ๊ฐ ํ† ํฐ์ด ๋‹ด๋Š” ์ •๋ณด๋Ÿ‰์ด 64์ฐจ์›์—์„œ 128์ฐจ์›์œผ๋กœ ๋‘ ๋ฐฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.22

Encoder ๊ตฌ์กฐ

Encoder๋Š” 3๋ฒˆ์˜ downsample์„ ํ†ตํ•ด ๊ณต๊ฐ„ ํ•ด์ƒ๋„๋ฅผ 1/81/8๋กœ ์ค„์ด๊ณ , ์ถ”๊ฐ€์ ์œผ๋กœ patch rearrangement(2ร—2 ํŒจ์น˜๋ฅผ ์ฑ„๋„๋กœ ์ ‘๊ธฐ)๋ฅผ ํ†ตํ•ด ์ตœ์ข…์ ์œผ๋กœ 1/161/16์˜ scale factor๋ฅผ ๋‹ฌ์„ฑํ•ฉ๋‹ˆ๋‹ค.7 ์ถœ๋ ฅ์€ 64์ฑ„๋„(mean + logvar)์—์„œ mean๋งŒ ์ทจํ•ด 32์ฑ„๋„์˜ ์ž ์žฌ ๋ฒกํ„ฐ๊ฐ€ ๋ฉ๋‹ˆ๋‹ค.7

flowchart TB
    subgraph Encoder["VAE Encoder"]
        E_IN["Input Image<br>(3ch, 1024ร—1024)"]
        E_CONV["Conv2d(3โ†’128)"]
        E_D1["ResBlockร—2 โ†’ Downsample<br>(128ch, 512ร—512)"]
        E_D2["ResBlockร—2 โ†’ Downsample<br>(256ch, 256ร—256)"]
        E_D3["ResBlockร—2 โ†’ Downsample<br>(512ch, 128ร—128)"]
        E_D4["ResBlockร—2<br>(512ch, 128ร—128, no downsample)"]
        E_MID["Mid Block<br>(ResBlock โ†’ AttnBlock โ†’ ResBlock)"]
        E_OUT["GroupNorm โ†’ Swish<br>โ†’ Conv2d(512โ†’64ch)"]
        E_MEAN["mean ์ถ”์ถœ โ†’ 32ch"]
        E_PATCH["Patch Rearrange (2ร—2)<br>(32ch, 128ร—128) โ†’ (128ch, 64ร—64)"]
        E_NORM["BatchNorm Normalize"]

        E_IN --> E_CONV --> E_D1 --> E_D2 --> E_D3 --> E_D4 --> E_MID --> E_OUT --> E_MEAN --> E_PATCH --> E_NORM
    end

์ฐธ๊ณ ๋กœ FLUX.1๊ณผ FLUX.2์˜ VAE๋Š” ๋™์ผํ•œ block_out_channels=[128, 256, 512, 512]๊ณผ 3ํšŒ downsample ๊ตฌ์กฐ๋ฅผ ๊ณต์œ ํ•ฉ๋‹ˆ๋‹ค.78 ํ•ต์‹ฌ ์ฐจ์ด๋Š” latent_channels๊ฐ€ 16์—์„œ 32๋กœ ๋‘ ๋ฐฐ๊ฐ€ ๋˜์—ˆ๋‹ค๋Š” ์ ์ž…๋‹ˆ๋‹ค.78 FLUX.1์—์„œ๋Š” patchification์ด ํŒŒ์ดํ”„๋ผ์ธ ๋ ˆ๋ฒจ์—์„œ ์ˆ˜ํ–‰๋˜์—ˆ์ง€๋งŒ, FLUX.2์—์„œ๋Š” VAE ๋‚ด๋ถ€์— patch_size=(2,2)๋กœ ํ†ตํ•ฉ๋˜์–ด ์žˆ์–ด VAE ์ž์ฒด์˜ ์‹ค์งˆ์  scale factor๊ฐ€ 16์ด ๋ฉ๋‹ˆ๋‹ค.37

Decoder ๊ตฌ์กฐ

Decoder๋Š” Encoder์˜ ์—ญ๊ณผ์ •์ž…๋‹ˆ๋‹ค.[^23] 32์ฑ„๋„ ์ž ์žฌ ๋ฒกํ„ฐ๋ฅผ ๋ฐ›์•„ ์›๋ณธ ํ•ด์ƒ๋„์˜ ์ด๋ฏธ์ง€๋ฅผ ๋ณต์›ํ•ฉ๋‹ˆ๋‹ค.7 ์ฃผ๋ชฉํ•  ์ ์€ Decoder์˜ ๊ฐ level์—์„œ ResBlock์ด 3๊ฐœ์”ฉ(Encoder๋Š” 2๊ฐœ) ์‚ฌ์šฉ๋œ๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.7 ์ด๋Š” ๋ณต์› ๊ณผ์ •์ด ์••์ถ• ๊ณผ์ •๋ณด๋‹ค ๋” ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์—, ๋” ๋งŽ์€ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ํ• ๋‹นํ•˜์—ฌ ๋””ํ…Œ์ผ ๋ณต์› ํ’ˆ์งˆ์„ ๋†’์ด๊ธฐ ์œ„ํ•œ ์„ค๊ณ„์ž…๋‹ˆ๋‹ค.22

flowchart TB
    subgraph Decoder["VAE Decoder"]
        D_DENORM["Denormalize (BatchNorm ์—ญ๋ณ€ํ™˜)"]
        D_UNPATCH["Unpatch (2ร—2)<br>(128ch, 64ร—64) โ†’ (32ch, 128ร—128)"]
        D_CONV["Conv2d(32โ†’512)"]
        D_MID["Mid Block<br>(ResBlock โ†’ AttnBlock โ†’ ResBlock)"]
        D_U4["ResBlockร—3<br>(512ch, 128ร—128)"]
        D_U3["ResBlockร—3 โ†’ Upsample<br>(512ch, 256ร—256)"]
        D_U2["ResBlockร—3 โ†’ Upsample<br>(256ch, 512ร—512)"]
        D_U1["ResBlockร—3 โ†’ Upsample<br>(128ch, 1024ร—1024)"]
        D_OUT["GroupNorm โ†’ Swish<br>โ†’ Conv2d(128โ†’3ch RGB)"]
        D_IMG["Output Image<br>(3ch, 1024ร—1024)"]

        D_DENORM --> D_UNPATCH --> D_CONV --> D_MID --> D_U4 --> D_U3 --> D_U2 --> D_U1 --> D_OUT --> D_IMG
    end
์ฐจ์› ๋ณ€ํ™” ์ถ”์  (1024ร—1024 ์ด๋ฏธ์ง€ ์ƒ์„ฑ ์˜ˆ์‹œ)

์ „์ฒด ์ถ”๋ก  ๊ณผ์ •์—์„œ ํ…์„œ์˜ ์ฐจ์›์ด ์–ด๋–ป๊ฒŒ ๋ณ€ํ™”ํ•˜๋Š”์ง€๋ฅผ ์ถ”์ ํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.22

flowchart TB
    subgraph TextPath["Text Encoding Path"]
        T1["Text Prompt<br>'a cat on the moon'"]
        T2["Mistral Layer 10, 20, 30 ์ถ”์ถœ<br>(B, 512, 5120) ร— 3"]
        T3["concat โ†’ (B, 512, 15360)"]
        T4["context_embedder<br>Linear(15360โ†’6144)<br>โ†’ (B, 512, 6144)"]
        T1 --> T2 --> T3 --> T4
    end

    subgraph NoisePath["Noise Initialization"]
        N1["x_T ~ N(0, I)<br>(B, 4096, 128)<br>โ† 64ร—64 latent patches"]
        N2["x_embedder<br>Linear(128โ†’6144)<br>โ†’ (B, 4096, 6144)"]
        N1 --> N2
    end

    subgraph TransformerPath["Transformer Processing"]
        DB["Double Blocks ร—8<br>img: (B, 4096, 6144)<br>txt: (B, 512, 6144)"]
        CAT["concat โ†’ (B, 4608, 6144)"]
        SB["Single Blocks ร—48<br>(B, 4608, 6144)"]
        SP["split (img only)<br>(B, 4096, 6144)"]
        FL["Final Layer<br>โ†’ (B, 4096, 128)"]
        DB --> CAT --> SB --> SP --> FL
    end

    subgraph DecodePath["VAE Decoding"]
        RS["reshape + unpatch<br>โ†’ (B, 32, 128, 128)"]
        UP["upsample ร—3<br>โ†’ (B, 3, 1024, 1024)"]
        OUT["Output Image<br>1024ร—1024 RGB"]
        RS --> UP --> OUT
    end

    T4 --> DB
    N2 --> DB
    FL --> RS

Sampling

FLUX.2์˜ sampling ๊ณผ์ •์€ FLUX.1๊ณผ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ Rectified Flow ๊ธฐ๋ฐ˜์˜ velocity prediction ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.413

Rectified Flow Velocity Prediction

๊ธฐ์กด DDPM ๋“ฑ์˜ diffusion model์—์„œ๋Š” ๊ฐ timestep์—์„œ ์ถ”๊ฐ€๋œ ๋…ธ์ด์ฆˆ ฯต\epsilon์„ ์˜ˆ์ธกํ•˜๋Š” ๋ฐฉ์‹์ด์—ˆ์Šต๋‹ˆ๋‹ค.16 ๋ฐ˜๋ฉด rectified flow์—์„œ๋Š” ๋ฐ์ดํ„ฐ ๋ถ„ํฌ์™€ ๋…ธ์ด์ฆˆ ๋ถ„ํฌ๋ฅผ ์ง์„ ์œผ๋กœ ์—ฐ๊ฒฐํ•˜๊ณ , ์ด ์ง์„  ์œ„์—์„œ์˜ velocity๋ฅผ ์˜ˆ์ธกํ•ฉ๋‹ˆ๋‹ค.1314 tt ์‹œ์ ์˜ ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ •์˜๋ฉ๋‹ˆ๋‹ค.

zt=(1โˆ’t)x0+tฯตz_t = (1-t)x_0 + t\epsilon

t=0t=0์ด๋ฉด ์›๋ณธ ๋ฐ์ดํ„ฐ x0x_0์ด๊ณ , t=1t=1์ด๋ฉด ์ˆœ์ˆ˜ ๋…ธ์ด์ฆˆ ฯต\epsilon์ž…๋‹ˆ๋‹ค.13 ๋ชจ๋ธ์€ ์ด ztz_t์—์„œ์˜ velocity vv๋ฅผ ์˜ˆ์ธกํ•˜๋ฉฐ, ํ•˜๋‚˜์˜ denoising step์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ์ˆ˜ํ–‰๋ฉ๋‹ˆ๋‹ค.

xtโˆ’1=xt+(tprevโˆ’tcurr)ร—vx_{t-1} = x_t + (t_{\text{prev}} - t_{\text{curr}}) \times v

tt๋Š” 1(์ˆœ์ˆ˜ ๋…ธ์ด์ฆˆ)์—์„œ 0(๊นจ๋—ํ•œ ์ด๋ฏธ์ง€)์œผ๋กœ ๊ฐ์†Œํ•˜๋ฏ€๋กœ, tprevโˆ’tcurrt_{\text{prev}} - t_{\text{curr}}๋Š” ์Œ์ˆ˜๊ฐ€ ๋˜์–ด velocity ๋ฐฉํ–ฅ์˜ ๋ฐ˜๋Œ€์ชฝ, ์ฆ‰ ๊นจ๋—ํ•œ ์ด๋ฏธ์ง€ ๋ฐฉํ–ฅ์œผ๋กœ ์ด๋™ํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.22 ์ง์„  ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ผ ์ด๋™ํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๊ณก์„  ๊ฒฝ๋กœ๋ฅผ ๋”ฐ๋ฅด๋Š” diffusion model ๋Œ€๋น„ ์ ์€ step์œผ๋กœ๋„ ๋†’์€ ํ’ˆ์งˆ์˜ ์ด๋ฏธ์ง€๋ฅผ ์ƒ์„ฑํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.14

Schedule ์ƒ์„ฑ

Timestep schedule์€ ๊ท ์ผํ•˜๊ฒŒ [1.0,...,0.0][1.0, ..., 0.0]์„ N ์Šคํ…์œผ๋กœ ๋‚˜๋ˆ„๋˜, SNR(Signal-to-Noise Ratio) shift๋ฅผ ์ ์šฉํ•ฉ๋‹ˆ๋‹ค.4 Shift ์ •๋„๋Š” ์ด๋ฏธ์ง€ ์‹œํ€€์Šค ๊ธธ์ด์— ๋”ฐ๋ผ ฮผ=aร—seq_len+b\mu = a \times \text{seq\_len} + b๋กœ ๊ฒฐ์ •๋ฉ๋‹ˆ๋‹ค.4 ์ด๋ฅผ ํ†ตํ•ด ๊ณ ํ•ด์ƒ๋„ ์ด๋ฏธ์ง€์ผ์ˆ˜๋ก(์‹œํ€€์Šค๊ฐ€ ๊ธธ์ˆ˜๋ก) ์Šค์ผ€์ค„์ด ์ ์ ˆํžˆ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค.22

Guidance

์ „ํ†ต์ ์ธ CFG์—์„œ๋Š” ๋งค denoising step๋งˆ๋‹ค ๋ชจ๋ธ์„ 2๋ฒˆ ์‹คํ–‰(conditional + unconditional)ํ•œ ๋’ค ์™ธ๋ถ€์—์„œ ๋ณด๊ฐ„ํ•ฉ๋‹ˆ๋‹ค.15

v=vuncond+ฮณร—(vcondโˆ’vuncond)v = v_{\text{uncond}} + \gamma \times (v_{\text{cond}} - v_{\text{uncond}})

FLUX.2 [dev]๋Š” ์ด ๋ฐฉ์‹ ๋Œ€์‹ , guidance scale ฮณ\gamma๋ฅผ ๋ชจ๋ธ ๋‚ด๋ถ€์— embedding์œผ๋กœ ์ฃผ์ž…ํ•ฉ๋‹ˆ๋‹ค.6 ๋ชจ๋ธ์ด ฮณ\gamma ๊ฐ’์„ ์ธ์‹ํ•œ ์ƒํƒœ์—์„œ ํ•œ ๋ฒˆ์˜ forward pass๋งŒ์œผ๋กœ guided output์„ ์ง์ ‘ ์ƒ์„ฑํ•˜๋ฏ€๋กœ, ์ „ํ†ต์ ์ธ two-pass CFG ๋Œ€๋น„ ์ถ”๋ก  ๋น„์šฉ์ด ์ ˆ๋ฐ˜์ž…๋‹ˆ๋‹ค.622 ์ด๋Š” ํ•™์Šต ๊ณผ์ •์—์„œ ๋‹ค์–‘ํ•œ ฮณ\gamma ๊ฐ’์— ๋Œ€ํ•ด CFG๊ฐ€ ์ ์šฉ๋œ ๊ฒฐ๊ณผ๋ฅผ ์ง์ ‘ ์ถœ๋ ฅํ•˜๋„๋ก distillation๋œ ๊ฒƒ์ž…๋‹ˆ๋‹ค.22

flowchart TB
    NOISE["x_T ~ N(0, I)<br>(์ˆœ์ˆ˜ ๋…ธ์ด์ฆˆ)"]
    LOOP{"Denoising Step<br>t: 1.0 โ†’ 0.0"}
    MODEL["Flux2 Transformer<br>(x_t, t, txt_emb, guidance)"]
    PRED["velocity v ์˜ˆ์ธก"]
    UPDATE["x(t-1) = x(t) + ฮ”t ร— v"]
    CLEAN["x_0 (clean latent)"]

    NOISE --> LOOP
    LOOP -->|"๊ฐ step"| MODEL --> PRED --> UPDATE
    UPDATE -->|"๋‹ค์Œ step"| LOOP
    LOOP -->|"์™„๋ฃŒ"| CLEAN

FLUX.2 Model Family

FLUX.2๋Š” 32B ๋ชจ๋ธ ์™ธ์—๋„ ๊ฒฝ๋Ÿ‰ํ™”๋œ Klein ๋ณ€ํ˜•์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.21 Hidden size์™€ ๋ธ”๋ก ์ˆ˜๋ฅผ ์ค„์—ฌ ๋‹ค์–‘ํ•œ ์ปดํ“จํŒ… ํ™˜๊ฒฝ์—์„œ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์˜€์Šต๋‹ˆ๋‹ค.21

ํ•ญ๋ชฉ FLUX.2 [dev] 32B Klein 9B Klein 4B
Hidden Size 6,144 (48 heads ร— 128d) 4,096 (32 heads ร— 128d) 3,072 (24 heads ร— 128d)
Double Blocks 8 8 5
Single Blocks 48 24 20
Guidance Embedding Yes No No

Klein ๋ชจ๋ธ๋“ค์€ guidance embedding์ด ์—†์œผ๋ฏ€๋กœ CFG ์—†์ด ๋™์ž‘ํ•˜๋ฉฐ, ๋” ์ ์€ ์ˆ˜์˜ Single Block์œผ๋กœ ๋น ๋ฅธ ์ถ”๋ก ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.21 ํฅ๋ฏธ๋กœ์šด ์ ์€ Klein 4B์˜ hidden size๊ฐ€ 3,072๋กœ FLUX.1 [dev] 12B์™€ ๋™์ผํ•˜๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.221 ํ•˜์ง€๋งŒ ๋ธ”๋ก ์ˆ˜๊ฐ€ Double 5 + Single 20์œผ๋กœ FLUX.1์˜ Double 19 + Single 38๋ณด๋‹ค ํ›จ์”ฌ ์ ์–ด, ํŒŒ๋ผ๋ฏธํ„ฐ ์ˆ˜๋Š” 4B์— ๊ทธ์นฉ๋‹ˆ๋‹ค.221 ์ฆ‰, FLUX.1๊ณผ ์œ ์‚ฌํ•œ ๋ธ”๋ก ํญ(width)์„ ๊ฐ€์ง€๋ฉด์„œ๋„ ๊นŠ์ด(depth)๋ฅผ ์ค„์—ฌ ๊ฒฝ๋Ÿ‰ํ™”ํ•œ ํ˜•ํƒœ์ž…๋‹ˆ๋‹ค.22

๋ชจ๋“  Klein ๋ณ€ํ˜•์€ 32B ๋ชจ๋ธ๊ณผ ๋™์ผํ•œ VAE(scale factor 16, 128 in_channels)๋ฅผ ๊ณต์œ ํ•˜๋ฏ€๋กœ, ์ž ์žฌ ๊ณต๊ฐ„์˜ ํ‘œํ˜„๋ ฅ์€ ๋™์ผํ•˜๊ฒŒ ์œ ์ง€๋ฉ๋‹ˆ๋‹ค.22 ์ฐจ์ด๋Š” Transformer ๋‚ด๋ถ€์˜ ์ฒ˜๋ฆฌ ์šฉ๋Ÿ‰์—์„œ๋งŒ ๋ฐœ์ƒํ•ฉ๋‹ˆ๋‹ค.22

Conclusion

๋งˆ์ง€๋ง‰์œผ๋กœ, FLUX.1๊ณผ FLUX.2์˜ ์ฃผ์š” ์ฐจ์ด๋ฅผ ํ•œ ํ…Œ์ด๋ธ”๋กœ ์ •๋ฆฌํ•ฉ๋‹ˆ๋‹ค.25678

ํ•ญ๋ชฉ FLUX.1 [dev] 12B FLUX.2 [dev] 32B
Text Encoder CLIP-L + T5-XXL (๋ณ„๋„ 2๊ฐœ) Mistral-Small-3.2-24B (1๊ฐœ)
Text Embed Dim 4,096 (T5) + 768 (CLIP) 15,360 (5,120 ร— 3 layers)
in_channels 64 128
Hidden Size 3,072 (24 heads ร— 128d) 6,144 (48 heads ร— 128d)
Double Blocks 19 8
Single Blocks 38 48
RoPE axes (16, 56, 56) (32, 32, 32, 32)
RoPE theta 10,000 2,000
FFN Activation GELU SwiGLU
VAE latent_channels 16 32
VAE Scale Factor 8 (+ ํŒŒ์ดํ”„๋ผ์ธ patchify) 16 (VAE ๋‚ด์žฅ patchify)
Image Editing ๋ฏธ์ง€์› ๋„ค์ดํ‹ฐ๋ธŒ ์ง€์›
Multimodal Input Text only Vision-Language
Model Family ๋‹จ์ผ ๋ชจ๋ธ dev 32B + Klein 9B/4B

ํ•ต์‹ฌ ๋ณ€๊ฒฝ ์‚ฌํ•ญ์„ ์š”์•ฝํ•˜๋ฉด ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.

  • ํ…์ŠคํŠธ ์ธ์ฝ”๋” ํ†ตํ•ฉ: ๋‘ ๊ฐœ์˜ ๋ณ„๋„ ๋ชจ๋ธ(CLIP-L + T5-XXL)์—์„œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM ํ•˜๋‚˜(Mistral-Small-3.2-24B)๋กœ ์ „ํ™˜. Multi-layer extraction(Layer 10, 20, 30)์„ ํ†ตํ•ด ๋” ํ’๋ถ€ํ•œ 15,360์ฐจ์› ์ž„๋ฒ ๋”ฉ ์ƒ์„ฑ46
  • Transformer ๊ตฌ์กฐ ์žฌ์„ค๊ณ„: Hidden size๋ฅผ 3,072์—์„œ 6,144๋กœ ๋‘ ๋ฐฐ ํ‚ค์šฐ๊ณ , Double Block์€ 19โ†’8๋กœ ์ค„์ด๋˜ Single Block์€ 38โ†’48๋กœ ๋Š˜๋ ค ํšจ์œจ์„ฑ๊ณผ ํ’ˆ์งˆ์˜ ๊ท ํ˜• ๊ฐœ์„ . FFN์€ GELU์—์„œ SwiGLU๋กœ ์ „ํ™˜26
  • VAE ๊นŠ์€ ์••์ถ•: latent_channels๋ฅผ 16์—์„œ 32๋กœ, in_channels๋ฅผ 64์—์„œ 128๋กœ ๋Š˜๋ ค ๋” ํ’๋ถ€ํ•œ ์ž ์žฌ ํ‘œํ˜„ ํ™•๋ณด. Patchification์„ VAE ๋‚ด๋ถ€์— ํ†ตํ•ฉ78
  • RoPE ๊ท ๋“ฑ ๋ถ„ํ• : 3์ถ• ๋ถˆ๊ท ๋“ฑ ๋ฐฐ๋ถ„(16,56,56)์—์„œ 4์ถ• ๊ท ๋“ฑ ๋ฐฐ๋ถ„(32,32,32,32)์œผ๋กœ ์ „ํ™˜. theta๋ฅผ 10,000์—์„œ 2,000์œผ๋กœ ๋‚ฎ์ถฐ ์„ธ๋ฐ€ํ•œ ์œ„์น˜ ์ธ์‹ ๊ฐ•ํ™”5622
  • ๋„ค์ดํ‹ฐ๋ธŒ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ: ์ด๋ฏธ์ง€ ํŽธ์ง‘๊ณผ ์ฐธ์กฐ ์ด๋ฏธ์ง€ ๊ธฐ๋ฐ˜ ์ƒ์„ฑ์ด ๋ณ„๋„ ๋ชจ๋ธ ์—†์ด ๊ธฐ๋ณธ ์ง€์›4
  • ๋ชจ๋ธ ํŒจ๋ฐ€๋ฆฌ: Klein 9B, Klein 4B ๊ฒฝ๋Ÿ‰ ๋ณ€ํ˜•์„ ํ†ตํ•ด ๋‹ค์–‘ํ•œ ์ปดํ“จํŒ… ํ™˜๊ฒฝ ๋Œ€์‘21

FLUX.2๋Š” ๋‹จ์ˆœํžˆ ๋ชจ๋ธ์„ ํ‚ค์šด ๊ฒƒ์ด ์•„๋‹ˆ๋ผ, ๊ฐ ๊ตฌ์„ฑ ์š”์†Œ๋ฅผ ๊ทผ๋ณธ์ ์œผ๋กœ ์žฌ์„ค๊ณ„ํ•˜์—ฌ ๋” ํ†ตํ•ฉ๋˜๊ณ  ํšจ์œจ์ ์ธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๋งŒ๋“ค์–ด๋‚ธ ์‚ฌ๋ก€์ž…๋‹ˆ๋‹ค.22 ํ…์ŠคํŠธ ์ธ์ฝ”๋”๋กœ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์„ ์ฑ„ํƒํ•œ ๊ฒƒ์€ CLIP/T5 ์กฐํ•ฉ์ด ํ‘œ์ค€์ด์—ˆ๋˜ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ ๋ถ„์•ผ์—์„œ ์ƒˆ๋กœ์šด ๋ฐฉํ–ฅ์„ ์ œ์‹œํ•˜๋Š” ์˜๋ฏธ ์žˆ๋Š” ๋ณ€ํ™”์ž…๋‹ˆ๋‹ค.22 ์•ž์œผ๋กœ ๋‹ค๋ฅธ ์ด๋ฏธ์ง€ ์ƒ์„ฑ ๋ชจ๋ธ๋“ค๋„ ์ด์™€ ์œ ์‚ฌํ•œ ๋ฐฉํ–ฅ์œผ๋กœ ์ง„ํ™”ํ•  ๊ฐ€๋Šฅ์„ฑ์ด ๋†’๋‹ค๊ณ  ์ƒ๊ฐ๋ฉ๋‹ˆ๋‹ค.22

Reference


  1. BFL ๊ณต์‹ ๋ฐœํ‘œ โ€” Black Forest Labs FLUX.1/FLUX.2 ๊ณต๊ฐœโ†ฉ
  2. FLUX.1-dev HuggingFace config.json โ€” {attention_head_dim:128, guidance_embeds:true, in_channels:64, joint_attention_dim:4096, num_attention_heads:24, num_layers:19, num_single_layers:38, pooled_projection_dim:768}โ†ฉ
  3. diffusers pipeline_flux.py โ€” FLUX.1 ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌํ˜„โ†ฉ
  4. CLIP (Radford et al., 2021) โ€” ํ…์ŠคํŠธ ์ „์šฉ ์ธ์ฝ”๋”โ†ฉ
  5. T5 (Raffel et al., 2020) โ€” ํ…์ŠคํŠธ ์ „์šฉ ์ธ์ฝ”๋”โ†ฉ
  6. Claude์˜ ์˜๊ฒฌ: ์ฝ”๋“œ ๋ถ„์„์— ๊ธฐ๋ฐ˜ํ•œ ํ•ด์„, ๋น„์œ , ๊ณ„์‚ฐ, ์ „๋ง ๋“ฑ Claude๊ฐ€ ์ง์ ‘ ์ž‘์„ฑํ•œ ๋ถ€๋ถ„ (BFL ๊ณต์‹ ๋…ผ๋ฌธ์ด ๋ฏธ๊ณต๊ฐœ ์ƒํƒœ์ด๋ฏ€๋กœ ๊ณต์‹ ๊ทผ๊ฑฐ ์—†์ด ์ถ”๋ก ํ•œ ๋‚ด์šฉ ํฌํ•จ)โ†ฉ
  7. diffusers pipeline_flux2.py โ€” FLUX.2 ํŒŒ์ดํ”„๋ผ์ธ ๊ตฌํ˜„ (Mistral3ForConditionalGeneration, ํŽธ์ง‘ ๋ชจ๋“œ, NSFW ํ•„ํ„ฐ๋ง, denoising loop, schedule ์ƒ์„ฑ ๋“ฑ)โ†ฉ
  8. diffusers transformer_flux2.py โ€” FLUX.2 Transformer ๊ตฌํ˜„. defaults: num_attention_heads=48, attention_head_dim=128, num_layers=8, num_single_layers=48, in_channels=128, joint_attention_dim=15360, mlp_ratio=3.0, axes_dims_rope=(32,32,32,32), rope_theta=2000, timestep_guidance_channels=256, guidance_embeds=True. MLPEmbedder(256โ†’6144), SwiGLU activationโ†ฉ
  9. HuggingFace FLUX.2-dev-Klein ๋ฆฌํฌ config โ€” Klein 9B/4B ๋ชจ๋ธ ์ŠคํŽ™โ†ฉ
  10. Mistral AI ๊ณต์‹ โ€” Mistral-Small-3.2-24B (vision-language model, hidden_size=5120)โ†ฉ
  11. Qwen3-VL ๊ธฐ์ˆ ๋ณด๊ณ ์„œ โ€” DeepStack, Interleaved MRoPEโ†ฉ
  12. diffusers transformer_flux.py โ€” FLUX.1 Transformer ๊ตฌํ˜„. FluxTransformerBlock, FluxSingleTransformerBlock, Modulation ํด๋ž˜์Šค. defaults: axes_dims_rope=(16,56,56), theta=10000, GELU activationโ†ฉ
  13. SwiGLU (Shazeer, 2020) โ€” GLU Variants Improve Transformerโ†ฉ
  14. CFG (Ho & Salimans, 2022) โ€” Classifier-Free Diffusion Guidanceโ†ฉ
  15. Stable Diffusion 3 (Esser et al., 2024) โ€” MM-DiT, Joint Attention, QK-Normโ†ฉ
  16. DiT (Peebles & Xie, 2023) โ€” AdaLN-Zero: gate์˜ zero initialization์œผ๋กœ ๊ฐ ๋ธ”๋ก์ด identity function์—์„œ ์‹œ์ž‘. ๋‹จ, FLUX.2์—์„œ ์‹ค์ œ zero init ์ ์šฉ ์—ฌ๋ถ€๋Š” ์ฝ”๋“œ์—์„œ ์ง์ ‘ ํ™•์ธํ•˜์ง€ ๋ชปํ•จโ†ฉ
  17. RoPE (Su et al., 2021) โ€” Rotary Position Embedding. ฮธi=baseโˆ’2i/d\theta_i = \text{base}^{-2i/d}์—์„œ base(theta)๊ฐ€ ์ฃผํŒŒ์ˆ˜ ํŠน์„ฑ์„ ๊ฒฐ์ •โ†ฉ
  18. FLUX.1 VAE config โ€” latent_channels:16, scaling_factor:0.3611, block_out_channels:[128,256,512,512]โ†ฉ
  19. diffusers autoencoder_kl_flux2.py โ€” FLUX.2 VAE ๊ตฌํ˜„. latent_channels=32, patch_size=(2,2), block_out_channels=(128,256,512,512)โ†ฉ
  20. Flow Matching (Lipman et al., 2022) โ€” linear interpolation path, velocity predictionโ†ฉ
  21. DDPM (Ho et al., 2020) โ€” noise prediction ๋ฐฉ์‹โ†ฉ
  22. Rectified Flow (Liu et al., 2022) โ€” ์ง์„  ๊ฒฝ๋กœ์˜ ์žฅ์ โ†ฉ