2025-11-21T22:28:16.015152

Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models

Han
Weight initialization governs signal propagation and gradient flow at the start of training. This paper offers a theory-grounded and empirically validated study across two regimes: compact ReLU multilayer perceptrons and GPT-2-style transformers. First, a logarithmic sweep of the initial standard deviation maps vanishing and exploding regimes and identifies a broad stability band with standard deviations between 1e-2 and 1e-1. Second, a controlled comparison shows that Kaiming (fan-in) initialization converges faster and more stably than Xavier under ReLU, consistent with variance-preserving theory. Third, in a from-scratch 12-layer GPT-2-style model, this paper tracks layerwise Q/K/V weight variance through pretraining and observe depth-dependent equilibration into narrow bands: shallow layers expand rapidly while deeper layers change more gradually. Together, these results connect classic initialization principles with modern transformer behavior and yield simple, practical recipes for robust training.
academic

์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ๋ฐ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™” ๋ฐ ๋ถ„์‚ฐ ๋™์—ญํ•™

๊ธฐ๋ณธ ์ •๋ณด

  • ๋…ผ๋ฌธ ID: 2510.09423
  • ์ œ๋ชฉ: Weight Initialization and Variance Dynamics in Deep Neural Networks and Large Language Models
  • ์ €์ž: Yankun Han (ํ”Œ๋กœ๋ฆฌ๋‹ค ๋Œ€ํ•™๊ต)
  • ๋ถ„๋ฅ˜: cs.LG (๊ธฐ๊ณ„ํ•™์Šต)
  • ๋ฐœํ‘œ ์‹œ๊ฐ„: 2025๋…„ 10์›” 10์ผ (arXiv ์‚ฌ์ „์ธ์‡„๋ณธ)
  • ๋…ผ๋ฌธ ๋งํฌ: https://arxiv.org/abs/2510.09423

์ดˆ๋ก

๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”๋Š” ํ›ˆ๋ จ ์‹œ์ž‘ ์‹œ ์‹ ํ˜ธ ์ „ํŒŒ ๋ฐ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์„ ์ œ์–ดํ•ฉ๋‹ˆ๋‹ค. ๋ณธ ๋…ผ๋ฌธ์€ ๋‘ ๊ฐ€์ง€ ์˜์—ญ์„ ํฌ๊ด„ํ•˜๋Š” ์ด๋ก ์  ๊ธฐ์ดˆ๊ฐ€ ๊ฒฌ๊ณ ํ•˜๊ณ  ๊ฒฝํ—˜์ ์œผ๋กœ ๊ฒ€์ฆ๋œ ์—ฐ๊ตฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค: ๊ฐ„๊ฒฐํ•œ ReLU ๋‹ค์ธต ํผ์…‰ํŠธ๋ก ๊ณผ GPT-2 ์Šคํƒ€์ผ์˜ Transformer์ž…๋‹ˆ๋‹ค. ์ฒซ์งธ, ์ดˆ๊ธฐ ํ‘œ์ค€ํŽธ์ฐจ์— ๋Œ€ํ•œ ๋กœ๊ทธ ์Šค์บ”์„ ํ†ตํ•ด ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค ๋ฐ ํญ๋ฐœ ์˜์—ญ์„ ๋งคํ•‘ํ•˜๊ณ , ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ 1e-2์—์„œ 1e-1 ์‚ฌ์ด์˜ ๊ด‘๋ฒ”์œ„ํ•œ ์•ˆ์ • ๋Œ€์—ญ์„ ์‹๋ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค. ๋‘˜์งธ, ์ œ์–ด๋œ ๋น„๊ต๋ฅผ ํ†ตํ•ด ReLU ํ™œ์„ฑํ™” ํ•จ์ˆ˜์—์„œ Kaiming(fan-in) ์ดˆ๊ธฐํ™”๊ฐ€ Xavier ์ดˆ๊ธฐํ™”๋ณด๋‹ค ๋” ๋น ๋ฅด๊ณ  ์•ˆ์ •์ ์œผ๋กœ ์ˆ˜๋ ดํ•จ์„ ๋ณด์˜€์œผ๋ฉฐ, ์ด๋Š” ๋ถ„์‚ฐ ๋ณด์กด ์ด๋ก ๊ณผ ์ผ์น˜ํ•ฉ๋‹ˆ๋‹ค. ์…‹์งธ, ์ฒ˜์Œ๋ถ€ํ„ฐ ๊ตฌ์ถ•ํ•œ 12์ธต GPT-2 ์Šคํƒ€์ผ ๋ชจ๋ธ์—์„œ ์‚ฌ์ „ํ›ˆ๋ จ ๊ณผ์ • ์ค‘ ๊ฐ ์ธต์˜ Q/K/V ๊ฐ€์ค‘์น˜ ๋ถ„์‚ฐ ๋ณ€ํ™”๋ฅผ ์ถ”์ ํ•˜์—ฌ, ๊นŠ์ด ๊ด€๋ จ ๊ท ํ˜• ํ˜„์ƒ์„ ๊ด€์ฐฐํ–ˆ์Šต๋‹ˆ๋‹ค: ์–•์€ ์ธต์€ ๋น ๋ฅด๊ฒŒ ํ™•์žฅ๋˜๋Š” ๋ฐ˜๋ฉด ๊นŠ์€ ์ธต์€ ๋”์šฑ ์ ์ง„์ ์œผ๋กœ ๋ณ€ํ™”ํ•ฉ๋‹ˆ๋‹ค.

์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ

๋ฌธ์ œ ์ •์˜

๋ณธ ์—ฐ๊ตฌ๊ฐ€ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ํ•ต์‹ฌ ๋ฌธ์ œ๋Š” ์‹ฌ์ธต ์‹ ๊ฒฝ๋ง ๋ฐ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ์—์„œ ๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”๊ฐ€ ํ›ˆ๋ จ ์•ˆ์ •์„ฑ ๋ฐ ์ˆ˜๋ ด์„ฑ์— ๋ฏธ์น˜๋Š” ์˜ํ–ฅ์ž…๋‹ˆ๋‹ค. ๊ตฌ์ฒด์ ์œผ๋กœ๋Š” ๋‹ค์Œ์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค:

  1. ์ดˆ๊ธฐํ™” ๊ทœ๋ชจ ๋ฏผ๊ฐ๋„: ์„œ๋กœ ๋‹ค๋ฅธ ์ดˆ๊ธฐ ํ‘œ์ค€ํŽธ์ฐจ๊ฐ€ ํ›ˆ๋ จ ์•ˆ์ •์„ฑ์— ์–ด๋–ป๊ฒŒ ์˜ํ–ฅ์„ ๋ฏธ์น˜๋Š”๊ฐ€
  2. ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ํŠน์ด์„ฑ: ReLU ๋ฐ GELU ๋“ฑ์˜ ํ™œ์„ฑํ™” ํ•จ์ˆ˜๊ฐ€ ํŠน์ • ์ดˆ๊ธฐํ™” ์ „๋žต์„ ํ•„์š”๋กœ ํ•˜๋Š”๊ฐ€
  3. ํ˜„๋Œ€ Transformer์˜ ๋ถ„์‚ฐ ๋™์—ญํ•™: ๋Œ€๊ทœ๋ชจ Transformer ๋ชจ๋ธ์—์„œ ๋ถ„์‚ฐ ์•ˆ์ •ํ™”๊ฐ€ ๊ณ„์† ์กด์žฌํ•˜๋Š”๊ฐ€

์ค‘์š”์„ฑ

๊ฐ€์ค‘์น˜ ์ดˆ๊ธฐํ™”๋Š” ์‹ฌ์ธตํ•™์Šต ํ›ˆ๋ จ ์„ฑ๊ณต์˜ ํ•ต์‹ฌ ์š”์†Œ์ด๋ฉฐ, ๋ถ€์ ์ ˆํ•œ ์ดˆ๊ธฐํ™”๋Š” ๋‹ค์Œ์„ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค:

  • ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค: ์‹ ํ˜ธ๊ฐ€ ๊นŠ์€ ์ธต์˜ ๋„คํŠธ์›Œํฌ์—์„œ ๊ณ„์ธต๋ณ„๋กœ ๊ฐ์†Œ
  • ๊ธฐ์šธ๊ธฐ ํญ๋ฐœ: ์‹ ํ˜ธ๊ฐ€ ์ „ํŒŒ ๊ณผ์ •์—์„œ ์ง€์ˆ˜์ ์œผ๋กœ ์ฆ๊ฐ€
  • ํ›ˆ๋ จ ๋ถˆ์•ˆ์ •์„ฑ: ์ตœ์ ํ™” ๊ณผ์ •์—์„œ์˜ ์ง„๋™ ๋ฐ ๋ฐœ์‚ฐ

๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ํ•œ๊ณ„

๊ณ ์ „์ ์ธ ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ•(LeCun, Xavier/Glorot, He/Kaiming)์€ ์ด๋ก ์ ์œผ๋กœ ๋ถ„์‚ฐ ๋ณด์กด์˜ ์ง๊ด€์„ ๊ฐ€์ง€๊ณ  ์žˆ์ง€๋งŒ, ์‹ค์ œ ์‘์šฉ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ๋ฌธ์ œ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค:

  1. ์ด์ƒ์ ์ธ ๊ทœ๋ชจ ํŽธ์ฐจ์— ๋Œ€ํ•œ ๋ฏผ๊ฐ๋„๊ฐ€ ์ถฉ๋ถ„ํžˆ ์ •๋Ÿ‰ํ™”๋˜์ง€ ์•Š์Œ
  2. ํŠน์ • ํ™œ์„ฑํ™” ํ•จ์ˆ˜(์˜ˆ: ReLU, GELU)์˜ ์˜ํ–ฅ ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ๋ถˆ๋ช…ํ™•
  3. ๋Œ€๊ทœ๋ชจ Transformer์—์„œ์˜ ์„ฑ๋Šฅ์— ๋Œ€ํ•œ ์ฒด๊ณ„์  ์—ฐ๊ตฌ ๋ถ€์กฑ

ํ•ต์‹ฌ ๊ธฐ์—ฌ

  1. ํ†ตํ•ฉ ๋ถ„์‚ฐ ๋ถ„์„ ํ”„๋ ˆ์ž„์›Œํฌ: ์ผ๋ฐ˜์ ์ธ ํ™œ์„ฑํ™” ํ•จ์ˆ˜(ReLU, GELU)์˜ ์ „๋ฐฉํ–ฅ ๋ฐ ์—ญ๋ฐฉํ–ฅ ๋ถ„์‚ฐ ์ „ํŒŒ ์กฐ๊ฑด์„ ๋„์ถœํ•˜์—ฌ, fan-in ์Šค์ผ€์ผ๋ง์ด ์‹ ํ˜ธ ์ง„ํญ์„ ์–ด๋–ป๊ฒŒ ๋ณด์กดํ•˜๋Š”์ง€, ๊ทธ๋ฆฌ๊ณ  ReLU์˜ ๊ณ„์ˆ˜ 2์˜ ์ถœ์ฒ˜๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.
  2. ๊ทœ๋ชจ ๋ฏผ๊ฐ๋„ ์ •๋Ÿ‰ํ™”: 25๊ฐœ์˜ ํ‘œ์ค€ํŽธ์ฐจ ๊ฐ’์— ๋Œ€ํ•œ ๋กœ๊ทธ ์Šค์บ”์„ ํ†ตํ•ด ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค/ํญ๋ฐœ ์˜์—ญ์„ ๋งคํ•‘ํ•˜๊ณ , ์•ˆ์ •์ ์ธ ํ›ˆ๋ จ ๋Œ€์—ญ ฯƒ โˆˆ 10โปยฒ, 10โปยน์„ ์‹๋ณ„ํ–ˆ์Šต๋‹ˆ๋‹ค.
  3. ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ์ธ์‹ ์ดˆ๊ธฐํ™” ๊ฒ€์ฆ: ์ œ์–ด๋œ ReLU MLP ํ›ˆ๋ จ์—์„œ Kaiming normal(fan-in)์ด Xavier normal๋ณด๋‹ค ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๊ณ  ์†์‹ค ๋ถ„์‚ฐ์ด ๋” ์ž‘์Œ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.
  4. Transformer ๋ถ„์‚ฐ ๋™์—ญํ•™ ๋ถ„์„: ์ฒ˜์Œ๋ถ€ํ„ฐ ๊ตฌ์ถ•ํ•œ 12์ธต GPT-2 ์Šคํƒ€์ผ ๋ชจ๋ธ์—์„œ ๋ช…ํ™•ํ•œ ๊นŠ์ด ๊ด€๋ จ ํŒจํ„ด์„ ๋ฐœ๊ฒฌํ–ˆ์Šต๋‹ˆ๋‹ค: ์–•์€ ์ธต ๊ฐ€์ค‘์น˜ ํ‘œ์ค€ํŽธ์ฐจ๋Š” ๋น ๋ฅด๊ฒŒ ํ™•์žฅ๋˜๊ณ , ๊นŠ์€ ์ธต์€ ๋”์šฑ ์ ์ง„์ ์ด๋ฉฐ, ์ตœ์ข…์ ์œผ๋กœ ๋ชจ๋‘ ์ข์€ ๋ถ„์‚ฐ ๋Œ€์—ญ์—์„œ ์•ˆ์ •ํ™”๋ฉ๋‹ˆ๋‹ค.

๋ฐฉ๋ฒ• ์ƒ์„ธ ์„ค๋ช…

์ด๋ก ์  ํ”„๋ ˆ์ž„์›Œํฌ

์ „๋ฐฉํ–ฅ ์ „ํŒŒ ๋ถ„์‚ฐ ๋ถ„์„

์„ ํ˜• ๋งคํ•‘์˜ ๊ฒฝ์šฐ:

Var[z_l] = n_in ฯƒยฒ_W Var[x_{l-1}]

๋น„์„ ํ˜• ํ™œ์„ฑํ™” ํ›„:

Var[x_l] โ‰ˆ c_ฯ† n_in ฯƒยฒ_W Var[x_{l-1}]

์—ฌ๊ธฐ์„œ c_ฯ† = E[ฯ†(z)ยฒ]/Var[z]๋Š” ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๊ด€๋ จ ์ƒ์ˆ˜์ž…๋‹ˆ๋‹ค.

ํ™œ์„ฑํ™” ๊ฐ’์˜ ์†Œ์‹ค ๋˜๋Š” ํญ๋ฐœ์„ ํ”ผํ•˜๊ธฐ ์œ„ํ•ด ฯƒยฒ_W โ‰ˆ 1/(c_ฯ† n_in)์„ ์„ ํƒํ•ฉ๋‹ˆ๋‹ค:

  • ReLU: c_ฯ† โ‰ˆ 1/2์ด๋ฏ€๋กœ ฯƒยฒ_W โ‰ˆ 2/n_in (He/Kaiming)
  • GELU: c_ฯ† โ‰ˆ 0.45-0.5๋กœ ReLU๋ณด๋‹ค ์•ฝ๊ฐ„ ์ž‘์Œ

์—ญ๋ฐฉํ–ฅ ์ „ํŒŒ ๋ถ„์‚ฐ ๋ถ„์„

์—ญ์ „ํŒŒ๋Š” ๋‹ค์Œ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค:

Var[ฮด_{l-1}] โ‰ˆ n_out ฯƒยฒ_W d_ฯ† Var[ฮด_l]

์—ฌ๊ธฐ์„œ d_ฯ† = E[ฯ†'(z)ยฒ]์ž…๋‹ˆ๋‹ค. ReLU์˜ ๊ฒฝ์šฐ d_ฯ† = 1/2์ด๊ณ , ๊ธฐ์šธ๊ธฐ ๋ถ„์‚ฐ ๊ท ํ˜•์„ ์œ„ํ•ด์„œ๋Š” ฯƒยฒ_W โ‰ˆ 2/n_out์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

๊ท ํ˜• ๋ฐ ์‹ค์ œ ์„ ํƒ

์ „๋ฐฉํ–ฅ ๋ฐ ์—ญ๋ฐฉํ–ฅ ๋ณด์กด ์กฐ๊ฑด์€ ์ผ๋ฐ˜์ ์œผ๋กœ ๋™์‹œ์— ๋งŒ์กฑ๋  ์ˆ˜ ์—†์œผ๋ฉฐ, n_in โ‰ˆ n_out ๋ฐ c_ฯ† โ‰ˆ d_ฯ†์ธ ๊ฒฝ์šฐ๋ฅผ ์ œ์™ธํ•ฉ๋‹ˆ๋‹ค. ์‹ค์ œ๋กœ๋Š” ์ „๋ฐฉํ–ฅ ์‹ ํ˜ธ ์•ˆ์ •์„ฑ์„ ์œ ์ง€ํ•˜๋Š” ๊ฒƒ์ด ์ผ๋ฐ˜์ ์œผ๋กœ ๋” ์ค‘์š”ํ•˜๋ฉฐ, ์ด๋Š” fan-in He/Kaiming์ด Xavier๋ณด๋‹ค ๋” ๋น ๋ฅด๊ฒŒ ์ˆ˜๋ ดํ•˜๋Š” ์ด์œ ๋ฅผ ์„ค๋ช…ํ•ฉ๋‹ˆ๋‹ค.

์‹คํ—˜ ์„ค๊ณ„

์‹คํ—˜ E1: ํ‘œ์ค€ํŽธ์ฐจ ์Šค์บ”

  • ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜: 784โ†’64โ†’32โ†’32โ†’10์˜ ReLU MLP
  • ๋ฐ์ดํ„ฐ์…‹: MNIST
  • ์Šค์บ” ๋ฒ”์œ„: 10โปโด์—์„œ 10๊นŒ์ง€ 25๊ฐœ์˜ ํ‘œ์ค€ํŽธ์ฐจ ๊ฐ’, ๋กœ๊ทธ ๊ฐ„๊ฒฉ
  • ํ‰๊ฐ€ ์ง€ํ‘œ: ์†์‹ค ๊ถค์ , ๋ถ„๋ฅ˜ ์ •ํ™•๋„

์‹คํ—˜ E2: Xavier vs Kaiming ๋น„๊ต

  • ๋„คํŠธ์›Œํฌ ์•„ํ‚คํ…์ฒ˜: 11โ†’16โ†’32โ†’32โ†’1์˜ ReLU ๋„คํŠธ์›Œํฌ
  • ๋ฐ์ดํ„ฐ์…‹: UCI Wine ์ด์ง„ ๋ถ„๋ฅ˜ ์ž‘์—…
  • ๋น„๊ต ๋ฐฉ์•ˆ: Xavier normal vs Kaiming uniform
  • ํ†ต๊ณ„ ๊ฒ€์ฆ: 10ํšŒ ๋ฌด์ž‘์œ„ ์‹คํ–‰, ์Œ์„ ์ด๋ฃฌ t ๊ฒ€์ •

์‹คํ—˜ E3: GPT-2 ๋ถ„์‚ฐ ๋™์—ญํ•™

  • ๋ชจ๋ธ ๊ทœ๋ชจ: 12์ธต GPT-2 ์Šคํƒ€์ผ Transformer
  • ์ดˆ๊ธฐํ™”: ํ‘œ์ค€ ๊ตฌ์„ฑ(๋Œ€๋ถ€๋ถ„์˜ ๋ชจ๋“ˆ std=0.02, ์ž„๋ฒ ๋”ฉ ์ธต xavier normal)
  • ์ตœ์ ํ™”๊ธฐ: AdamW, ํ•™์Šต๋ฅ  1ร—10โปโด, ๋ฐฐ์น˜ ํฌ๊ธฐ 16
  • ์ถ”์  ๋Œ€์ƒ: ๋ชจ๋“  ์ธต์˜ Q/K/V ํˆฌ์˜ ๊ฐ€์ค‘์น˜ ํ‘œ์ค€ํŽธ์ฐจ

์‹คํ—˜ ๊ฒฐ๊ณผ

E1: ํ‘œ์ค€ํŽธ์ฐจ ์Šค์บ” ๊ฒฐ๊ณผ

  • ์•ˆ์ • ๊ตฌ๊ฐ„: ฯƒ โˆˆ 10โปยฒ, 10โปยน ๋‚ด์—์„œ ํ›ˆ๋ จ์ด ํ‰ํ™œํ•˜๊ณ , ๊ธฐ์šธ๊ธฐ ์„ฑ๋Šฅ์ด ์–‘ํ˜ธํ•˜๋ฉฐ, ์ •ํ™•๋„๊ฐ€ ์ด ๊ตฌ๊ฐ„ ๋‚ด์—์„œ ์ตœ๊ณ ๊ฐ’์— ๋„๋‹ฌํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค: ๊ทน์†Œ ๊ทœ๋ชจ(ฯƒ โ‰ฒ 10โปยณ)๋Š” ์—…๋ฐ์ดํŠธ ์†Œ์‹ค ๋ฐ ์ •ํ™•๋„ ์ €ํ•˜๋ฅผ ์ดˆ๋ž˜ํ•ฉ๋‹ˆ๋‹ค.
  • ๊ธฐ์šธ๊ธฐ ํญ๋ฐœ: ๊ทน๋Œ€ ๊ทœ๋ชจ(ฯƒ โ‰ณ 1)๋Š” ๋ถˆ์•ˆ์ •ํ•œ ์†์‹ค ๋ฐ ๊ฐ„ํ—์  ๋ฐœ์‚ฐ์„ ์ƒ์„ฑํ•ฉ๋‹ˆ๋‹ค.

E2: ์ดˆ๊ธฐํ™” ๋ฐฉ๋ฒ• ๋น„๊ต

Kaiming ์ดˆ๊ธฐํ™”๋Š” ์—ฌ๋Ÿฌ ์ฐจ์›์—์„œ ์ง€์†์ ์œผ๋กœ Xavier๋ฅผ ๋Šฅ๊ฐ€ํ•ฉ๋‹ˆ๋‹ค:

  • ์ˆ˜๋ ด ์†๋„: ๋ชฉํ‘œ ๋‹ฌ์„ฑ๊นŒ์ง€์˜ ์ค‘์•™๊ฐ’ ์—ํฌํฌ๊ฐ€ ๋” ์ ๊ณ , ์ดˆ๊ธฐ ์†์‹ค ๊ฐ์†Œ๊ฐ€ ๋” ๊ฐ€ํŒŒ๋ฆ…๋‹ˆ๋‹ค.
  • ์ •ํ™•๋„: ์ตœ์ข… ๊ฒ€์ฆ ์ •ํ™•๋„๊ฐ€ Xavier์™€ ์ผ์น˜ํ•˜๊ฑฐ๋‚˜ ์•ฝ๊ฐ„ ์ดˆ๊ณผํ•ฉ๋‹ˆ๋‹ค.
  • ํ†ต๊ณ„์  ์œ ์˜์„ฑ: ์Œ์„ ์ด๋ฃฌ t ๊ฒ€์ •์€ ์†์‹ค ๋ฐ ํ›ˆ๋ จ ์ •ํ™•๋„ ์ฐจ์ด๊ฐ€ ์œ ์˜ํ•จ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค(p < 0.05).

E3: Transformer ๋ถ„์‚ฐ ๋™์—ญํ•™ ๋ฐœ๊ฒฌ

  • ๊นŠ์ด ๊ด€๋ จ ํŒจํ„ด: ์–•์€ ์ธต์€ ์ดˆ๊ธฐ ํ›ˆ๋ จ์—์„œ ๋น ๋ฅด๊ณ  ํ˜„์ €ํ•œ ๊ฐ€์ค‘์น˜ ํ‘œ์ค€ํŽธ์ฐจ ํ™•์žฅ์„ ๋ณด์ด๋Š” ๋ฐ˜๋ฉด, ๊นŠ์€ ์ธต์€ ๋”์šฑ ๋А๋ฆฌ๊ณ  ํ‰ํ™œํ•œ ํ™•์žฅ์„ ๋ณด์ž…๋‹ˆ๋‹ค.
  • ๋ถ„์‚ฐ ๊ท ํ˜•: ๋ชจ๋“  ์ธต์€ ์ตœ์ข…์ ์œผ๋กœ ์ข์€ ๋ถ„์‚ฐ ๋Œ€์—ญ์—์„œ ์•ˆ์ •ํ™”๋ฉ๋‹ˆ๋‹ค.
  • ๋ถ„ํฌ ํฌ์†Œํ™”: ํ›ˆ๋ จ ํ›„ ๊ฐ€์ค‘์น˜ ๋ถ„ํฌ๋Š” ๋”์šฑ ํฌ์†Œํ•ด์ง€๋ฉฐ, ๋งŽ์€ ํ•ญ๋ชฉ์ด 0์— ๊ฐ€๊นŒ์›Œ ๋ณ€ํ•˜์ง€ ์•Š๊ณ , ์†Œ์ˆ˜์˜ ํฐ ๊ฐ€์ค‘์น˜๊ฐ€ ์ง€๋ฐฐ์ ์ž…๋‹ˆ๋‹ค.

์ด๋ก ์  ํ†ต์ฐฐ ๋ฐ ์‹ค์ œ ์˜์˜

๊นŠ์ด ๊ด€๋ จ ๋ถ„์‚ฐ ๊ท ํ˜• ๋ฉ”์ปค๋‹ˆ์ฆ˜

๋…ผ๋ฌธ์€ Transformer์˜ ์ ์ง„์  ๊ท ํ˜• ํŒจํ„ด์„ ๋“œ๋Ÿฌ๋ƒ…๋‹ˆ๋‹ค:

  1. ์–•์€ ์ธต์˜ ๋น ๋ฅธ ์ ์‘: ์ž…๋ ฅ์— ๊ฐ€๊นŒ์šด ์ธต์€ ๋†’์€ ์‹ ํ˜ธ ๋Œ€ ์žก์Œ๋น„ ๊ธฐ์šธ๊ธฐ๋ฅผ ๊ฐ€์ง€๋ฉฐ, ์ดˆ๊ธฐ ์ ๊ทน์  ์Šค์ผ€์ผ๋ง์„ ์žฅ๋ คํ•ฉ๋‹ˆ๋‹ค.
  2. ๊นŠ์€ ์ธต์˜ ์ ์ง„์  ์กฐ์ •: ์ž”์ฐจ ๊ฒฝ๋กœ ๊ธธ์ด ๋ฐ ์‚ฌ์ „ ์ •๊ทœํ™”๋Š” ๊นŠ์€ ์ธต์˜ ์œ ํšจ ์Šคํ… ํฌ๊ธฐ๋ฅผ ์ œํ•œํ•ฉ๋‹ˆ๋‹ค.
  3. ์•”๋ฌต์  ์ œ์•ฝ: ์ฃผ์˜ softmax ํฌํ™” ๋ฐ AdamW์˜ ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ๋Š” ํฐ ๋งค๊ฐœ๋ณ€์ˆ˜ ๊ทœ๋ชจ๋ฅผ ๋ฐฉ์ง€ํ•ฉ๋‹ˆ๋‹ค.

์‹ค์ œ ์ง€๋„ ์›์น™

  1. ReLU/GELU MLP: fan-in He/Kaiming์—์„œ ์‹œ์ž‘ํ•ฉ๋‹ˆ๋‹ค. ๋งค์šฐ ๋ถˆ๊ท ํ˜•ํ•œ ์ธต์ด ๊ธฐ์šธ๊ธฐ ๋“œ๋ฆฌํ”„ํŠธ๋ฅผ ์ดˆ๋ž˜ํ•˜๋ฉด, fan-average ์„ ํƒ์œผ๋กœ ์•ฝ๊ฐ„ ์ด๋™ํ•ฉ๋‹ˆ๋‹ค.
  2. ๊นŠ์€ ์ž”์ฐจ ์Šคํƒ: ์ž”์ฐจ ์Šค์ผ€์ผ๋ง(์˜ˆ: 1/โˆšL) ๋˜๋Š” ์ •๊ทœํ™”๋Š” ๊นŠ์ด ๋ถ„์‚ฐ ๋“œ๋ฆฌํ”„ํŠธ ๋ฐฉ์ง€์— ๋„์›€์ด ๋ฉ๋‹ˆ๋‹ค.
  3. Transformer ํˆฌ์˜: ์ž‘์€ ํ‘œ์ค€ํŽธ์ฐจ ์ดˆ๊ธฐํ™”(์˜ˆ: 0.02)๋ฅผ ์‚ฌ์šฉํ•˜๊ณ , ๊ฐ ์ธต์˜ ํ‘œ์ค€ํŽธ์ฐจ ๋ฐ ๊ธฐ์šธ๊ธฐ ๋ฒ”์œ„๋ฅผ ๋ชจ๋‹ˆํ„ฐ๋งํ•ฉ๋‹ˆ๋‹ค.

๊ด€๋ จ ์—ฐ๊ตฌ ๋น„๊ต

๊ธฐ์ดˆ ์ดˆ๊ธฐํ™” ์ „๋žต

  • LeCun ๋ฐฉ๋ฒ•: ์„ ํ˜• ํ™œ์„ฑํ™”์— ๋Œ€ํ•œ ๋ถ„์‚ฐ ๋ณด์กด ๊ทœ์น™
  • Glorot/Xavier: tanh/sigmoid ์Šค์ผ€์ผ๋ง์„ ์œ„ํ•œ fan ๊ธฐ๋ฐ˜
  • He/Kaiming: ReLU์—์„œ ์ด์ฐจ ๋ชจ๋ฉ˜ํŠธ ๋ฐ˜๊ฐ์„ ๋ณด์ •ํ•˜๋Š” ํ™œ์„ฑํ™” ์ธ์‹ ์Šค์ผ€์ผ๋ง

ํ˜„๋Œ€ ๋ฐœ์ „

  • Fixup ์ดˆ๊ธฐํ™”: ์‹ ์ค‘ํ•˜๊ฒŒ ์„ ํƒ๋œ ์ดˆ๊ธฐํ™” ๋ฐ ์ž”์ฐจ ์Šค์ผ€์ผ๋ง์„ ํ†ตํ•ด ๊ทน๋„๋กœ ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ ์ •๊ทœํ™” ํ•„์š”์„ฑ ์ œ๊ฑฐ
  • DeepNet: ์ฒœ ์ธต๊ธ‰ ํ›ˆ๋ จ์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์›์น™์  ๊นŠ์ด ์Šค์ผ€์ผ๋ง ๊ทœ์น™ ์ œ์•ˆ
  • ์‚ฌ์ „ ์ •๊ทœํ™” ์žฅ์ : ์‚ฌํ›„ ์ •๊ทœํ™”์™€ ๋น„๊ตํ•˜์—ฌ ํ‰ํ™œํ•œ ๊ธฐ์šธ๊ธฐ ํ๋ฆ„์„ ํ†ตํ•ด ์ตœ์ ํ™” ์•ˆ์ •์„ฑ ๊ฐœ์„ 

๊ฒฐ๋ก  ๋ฐ ๋…ผ์˜

์ฃผ์š” ๊ฒฐ๋ก 

  1. ์•ˆ์ •์„ฑ ๋Œ€์—ญ ์กด์žฌ: ฯƒ โˆˆ 10โปยฒ, 10โปยน ๋ฒ”์œ„ ๋‚ด์— ๊ด‘๋ฒ”์œ„ํ•˜์ง€๋งŒ ๋ฏผ๊ฐํ•œ ์•ˆ์ •์„ฑ ๋Œ€์—ญ์ด ์กด์žฌํ•ฉ๋‹ˆ๋‹ค.
  2. ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ํŠน์ด์„ฑ ์ค‘์š”: Kaiming ์ดˆ๊ธฐํ™”๋Š” ReLU ๋„คํŠธ์›Œํฌ์—์„œ ์‹ค์ œ๋กœ Xavier๋ณด๋‹ค ์šฐ์ˆ˜ํ•ฉ๋‹ˆ๋‹ค.
  3. ๊นŠ์ด ๊ด€๋ จ ๋™์—ญํ•™: Transformer๋Š” ๊นŠ์ด ๊ด€๋ จ ๋ถ„์‚ฐ ๊ท ํ˜•์„ ๋‚˜ํƒ€๋‚ด๋ฉฐ, ์–•์€ ์ธต์€ ๋น ๋ฅด๊ฒŒ ์ ์‘ํ•˜๊ณ  ๊นŠ์€ ์ธต์€ ์ ์ง„์ ์œผ๋กœ ์กฐ์ •๋ฉ๋‹ˆ๋‹ค.

ํ•œ๊ณ„

  1. ์‹คํ—˜ ๊ทœ๋ชจ: GPT-2 ์‹คํ—˜์€ ์ƒ๋Œ€์ ์œผ๋กœ ์ž‘์Šต๋‹ˆ๋‹ค(12์ธต). ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์˜ ๋™์ž‘์€ ๋‹ค๋ฅผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  2. ํ™œ์„ฑํ™” ํ•จ์ˆ˜ ๋ฒ”์œ„: ์ฃผ๋กœ ReLU ๋ฐ GELU์— ์ดˆ์ ์„ ๋งž์ถ”๋ฉฐ, ๋‹ค๋ฅธ ํ™œ์„ฑํ™” ํ•จ์ˆ˜์˜ ๋ถ„์„์€ ์ œํ•œ์ ์ž…๋‹ˆ๋‹ค.
  3. ์ตœ์ ํ™”๊ธฐ ์˜์กด์„ฑ: ๊ฒฐ๊ณผ๋Š” ํŠน์ • ์ตœ์ ํ™”๊ธฐ(AdamW) ๋ฐ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ ์„ค์ •์— ๋ฏผ๊ฐํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ–ฅํ›„ ๋ฐฉํ–ฅ

  1. ์ ์‘ํ˜• ๊นŠ์ด ์ธ์‹ ์ดˆ๊ธฐํ™”: ๊ฐ ์ธต ๋˜๋Š” ๊ฐ ํ—ค๋“œ์˜ ๊ทœ๋ชจ๋ฅผ ํ•™์Šตํ•˜์—ฌ ์–•์€ ์ธต์„ ์ตœ์ข… ๋ถ„์‚ฐ ์ˆ˜์ค€์— ๋” ๊ฐ€๊น๊ฒŒ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  2. ์ตœ์ ํ™”๊ธฐ ๋ฐ ์Šค์ผ€์ค„ ๊ฒฐํ•ฉ: ์˜ˆ์—ด ๊ธธ์ด, ๊ฐ€์ค‘์น˜ ๊ฐ์†Œ ๋ฐ ๊ธฐ์šธ๊ธฐ ํด๋ฆฌํ•‘์„ ๊ณต๋™์œผ๋กœ ์ตœ์ ํ™”ํ•ฉ๋‹ˆ๋‹ค.
  3. ๊นŠ์ด ๋ฐ ๋„ˆ๋น„ ์Šค์ผ€์ผ๋ง: ๋” ํฐ ๋ชจ๋ธ์—์„œ ๊นŠ์ด ๊ด€๋ จ ๊ท ํ˜•์˜ ์ง€์†์„ฑ์„ ํ‰๊ฐ€ํ•ฉ๋‹ˆ๋‹ค.

์‹ฌ์ธต ํ‰๊ฐ€

์žฅ์ 

  1. ์ด๋ก ๊ณผ ์‹ค์ œ์˜ ๊ฒฐํ•ฉ: ๊ณ ์ „์  ๋ถ„์‚ฐ ์ „ํŒŒ ์ด๋ก ์„ ํ˜„๋Œ€ Transformer ๋™์ž‘๊ณผ ์œ ๊ธฐ์ ์œผ๋กœ ๊ฒฐํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  2. ์ฒด๊ณ„์  ์‹คํ—˜ ์„ค๊ณ„: ๊ฐ„๋‹จํ•œ MLP์—์„œ ๋ณต์žกํ•œ Transformer๋กœ์˜ ์ ์ง„์  ๊ฒ€์ฆ
  3. ๋†’์€ ์‹ค์ œ ๊ฐ€์น˜: ๊ตฌ์ฒด์ ์ธ ์ดˆ๊ธฐํ™” ๊ถŒ์žฅ์‚ฌํ•ญ ๋ฐ ์ง„๋‹จ ๋ฐฉ๋ฒ• ์ œ๊ณต
  4. ํ†ต๊ณ„์  ์—„๋ฐ€์„ฑ: ์Œ์„ ์ด๋ฃฌ t ๊ฒ€์ • ๋“ฑ์˜ ํ†ต๊ณ„ ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ฒฐ๊ณผ ์œ ์˜์„ฑ ๊ฒ€์ฆ

๋ถ€์กฑํ•œ ์ 

  1. ์ด๋ก  ๋ถ„์„ ๊นŠ์ด ์ œํ•œ: ๊นŠ์ด ๊ด€๋ จ ํ˜„์ƒ์— ๋Œ€ํ•œ ๋” ๊นŠ์€ ์ด๋ก ์  ์„ค๋ช… ๋ถ€์กฑ
  2. ์‹คํ—˜ ๊ทœ๋ชจ ์ œ์•ฝ: ๊ณ„์‚ฐ ์ž์› ์ œํ•œ์œผ๋กœ ์ธํ•ด ์ง„์ •ํ•œ ๋Œ€๊ทœ๋ชจ ๋ชจ๋ธ์—์„œ ๊ฒ€์ฆํ•˜์ง€ ๋ชปํ•จ
  3. ์ผ๋ฐ˜ํ™” ๋ฌธ์ œ: ๊ฒฐ๊ณผ๋Š” ์ฃผ๋กœ ํŠน์ • ์•„ํ‚คํ…์ฒ˜ ๋ฐ ์ž‘์—…์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•˜๋ฉฐ, ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์€ ์ถ”๊ฐ€ ๊ฒ€์ฆ์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.

์˜ํ–ฅ๋ ฅ ํ‰๊ฐ€

  1. ํ•™์ˆ  ๊ธฐ์—ฌ: ์ดˆ๊ธฐํ™” ์ด๋ก ์— ํ˜„๋Œ€์  ๊ด€์ ์„ ์ œ๊ณตํ•˜์—ฌ ๊ณ ์ „ ์ด๋ก ๊ณผ ํ˜„์žฌ ์‹ค์ œ๋ฅผ ์—ฐ๊ฒฐํ•ฉ๋‹ˆ๋‹ค.
  2. ์‹ค์ œ ๊ฐ€์น˜: ์‹ค๋ฌด์ž์—๊ฒŒ ๋ช…ํ™•ํ•œ ์ดˆ๊ธฐํ™” ์ „๋žต ๋ฐ ์ง„๋‹จ ๋„๊ตฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  3. ์žฌํ˜„์„ฑ: ์‹คํ—˜ ์„ค๊ณ„๊ฐ€ ๋ช…ํ™•ํ•˜๊ณ  ์ฝ”๋“œ ๋ฐ ๋งค๊ฐœ๋ณ€์ˆ˜ ์„ค์ •์ด ์ƒ์„ธํ•˜์—ฌ ์žฌํ˜„์ด ์šฉ์ดํ•ฉ๋‹ˆ๋‹ค.

์ ์šฉ ์‹œ๋‚˜๋ฆฌ์˜ค

  1. ์‹ฌ์ธต ๋„คํŠธ์›Œํฌ ํ›ˆ๋ จ: ํŠนํžˆ ReLU/GELU ํ™œ์„ฑํ™”์˜ ์‹ฌ์ธต ๋„คํŠธ์›Œํฌ์— ์ ํ•ฉํ•ฉ๋‹ˆ๋‹ค.
  2. Transformer ์ตœ์ ํ™”: ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ ํ›ˆ๋ จ์— ์ดˆ๊ธฐํ™” ์ง€๋„๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.
  3. ์—ฐ๊ตฌ ๋„๊ตฌ: ์—ฐ๊ตฌ์ž์—๊ฒŒ ๊ฐ€์ค‘์น˜ ๋™์—ญํ•™ ๋ถ„์„์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•๋ก  ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.

์ฐธ๊ณ  ๋ฌธํ—Œ

๋…ผ๋ฌธ์€ LeCun, Glorot, He ๋“ฑ์˜ ๊ธฐ์ดˆ ์—ฐ๊ตฌ๋ฅผ ํฌํ•จํ•œ ์ดˆ๊ธฐํ™” ๋ถ„์•ผ์˜ ํ•ต์‹ฌ ์ €์ž‘๊ณผ Transformer ์ตœ์ ํ™”์˜ ์ตœ๊ทผ ์ง„์ „์„ ์ธ์šฉํ•˜์—ฌ ๋ณธ ์—ฐ๊ตฌ์— ๊ฒฌ๊ณ ํ•œ ์ด๋ก ์  ๊ธฐ์ดˆ๋ฅผ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค.