The Curse of Depth in Large Language Models
摘要
研究发现在 Llama、DeepSeek、Qwen 等主流模型中普遍存在深层网络无效的现象。分析指出 Pre-LN 机制导致输出方差随深度指数增长,使深层块在训练中趋于恒等变换。为此提出 LNS 方法,通过按深度平方根反向缩放 LayerNorm 的方差来抑制爆炸,实验证明该方法在 130M 至 7B 规模模型的预训练和微调中均能显著提升性能。
荐读理由
在微调或训练 Llama/Qwen 等主流模型时,需警惕 Pre-LN 导致的深层参数失效问题,并可尝试通过 LayerNorm Scaling (LNS) 方案来找回这部分被浪费的算力与模型容量。
原文(据论文摘要)
In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models (LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling (LNS), which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Across a wide range of model sizes (130M to 7B), our experiments show that LNS consistently outperforms previous normalization and scaling techniques in enhancing LLM pre-training performance. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training. Our code is available at \href{https://github.com/lmsdss/LayerNorm-Scaling}{LayerNorm-Scaling}.
这条对你有帮助吗?