YARN - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

After the initial pre-training of DeepSeek-V2, we employ YaRN (Peng et al., 2023) to extend the default context window length from 4K to 128K. YaRN was specifically applied to the decoupled shared key k k? ? as it is responsible for carrying RoPE (Su et al., 2024). For YaRN, we set the scale ? to 40, ? to 1, ? to 32, and the target maximum context length to 160K. Under these settings, we can expect the the model to respond well for a context length of 128K. Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy

0 码力 | 52 页 | 1.23 MB | 1 年前
3

共 1 条前往

页

DeepSeek V2 Strong Economical and Efficient Mixture of Experts Language Model