DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language ModelAfter the initial pre-training of DeepSeek-V2, we employ YaRN (Peng et al., 2023) to extend the default context window length from 4K to 128K. YaRN was specifically applied to the decoupled shared key k k? ? as it is responsible for carrying RoPE (Su et al., 2024). For YaRN, we set the scale ? to 40, ? to 1, ? to 32, and the target maximum context length to 160K. Under these settings, we can expect the the model to respond well for a context length of 128K. Slightly diverging from original YaRN, due to our distinct attention mechanism, we adjust the length scaling factor to modulate the attention entropy0 码力 | 52 页 | 1.23 MB | 1 年前3
共 1 条
- 1













