DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language Model. . . . . . . 31 E Discussion About Pre-Training Data Debiasing 32 F Additional Evaluations on Math and Code 33 G Evaluation Formats 34 3 1. Introduction In the past few years, Large Language Models pre-training corpus. Then, we collect 1.5M conversational sessions, which encompass various domains such as math, code, writing, reasoning, safety, and more, to perform Supervised Fine-Tuning (SFT) for DeepSeek-V2 datasets include CHID (Zheng et al., 2019) and CCPM (Li et al., 2021). Math datasets include GSM8K (Cobbe et al., 2021), MATH (Hendrycks et al., 2021), and CMath (Wei et al., 2023). Code datasets include0 码力 | 52 页 | 1.23 MB | 1 年前3
Trends Artificial Intelligence
benchmark evaluates a language model's performance across 57 academic and professional subjects, such as math, law, medicine, and history. It measures both factual recall and reasoning ability, making it a standard unified approach also creates a more seamless experience for users… …we’ve optimized somewhat less for math and computer science competition problems, and instead shifted focus towards real-world tasks that AI… Performance on MATH Level 5 Test, Open vs. Closed LLMs by Year Released – 6/23-4/25, per Epoch AI Note: MATH Level 5 pass@1 refers to the accuracy of an AI model on the MATH benchmark, a dataset0 码力 | 340 页 | 12.14 MB | 4 月前3
OpenAI 《A practical guide to building agents》enforcement, or safety classification. For example, the agent above processes a math question input optimistically until the math_homework_tripwire guardrail identifies a violation and raises an exception0 码力 | 34 页 | 7.00 MB | 6 月前3
Google 《Prompt Engineering v7》.9, and top-K of 20. Finally, if your task always has a single correct answer (e.g., answering a math problem), start with a temperature of 0. NOTE: With more freedom (higher temperature, top-K, top-P simple as multiplying two numbers. This is because they are trained on large volumes of text and math may require a different approach. So let’s see if intermediate reasoning steps will improve the0 码力 | 68 页 | 6.50 MB | 6 月前3
清华大学 DeepSeek+DeepResearch 让科研像聊天一样简单OpenAI-4o等其他闭源模型。 • 数学推理能力对标顶尖模型:DeepSeek R1 在 AIME 2024 基准测试中得 分 79.8%(pass@1),略优于 OpenAI-o1-1217;在 MATH-500 测试 中,取得 97.3%,表现与 OpenAI-o1-1217 相当,远超其他模型。 • 代码生成能力达专家级水平:DeepSeek R1在编程任务中,Elo评分达 2029,超越0 码力 | 85 页 | 8.31 MB | 8 月前3
共 5 条
- 1













