 DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language Modelintuitive overview of these benchmarks, we additionally provide our evaluation formats for each benchmark in Appendix G. 3.2.2. Evaluation Results In Table 2, we compare DeepSeek-V2 with several representative code, and math benchmarks. As for Chinese benchmarks, Qwen1.5 72B shows better performance on 14 Benchmark (Metric) # Shots DeepSeek Qwen1.5 Mixtral LLaMA 3 DeepSeek-V2 67B 72B 8x22B 70B Architecture - multiple-choice tasks while DeepSeek-V2 is comparable or better on others. Note that for the CHID benchmark, the tokenizer of Qwen1.5 72B will encounter errors in our evaluation framework, so we leave the0 码力 | 52 页 | 1.23 MB | 1 年前3 DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language Modelintuitive overview of these benchmarks, we additionally provide our evaluation formats for each benchmark in Appendix G. 3.2.2. Evaluation Results In Table 2, we compare DeepSeek-V2 with several representative code, and math benchmarks. As for Chinese benchmarks, Qwen1.5 72B shows better performance on 14 Benchmark (Metric) # Shots DeepSeek Qwen1.5 Mixtral LLaMA 3 DeepSeek-V2 67B 72B 8x22B 70B Architecture - multiple-choice tasks while DeepSeek-V2 is comparable or better on others. Note that for the CHID benchmark, the tokenizer of Qwen1.5 72B will encounter errors in our evaluation framework, so we leave the0 码力 | 52 页 | 1.23 MB | 1 年前3
 Trends Artificial Intelligence
“notable” language models shown (per Epoch AI, includes state of the art improvement on a recognized benchmark, >1K citations, historically relevant, with significant use). Source: Epoch AI (5/25) Training Only language models shown (per Epoch AI, includes state of the art improvement on a recognized benchmark, >1K citations, historically relevant, with significant use). Source: Epoch AI (5/25) Training Stanford HAI AI System Performance on MMLU Benchmark Test – 2019-2024, per Stanford HAI Note: The MMLU (Massive Multitask Language Understanding) benchmark evaluates a language model's performance across0 码力 | 340 页 | 12.14 MB | 4 月前3 Trends Artificial Intelligence
“notable” language models shown (per Epoch AI, includes state of the art improvement on a recognized benchmark, >1K citations, historically relevant, with significant use). Source: Epoch AI (5/25) Training Only language models shown (per Epoch AI, includes state of the art improvement on a recognized benchmark, >1K citations, historically relevant, with significant use). Source: Epoch AI (5/25) Training Stanford HAI AI System Performance on MMLU Benchmark Test – 2019-2024, per Stanford HAI Note: The MMLU (Massive Multitask Language Understanding) benchmark evaluates a language model's performance across0 码力 | 340 页 | 12.14 MB | 4 月前3
 TVM Meetup Nov. 16th - Linarofor more flexibility with the runtime plugins? ○ Integrate TVM codegen into Arm NN? ● CI and benchmark testing for TVM on member hardware platforms ○ Shall we maintain a list of Arm platforms supported0 码力 | 7 页 | 1.23 MB | 5 月前3 TVM Meetup Nov. 16th - Linarofor more flexibility with the runtime plugins? ○ Integrate TVM codegen into Arm NN? ● CI and benchmark testing for TVM on member hardware platforms ○ Shall we maintain a list of Arm platforms supported0 码力 | 7 页 | 1.23 MB | 5 月前3
 XDNN TVM - Nov 2019oo (embedded i.e. ZC104/Ultra96) https://github.com/Xilinx/ml-suite/blob/master/examples/caffe/Benchmark_README.md Two measurements we track: Latency & Throughput ˃ ML pipeline contains multiple stages0 码力 | 16 页 | 3.35 MB | 5 月前3 XDNN TVM - Nov 2019oo (embedded i.e. ZC104/Ultra96) https://github.com/Xilinx/ml-suite/blob/master/examples/caffe/Benchmark_README.md Two measurements we track: Latency & Throughput ˃ ML pipeline contains multiple stages0 码力 | 16 页 | 3.35 MB | 5 月前3
 OpenAI - AI in the EnterpriseEvals are built around tasks that measure the quality of the output of a model against a benchmark—is it more accurate? More compliant? Safer? Your key metrics will depend on what matters most0 码力 | 25 页 | 9.48 MB | 5 月前3 OpenAI - AI in the EnterpriseEvals are built around tasks that measure the quality of the output of a model against a benchmark—is it more accurate? More compliant? Safer? Your key metrics will depend on what matters most0 码力 | 25 页 | 9.48 MB | 5 月前3
共 5 条
- 1













