 Bring Your Own Codegen to TVMbuild_extern(mod, “dnnl”) 4. Run the inference exe = relay.create_executor(“vm”, mod=mod, ctx=tvm.cpu(0)) data = np.random.uniform(size=(1, 3, 224, 224)).astype(“float32”) out = exe.evaluate()(data, **params) Relay Runtime (VM, Graph Runtime, Interpreter) Your Dispatcher Target Device General Devices (CPU/GPU/FPGA) Mark supported operators or subgraphs 1. Implement an operator-level annotator, OR 2. Implement Options Op-level annotation ● Simple and easy to implement 👍 ● One op per subgraph results in overhead 👎 (working on an algorithm to merge annotated ops) Graph-level annotation ● High flexibility0 码力 | 19 页 | 504.69 KB | 5 月前3 Bring Your Own Codegen to TVMbuild_extern(mod, “dnnl”) 4. Run the inference exe = relay.create_executor(“vm”, mod=mod, ctx=tvm.cpu(0)) data = np.random.uniform(size=(1, 3, 224, 224)).astype(“float32”) out = exe.evaluate()(data, **params) Relay Runtime (VM, Graph Runtime, Interpreter) Your Dispatcher Target Device General Devices (CPU/GPU/FPGA) Mark supported operators or subgraphs 1. Implement an operator-level annotator, OR 2. Implement Options Op-level annotation ● Simple and easy to implement 👍 ● One op per subgraph results in overhead 👎 (working on an algorithm to merge annotated ops) Graph-level annotation ● High flexibility0 码力 | 19 页 | 504.69 KB | 5 月前3
 Dynamic Model in TVMfunction CPU strategy func GPU strategy func OpStrategy OpStrategy OpStrategy Default implement Specialized implement 1 Specialized implement 2 (e.g., winograd) kernel_size <= 3 b < 8 “cpu” “gpu”© Affiliates. All rights reserved. How to register a strategy? @conv2d_strategy.register("cpu") def conv2d_strategy_cpu(attrs, inputs, out_type, target): strategy = OpStrategy() layout = attrs.data_layout Services, Inc. or its Affiliates. All rights reserved. Why do we need graph dispatcher 1. Minimal overhead: only one dispatching operation is required for each inference. 2. Fit for operator such as conv2d_NCHWc0 码力 | 24 页 | 417.46 KB | 5 月前3 Dynamic Model in TVMfunction CPU strategy func GPU strategy func OpStrategy OpStrategy OpStrategy Default implement Specialized implement 1 Specialized implement 2 (e.g., winograd) kernel_size <= 3 b < 8 “cpu” “gpu”© Affiliates. All rights reserved. How to register a strategy? @conv2d_strategy.register("cpu") def conv2d_strategy_cpu(attrs, inputs, out_type, target): strategy = OpStrategy() layout = attrs.data_layout Services, Inc. or its Affiliates. All rights reserved. Why do we need graph dispatcher 1. Minimal overhead: only one dispatching operation is required for each inference. 2. Fit for operator such as conv2d_NCHWc0 码力 | 24 页 | 417.46 KB | 5 月前3
 Facebook -- TVM AWS Meetup Talk(baseline), 40us (target) - 85x speedup - Uh ohEnter, TVM and model co-design - PyTorch operator overhead makes interpreter infeasible - Reduce FLOPs with block-sparsified weight matrices - not a new (~10 lines of Relay IR) - A few days of work - TVM sampling model running in 30us on single server CPU core - Beat hand-written, highly optimized baselines (https://github.com/mozilla/LPCNet) by ~40%0 码力 | 11 页 | 3.08 MB | 5 月前3 Facebook -- TVM AWS Meetup Talk(baseline), 40us (target) - 85x speedup - Uh ohEnter, TVM and model co-design - PyTorch operator overhead makes interpreter infeasible - Reduce FLOPs with block-sparsified weight matrices - not a new (~10 lines of Relay IR) - A few days of work - TVM sampling model running in 30us on single server CPU core - Beat hand-written, highly optimized baselines (https://github.com/mozilla/LPCNet) by ~40%0 码力 | 11 页 | 3.08 MB | 5 月前3
 DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language Modelcan be trained without the necessity of tensor parallelism, thereby decreasing the communication overhead. Moreover, in order to further improve the training efficiency, we overlap the computation of shared parameters, it can be completed offline at once. Through this optimization, we avoid the computational overhead for recomputing k? ? and v? ? during inference. D. Ablation of Attention Mechanisms D.1. Ablation0 码力 | 52 页 | 1.23 MB | 1 年前3 DeepSeek-V2: A Strong, Economical, and Efficient
Mixture-of-Experts Language Modelcan be trained without the necessity of tensor parallelism, thereby decreasing the communication overhead. Moreover, in order to further improve the training efficiency, we overlap the computation of shared parameters, it can be completed offline at once. Through this optimization, we avoid the computational overhead for recomputing k? ? and v? ? during inference. D. Ablation of Attention Mechanisms D.1. Ablation0 码力 | 52 页 | 1.23 MB | 1 年前3
 PAI & TVM Meetup - Shanghai 20191116on warp level schedule Motivation 全各 “The overhead of writing warp-level schedule for TensorCore 。Work at the scheduling level: the less the better0 码力 | 26 页 | 5.82 MB | 5 月前3 PAI & TVM Meetup - Shanghai 20191116on warp level schedule Motivation 全各 “The overhead of writing warp-level schedule for TensorCore 。Work at the scheduling level: the less the better0 码力 | 26 页 | 5.82 MB | 5 月前3
 OpenAI 《A practical guide to building agents》agents can provide intuitive separation of concepts, but can introduce additional complexity and overhead, so often a single agent with tools is sufficient. For many complex workflows, splitting up0 码力 | 34 页 | 7.00 MB | 6 月前3 OpenAI 《A practical guide to building agents》agents can provide intuitive separation of concepts, but can introduce additional complexity and overhead, so often a single agent with tools is sufficient. For many complex workflows, splitting up0 码力 | 34 页 | 7.00 MB | 6 月前3
 XDNN TVM - Nov 201920% 40% 60% 80% 100% VGG16 ResNet-50 GoogleNet-V3 Aristotle on 7020 FPGA Iphone8plus Kirin 970 CPU MEM CONTROLLER BUS Data Mover IMG WR SCHEDULER WEIGHTS WR SCHEDULER SMART MEM FABRIC IMG RD Efficiency > 50% for mainstream neural networks >> 4© Copyright 2018 Xilinx Inference Flow >> 5 MxNet CPU Layers FPGA Layers Runtime Image Model Weights Calibration Set Quantizer Compiler Tensor Graph TVM Partitioning >> 7 Subgraph 1 Parallel Subgraphs Post-Processing Pre-Processing FPGA or CPU FPGA CPU CPU FPGA - More than supported/not supported, pattern matching graph colorization - Choices how0 码力 | 16 页 | 3.35 MB | 5 月前3 XDNN TVM - Nov 201920% 40% 60% 80% 100% VGG16 ResNet-50 GoogleNet-V3 Aristotle on 7020 FPGA Iphone8plus Kirin 970 CPU MEM CONTROLLER BUS Data Mover IMG WR SCHEDULER WEIGHTS WR SCHEDULER SMART MEM FABRIC IMG RD Efficiency > 50% for mainstream neural networks >> 4© Copyright 2018 Xilinx Inference Flow >> 5 MxNet CPU Layers FPGA Layers Runtime Image Model Weights Calibration Set Quantizer Compiler Tensor Graph TVM Partitioning >> 7 Subgraph 1 Parallel Subgraphs Post-Processing Pre-Processing FPGA or CPU FPGA CPU CPU FPGA - More than supported/not supported, pattern matching graph colorization - Choices how0 码力 | 16 页 | 3.35 MB | 5 月前3
 TVM@AliOSTVMQ@Alios AIOS ! 驱动万物智能 PRESENTATION AGENDA 人 人 e 人 e@ TVM Q@ AliOs Overview TVM @ AliOs ARM CPU TVM @ AliOos Hexagon DSP TVM @ Alios Intel GPU Misc /NiiOS ! 驱动万物智能 PART ONE TVM Q@ AliOs Overview Multimodal Interection CPU (ARM、Intel) 1驱动万物智能 Accelerated Op Library / Others Inference Engine DSP (Qualcomm) PART TWO Alios TVM @ ARM CPU AiOS 1驱动万物智能 Alios TVMQOARM CPU 。 Support TFLite ( Open Open Source and Upstream Master ) 。, Optimize on INT8 & FP32 AiiOS ! 驱动万物智能 Alios TVM @ ARM CPU INT8 * Cache 芍四 Data FO Data FOData … QNNPACK Convolution 。,NHWC layout Cach, 浆百0 码力 | 27 页 | 4.86 MB | 5 月前3 TVM@AliOSTVMQ@Alios AIOS ! 驱动万物智能 PRESENTATION AGENDA 人 人 e 人 e@ TVM Q@ AliOs Overview TVM @ AliOs ARM CPU TVM @ AliOos Hexagon DSP TVM @ Alios Intel GPU Misc /NiiOS ! 驱动万物智能 PART ONE TVM Q@ AliOs Overview Multimodal Interection CPU (ARM、Intel) 1驱动万物智能 Accelerated Op Library / Others Inference Engine DSP (Qualcomm) PART TWO Alios TVM @ ARM CPU AiOS 1驱动万物智能 Alios TVMQOARM CPU 。 Support TFLite ( Open Open Source and Upstream Master ) 。, Optimize on INT8 & FP32 AiiOS ! 驱动万物智能 Alios TVM @ ARM CPU INT8 * Cache 芍四 Data FO Data FOData … QNNPACK Convolution 。,NHWC layout Cach, 浆百0 码力 | 27 页 | 4.86 MB | 5 月前3
 Trends Artificial Intelligence
additional high-cost layers: research, data acquisition and hosting, and a mix of salaries, general overhead, and go-to-market operations. Even as the cost to train models climbs, a growing share of total0 码力 | 340 页 | 12.14 MB | 4 月前3 Trends Artificial Intelligence
additional high-cost layers: research, data acquisition and hosting, and a mix of salaries, general overhead, and go-to-market operations. Even as the cost to train models climbs, a growing share of total0 码力 | 340 页 | 12.14 MB | 4 月前3
 TVM Meetup: QuantizationTarget-independent Relay passes Target-optimized graph Target-dependent Relay passes Intel x86 ARM CPU Nvidia GPU ARM GPU Schedule templates written in TVM Tensor IR .. More targets AutoTVM – Tuning passes Target-independent Relay passes Target-optimized Int8 Relay Graph Intel x86 schedule ARM CPU schedule Nvidia GPU schedule ARM GPU schedule Relay Int8 Graph Target-dependent Relay layout opt© passes Target-independent Relay passes Target-optimized Int8 Relay Graph Intel x86 schedule ARM CPU schedule Nvidia GPU schedule ARM GPU schedule Relay Int8 Graph Target-dependent Relay layout opt©0 码力 | 19 页 | 489.50 KB | 5 月前3 TVM Meetup: QuantizationTarget-independent Relay passes Target-optimized graph Target-dependent Relay passes Intel x86 ARM CPU Nvidia GPU ARM GPU Schedule templates written in TVM Tensor IR .. More targets AutoTVM – Tuning passes Target-independent Relay passes Target-optimized Int8 Relay Graph Intel x86 schedule ARM CPU schedule Nvidia GPU schedule ARM GPU schedule Relay Int8 Graph Target-dependent Relay layout opt© passes Target-independent Relay passes Target-optimized Int8 Relay Graph Intel x86 schedule ARM CPU schedule Nvidia GPU schedule ARM GPU schedule Relay Int8 Graph Target-dependent Relay layout opt©0 码力 | 19 页 | 489.50 KB | 5 月前3
共 15 条
- 1
- 2













