 TVM@AliOSTVMQ@Alios AIOS ! 驱动万物智能 PRESENTATION AGENDA 人 人 e 人 e@ TVM Q@ AliOs Overview TVM @ AliOs ARM CPU TVM @ AliOos Hexagon DSP TVM @ Alios Intel GPU Misc /NiiOS ! 驱动万物智能 PART ONE TVM Q@ AliOs Overview Multimodal Interection CPU (ARM、Intel) 1驱动万物智能 Accelerated Op Library / Others Inference Engine DSP (Qualcomm) PART TWO Alios TVM @ ARM CPU AiOS 1驱动万物智能 Alios TVMQOARM CPU 。 Support TFLite ( Open Open Source and Upstream Master ) 。, Optimize on INT8 & FP32 AiiOS ! 驱动万物智能 Alios TVM @ ARM CPU INT8 * Cache 芍四 Data FO Data FOData … QNNPACK Convolution 。,NHWC layout Cach, 浆百0 码力 | 27 页 | 4.86 MB | 5 月前3 TVM@AliOSTVMQ@Alios AIOS ! 驱动万物智能 PRESENTATION AGENDA 人 人 e 人 e@ TVM Q@ AliOs Overview TVM @ AliOs ARM CPU TVM @ AliOos Hexagon DSP TVM @ Alios Intel GPU Misc /NiiOS ! 驱动万物智能 PART ONE TVM Q@ AliOs Overview Multimodal Interection CPU (ARM、Intel) 1驱动万物智能 Accelerated Op Library / Others Inference Engine DSP (Qualcomm) PART TWO Alios TVM @ ARM CPU AiOS 1驱动万物智能 Alios TVMQOARM CPU 。 Support TFLite ( Open Open Source and Upstream Master ) 。, Optimize on INT8 & FP32 AiiOS ! 驱动万物智能 Alios TVM @ ARM CPU INT8 * Cache 芍四 Data FO Data FOData … QNNPACK Convolution 。,NHWC layout Cach, 浆百0 码力 | 27 页 | 4.86 MB | 5 月前3
 TVM Meetup Nov. 16th - Linaro2019Bringing together the Arm ecosystemLinaro AI Initiative Provide the best-in-class Deep Learning performance by leveraging Neural Network acceleration in IP and SoCs from the Arm ecosystem, through collaborative Internal Jira project restricted to Linaro members ● Three sub-projects: ○ Arm Compute Library ○ Arm NN ○ Android NN Driver ● Arm Compute Library has been integrated by: ○ MATLAB Coder ○ ONNX RuntimeArm upstream IPs Target Hardware/Model Options Codegen CPU arm_cpu pixel2 (snapdragon 835), mate10/mate10pro (kirin 970), p20/p20pro (kirin 970) -target=arm64-linux-android -mattr=+neon llvm firefly rk33990 码力 | 7 页 | 1.23 MB | 5 月前3 TVM Meetup Nov. 16th - Linaro2019Bringing together the Arm ecosystemLinaro AI Initiative Provide the best-in-class Deep Learning performance by leveraging Neural Network acceleration in IP and SoCs from the Arm ecosystem, through collaborative Internal Jira project restricted to Linaro members ● Three sub-projects: ○ Arm Compute Library ○ Arm NN ○ Android NN Driver ● Arm Compute Library has been integrated by: ○ MATLAB Coder ○ ONNX RuntimeArm upstream IPs Target Hardware/Model Options Codegen CPU arm_cpu pixel2 (snapdragon 835), mate10/mate10pro (kirin 970), p20/p20pro (kirin 970) -target=arm64-linux-android -mattr=+neon llvm firefly rk33990 码力 | 7 页 | 1.23 MB | 5 月前3
 TVM Meetup: QuantizationTarget-independent Relay passes Target-optimized graph Target-dependent Relay passes Intel x86 ARM CPU Nvidia GPU ARM GPU Schedule templates written in TVM Tensor IR .. More targets AutoTVM – Tuning the Target-independent Relay passes Target-optimized Int8 Relay Graph Intel x86 schedule ARM CPU schedule Nvidia GPU schedule ARM GPU schedule Relay Int8 Graph Target-dependent Relay layout opt© 2019, Amazon Target-independent Relay passes Target-optimized Int8 Relay Graph Intel x86 schedule ARM CPU schedule Nvidia GPU schedule ARM GPU schedule Relay Int8 Graph Target-dependent Relay layout opt© 2019, Amazon0 码力 | 19 页 | 489.50 KB | 5 月前3 TVM Meetup: QuantizationTarget-independent Relay passes Target-optimized graph Target-dependent Relay passes Intel x86 ARM CPU Nvidia GPU ARM GPU Schedule templates written in TVM Tensor IR .. More targets AutoTVM – Tuning the Target-independent Relay passes Target-optimized Int8 Relay Graph Intel x86 schedule ARM CPU schedule Nvidia GPU schedule ARM GPU schedule Relay Int8 Graph Target-dependent Relay layout opt© 2019, Amazon Target-independent Relay passes Target-optimized Int8 Relay Graph Intel x86 schedule ARM CPU schedule Nvidia GPU schedule ARM GPU schedule Relay Int8 Graph Target-dependent Relay layout opt© 2019, Amazon0 码力 | 19 页 | 489.50 KB | 5 月前3
 TVM@Alibaba AI LabsLabs 阿里巴巴人工智能实验室 AiILabs & TVM PART 1 : ARM32 CPU CONTENT PART 2 : HIFI4 DSP PART 3 : _ PowervVR GPU [和| Alibaba AL.Labs 阿里巴巴人工智能实验室 ARM 32 CPU Resolution Quantization Orize Kernel ALIOS ent pl 1=int8 int8 * int8 int32 = int16 1 + int16 x int8 Alibaba Al.Labs 阿里巴巴人工智能实验室 CPU : MTK8167S (ARM32 A35 1.5GHz) Model : MobileNetV2_ 1.0_ 224 400 336 350 3丈 300 2500 码力 | 12 页 | 1.94 MB | 5 月前3 TVM@Alibaba AI LabsLabs 阿里巴巴人工智能实验室 AiILabs & TVM PART 1 : ARM32 CPU CONTENT PART 2 : HIFI4 DSP PART 3 : _ PowervVR GPU [和| Alibaba AL.Labs 阿里巴巴人工智能实验室 ARM 32 CPU Resolution Quantization Orize Kernel ALIOS ent pl 1=int8 int8 * int8 int32 = int16 1 + int16 x int8 Alibaba Al.Labs 阿里巴巴人工智能实验室 CPU : MTK8167S (ARM32 A35 1.5GHz) Model : MobileNetV2_ 1.0_ 224 400 336 350 3丈 300 2500 码力 | 12 页 | 1.94 MB | 5 月前3
 亿联TVM部署performance gain by autotuning 3. TVM can support many kinds of hardware platform: Intel/arm CPU, Nividia/arm GPU, VTA…5 �������������� 1. Get a .log file from the autotvm on Ubuntu 2. Use the .log0 码力 | 6 页 | 1.96 MB | 5 月前3 亿联TVM部署performance gain by autotuning 3. TVM can support many kinds of hardware platform: Intel/arm CPU, Nividia/arm GPU, VTA…5 �������������� 1. Get a .log file from the autotvm on Ubuntu 2. Use the .log0 码力 | 6 页 | 1.96 MB | 5 月前3
 TVM: Where Are We GoingHaichen Shen et.aluTVM: TVM on bare-metal Devices Support bare-metal J-TAG devices, no OS is needed ARM Cortex-M RISC-V Credit: Logan WeberuTVM upcoming: Self Hosted Runtime Credit: Logan WeberDesigned Runtime JIT compile accelerator micro code • Support heterogenous devices, 10x better than CPU on the same board. • Move hardware complexity to software HW-SW Blueprint for Flexible Deep Learning0 码力 | 31 页 | 22.64 MB | 5 月前3 TVM: Where Are We GoingHaichen Shen et.aluTVM: TVM on bare-metal Devices Support bare-metal J-TAG devices, no OS is needed ARM Cortex-M RISC-V Credit: Logan WeberuTVM upcoming: Self Hosted Runtime Credit: Logan WeberDesigned Runtime JIT compile accelerator micro code • Support heterogenous devices, 10x better than CPU on the same board. • Move hardware complexity to software HW-SW Blueprint for Flexible Deep Learning0 码力 | 31 页 | 22.64 MB | 5 月前3
 XDNN TVM - Nov 201920% 40% 60% 80% 100% VGG16 ResNet-50 GoogleNet-V3 Aristotle on 7020 FPGA Iphone8plus Kirin 970 CPU MEM CONTROLLER BUS Data Mover IMG WR SCHEDULER WEIGHTS WR SCHEDULER SMART MEM FABRIC IMG RD Efficiency > 50% for mainstream neural networks >> 4© Copyright 2018 Xilinx Inference Flow >> 5 MxNet CPU Layers FPGA Layers Runtime Image Model Weights Calibration Set Quantizer Compiler Tensor Graph TVM Partitioning >> 7 Subgraph 1 Parallel Subgraphs Post-Processing Pre-Processing FPGA or CPU FPGA CPU CPU FPGA - More than supported/not supported, pattern matching graph colorization - Choices how0 码力 | 16 页 | 3.35 MB | 5 月前3 XDNN TVM - Nov 201920% 40% 60% 80% 100% VGG16 ResNet-50 GoogleNet-V3 Aristotle on 7020 FPGA Iphone8plus Kirin 970 CPU MEM CONTROLLER BUS Data Mover IMG WR SCHEDULER WEIGHTS WR SCHEDULER SMART MEM FABRIC IMG RD Efficiency > 50% for mainstream neural networks >> 4© Copyright 2018 Xilinx Inference Flow >> 5 MxNet CPU Layers FPGA Layers Runtime Image Model Weights Calibration Set Quantizer Compiler Tensor Graph TVM Partitioning >> 7 Subgraph 1 Parallel Subgraphs Post-Processing Pre-Processing FPGA or CPU FPGA CPU CPU FPGA - More than supported/not supported, pattern matching graph colorization - Choices how0 码力 | 16 页 | 3.35 MB | 5 月前3
 Bring Your Own Codegen to TVMbuild_extern(mod, “dnnl”) 4. Run the inference exe = relay.create_executor(“vm”, mod=mod, ctx=tvm.cpu(0)) data = np.random.uniform(size=(1, 3, 224, 224)).astype(“float32”) out = exe.evaluate()(data, **params) Relay Runtime (VM, Graph Runtime, Interpreter) Your Dispatcher Target Device General Devices (CPU/GPU/FPGA) Mark supported operators or subgraphs 1. Implement an operator-level annotator, OR 2. Implement Relay Runtime (VM, Graph Runtime, Interpreter) Your Dispatcher Target Device General Devices (CPU/GPU/FPGA) Mark supported operators or subgraphs 1. Implement extern operator functions, OR 2. Implement0 码力 | 19 页 | 504.69 KB | 5 月前3 Bring Your Own Codegen to TVMbuild_extern(mod, “dnnl”) 4. Run the inference exe = relay.create_executor(“vm”, mod=mod, ctx=tvm.cpu(0)) data = np.random.uniform(size=(1, 3, 224, 224)).astype(“float32”) out = exe.evaluate()(data, **params) Relay Runtime (VM, Graph Runtime, Interpreter) Your Dispatcher Target Device General Devices (CPU/GPU/FPGA) Mark supported operators or subgraphs 1. Implement an operator-level annotator, OR 2. Implement Relay Runtime (VM, Graph Runtime, Interpreter) Your Dispatcher Target Device General Devices (CPU/GPU/FPGA) Mark supported operators or subgraphs 1. Implement extern operator functions, OR 2. Implement0 码力 | 19 页 | 504.69 KB | 5 月前3
 Dynamic Model in TVMfunction CPU strategy func GPU strategy func OpStrategy OpStrategy OpStrategy Default implement Specialized implement 1 Specialized implement 2 (e.g., winograd) kernel_size <= 3 b < 8 “cpu” “gpu”© Affiliates. All rights reserved. How to register a strategy? @conv2d_strategy.register("cpu") def conv2d_strategy_cpu(attrs, inputs, out_type, target): strategy = OpStrategy() layout = attrs.data_layout0 码力 | 24 页 | 417.46 KB | 5 月前3 Dynamic Model in TVMfunction CPU strategy func GPU strategy func OpStrategy OpStrategy OpStrategy Default implement Specialized implement 1 Specialized implement 2 (e.g., winograd) kernel_size <= 3 b < 8 “cpu” “gpu”© Affiliates. All rights reserved. How to register a strategy? @conv2d_strategy.register("cpu") def conv2d_strategy_cpu(attrs, inputs, out_type, target): strategy = OpStrategy() layout = attrs.data_layout0 码力 | 24 页 | 417.46 KB | 5 月前3
 Facebook -- TVM AWS Meetup Talk(~10 lines of Relay IR) - A few days of work - TVM sampling model running in 30us on single server CPU core - Beat hand-written, highly optimized baselines (https://github.com/mozilla/LPCNet) by ~40%0 码力 | 11 页 | 3.08 MB | 5 月前3 Facebook -- TVM AWS Meetup Talk(~10 lines of Relay IR) - A few days of work - TVM sampling model running in 30us on single server CPU core - Beat hand-written, highly optimized baselines (https://github.com/mozilla/LPCNet) by ~40%0 码力 | 11 页 | 3.08 MB | 5 月前3
共 11 条
- 1
- 2













