Facebook -- TVM AWS Meetup TalkGRU units and FC layers - 24kHz sampling frequency requires 40us sampling net runtime - First PyTorch model used a 3,400us sampling net runtime Image from LPCNetExit, Pursued By A Bear - 3400us (baseline) (baseline), 40us (target) - 85x speedup - Uh ohEnter, TVM and model co-design - PyTorch operator overhead makes interpreter infeasible - Reduce FLOPs with block-sparsified weight matrices - not trades off icache/ dcache - also available today in FBGEMMPyTorch and TVM - Lots of opportunity in PyTorch - Graph optimization - Existing fusion infrastructure fairly limited (CUDA-only, injective-only)0 码力 | 11 页 | 3.08 MB | 5 月前3
共 1 条
- 1













