GPU - IT文库_程序员IT互联网编程电子书和文档免费下载，助您码力十足！

首页文库资料文章资讯上传文档发布文章登录账户

Go on GPU

Changkun Ou. 2023. Go on GPU. GopherChina 2023. Session "Foundational Toolchains" Go on GPU Changkun Ou changkun.de/s/gogpu GopherChina 2023 Session “Foundational Toolchains” 2023 June 10 1 Changkun Ou. 2023. Go on GPU. GopherChina 2023. Session "Foundational Toolchains" Agenda ● Basic knowledge for interacting with GPUs ● Accelerate Go programs using GPUs ● Challenges in Go when using outlooks 2 Changkun Ou. 2023. Go on GPU. GopherChina 2023. Session "Foundational Toolchains" Agenda ● Basic knowledge for interacting with GPUs ○ Motivation ○ GPU Driver and Standards ○ Render and

0 码力 | 57 页 | 4.62 MB | 1 年前
3
Bridging the Gap: Writing Portable Programs for CPU and GPU

1/66Bridging the Gap: Writing Portable Programs for CPU and GPU using CUDA Thomas Mejstrik Sebastian Woblistin 2/66Content 1 Motivation Audience etc.. Cuda crash course Quiz time 2 Patterns Oldschool Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Algorithms are designed differently Latency/Throughput Memory bandwidth Number of cores Motivation Patterns The dark path Cuda proposal Thank you Why write programs for CPU and GPU Difference CPU/GPU Why it makes sense? Library/Framework developers Embarrassingly parallel algorithms User

0 码力 | 124 页 | 4.10 MB | 6 月前
3
FFmpeg在Intel GPU上的硬件加速与优化

FFmpeg在Intel GPU上的硬件加速与优化赵军 DCG/NPG @ Intel 介绍FFmpeg VAAPI • Media pipeline review • 何谓FFmpeg VAAPI • 为什么我们需要FFmpeg VAAPI • 当前状态 • 更进一步的计划 • 附录典型的 media pipeline File Device Network Stream radeon, nouveau (?), freedreno, … • 废弃的 API bridges • vdpau—va bridge • powervr—va bridge • … Intel GPU简介 • Gfx Label • Gen3: Pinetrail (Pineview) • Gen4: G965 • Gen5: G4X, Ironlake (Piketon, Calpella) Kabylake • … • Intel® Processor Graphics • 3D 渲染(OpenGL & Vulkan) • Media • 显示与计算（CUDA & OpenCL） Intel GPU media 硬件编程模型 slice Ring buffer FFmpeg MSDK i965/iHD OS scheduler com1 KMD com2 com3 Batch

0 码力 | 26 页 | 964.83 KB | 1 年前
3
C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

CUDA 开启的 GPU 编程 by 彭于斌（ @archibate ）往期录播： https://www.bilibili.com/video/BV1fa411r7zp 课程 PPT 和代码： https://github.com/parallel101/course 前置条件 • 学过 C/C++ 语言编程。 • 理解 malloc/free 之类的概念。 • 熟悉 STL 中的容器、函数模板等。做不到的。编写一段在 GPU 上运行的代码 • 定义函数 kernel ，前面加上 __global__ 修饰符，即可让他在 GPU 上执行。 • 不过调用 kernel 时，不能直接 kernel() ，而是要用 kernel<<<1, 1>>>() 这样的三重尖括号语法。为什么？这里面的两个 1 有什么用？稍后会说明。 • 运行以后，就会在 GPU 上执行 printf 了。 kernel 函数在 GPU 上执行，称为核函数，用 __global__ 修饰的就是核函数。没有反应？同步一下！ • 然而如果直接编译运行刚刚那段代码，是不会打印出 Hello, world! 的。 • 这是因为 GPU 和 CPU 之间的通信，为了高效，是异步的。也就是 CPU 调用 kernel<<<1, 1>>>() 后，并不会立即在 GPU 上执行完毕，再返回。实际上只是把

0 码力 | 142 页 | 13.52 MB | 1 年前
3
POCOAS in C++: A Portable Abstraction for Distributed Data Structures

CPU vFast GPU vvFast PCI Bus (or other fabric)GPUs as a First-Class Computing Resource CPU GPU PCI Bus (or other fabric) NIC - Historically, network comm. was CPU-centric 1) Direct GPU access to Infiniband allows GPU-to-GPU network transfers 2) Fast in-node fabrics like NVLink, Infinity Fabric allow very fast intra-node transfers DataGPUs as a First-Class Computing Resource CPU GPU PCI Bus (or fabric) NIC Data - Historically, network comm. was CPU-centric 1) Direct GPU access to Infiniband allows GPU-to-GPU network transfers 2) Fast in-node fabrics like NVLink, Infinity Fabric allow

0 码力 | 128 页 | 2.03 MB | 6 月前
3
Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

B" : GPU operation 9Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread blocks until GPU finishes B" : GPU operation 10Existing TGPSs on Heterogenous Computing - Challenge A C D B! B" 5 task_b = sched.emplace([](&){ 6 // CPU code; // GPU code; 7 }); // CPU thread blocks until GPU finishes operation B" : GPU operation Atomic execution per task 11Existing TGPSs on Heterogenous Computing - Challenge CPU A B! C Idle GPU D B" Runtime A C D B! B" Assume one CPU and one GPU B! : CPU operation

0 码力 | 84 页 | 8.82 MB | 6 月前
3
Heterogeneous Modern C++ with SYCL 2020

http://wongmichael.com/about ● C++11 book in Chinese: https://www.amazon.cn/dp/B00ETOV2OQ We build GPU compilers for some of the most powerful supercomputers in the world 34 Nevin “:-)” Liber nliber@anl Attribution 4.0 International License SYCL Single Source C++ Parallel Programming GPU FPGA DSP Custom Hardware GPU CPU CPU CPU Standard C++ Application Code C++ Libraries ML Frameworks give better performance on complex apps and libs than hand-coding AI/Tensor HW GPU FPGA DSP Custom Hardware GPU CPU CPU CPU AI/Tensor HW Other BackendsSYCL 2020 is here! Open Standard for

0 码力 | 114 页 | 7.94 MB | 6 月前
3
Bringing Existing Code to CUDA Using constexpr and std::pmr

cudaFree(x); cudaFree(y); } An Even Easier Introduction to CUDA 5 |__global__ void add_gpu(int n, float* x, float* y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } TEST_CASE("cppcon-1" TEST_CASE("cppcon-1", "[CUDA]") { // … } An Even Easier Introduction to CUDA 6 |__global__ void add_gpu(int n, float* x, float* y) { for (int i = 0; i < n; i++) y[i] = x[i] + y[i]; } TEST_CASE("cppcon-1" 20; float* x; float* y; // … add_gpu<<<1, 1>>>(N, x, y); // … } An Even Easier Introduction to CUDA 7 |__global__ void add_gpu(int n, float* x, float* y) { for (int i = 0;

0 码力 | 51 页 | 3.68 MB | 6 月前
3
2022年美团技术年货合辑

目录 < v Replication（上）：常见复制模型 & 分布式系统挑战 792 Replication（下）：事务，一致性与共识 818 TensorFlow 在美团外卖推荐场景的 GPU 训练优化实践 855 CompletableFuture 原理与实践 - 外卖商家端 API 的异步化 879 工程效能 CI/CD 之流水线引擎的建设实践 912 美团外卖搜索基于 Elasticsearch SQL 分析与审计系统性能优化之旅 1048 数据库异常智能分析与诊断 1059 美团外卖广告智能算力的探索与实践（二） 1079 Linux 下跨语言调用 C++ 实践 1101 GPU 在外卖场景精排模型预估中的应用实践 1130 美团集群调度系统的云原生实践 1149 广告平台化的探索与实践 | 美团外卖广告工程实践专题连载 1161 数据 1193 Kafka AP，在 T4 上推理速度可达 1242 FPS；YOLOv6-s 在 COCO 上精度可达 43.1% AP，在 T4 上推理速度可达 520 FPS。在部署方面， YOLOv6 支持 GPU（TensorRT）、CPU（OPENVINO）、ARM（MNN、TNN、 NCNN）等不同平台的部署，极大地简化工程部署时的适配工作。目前，项目已开源至 Github，传送门：YOLOv6。欢迎有需要的小伙伴们

0 码力 | 1356 页 | 45.90 MB | 1 年前
3
Distributed Ranges: A Model for Building Distributed Data Structures, Algorithms, and Views

involve experimental prototypes and early research.Problem: writing parallel programs is hard - Multi-GPU, multi-CPU systems require partitioning data - Users must manually split up data amongst GPUs / execution necessary. CPU NIC GPU GPU GPU GPU Xe LinkMulti-GPU Systems - NUMA regions: - 4+ GPUs - 2+ CPUs CPU NIC GPU GPU GPU GPU Xe LinkMulti-GPU Systems - NUMA regions: - 4+ GPUs more memory domains - Software needed to reduce complexity CPU NIC GPU Tile 1 Tile 0 GPU Tile 1 Tile 0 GPU Tile 1 Tile 0 GPU Tile 1 Tile 0 Xe LinkProject Goals - Offer high-level, standard C++

0 码力 | 127 页 | 2.06 MB | 6 月前
3

共 249 条前往

页

分类

语言

格式

Go on GPU

Bridging the Gap: Writing Portable Programs for CPU and GPU

FFmpeg在Intel GPU上的硬件加速与优化

C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程

POCOAS in C++: A Portable Abstraction for Distributed Data Structures

Taro: Task graph-based Asynchronous Programming Using C++ Coroutine

Heterogeneous Modern C++ with SYCL 2020

Bringing Existing Code to CUDA Using constexpr and std::pmr

2022年美团技术年货合辑

Distributed Ranges: A Model for Building Distributed Data Structures, Algorithms, and Views