Designing an ultra low-overhead multithreading runtime for Nimfrom? ◇ 3 years ago: started writing a tensor library in Nim. ◇ 2 threading APIs at the time: OpenMP and simple threadpool ◇ 1 year ago: complete refactoring of the internals 3 Agenda ◇ Understanding - Userland, lightweight context switches - Cannot use hardware threads Preemptive: - PThreads (OpenMP, TBB, Cilk, …) - Scheduled by the OS, heavier context switches - Need synchronization primitives: bus, caches, ... 14 Data parallelism Parallel for loop - Same instructions on multiple data - OpenMP - Use-cases - Vectors, matrices, multi-dimensional arrays and tensors - Challenges: - Nested0 码力 | 37 页 | 556.64 KB | 1 年前3
Modern C++ for Parallelism in High Performance Computingwhat extent we achieve scaling for different parallelization strategies: C-style programming with OpenMP, native mechanisms in modern C++, as well as through Kokkos and Sycl. Discussion An important corner distributed memory, and OpenMP (Open Multi-Processing) for shared memory. In this project we focus mostly on the shared memory aspect and use OpenMP as the performance baseline. (1) OpenMP has standard bindings ‘C-style’ imple- mentation based on simple loops and linear vectors for floating point data storage. (2) OpenMP also has bindings to C++, where it can exploit a random access iterator. This means that we reimplement0 码力 | 3 页 | 91.16 KB | 6 月前3
大模型时代下向量数据库的设计与应用• Faiss OpenMP线程改造 • LLVM解析源码,找到所有 OpenMP指令语句 • 转换为调用自定义线程池和 lambda表达式 • 共享变量替换及并发保护 PieCloudVector • Faiss OpenMP线程改造 • 控制全局线程数 • 降低线程锁冲突 • 降低内存使用 PieCloudVector • Faiss OpenMP线程改造 • 避免无效线程 避免无效线程 PieCloudVector • Faiss OpenMP线程改造 • QPS大幅提升 PieCloudVector • Faiss OpenMP线程改造 • 内存占用大幅降低 PieCloudVector • Faiss与postgres内核对接 - gpu搜索的特殊路径 • 避免并发调用gpu • 查询请求按批单线程提交 PieCloudVector • 兼容国产硬件和操作系统0 码力 | 28 页 | 1.69 MB | 1 年前3
C++高性能并行编程与优化 - 课件 - 01 学 C++ 从 CMake 学起内存管理 3.现代 C++ 进阶:模板元编程与函数式编程 4.编译器如何自动优化:从汇编角度看 C++ 5.C++11 起的多线程编程:从 mutex 到无锁并行 6.并行编程常用框架: OpenMP 与 Intel TBB 7.被忽视的访存优化:内存带宽与 cpu 缓存机制 8.GPU 专题: wrap 调度,共享内存, barrier 9.并行算法实战: reduce , scan 规则,为 g++ 准备的参数可能对 MSVC 不适用。 • CMake 可以自动检测当前的编译器,需要添加哪些 flag 。比如 OpenMP ,只需要在 CMakeLists.txt 中指明 target_link_libraries(a.out OpenMP::OpenMP_CXX) 即可。 输出的可执行文件 输入的多个源文件 CMake 的命令行调用 • 读取当前目录的 CMakeLists range-v3::range-v3 4. TBB::tbb 5. OpenVDB::openvdb 6. Boost::iostreams 7. Eigen3::Eigen 8. OpenMP::OpenMP_CXX • 不同的包之间常常有着依赖关系,而包管理器的作者为 find_package 编写的脚本(例如 /usr/lib/cmake/TBB/TBBConfig.cmake )能够自动查找所有依赖,并利用刚刚提0 码力 | 32 页 | 11.40 MB | 1 年前3
C++高性能并行编程与优化 - 课件 - 04 从汇编角度看编译器优化内存管理 3.现代 C++ 进阶:模板元编程与函数式编程 4.编译器如何自动优化:从汇编角度看 C++ 5.C++11 起的多线程编程:从 mutex 到无锁并行 6.并行编程常用框架: OpenMP 与 Intel TBB 7.被忽视的访存优化:内存带宽与 cpu 缓存机制 8.GPU 专题: wrap 调度,共享内存, barrier 9.并行算法实战: reduce , scan __restrict 关键字,打消编译器的顾虑! 这下只需要生成一个 SIMD 版本了,没有了运行时判断重叠的焦虑。 SIMD 版 循环中的矢量化: OpenMP 强制矢量化 除了可以用 __restrict 让编译器放心做 SIMD 优化外,还可以用 OpenMP 的这条指令: 来迫使编译器无视指针别名的问题,并启用 SIMD 优化。不过你得给编译器打开 - fopenmp 这个选项。 循环中的矢量化:编译器提示忽略指针别名 测试结果 SOA + unroll 的方案,比优化前快了 5 倍 ! 并行情况下最快的也是 SOA 。 单线程的 SOA + unroll 甚至略微超过了并 行版的 AOS !可见 OpenMP 并非万能膏 药,单线程的程序认真优化后一样打败无脑 并行。 结论: SOA 是针对这个案例最高效的数据排布格式 第 7 章: STL 容器 std::vector :也有指针别名问题0 码力 | 108 页 | 9.47 MB | 1 年前3
Khronos APIs for Heterogeneous Compute and Safety: SYCL and SYCL SC64); parallel_for_each(e, [=](index<2> idx) restrict(amp) { ptr[idx] *= 2.0f; }); Here we’re using OpenMP as an example float *h_a = { … }, d_a; cudaMalloc((void **)&d_a, size); cudaMemcpy(d_a, h_a, size 64>>>(a, b, c); cudaMemcpy(d_a, h_a, size, cudaMemcpyDeviceToHost); Examples: - OpenCL, CUDA, OpenMP, SYCL 2020 Implementation: - Data is moved to the device via explicit copy APIs Here we’re using0 码力 | 82 页 | 3.35 MB | 6 月前3
Heterogeneous Modern C++ with SYCL 2020Chair of SYCL Heterogeneous Programming Language ● ISO C++ Directions Group past Chair ● Past CEO OpenMP ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● michael@codeplay.com Application uses SYCL, Kokkos, Raja SYCL in HPC/Supercomputers CUDA/pthreads/ OpenACC/OpenCL OpenMP for C and Fortran Need Languages that allow control of these Data Issues Set Data affinity, Data0 码力 | 114 页 | 7.94 MB | 6 月前3
cppcon 2021 safety guidelines for C parallel and concurrencyChair of SYCL Heterogeneous Programming Language ● ISO C++ Directions Group past Chair ● Past CEO OpenMP ● ISOCPP.org Director, VP http://isocpp.org/wiki/faq/wg21#michael-wong ● michael@codeplay.com ●0 码力 | 52 页 | 3.14 MB | 6 月前3
Interesting Upcoming Features from Low Latency, Parallelism and Concurrencycollection, and optimization processes. Useful for: ● Lock-free data structures ● Parallel reductions (OpenMP) ● Optimization algorithms ● Statistics collectionProposed interface namespace std { template0 码力 | 56 页 | 514.85 KB | 6 月前3
C++高性能并行编程与优化 - 课件 - 10 从稀疏数据结构到量化数据类型sizeof(tbb::spin_mutex) = 1 字节…… 小彭老师解决:访问者模式 把写入过的块地址缓存起来,可以避免多次访问全局表的开销。缓存在访问 者 (accessor) 的成员 map 里。访问者对象被我用 OpenMP 标记为 firstprivate ,意味着这个 map 是线程局部的,因此对他的访问不需要加锁, 更快。 应用在刚刚的 SNode 系统中 std::unordered_map 不支持 omp0 码力 | 102 页 | 9.50 MB | 1 年前3
共 60 条
- 1
- 2
- 3
- 4
- 5
- 6













