JVM 内存模型0 码力 | 1 页 | 48.42 KB | 1 年前3
C++20: An (Almost) Complete Overviewconstexpr functions can now: use dynamic_cast() and typeid do dynamic memory allocations, new / delete contain try/catch blocks But still cannot throw exceptions33 constexpr string & vector std::string attribute can now include a reason Example: [[nodiscard("Ignoring the return value will result in memory leaks.")]] void* GetData() { /* ... */ }80 Bit Operations Header Set of global 0 码力 | 85 页 | 512.18 KB | 6 月前3
Working with Asynchrony Generically: A Tour of C++ ExecutorsOperation state notifies receiver by calling one of these exactly once.23 CONCEPTUAL BUILDING BLOCKS OF P2300 concept scheduler: schedule(scheduler) sender; concept sender: connect(sender0 码力 | 121 页 | 7.73 MB | 6 月前3
Bringing Existing Code to CUDA Using constexpr and std::pmr• Introduction • Memory • Host vs Device Functions • Return on Investment • Concluding remarks Outline 2 |• I work the RiskLab team at CSIRO on applied mathematics for Financial Risk. • The aim of … add_gpu<<>>(N, x, y); // … } Ok, about the kernel parameters 10 |Memory“In a typical PC or cluster node today, the memories of the CPU and GPU are physically distinct and https://developer.nvidia.com/blog/unified-memory-in-cuda- 6/ CPU vs GPU Memory System Memory GPU Memory 12 |“Unified Memory creates a pool of managed memory that is shared between the CPU and GPU, 0 码力 | 51 页 | 3.68 MB | 6 月前3
Making Libraries Consumable for Non-C++ DevelopersWindows, sizeof(wchar_t) == 2 • Non-Windows, sizeof(wchar_t) == 4 std::basic_stringhas memory implications. More on that later.What assumptions are being made? void get_size(size_t dev, long* returns the struct in registers, but the get_data_from() member function returns in caller provided memory. This is often unexpected but occurs using the MSVC compiler for x86 with stdcall (callee cleanup) cleanup) or cdecl (caller cleanup). For non-MSVC, data_t is always returned in a caller provided memory.What else isn’t being declared? struct data_t { int a; int b; }; /* Get data from device ‘dev’ 0 码力 | 29 页 | 1.21 MB | 6 月前3
基于Rust-vmm实现Kubernetes运行时Docker CP vulnerability Pod Isolation Challenges • Noisy neighbor −Impact performance on CPU, Memory, Bandwidth, Buffer IO, PIDs, File descriptors # kubectl run --rm -it bb --image=busybox sh / # abstracts the common virtualization components which implements a Rust-based VMM. • Written in Rust: Memory-safe language • Secure: Minimal hardware emulation • Flexible: Easy to customize to fit various network focus on correctness and performance • Compiled to native code offering performance similar to C • Memory management without garbage collection • Designed for systems programming Rust is a multi-paradigm0 码力 | 27 页 | 34.17 MB | 1 年前3
C++高性能并行编程与优化 - 课件 - 08 CUDA 开启的 GPU 编程会自动进行同步操作 ,即和 cudaDeviceSynchronize() 等价! 因此前面的 cudaDeviceSynchronize() 实 际上可以删掉了。 统一内存地址技术( Unified Memory ) • 还有一种在比较新的显卡上支持的特性, 那就是统一内存 (managed) ,只需把 cudaMalloc 换成 cudaMallocManaged 即可,释放时也是通过 cudaFree )组成的。每个 SM 可以处理一个或多个板块。 • SM 又由多个流式单处理器( SP )组成。每个 SP 可以处理一个或多个线程。 • 每个 SM 都有自己的一块共享内存( shared memory ),他的性质类似于 CPU 中的缓 存——和主存相比很小,但是很快,用于缓冲临时数据。还有点特殊的性质,我们稍后会 讲。 • 通常板块数量总是大于 SM 的数量,这时英伟达驱动就会在多个 SM 循环内部都是没有数据依赖,从而是可以 并行的(对 CPU 而言是 SIMD 和指令级并行,虽 然 GPU 没有,但为了引出共享内存的概念我才这样 改)。 板块的共享内存( shared memory ) • 刚刚已经实现了无数据依赖可以并行的 for ,那么如何把 他真正变成并行的呢?这就是板块的作用了,我们可以把 刚刚的线程升级为板块,刚刚的 for 升级为线程,然后把 刚刚 local_sum0 码力 | 142 页 | 13.52 MB | 1 年前3
使用硬件加速Tokio - 戴翔Fly.io, and Embark), even it has paid contributors! Tokio is good enough Tokio Tokio's APIs are memory-safe, thread-safe, and misuse-resistant. This helps prevent common bugs, such as unbounded queues Software Enqueue Software Producer Consumer Consumer Consumer • Synchronization latency • Memory/Cache latency • CPU cycles latency DLB : Dynamic Load Balance DLB Enqueue Logic Head and Tail Load Balancer Producer Producer Consumer Consumer Consumer • No Synchronization latency • No memory/cache latency • No CPU cycles DLB-Assist Channel Intro Hardware Senders Receive Senders Senders0 码力 | 17 页 | 1.66 MB | 1 年前3
陈东 - 利用Rust重塑移动应用开发-230618Cross platform - Performance - Thread Safe - Memory Safe - Love 利用 Rust 重塑移动应用开发 Rust 在移动应用 开发中的应用 Why Rust? - Cross platform - Performance - Memory Safe - Love 利用 Rust 重塑移动应用开发 PhoTto / image mobile platforms is increasingly gaining attention from developers. With its impressive performance, memory safety, and concurrency features, Rust has become an ideal choice for building high-performance0 码力 | 22 页 | 2.10 MB | 1 年前3
C++高性能并行编程与优化 - 课件 - 07 深入浅出访存优化int 数组里赋值 1 比赋值 0 慢一倍? 第 1 章:内存带宽 cpu-bound 与 memory-bound • 通常来说,并行只能加速计算的部分,不能加速内存读写的部分 。 • 因此,对 fill 这种没有任何计算量,纯粹只有访存的循环体,并 行没有加速效果。称为内存瓶颈( memory-bound )。 • 而 sine 这种内部需要泰勒展开来计算,每次迭代计算量很大的 循 实战案例:小内核卷积 最内层循环 unroll 一下更好 英文术语对照表 • 缓存行: cacheline • 一级缓存: L1 cache, L2 cache... • 内存带宽: memory bandwidth • 缓存命中,缓存未命中: cache hit , cache miss • 伪共享: false sharing • 预取: prefetching • 直写: streaming array-of-struct-of-array Morton Ordering on the Intel Xeon Phi • Modern server architectures rely on memory locality for optimal performance. Data needs to be organized that allows the CPU’s to process the0 码力 | 147 页 | 18.88 MB | 1 年前3
共 28 条
- 1
- 2
- 3













