CUDA C++编程手册（总论）

V2AS问路

CUDA C++编程手册（总论）

阅读原文时间：2023年07月09日阅读：3

CUDA C++编程手册（总论）

CUDA C++ Programming Guide

The programming guide to the CUDA model and interface.

Changes from Version 10.0

Use CUDA C++ instead of CUDA C to clarify that CUDA C++ is a C++ language extension not a C language.
General wording improvements throughput the guide.
Fixed minor typos in code examples.
Updated From Graphics Processing to General Purpose Parallel Computing.
Added reference to NVRTC in Just-in-Time Compilation.
Clarified linear memory address space size in Device Memory.
Clarified usage of CUDA_API_PER_THREAD_DEFAULT_STREAM when compiling with nvccin Default Stream.
Updated Host Functions (Callbacks) to use cudaLaunchHostFunc instead of the deprecated cudaStreamAddCallback.
Clarified that 8 GPU peer limit only applies to non-NVSwitch enabled systems in Peer-to-Peer Memory Access.
Added section IOMMU on Linux.
Added reference to CUDA Compatibility in Versioning and Compatibility.
Extended list of types supported by __ldg() function in Read-Only Data Cache Load Function.
Documented support for unsigned short int with atomicCAS().
Added section Address Space Predicate Functions.
Added removal notice for deprecated warp vote functions on devices with compute capability 7.x or higher in Warp Vote Functions.
Added documentation for __nv_aligned_device_malloc() in Dynamic Global Memory Allocation and Operations.
Added documentation of cudaLimitStackSize in CUDA Dynamic Parallelism Configuration Options.
Added synchronization performance guideline to CUDA Dynamic Parallelism Synchronization.
Documented performance improvement of roundf(), round() and updated Maximum ULP Error Table for Mathematical Standard Functions.
Updated Performance Guidelines Multiprocessor Level for devices of compute capability 7.x.
Clarified Shared Memory carve out description in Compute Capability 7.x Shared Memory.
Added missing Stream CUStream Mapping to Driver API
Added remark about driver and runtime API inter-operability, highlighting cuDevicePrimaryCtxRetain() in Driver API Context.
Updated default value of CUDA_CACHE_MAXSIZE and removed no longer supported environment variables from CUDA Environment Variables
Added new Unified Memory sections: System Allocator, Hardware Coherency, Access Counters
Added section External Resource Interoperability .

CUDA C++程序设计指南

CUDA模型和接口的编程指南。

版本10.0的更改

使用CUDA C++代替CUDA C来说明CUDA C++是一种C++语言扩展而不是C语言。

一般措词提高吞吐量指南。

修复了代码示例中的小错误。

从图形处理更新为通用并行计算。

在即时编译中添加了对NVRTC的引用。

澄清了设备内存中的线性内存地址空间大小。

阐明了使用nvccin DEFAULT STREAM编译时每个线程的CUDA_API_DEFAULT_流的用法。

已更新主机函数（回调）以使用cudaLaunchHostFunc，而不是已弃用的cudaStreamAddCallback。

阐明了8 GPU对等限制仅适用于对等内存访问中启用非NVSwitch的系统。

在Linux上添加了IOMMU部分。

在版本控制和兼容性中添加了对CUDA兼容性的引用。

只读数据缓存加载函数中的ldg（）函数支持的扩展类型列表。

使用atomicCAS（）支持无符号短整型。

添加了节地址空间谓词函数。

在warp投票功能中具有计算能力7.x或更高版本的设备上添加了已弃用的warp投票功能的删除通知。

添加了有关动态全局内存分配和操作中与设备对齐的malloc（）的文档。

在CUDA动态并行配置选项中添加了cudaLimitStackSize的文档。

为CUDA动态并行同步增加了同步性能指标。

记录了roundf（）、round（）和更新的数学标准函数最大ULP错误表的性能改进。

更新了计算能力为7.x的设备的多处理器级性能指南。

阐明了计算能力7.x共享内存中的共享内存划分描述。

已将缺少的流CUStream映射添加到驱动程序API

添加了有关驱动程序和运行时API互操作性的注释，突出显示了驱动程序API上下文中的cuDevicePrimaryCtxRetain（）。

更新了CUDA_CACHE_MAXSIZE的默认值，并从CUDA环境变量中删除了不再支持的环境变量

添加了新的统一内存部分：系统分配器、硬件一致性、访问计数器

添加了“外部资源互操作性”一节。

一．介绍

1.1 从图形处理到通用并行计算

由于市场对实时、高清晰度三维图形的需求无法满足，可编程图形处理器单元（GPU）已经发展成为一个高度并行、多线程、多核处理器，具有巨大的计算能力和非常高的内存带宽，如图1和图2所示。

图1. CPU和GPU每秒的浮点操作数

图2. CPU和GPU的内存带宽

CPU和GPU之间浮点能力差异的原因是，GPU专门用于高度并行的计算——正是图形渲染的目的——因此设计的晶体管更多地用于数据处理，而不是数据缓存和流控制，如图3所示。

图3. GPU将更多的晶体管用于数据处理

这在概念上适用于高度并行计算，因为GPU可以通过计算隐藏内存访问延迟，而不是通过大型数据缓存和流控制来避免内存访问延迟。

数据并行处理将数据元素映射到并行处理线程。许多处理大型数据集的应用程序可以使用数据并行编程模型来加快计算速度。在三维渲染中，大量像素和顶点被映射到平行线程。类似地，图像和媒体处理应用程序，例如渲染图像的后处理、视频编码和解码、图像缩放、立体视觉和模式识别，可以将图像块和像素映射到并行处理线程。事实上，许多图像绘制和处理领域之外的算法都是通过数据并行处理来加速的，从一般的信号处理或物理模拟到计算金融或计算生物学。

1.2. CUDA通用并行计算平台及编程模型

2006年11月，NVIDIA推出了CUDA？通用并行计算平台和编程模型，它利用NVIDIA GPUs中的并行计算引擎以比CPU更高效的方式解决许多复杂的计算问题。

CUDA附带了一个软件环境，允许开发人员使用C++作为高级编程语言。如图4所示，支持其他语言、应用程序编程接口或基于指令的方法，如FORTRAN、DirectCompute、OpenACC。

图4. GPU计算应用。CUDA设计用于支持各种语言和应用程序编程接口。

1.3. 一种可扩展的编程模型

多核CPU和多核GPU的出现意味着主流处理器芯片现在是并行系统。挑战在于开发应用程序软件，透明地扩展其并行性，以利用不断增加的处理器内核数量，就像3D图形应用程序透明地将其并行性扩展到具有大量不同内核的多个核心gpu一样。

CUDA并行编程模型旨在克服这一挑战，同时为熟悉C等标准编程语言的程序员保持较低的学习曲线。

它的核心是三个关键的抽象——线程组的层次结构、共享内存和障碍同步——它们只是作为一组最小的语言扩展暴露给程序员。

这些抽象提供了细粒度数据并行和线程并行，嵌套在粗粒度数据并行和任务并行中。它们指导程序员将问题划分为粗的子问题，这些子问题可以由线程块独立并行地解决，而每个子问题又划分为更细的子问题，这些子问题可以由块内的所有线程协同并行地解决。这种分解通过允许线程在解决每个子问题时进行协作来保持语言的表达能力，同时实现自动可伸缩性。实际上，每个线程块都可以按任意顺序、并发或顺序调度到GPU内的任何可用多处理器上，这样编译后的CUDA程序就可以在任何数量的多处理器上执行，如图5所示，并且只有运行时系统需要知道物理多处理器计数。

这种可扩展的编程模型允许GPU体系结构通过简单地扩展多处理器和内存分区的数量来跨越广泛的市场范围：从高性能的狂热者GeForce GPU和专业的Quadro和Tesla计算产品到各种便宜的主流GeForce GPU（请参见启用CUDA的GPU列表在所有启用CUDA的GPU中）。

图5. 自动伸缩性

注意：GPU是围绕流式多处理器（SMs）阵列构建的（有关更多详细信息，请参阅硬件实现）。多线程程序被划分成独立执行的线程块，这样多处理器的GPU将比少处理器的GPU在更短的时间内自动执行程序。

1.4. 文件结构

本文件分为以下章节：

Chapter Introduction is a general introduction to CUDA.
Chapter Programming Model outlines the CUDA programming model.
Chapter Programming Interface describes the programming interface.
Chapter Hardware Implementation describes the hardware implementation.
Chapter Performance Guidelines gives some guidance on how to achieve maximum performance.
Appendix CUDA-Enabled GPUs lists all CUDA-enabled devices.
Appendix C++ Language Extensions is a detailed description of all extensions to the C++ language.
Appendix Cooperative Groups describes synchronization primitives for various groups of CUDA threads.
Appendix CUDA Dynamic Parallelism describes how to launch and synchronize one kernel from another.
Appendix Mathematical Functions lists the mathematical functions supported in CUDA.
Appendix C++ Language Support lists the C++ features supported in device code.
Appendix Texture Fetching gives more details on texture fetching
Appendix Compute Capabilities gives the technical specifications of various devices, as well as more architectural details.
Appendix Driver API introduces the low-level driver API.
Appendix CUDA Environment Variables lists all the CUDA environment variables.
Appendix Unified Memory Programming introduces the Unified Memory programming model.