元鉴 Yuanjian

中文内容

已翻译official company source英文原文2026-06-05

开发者现在可以在大型现有 C++ GPU 代码库中使用 NVIDIA CUDA Tile 编程，借助基于 tile 的抽象来开发高度优化的 GPU 内核。

随 NVIDIA CUDA 13.1 推出的 NVIDIA CUDA Tile 为 GPU 引入了基于 tile 的编程。CUDA Tile 设计了一个顶层语言层，以及另一个可供任何高级编程语言作为目标的中间层，能够自动利用 NVIDIA 硬件的先进能力——包括张量核心、共享内存和张量内存加速器——而无需应用程序直接以它们为目标。

Python 是首个支持基于 tile 的 GPU 应用程序的语言。最新发布的 CUDA 13.3 增加了对使用 C++ 编写 tile 内核的支持，使开发者能够构建高度优化的 GPU 内核。

什么是 CUDA Tile C++？

CUDA Tile C++ 是 CUDA Tile 编程模型在 C++ 中的一种表达，构建于 CUDA Tile IR 规范之上。它使开发者能够使用 C++ 编写 tile 内核，并采用基于 tile 的模型来表达 GPU 内核，而不是仅使用单指令多线程（SIMT）模型，或在其基础上结合使用。

作为回顾，在 tile 模型中：

多维数组是主要的数据存储形式。
Tiles 是数组中供内核运行处理的部分。
核函数是由线程块并行执行的函数。
线程块是 GPU 的子集；对图块的操作会在每个线程块中的所有线程之间并行化。

CUDA Tile C++ 会自动处理线程块内的并行性，以及异步、内存移动和 GPU 编程的其他底层细节。CUDA Tile C++ 可在不同的 NVIDIA GPU 架构之间移植，使开发者能够使用最新的硬件特性，而无需重写代码。

CUDA Tile C++ 向量加法示例

熟悉用于 SIMT 的 CUDA C++ 的开发者很可能见过经典的向量加法核函数。假设数据已经在 GPU 上，CUDA SIMT 中的向量加法核函数会接收两个向量，并将它们逐元素相加以生成第三个向量。这是最简单的 CUDA 核函数之一。其形式如下。

__global__ void vecAdd(float* A, float* B, float* C, int vectorLength)
{
 /* calculate my thread index */
 int workIndex = threadIdx.x + blockIdx.x*blockDim.x;

 if(workIndex < vectorLength)
 {
  /* perform the vector addition */
  C[workIndex] = A[workIndex] + B[workIndex];
 }
}

在这个核函数中，每个线程的工作都被明确指定，而程序员在启动该核函数时，会指定要启动的块数和线程数。

来看用 CUDA Tile C++ 编写的等效代码，无需指定每个线程要做什么。只需将数据划分为图块，并指定这些图块的数学运算。其余一切都会被处理。

CUDA Tile C++ 核函数如下所示：

#include "cuda_tile.h"
__tile_global__ void vectorAdd(float* a, float* b, float* out, size_t n) {

/* set up the namespace */
  namespace ct = cuda::tiles;
  using namespace ct::literals;

/* attach shape to raw pointers */
  auto aSpan = ct::tensor_span{a,   ct::extents{n}};
  auto bSpan = ct::tensor_span{b,   ct::extents{n}};
  auto oSpan = ct::tensor_span{out, ct::extents{n}};

/* partition each span into tiles of size 8 */
  auto aView = ct::partition_view{aSpan, ct::shape{8_ic}};
  auto bView = ct::partition_view{bSpan, ct::shape{8_ic}};
  auto oView = ct::partition_view{oSpan, ct::shape{8_ic}};

/* load the a and b tiles from global memory */
  int bx = ct::bid().x;
  auto aTile = aView.load(bx);          // load bx-th tile
  auto bTile = bView.load(bx);

/* add the two tiles together, elementwise */
  auto oTile = aTile + bTile;

/* store the result tile to the output partition. */
  oView.store(oTile, bx); 
}

对于一个简单的 vectorAdd 内核来说，这看起来像是大量代码。不要惊慌。这个过于冗长的内核用于按顺序展示所有步骤。后面会给出一个用更少代码行完成相同工作的简化版本。

第一个区别是使用 __tile_global__ 向编译器表明这是一个 tile 内核。数组指针和数组大小作为参数传入，就像在 SIMT 内核中一样。

__tile_global__ void vectorAdd(float* a, float* b, float* out, std::size_t n) {

然后，为 cuda::tiles 和 ct::literals 设置命名空间。

  namespace ct = cuda::tiles;  
  using namespace ct::literals;

使用这段代码 ct::tensor_span 为三个数组分别创建一个张量跨度。张量跨度本质上是指向内存中多维数组的指针，类似于 C++23 的 std::mdspan。张量跨度携带有关数组形状（范围）以及数组元素布局（例如行优先或列优先）的信息。ct::extents{} 会告诉张量跨度数组的维度是什么。1D 数组使用 n。

auto aSpan = ct::tensor_span{a,   ct::extents{n}};
auto bSpan = ct::tensor_span{b,   ct::extents{n}};
auto oSpan = ct::tensor_span{out, ct::extents{n}}

现在，从张量跨度和瓦片形状创建一个分区视图。分区视图是张量跨度的包装器，它将数组呈现为一系列不重叠、固定大小的分区。每个分区的大小由 shape 参数指定，该参数必须是编译期参数。在此示例中，8_ic 是一个整数常量，由 ct::literals 定义。ct::shape<8>{} 和 ct::shape{8_ic} 在此上下文中是等价的。创建的分区视图本质上就是原始数组，被切分成大小为 8 的块，这也是瓦片大小。

  auto aView = ct::partition_view{aSpan, ct::shape{8_ic}};
  auto bView = ct::partition_view{bSpan, ct::shape{8_ic}};
  auto oView = ct::partition_view{oSpan, ct::shape{8_ic}};

通过使用 ct::bid().x 获取 X 维度上的块索引来加载输入瓦片。如果使用多维块，也要使用 Y 和 Z 维度。然后加载 a 和 b 瓦片。为方便起见使用 auto，但明确来说，aTile 和 bTile 的类型是 ct::tile<float, ct::shape<8>>>。它们是大小为 8 的一维瓦片，元素类型为 float。借助分区视图，可以轻松传入块索引。load 函数会自动获取数组中正确的块，并将其加载到瓦片中。

int bx = ct::bid().x;
auto aTile = aView.load(bx);         
auto bTile = bView.load(bx);

进行加法并存储结果。这一行代码对输入瓦片执行逐元素加法，并将结果存储到输出瓦片中。将该输出瓦片存储到 oView 分区视图中，使用 X 维度上的相同块索引 bx 进行索引。

/* add the two tiles together, elementwise. */
auto oTile = aTile + bTile;

/* store the result tile to the output partition. */
oView.store(oTile, bx);

完整的向量加法示例

以下示例展示了如何通过一段完整、可运行的 C++ 代码调用这个向量加法 kernel。

有几点需要注意，以帮助编译器进行优化。

首先，为获得最佳性能，在 kernel 运行期间，输入数组和输出数组应仅通过各自对应的指针访问。当满足这一条件时，这些数组不存在别名——即不会通过其他指针或符号访问这些数组。使用 __restrict__ 修饰符标记数组指针，可以将这一点传达给编译器。

使用基指针对齐到 16 字节边界的数组，有助于编译器生成更高效的内存访问模式。通过对每个 kernel 参数调用 ct::assume_aligned<16>，告知编译器这些指针已对齐。使用这些调用的返回值，以便编译器利用这种对齐。由 cudaMalloc 或类似 CUDA API 返回的指针始终满足这一要求，因为它们具有 256 字节对齐。

最后，使用远大于 8 的 tile 大小。对下面可运行的代码进行这些调整，并添加 load_masked 和 store_masked 的使用，它们用于处理可能无法被 tile 大小整除的数据。

以下是完整代码，包括 kernel 和 main 函数。请注意其中应用的优化以及降低的冗余度。

#include <cstdio>
#include <cstdlib>
#include "cuda_tile.h"

__tile_global__ void vectorAdd(float* __restrict__ a, float* __restrict__ b, float* __restrict__ out, size_t n) {
  namespace ct = cuda::tiles;
  using namespace ct::literals;

  a   = ct::assume_aligned(a,   16_ic);
  b   = ct::assume_aligned(b,   16_ic);
  out = ct::assume_aligned(out, 16_ic);  

  int bx     = ct::bid().x;
  
/* create partition views for the input tiles and load them */
  auto aTile = ct::partition_view{ct::tensor_span{a,   ct::extents{n}}, ct::shape{1024_ic}}.load_masked(bx);
  auto bTile = ct::partition_view{ct::tensor_span{b,   ct::extents{n}}, ct::shape{1024_ic}}.load_masked(bx);
  
/* add the two tiles together, elementwise. */
  auto oTile = aTile + bTile;

/* create the partition view for the output tile and then store the output tile*/  
  auto oView = ct::partition_view{ct::tensor_span{out, ct::extents{n}}, ct::shape{1024_ic}};  
  oView.store_masked(oTile, bx);
}  

/* define a macro to check for CUDA errors */
#define checkCudaError(X) do {\
  auto ret = X;\
  if (ret != cudaSuccess) {\
    printf("\n error on line %d, CUDART error string : %s", __LINE__, cudaGetErrorString(ret));\
    exit(1);\
  }\
} while (0)

int main() {
  constexpr size_t N = 2ULL << 25;
  constexpr int TILE_SIZE = 1024;
  constexpr int BLOCKS = (N + TILE_SIZE - 1) / TILE_SIZE;

/* declare and allocate the host arrays */
  float* h_a   = (float*)malloc(sizeof(float) * N);
  float* h_b   = (float*)malloc(sizeof(float) * N);
  float* h_out = (float*)malloc(sizeof(float) * N);

/* initialize the host arrays */
  for (size_t idx = 0; idx < N; ++idx) {
    h_a[idx] = (float)rand() / RAND_MAX;
    h_b[idx] = (float)rand() / RAND_MAX;
    h_out[idx] = -1.0f;
  }

/* declare the device arrays */
  float* d_a{nullptr};
  float* d_b{nullptr};
  float* d_out{nullptr};

/* allocate the device arrays */
  checkCudaError(cudaMalloc(&d_a, sizeof(float) * N));
  checkCudaError(cudaMalloc(&d_b, sizeof(float) * N));
  checkCudaError(cudaMalloc(&d_out, sizeof(float) * N));

/* copy the host arrays to the device arrays */
  checkCudaError(cudaMemcpy(d_a, h_a, sizeof(float) * N, cudaMemcpyHostToDevice));
  checkCudaError(cudaMemcpy(d_b, h_b, sizeof(float) * N, cudaMemcpyHostToDevice));

/* initialize the device output array to 0 */
  checkCudaError(cudaMemset(d_out, -1, sizeof(float) * N));

/* launch the kernel */
  vectorAdd<<<BLOCKS, 1>>>(d_a, d_b, d_out, N);

/* synchronize the device and check for errors */
  checkCudaError(cudaDeviceSynchronize());

/* copy the device array out back to the host */
  checkCudaError(cudaMemcpy(h_out, d_out, sizeof(float) * N, cudaMemcpyDeviceToHost));

/* compare the results to host results */

  float max_err = 0.0f;
  for (size_t idx = 0; idx < N; ++idx) {
    float expected = h_a[idx] + h_b[idx];
    max_err = fmaxf(max_err, fabsf(h_out[idx] - expected));
  }

  printf("N: %zu\n", N);
  printf("Max error: %e\n", max_err);

  checkCudaError(cudaFree(d_a));
  checkCudaError(cudaFree(d_b));
  checkCudaError(cudaFree(d_out));

  free(h_a);
  free(h_b);
  free(h_out);
}

如果熟悉启动 SIMT kernel，这个过程类似，但需要一项特定修改。该 kernel 的启动方式如下：

vectorAdd<<<BLOCKS, 1>>>(d_a, d_b, d_out, N);

启动 tile kernel 时，<<<>>> 中的第一个参数是 tile block 的数量（在 SIMT 中，这将是 thread block 的数量）。第二个参数必须为 1。用于执行 kernel 的线程数量由编译器决定；启动 tile kernel 时，始终将该参数设为 1。

在采用 NVIDIA Ampere 架构或更新 GPU、计算能力为 8.0 的设备上运行 CUDA 13.3 或更高版本时，这些命令会生成以下输出。

调整 -arch sm_120 命令以匹配相应架构；在使用 cuda_tile.h 时包含 -std=c++20；并添加 --enable-tile 选项以编译 tile kernels。

$ nvcc -std=c++20 --enable-tile -arch sm_120 -o vectorAdd vectorAdd.cu
$ ./vectorAdd
N: 67108864
Max error: 0.000000e+00

这就完成了第一个 CUDA Tile C++ 程序。

开发者工具

Tile C++ 内核可以像 SIMT 内核一样使用 NVIDIA Nsight Compute 进行分析。以下命令展示了如何使用 Nsight Compute 创建分析文件。

$ ncu -o VecAddProfile --set detailed ./vectorAdd

创建完成并使用图形界面版本的 Nsight Compute 打开后：

从下拉菜单中选择 vectorAdd 内核。
选择 Details 选项卡
展开 Tile Statistics 报告部分

图 1 显示了由 Nsight Compute 生成的性能分析结果。

图 1. 由 Nsight Compute 生成的性能分析结果，显示 vectorAdd kernel 的 tile 统计信息

请注意，Tile Statistics 报告部分包括指定的 tile block 数量、block 大小（由编译器选择）以及其他 tile 相关信息。

源代码页面也支持在源代码行级别显示 tile 内核和性能指标，就像 CUDA C++ 内核一样。

矩阵乘法

前面的示例展示了 vectorAdd，并详细说明了如何加载和存储分区视图。这个矩阵乘法示例说明了如何用非常简单的代码来表达矩阵乘法。

该内核执行一个 MxK 与 KxN 的矩阵乘法，以计算一个 MxN 矩阵。在该内核中，M=8，N=16，K 可以是可变的，只要它是 8 的倍数。设置 K=24。这些非常小的尺寸仅用于说明概念。

完整的内核如下，并附有要点概述。

#include "cuda_tile.h"

/* this kernel multiplies MxK and KxN matrices, where M=8 and N=16.  K is variable but must be divisible by 8.*/
__tile_global__ void kernel(float* __restrict__ a, float* __restrict__ b, size_t length, float* __restrict__ c) {
    namespace ct = cuda::tiles;
    using namespace ct::literals;

    a = ct::assume_aligned(a, 16_ic);
    b = ct::assume_aligned(b, 16_ic);
    c = ct::assume_aligned(c, 16_ic);

    auto aShape = ct::extents{8_ic, length};
    auto bShape = ct::extents{length, 16_ic};
    auto cShape = ct::extents{8_ic, 16_ic};

    auto aSpan = ct::tensor_span{a, aShape};
    auto bSpan = ct::tensor_span{b, bShape};
    auto cSpan = ct::tensor_span{c, cShape};

    auto aView = ct::partition_view{aSpan, ct::shape{4_ic, 8_ic}};
    auto bView = ct::partition_view{bSpan, ct::shape{8_ic, 4_ic}};
    auto cView = ct::partition_view{cSpan, ct::shape{4_ic, 4_ic}};
    
    using f32x4x4 = ct::tile<float, ct::shape<4, 4>>;
    auto accTile = ct::full<f32x4x4>(0);

    auto [xBlock, yBlock, dummy] = ct::bid();
    for (auto idx : ct::irange(0, 1 + int(length - 1) / 8)) {
        auto aTile = aView.load_masked(xBlock, idx);
        auto bTile = bView.load_masked(idx, yBlock);
        accTile = ct::mma(aTile, bTile, accTile);
    }

    cView.store_masked(accTile, xBlock, yBlock);
}

使用 ct::extents 对象为 a、b 和 c 矩阵创建范围。可使用编译期值或运行时值。M=8，N=16，但 K 是可变的。这些范围将在下一步用于创建张量跨度。

 auto aShape = ct::extents{8_ic, length};
 auto bShape = ct::extents{length, 16_ic};
 auto cShape = ct::extents{8_ic, 16_ic};

创建张量跨度。这会携带有关 a、b 和 c 的信息，用于创建分区视图。

    auto aSpan = ct::tensor_span{a, aShape};
    auto bSpan = ct::tensor_span{b, bShape};
    auto cSpan = ct::tensor_span{c, cShape};

创建 a、b 和 c 的分区视图，其中 a 被分区为 4×8，视图 b 为 8×4 视图。只要能正确整除 a 和 b 的值，就可以进行调整。这些维度还决定了 c 视图为 4×4。

  auto aView = ct::partition_view{aSpan, ct::shape{4_ic, 8_ic}};
  auto bView = ct::partition_view{bSpan, ct::shape{8_ic, 4_ic}};
  auto cView = ct::partition_view{cSpan, ct::shape{4_ic, 4_ic}};

二维分区在两个维度上进行索引。矩阵 a 的大小为 8×24，分区视图为 4×8，如图 2 所示。

aView 和 bView 的分区视图大小也决定了 accTile 的形状，accTile 是矩阵乘法期间用于累加结果的 tile。在此示例中，accTile 是一个 4×4 tile，与 cView 的形状相匹配。

A two-dimensional partition view of 8 x 24 partitioned into views of 4 x 8, and indexed by two-dimensional coordinates x and y. The first partition is (0,0), and the last partition is (1,2). — 图 2. 一个 8 x 24 矩阵的二维分区视图，分区为大小为 4 x 8 的视图

    using f32x4x4 = ct::tile<float, ct::shape<4, 4>>;
    auto accTile = ct::full<f32x4x4>(0);

使用 ct::bid 执行循环，以获取三个维度中的块索引。循环从 0 迭代到 length / 8，对应于总体 K 维度除以 8。除以 8 是因为 aView 和 bView 的 K 维度为 8。在循环内部，使用 load_masked 从 a 和 b 加载 tile，并调用 ct::mma 执行矩阵乘法，将结果累加到 accTile 中。

    auto [xBlock, yBlock, dummy] = ct::bid();

    for (auto idx : ct::irange(0, int(length / 8))) {
        auto aTile = aView.load_masked(xBlock, idx);
        auto bTile = bView.load_masked(idx, yBlock);
        accTile = ct::mma(aTile, bTile, accTile);
    }

将 accTile 的值存储到 c 的分区视图 cView 中。就这样。大部分内核代码都涉及为数据设置视图以及加载/存储数据。内核的计算部分很简单。

  cView.store_masked(accTile, xBlock, yBlock);

启动内核。由于 cView 的维度，使用 dim3(2,4)。cView 是 4×4，这意味着每个块都在计算 C 矩阵的一个 4×4 分块。由于 C 是 8×16，因此将 cView 的维度除进 C 矩阵的维度中。因为 8/4=2，且 16/4=4，所以使用 dim3(2,4) 启动内核。

  kernel<<<dim3(2, 4), 1>>>(d_a, d_b, K, d_c);

立即开始使用 CUDA Tile C++

运行 CUDA Tile C++ 程序需要以下内容：

计算能力为 8.x 或更新版本的 GPU。
NVIDIA Driver R580 或更高版本。如果 tile kernel 需要 JIT 编译，则 NVIDIA 驱动程序版本必须等同于或高于用于生成代码的 CUDA Toolkit 所关联的版本。例如，CUDA Toolkit 13.3 需要 R610 驱动程序或更高版本。
正文：CUDA Toolkit 13.3

基于 tile 的编程能力现已面向 C++ 开发者开放。立即查看文档、API 参考手册和 CUDA Toolkit 13.3，开始编写 tile C++ kernel，体验加速计算的新标准。

致谢

感谢 NVIDIA 贡献者 Jaydeep Marathe 和 Ezra Stein。

使用 NVIDIA CUDA Tile 在 C++ 中开发高性能 GPU 内核