In the previous chapter, Chapter 8: Core CuTe and Type Tests, we verified our "bricks" (custom types like FP8) and our "cranes" (TMA data movement). We proved that the fundamental components function correctly.
Now, it is time to build the skyscraper.
In this chapter, we explore Blackwell Dense GEMM Tests. We will combine the new hardware features of the NVIDIA Blackwell architecture (SM100/SM120) with the flexibility of CuTe to create high-performance Matrix Multiplication kernels.
As AI models grow larger, moving 32-bit (float) or even 16-bit (half) numbers takes too long and uses too much memory.
The Solution: Narrow Precision.
Blackwell is designed to crunch these tiny numbers at incredible speeds. However, writing a kernel for Blackwell is complex. You have to coordinate the TMA (memory mover) with the Warp Specialized math engines.
Central Use Case: We want to verify a Matrix Multiplication ($D = A \times B + C$) where:
Standard C++ doesn't have 4-bit types. CUTLASS defines them for us.
cutlass::float_e4m3_t: An 8-bit float (range approx +/- 448).cutlass::float_e2m1_t: A 4-bit float (SM120 only).CollectiveBuilderIn older CUTLASS versions, you had to manually plug together pipeline stages. In CUTLASS 3.x (and for Blackwell), we use the Collective Builder.
Think of the Builder as a General Contractor. You tell it:
The Builder figures out the best blueprint (Pipeline schedule, Kernel Schedule) for you.
Blackwell introduces Clustersβgroups of Streaming Multiprocessors (SMs) that work together.
Let's look at how a test is constructed in test/unit/gemm/device/sm100_tensorop_gemm/.
First, we define the shape of the problem and the specific data types.
// 1. Define the shapes
// The Tile size for the math instruction
using MmaTileShape = Shape<_64, _8, _128>;
// The Cluster size (4 Thread Blocks working together)
using ClusterShape = Shape<_4, _1, _1>;
// 2. Define Types (Input A is FP8)
using ElementA = cutlass::float_e4m3_t;
using LayoutA = cutlass::layout::RowMajor;
// Input B is also FP8
using ElementB = cutlass::float_e4m3_t;
using LayoutB = cutlass::layout::ColumnMajor;
Explanation: We are setting up a math operation using 8-bit inputs. Shape<_64, _8, _128> implies the fundamental math instruction works on chunks of this size.
The Epilogue handles writing the result to memory. We use the Builder to create it.
using CollectiveEpilogue = typename cutlass::epilogue::collective::CollectiveBuilder<
cutlass::arch::Sm100, // Target Architecture
cutlass::arch::OpClassTensorOp, // Use Tensor Cores
MmaTileShape, ClusterShape, // Geometry
cutlass::epilogue::collective::EpilogueTileAuto, // Auto-size output tile
float, float, // Accumulator, Compute types
void, LayoutC, 0, // Output C (Void means C=0)
cutlass::bfloat16_t, LayoutD, 16 // Output D (BF16 result)
>::CollectiveOp;
Explanation: This massive template asks the builder: "Give me an Epilogue optimized for SM100 that takes float results, converts them to BF16, and stores them."
The Mainloop handles reading data and doing the math ($A \times B$).
using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm100, cutlass::arch::OpClassTensorOp,
ElementA, LayoutA, 16, // Input A info
ElementB, LayoutB, 16, // Input B info
float, // Accumulator type
MmaTileShape, ClusterShape,
// Auto-calculate shared memory usage
cutlass::gemm::collective::StageCountAutoCarveout<sizeof(CollectiveEpilogue::SharedStorage)>,
cutlass::gemm::KernelTmaWarpSpecialized1SmSm100 // Schedule Policy
>::CollectiveOp;
Explanation:
OpClassTensorOp: We are using the specialized Tensor Cores.KernelTmaWarpSpecialized...: This is the secret sauce. It selects the specific Blackwell pipeline where the TMA engine feeds the WGMMA (Warpgroup Matrix Multiply Accumulate) instructions.Finally, we package these two "Collectives" into a kernel and run the test using the helpers we learned about in Chapter 7: Reference GEMM Implementations.
// Combine Mainloop and Epilogue into a Kernel
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
Shape<int,int,int,int>, // Problem size type
CollectiveMainloop,
CollectiveEpilogue
>;
// Wrap it for the testbed
using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
// Run the test!
TEST(SM100, FP8_GEMM) {
bool passed = test::gemm::device::TestAll<Gemm>();
EXPECT_TRUE(passed);
}
How does the CollectiveBuilder know what to do?
It relies on Partial Template Specialization. Inside the library, there are hundreds of specializations matching arch::Sm100.
For the absolute bleeding edge (SM120), we look at test/unit/gemm/device/sm120_tensorop_gemm/. The pattern is the same, but the types are smaller.
// Inside sm120_gemm_f4_f4_f32_tensor_op.cu
// 4-bit Floating Point!
using ElementA = cutlass::float_e2m1_t;
using ElementB = cutlass::float_e2m1_t;
// We still target SM120 architecture
using CollectiveMainloop = typename cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm120, // <-- Specific SM120 Tag
cutlass::arch::OpClassTensorOp,
ElementA, LayoutA, AlignmentA,
// ... same as before
>::CollectiveOp;
Why this is cool:
float_e2m1_t only has 16 possible values.When running these tests, you might encounter issues:
cudaErrorNoKernelImageForDevice.sm100_tensorop_gemm takes a long time.TestAll<Gemm> helper instantiates the kernel for many different problem sizes (128x128, 256x256, etc.).TestSmall for faster iteration.In this chapter, we learned:
TestAll helpers.Now that we can run dense matrix multiplications, what if we want to squeeze even more precision out of small numbers? We can use Block Scaling.
Next Chapter: Block Scaled GEMM Tests
Generated by Code IQ