In the previous chapter, Chapter 7: Reference GEMM Implementations, we built the "Gold Standard" to verify our math. We learned how to check if a matrix multiplication result is correct.
But before we can build a skyscraper (a massive GEMM kernel), we need to test the bricks and the cranes.
Half (16-bit) or FP8 (8-bit) in C++?This chapter explores the unit tests that verify these fundamental building blocks.
You might think, "It's a computer; surely it knows how to add numbers!"
In High-Performance Computing (HPC) and AI, we often use "weird" number formats to save memory.
half_t): Standard C++ doesn't always support this natively on the CPU. CUTLASS has to write a custom class for it.float_e4m3_t): A tiny 8-bit float. It behaves very differently from a standard float.
If our custom class for FP8 is broken, our massive AI model will output garbage, even if the matrix math logic is perfect.
Furthermore, on Hopper (SM90) GPUs, we don't move memory with simple loops anymore. We use the Tensor Memory Accelerator (TMA). This is a complex hardware engine. We need to verify we are giving it the correct instructions.
CUTLASS provides C++ classes that mimic built-in types.
C++ has float (32-bit). It doesn't natively have cutlass::float_e4m3_t (8-bit). We need to ensure that FP8 + FP8 yields the correct result, and that casting int -> FP8 works.
The tests in test/unit/core/ (like half.cu and float8.cu) perform basic sanity checks.
int (7) into an FP8 and back?2.0_fp8 + 2.0_fp8 equal 4.0?
Here is a simplified snippet from the float8.cu test. It verifies that we can define an FP8 number using a user-defined literal (like 7_fe4m3) and cast it to an integer.
// From test/unit/core/float8.cu
TEST(float_e4m3_t, host_conversion) {
using FP8 = cutlass::float_e4m3_t;
// 1. Initialize using a literal suffix
FP8 val = 7_fe4m3;
// 2. Cast back to standard integer
int result = static_cast<int>(val);
// 3. Verify
EXPECT_TRUE(result == 7);
}
Explanation:
_fe4m3 is a custom literal. It tells C++ "Treat this number as an FP8 type."static_cast works correctly, our "brick" is solid.CuTe is the layout engine inside CUTLASS. It describes where data lives.
Imagine a chess board.
In CuTe, we define a layout using Shape (dimensions) and Stride (step size).
// Shape: (8 rows, 8 columns)
// Stride: (1, 8) -> Column Major (contiguous in memory)
auto layout = make_layout(Shape<_8, _8>{}, Stride<_1, _8>{});
The tests verify that these layouts correctly calculate memory addresses before we try to use them in a kernel.
The Tensor Memory Accelerator (TMA) is a hardware feature on SM90+ (Hopper/Blackwell).
The unit tests in test/unit/cute/hopper/tma_load.cu verify this machinery.
We want to ensure the TMA can load a Tile of data correctly.
The Test:
This simplified snippet shows how the test is structured using the generic test_tma_load helper.
// From test/unit/cute/hopper/tma_load.cu
TEST(SM90_CuTe_Hopper, Tma_Load_32x32_Col) {
// 1. Define the Shared Memory (Destination) Layout
// Shape: 32x32, Column Major
Layout smem_layout = Layout<Shape<_32,_32>, Stride<_1,_32>>{};
// 2. Define the Global Memory (Source) Layout
Layout gmem_layout = smem_layout; // Same shape/stride
// 3. Run the test for different data types
test_tma_load<half_t>(gmem_layout, smem_layout);
test_tma_load<float>(gmem_layout, smem_layout);
}
Explanation:
test_tma_load does that.half_t).
What happens inside test_tma_load? It bridges the gap between the host (CPU) setup and the device (GPU) execution.
One of the most complex parts of TMA is Swizzling. To avoid "Bank Conflicts" in Shared Memory (where multiple threads try to access the same memory bank simultaneously), we shuffle (swizzle) the data layout.
The test file explicitly checks different "Swizzle Atoms."
// From test/unit/cute/hopper/tma_load.cu
template <class T, template <typename> typename SWIZZLE_ATOM>
void test_tma_load_swizzle_atom_mn() {
// Create a layout that uses a specific Swizzle Pattern (e.g., 128 bytes)
auto smem_layout = SWIZZLE_ATOM<T>{};
// Create a large Global Memory source
Layout gmem_layout = make_layout(
make_shape(2 * size<0>(smem_layout), 2 * size<1>(smem_layout)),
GenColMajor{}
);
// Verify TMA works even with this complex shuffled destination
test_tma_load<T>(gmem_layout, smem_layout);
}
Why this matters:
If you get the swizzle wrong, the TMA might write data to Address X but your math kernel reads from Address Y. The data is there, but scrambled. This test proves that the reader and writer agree on the scramble pattern.
The file tma_load.cu also tests "Internal Types." This checks if we can trick the TMA.
Example: Loading complex<double>.
The TMA doesn't know what a complex number is. But a complex<double> is just 128 bits (16 bytes). We can tell the TMA "Load 128-bit unsigned integers (uint128_t)" and then cast them back to complex numbers in software.
// Loading Complex Doubles by pretending they are uint64_t
test_tma_load<complex<double>, uint64_t>(gmem_layout, smem_layout);
This flexibility allows CUTLASS to support future data types without waiting for new hardware.
In this chapter, we learned:
half_t and float_e4m3_t behave like normal numbers (core/half.cu, core/float8.cu).cute/hopper/tma_load.cu).These tests form the foundation. If the types work and the data movement works, we are ready to build the complex matrix multiplication logic on top of them.
In the next chapter, we will see how these components come together to test the full Dense GEMM on the new Blackwell architecture.
Next Chapter: Blackwell Dense GEMM Tests
Generated by Code IQ