In the previous chapter, Chapter 12: Sparse Compressor Test, we explored the cutting-edge world of structured sparsity on the newest hardware.
But the world isn't made entirely of the latest GPUs. Millions of systems still run on Ampere (A100), Volta (V100), or even older Maxwell/Pascal cards.
CUTLASS is backwards compatible. This chapter covers Legacy Architecture Tests. We will learn how CUTLASS tests the "Classic" kernels that powered the AI revolution before the arrival of the CollectiveBuilder and Hopper TMA.
If Chapter 9: Blackwell Dense GEMM Tests was about flying a spaceship, this chapter is about driving a reliable muscle car.
The logic for older GPUs is different:
Central Use Case: You want to verify a standard FP16 Matrix Multiplication on an NVIDIA A100 (Sm80) or a GTX 1080 (Sm61/50).
Gemm Template
In modern CUTLASS (3.x+), we use a Builder. In legacy CUTLASS (2.x), we use the device::Gemm template. It takes many arguments to fully define the "engine."
using Gemm = cutlass::gemm::device::Gemm<
ElementA, LayoutA, // Input A
ElementB, LayoutB, // Input B
ElementC, LayoutC, // Output C
ElementAccumulator, // Math Type (e.g., float)
OpClass, // TensorOp vs SIMT
ArchTag, // SM80, SM70, SM50
ThreadblockShape, // Tile Size (128x128x32)
WarpShape, // Warp Size (64x64x32)
InstructionShape, // Tensor Core Instruction
EpilogueOp, // Alpha/Beta logic
Swizzle, // Grid Organization
Stages // Pipeline Depth
>;
It looks scary, but it gives you total control over the legacy hardware pipeline.
This is the most important distinction in legacy tests.
OpClassTensorOp: Uses Tensor Cores.OpClassSimt: Uses CUDA Cores.
Let's look at a test file like gemm_f16n_f16t_f32t_tensor_op_f32_sm80.cu. This tests a standard Half-precision GEMM on an A100.
We manually define the hierarchy of tiles.
// 1. How big is the block computed by the whole Thread Block?
using ThreadblockShape = cutlass::gemm::GemmShape<128, 256, 64>;
// 2. How big is the block computed by one Warp (32 threads)?
using WarpShape = cutlass::gemm::GemmShape<64, 64, 64>;
// 3. What is the size of the hardware instruction (Tensor Core)?
using InstructionShape = cutlass::gemm::GemmShape<16, 8, 16>;
We plug these shapes into the main template. Note OpClassTensorOp and Sm80.
using Gemm = cutlass::gemm::device::Gemm<
cutlass::half_t, cutlass::layout::ColumnMajor, // A
cutlass::half_t, cutlass::layout::RowMajor, // B
float, cutlass::layout::RowMajor, // C
float, // Accumulator
cutlass::arch::OpClassTensorOp, // Use Tensor Cores
cutlass::arch::Sm80, // Ampere Arch
ThreadblockShape, WarpShape, InstructionShape, // Shapes
EpilogueOp, // Epilogue
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
3 // 3 Pipeline Stages
>;
We use the TestAllGemm helper, which works exactly like the one in Chapter 9.
// Run the test across various problem sizes
EXPECT_TRUE(test::gemm::device::TestAllGemm<Gemm>());
Now let's look at simt_sgemm_nn_sm50.cu. This targets GPUs from ~2014.
Notice the differences:
OpClassSimt.Sm50.1x1x1. Because scalar cores process one element at a time, there is no "matrix instruction size."using Gemm = cutlass::gemm::device::Gemm<
float, cutlass::layout::ColumnMajor,
float, cutlass::layout::ColumnMajor,
float, cutlass::layout::RowMajor,
float,
cutlass::arch::OpClassSimt, // <--- Key Difference: Scalar Math
cutlass::arch::Sm50, // <--- Key Difference: Old Arch
ThreadblockShape, WarpShape,
cutlass::gemm::GemmShape<1, 1, 1>, // <--- Instruction is scalar
EpilogueOutputOp,
cutlass::gemm::threadblock::GemmIdentityThreadblockSwizzle<>,
2
>;
In scientific computing, complex numbers ($a + bi$) are common.
The file gemm_planar_complex...sm80.cu tests this specific memory layout.
// A universal kernel that handles Planar Complex math
using GemmKernel = cutlass::gemm::kernel::DefaultGemmPlanarComplexUniversal<
cutlass::half_t, cutlass::layout::RowMajor, // Real Part Layout
cutlass::ComplexTransform::kNone, // Transform (e.g., Conjugate)
8, // Alignment
// ... (Repeats for Imaginary Part) ...
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm80
>::GemmKernel;
Explanation: This kernel performs 4 internal matrix multiplications (RealReal, ImagImag, etc.) and combines them to produce the correct complex result ($D = A \times B$).
Convolution (used in Vision Models like ResNet) can be calculated as a matrix multiplication. The file conv2d_fprop...sm80.cu tests this.
It uses Implicit GEMM. Instead of physically reshaping the image into a matrix (which is slow and uses memory), the kernel calculates the addresses on-the-fly to pretend the image is a matrix.
using Conv2dFpropKernel = typename cutlass::conv::kernel::DefaultConv2dFprop<
cutlass::half_t, cutlass::layout::TensorNHWC, // Activation (Image)
cutlass::half_t, cutlass::layout::TensorNHWC, // Filter (Weights)
// ...
cutlass::arch::OpClassTensorOp,
cutlass::arch::Sm80,
// ...
cutlass::conv::IteratorAlgorithm::kOptimized // <--- Magic Iterator
>::Kernel;
Explanation: kOptimized tells the iterator to pre-compute indices to traverse the image data as if it were a column in a matrix.
How does the legacy engine work compared to the modern Builder?
The Legacy architecture uses a strict hierarchy of Core Components.
DefaultMma
In the legacy world, you don't build the pipeline yourself. You rely on DefaultMma. This meta-template looks at your architecture (Sm80) and your types (float16) and picks the best hand-tuned pipeline.
This logic is often found implicitly in the test instantiation:
// From gemm/threadblock/mma_pipelined_sm80.cu
using MmaCore = typename cutlass::gemm::threadblock::DefaultMmaCore<
ThreadblockShape, WarpShape, InstructionShape,
ElementA, LayoutA,
ElementB, LayoutB,
ElementC, LayoutC,
cutlass::arch::OpClassTensorOp // <--- The key selector
>;
If you change OpClassTensorOp to OpClassSimt, DefaultMmaCore swaps the entire internal engine from Warp-synchronous Tensor operations to thread-level FMA (Fused Multiply Add) instructions.
In this chapter, we learned:
device::Gemm: The monolithic template used for legacy kernels.OpClassTensorOp (Tensor Cores) and OpClassSimt (CUDA Cores).We have now covered the full spectrum of matrix multiplication: from the newest Blackwell Block-Scaled Sparse kernels down to the legacy Maxwell scalar kernels.
But how do we create these thousands of kernel variants without writing thousands of C++ files by hand? We use Code Generation.
Next Chapter: C++ Code Generators
Generated by Code IQ