In the previous chapter, Chapter 10: Block Scaled GEMM Tests, we learned how to compress data values into tiny formats (like 4-bit) using Block Scaling.
Now, we will explore two advanced techniques to squeeze even more performance out of the Blackwell architecture:
This chapter covers Sparse and Stream-K Tests, the cutting edge of efficiency in CUTLASS.
Imagine a matrix that is 50% zeros.
0 * 5. It wastes energy calculating zero.NVIDIA GPUs (Ampere and newer) support 2:4 Structured Sparsity. This means in every block of 4 numbers, at least 2 must be zero. If you follow this rule, the hardware runs 2x faster.
Imagine you have 100 slices of pizza (Work Tiles) and 8 people (SMs/Cores).
Stream-K ensures "perfect load balancing," which is critical when running odd-shaped matrix multiplications on massive GPUs.
OpClassSparseTensorOpThis is the magic tag in the Builder. It tells the compiler: "I am providing compressed data. Please use the Sparse Tensor Cores."
Since we removed the zeros, the remaining data is packed tight. The GPU needs a "map" (Metadata) to know where the original values belonged.
A scheduling algorithm that decouples the grid of thread blocks from the geometry of the matrix. It allows "partial tiles" where two different SMs work on the same output tile to balance the load.
We want to build the ultimate efficient kernel for Blackwell (SM100) that combines:
nv_float4).We will walk through how to define this in a test.
We will look at the file sm100_bssp_gemm_nvf4_nvf4_f32_f16_nvf4_o_tnt_streamk.cu.
We need to define our inputs. We use nv_float4_t, which we learned about in Chapter 10, but now we will use it in a sparse context.
// Define the 4-bit float type
using ElementData = cutlass::float_e2m1_t;
// Define the Block Scaled Pair (Data + Scale)
// The "nv_float4_t" handles the 4-bit packing
using ElementPairA = cutlass::nv_float4_t<ElementData>;
using ElementPairB = cutlass::nv_float4_t<ElementData>;
Explanation: This sets up the data types. nv_float4_t is the NVIDIA-specific format for 4-bit block-scaled numbers.
This is the most critical step. We must choose the correct Operation Class.
// Target Blackwell (SM100)
using ArchTag = cutlass::arch::Sm100;
// Tell the builder to use Block Scaled + Sparse hardware
using OpClassTag = cutlass::arch::OpClassBlockScaledSparseTensorOp;
Explanation:
OpClassTensorOp, it would run a dense kernel.OpClassBlockScaledSparseTensorOp enables both the scaling logic and the sparsity logic simultaneously.
We use the CollectiveBuilder to generate the code that loads the data.
using CollectiveMainloop =
typename cutlass::gemm::collective::CollectiveBuilder<
ArchTag,
OpClassTag, // <--- Passed in here
ElementPairA, LayoutA, 64, // Alignment is critical!
ElementPairB, LayoutB, 32,
float, // Accumulator
Shape<_128, _128, _256>, // Tile Shape
Shape<_1, _1, _1>, // Cluster Shape
// ... stage counts ...
// Use the specialized NVF4 Sparse Kernel Schedule
cutlass::gemm::KernelSparseTmaWarpSpecialized1SmNvf4Sm100
>::CollectiveOp;
Explanation: The builder takes the OpClassTag and the ElementPair types. It automatically generates a pipeline that:
Finally, when defining the Kernel, we swap the standard scheduler for Stream-K.
using GemmKernel = cutlass::gemm::kernel::GemmUniversal<
cute::Shape<int,int,int,int>,
CollectiveMainloop,
CollectiveEpilogue,
// THE SCHEDULER:
cutlass::gemm::StreamKScheduler
>;
Explanation: By passing StreamKScheduler as the last template argument, we change how the work is distributed to the GPU cores. The math (Mainloop/Epilogue) stays the same, but the orchestration changes.
How does the GPU handle "missing" data?
Inside the CUTLASS library, the OpClassSparseTensorOp triggers specific template specializations.
In the file sm100_sparse_tensorop_gemm/sm100_sp_gemm...cu, you will see:
namespace cutlass3x_sm100_sptensorop... {
// ...
using CollectiveMainloop =
typename cutlass::gemm::collective::CollectiveBuilder<
cutlass::arch::Sm100,
cutlass::arch::OpClassSparseTensorOp, // <--- The Trigger
// ...
cutlass::gemm::KernelSparseTmaWarpSpecialized1SmSm100
>::CollectiveOp;
}
This builder selects a "Sparse TMA" pipeline (KernelSparseTma...). This pipeline allocates extra shared memory specifically to hold the Metadata (often called E or Meta).
When StreamKScheduler is used, the kernel launch logic changes.
Running these tests requires specific hardware (SM100/SM120) because older GPUs do not understand the float_e2m1 types or the specific Blackwell sparse instructions.
// Example Test Execution (from the file)
TEST(SM100, Sparse_BlockScaled_StreamK) {
// Define the GEMM type we built above
using Gemm = cutlass::gemm::device::GemmUniversalAdapter<GemmKernel>;
// Use the TestSmall helper
// It generates random compressed data and metadata for you!
EXPECT_TRUE(test::gemm::device::TestSmall<Gemm>(...));
}
Note: The TestSmall helper is very smart. For sparse tests, it automatically generates a valid 2:4 sparse matrix on the CPU, compresses it, generates the metadata, and sends it to the GPU for the test.
In this chapter, we learned:
OpClassSparseTensorOp to enable it.You might be wondering: "The test helper generated the metadata for me. But in a real application, how do I take a dense matrix and turn it into the compressed format and metadata required by these kernels?"
That requires a Sparse Compressor.
Next Chapter: Sparse Compressor Test
Generated by Code IQ