In the previous chapter, Chapter 11: Sparse and Stream-K Tests, we explored the cutting-edge feature of Sparsity. We learned that if a matrix has enough zeros, we can skip the math for them and run 2x faster.
However, we skipped a crucial step. We assumed the data was already in the "Magical Compressed Format" that the GPU expects.
But real data starts as a normal dense matrix.
This chapter introduces the Sparse Compressor Test. This tool tests the machinery that takes a normal matrix (with zeros) and "vacuum packs" it into the compressed format required by NVIDIA's Blackwell (SM100) Sparse Tensor Cores.
Imagine you are packing clothes for a trip.
If you try to shove the messy bed (Dense Matrix) into the suitcase, it won't fit. You need a Compressor.
Central Use Case: You have a trained AI model with weights in FP4 (4-bit float). You pruned it so 50% of the weights are zero. You now need to convert this into the Compressed Values and Metadata arrays so the Blackwell GPU can execute it efficiently.
The hardware doesn't just support random zeros. It requires structure. Usually, in every group of 4 numbers, 2 must be zero and 2 must be non-zero. The compressor analyzes the dense data to ensure it fits this rule (or selects the best values) and packs them.
The compressor takes one input (Dense D) and produces two outputs:
Sm1xxGemmSparseConfigThis is the "instruction manual" for the compressor. It tells CUTLASS exactly how the bits are arranged in the hardware registers for the SM100 architecture.
We will look at how to set up a test for compressing 4-bit Floating Point (FP4) data on an SM100 GPU.
First, we define what our dense data looks like. Here we use float_e2m1_t, which is the technical name for FP4.
// Define the data type: 4-bit float (e2m1)
using ElementA = cutlass::float_e2m1_t;
// Define the layout: Row Major
using LayoutATag = cutlass::layout::RowMajor;
Explanation: This is the format of the "clothes on the bed" before compression.
This is where we map our data to the hardware's internal storage units.
// Hardware Internal Types (Cute DSL)
// "4 bits per element, stored in uint8_t container"
using ElementAMma = cute::sparse_elem<4, uint8_t>;
// Metadata type: "2 bits per element => 16 elements per byte"
using ElementEMma = cute::sparse_elem<16, uint8_t>;
// The Configuration Object
using Sm1xxSparseConfig = cutlass::Sm1xxGemmSparseConfig<
ElementAMma, LayoutATag, ElementEMma
>;
Explanation: Sm1xxSparseConfig connects the dots. It says: "We are taking 4-bit inputs and generating metadata that packs 16 indices into a single byte."
Now we define the actual GPU kernel that performs the work.
using CompressorKernel = cutlass::transform::kernel::StructuredSparseCompressor<
cute::Shape<int, int, int, int>, // Problem Size (Dynamic)
ElementA, // Input Type
LayoutATag, // Layout
Sm1xxSparseConfig, // The Hardware Config
cutlass::arch::Sm100 // Architecture
>;
Explanation: StructuredSparseCompressor is the worker. It reads ElementA and uses Sm1xxSparseConfig to write the compressed outputs.
Just like in previous chapters, we wrap the kernel in an adapter and use a testbed to verify it.
// Wrap the kernel for easy launching
using Compressor = cutlass::transform::device::
TransformUniversalAdapter<CompressorKernel>;
// Run the test
TEST(SM100_Compressor, FP4_Test) {
test::transform::device::TestbedSparseGemmCompressor<Compressor> testbed;
// The testbed generates random sparse data, compresses it,
// and verifies the output matches the 2:4 rule.
EXPECT_TRUE(testbed.run_auto());
}
What happens inside the StructuredSparseCompressor? It acts as a parallel filter.
The compressor logic is heavily templated to handle different bit-widths (4-bit, 8-bit, 16-bit).
1. Reordering (Swizzling)
The raw compressed data isn't just written linearly. The SM100 Tensor Cores expect data in a specific "Swizzled" pattern to maximize memory bandwidth. The Sm1xxSparseConfig handles this calculation.
2. Dynamic Types
You might notice a test case for cutlass::type_erased_dynamic_float4_t.
// From sm100_sparse_gemm_compressor_f4_omma.cu
using ElementA = cutlass::type_erased_dynamic_float4_t;
This allows the compressor to handle data types that are only known at runtime (e.g., loaded from a Python script), rather than fixed at compile time. This is essential for frameworks like PyTorch.
In Chapter 9: Blackwell Dense GEMM Tests, the kernels did Math (Multiply-Add).
Here, the Compressor does Transform (Rearrange-Pack).
This operation is usually run offline (once, before training/inference starts) to prepare the weights. That is why it is in the cutlass/transform directory, not cutlass/gemm.
In this chapter, we learned:
Now we have covered the entire modern pipeline: building, defining, profiling, and running advanced Sparse/Dense kernels on the latest hardware.
But what about the older hardware? CUTLASS is backwards compatible. The next chapter explores how to test architectures that came before Blackwell and Hopper.
Next Chapter: Legacy Architecture Tests
Generated by Code IQ