In the previous chapter, Chapter 13: Legacy Architecture Tests, we looked at how to manually write C++ templates for older GPUs. You might have noticed that defining a single kernel required 20-30 lines of complex C++ code.
Now imagine you need to support:
That is $5 \times 4 \times 10 \times 3 = 600$ kernels. Writing 600 C++ files by hand is impossible to maintain.
This chapter introduces the C++ Code Generators. These are Python scripts that write the C++ code for us.
Do you remember the game "Mad Libs"? You have a story with blanks, and you fill in the blanks with words to create a sentence.
${color} fox jumps over the ${adjective} dog."CUTLASS uses this exact same logic.
${placeholders}..cu file that can be compiled.Central Use Case: We want to generate the C++ code for a Blackwell (SM100) GEMM using the Epilogue Visitor Tree (EVT) without manually writing the confusing nested templates.
GemmOperation ObjectThis is a Python class that acts as the "Blueprint." It holds all the metadata about a specific kernel but contains no C++ code itself.
EmitGemm...)
This is the "Printer." It takes a GemmOperation blueprint and knows which C++ template string to use. It performs the text substitution.
For the new Blackwell architecture, the Epilogue (what happens after the matrix multiply) is very complex. The Sm100Emitter specializes in generating the C++ code for fused operations like ReLU(Bias + (A*B)).
Let's see how we use Python to generate the C++ code for a specific operation.
We create an instance of GemmOperationUniversal. This is purely Python code.
# Conceptual Python code (simplified)
from cutlass_cppgen.backend import GemmOperationUniversal
# 1. Define the Blueprint
operation = GemmOperationUniversal(
arch=100, # Target SM100 (Blackwell)
tile_description=...,
A=TensorDescription(element=DataType.f16, layout=LayoutType.RowMajor),
B=TensorDescription(element=DataType.f16, layout=LayoutType.ColumnMajor),
C=TensorDescription(element=DataType.f32, layout=LayoutType.ColumnMajor),
epilogue_functor=...
)
Based on the architecture (SM100) and API version (3.x), the system selects the correct emitter class.
# Inside the backend logic
from cutlass_cppgen.backend.gemm_operation import EmitGemmUniversalInstance3x
# Create the emitter for CUTLASS 3.x
emitter = EmitGemmUniversalInstance3x(operation_suffix="_sm100")
We call the emit method. This functions like the "Print" button.
# Generate the string
cpp_code = emitter.emit(operation)
# Print it to see the result
print(cpp_code)
The Output (Simplified C++):
// Auto-generated C++ code
using cutlass_sm100_tensorop_f16_s128x128_row_col_align8 =
cutlass::gemm::kernel::GemmUniversal<
Shape<int,int,int,int>,
CollectiveMainloop, // ... configured for f16 ...
CollectiveEpilogue // ... configured for f16 ...
>;
How does the text substitution actually work? It uses a dictionary of values mapping key names to C++ snippets.
In python/cutlass_cppgen/backend/gemm_operation.py, the class EmitGemmUniversalInstance3x defines the raw C++ template.
# From python/cutlass_cppgen/backend/gemm_operation.py
self.gemm_template_kernel = """
using CollectiveMainloop =
typename cutlass::gemm::collective::CollectiveBuilder<
${arch}, ${opcode_class},
${element_a}, ${layout_a}, ${align_a},
${element_b}, ${layout_b}, ${align_b},
// ...
>::CollectiveOp;
"""
Explanation: Notice the ${...} markers. These are the blanks in our "Mad Libs" game.
The emit method creates a Python dictionary mapping those markers to actual C++ types.
# Inside emit() method
values = {
"arch": "cutlass::arch::Sm%d" % operation.arch,
"element_a": DataTypeTag[operation.A.element], # e.g., "cutlass::half_t"
"layout_a": LayoutTag[operation.A.layout], # e.g., "cutlass::layout::RowMajor"
# ...
}
# The magic function that replaces the text
return SubstituteTemplate(self.gemm_template_kernel, values)
For the Blackwell architecture (SM100), the Epilogue is treated differently. We use a Visitor Tree pattern (covered in Chapter 2 via DSL, but here is the generator backend).
The file python/cutlass_cppgen/backend/evt/backend/sm100_emitter.py handles this.
Blackwell supports advanced fusion (e.g., GELU(LinearCombination)). The generator must analyze the graph of operations and emit a C++ descriptor.
# From python/cutlass_cppgen/backend/evt/backend/sm100_emitter.py
class Sm100CollectiveEpilogue:
def emit(self):
# ... logic to calculate stride strings ...
# Returns a C++ type definition
return f"""
using EpilogueDescriptor =
cutlass::epilogue::collective::detail::Sm100EpilogueDescriptor<
{OpcodeClassTag[self.opclass]},
{self.CtaTileMNK},
{self.EpilogueTileType},
// ...
>;
"""
Explanation:
CtaTileMNK: The Python script calculates the thread block shape (e.g., Shape<_128,_128,_64>) and injects it into the string.Sm100EpilogueDescriptor which connects the generated kernel to the hardware features we learned about in Chapter 9: Blackwell Dense GEMM Tests.In this chapter, we learned:
${placeholders}.EmitGemmUniversalInstance3x) handle the logic of translating Python objects into C++ syntax.Sm100Emitter handles the complex generation required for new Epilogue features.This system allows CUTLASS to publish thousands of highly optimized kernels without a human having to type them out.
Now that we understand how the library generates the underlying C++ code, we can move up a level of abstraction. What if you want to write your own kernels using a high-level Python syntax?
Next Chapter: CuTe DSL Pipelines
Generated by Code IQ