Generated by Code IQ · v1.0

cutlass
Knowledge Tutorial

A chapter-by-chapter walkthrough of cutlass, generated from its source code and tutorial markdown.

16
Chapters
-
Subsystems
Rabbit Holes
▶ Start Reading ⎇ View on GitHub
System Architecture

How the pieces fit

cutlass is organized as connected concepts and components. Start broad, then drill down chapter by chapter.

⚙️
Build Configuration
Build Configuration
⚙️
Documentation
Documentation
⚙️
Library Definitions
Library Definitions
⚙️
Operation Wrappers
Operation Wrappers
🔧
Profiler Tool
Profiler Tool
⚙️
Host Tensor Utility
Host Tensor Utility
⚙️
Reference GEMM Implementations
Reference GEMM Implementations
⚙️
Core CuTe and Type Tests
Core CuTe and Type Tests
⚙️
Blackwell Dense GEMM Tests
Blackwell Dense GEMM Tests
⚙️
Block Scaled GEMM Tests
Block Scaled GEMM Tests
⚙️
Sparse and Stream-K Tests
Sparse and Stream-K Tests
⚙️
Sparse Compressor Test
Sparse Compressor Test
cutlass — bash
open tutorial
◆ Scanning numbered chapters
◆ Building navigation and Mermaid diagrams
◆ Generating chapter and subsystem pages
✓ 16 chapter pages built
✓ Theme toggle enabled
Repository Overview

Intro and Architecture Diagram

CUTLASS is a high-performance CUDA C++ template library for implementing matrix-matrix multiplication (GEMM) and related linear algebra primitives. It provides hierarchical abstractions to efficiently target NVIDIA GPUs, from Volta to the latest Blackwell architecture, handling complex operations like sparse GEMM, block scaling, and Stream-K scheduling. The project includes a Profiler for benchmarking, extensive unit tests, and a Python-based CuTe DSL for developing kernels with rapid iteration.

Source Repository: https://github.com/NVIDIA/cutlass

flowchart TD A0["Build Configuration"] A1["Documentation"] A2["Host Tensor Utility"] A3["Reference GEMM Implementations"] A4["Library Definitions"] A5["Operation Wrappers"] A6["Profiler Tool"] A7["Core CuTe and Type Tests"] A8["Legacy Architecture Tests"] A9["Blackwell Dense GEMM Tests"] A10["Block Scaled GEMM Tests"] A11["Sparse and Stream-K Tests"] A12["Sparse Compressor Test"] A13["C++ Code Generators"] A14["CuTe DSL Pipelines"] A15["DSL Infrastructure"] A0 -->|"Configures build"| A6 A0 -->|"Compiles"| A8 A1 -->|"Describes"| A15 A2 -->|"Supports data management"| A3 A3 -->|"Verifies correctness"| A8 A4 -->|"Defines API configuration"| A5 A5 -->|"Uses types from"| A4 A6 -->|"Profiles"| A5 A7 -->|"Uses"| A2 A9 -->|"Validates against"| A3 A10 -->|"Validates against"| A3 A11 -->|"Validates against"| A3 A12 -->|"Uses"| A2 A13 -->|"Generates code for"| A5 A14 -->|"Uses"| A15
Tutorial Chapters

All 16 chapters

Follow sequentially or jump to any topic. Start with Build Configuration.

Ch.01 CORE
Build Configuration
Welcome to the first chapter of the CUTLASS tutorial! Before we can write high-performance matrix multiplication kernels, we need to set up…
Ch.02 CORE
Documentation
In the previous chapter, Chapter 1: Build Configuration, we set up our "kitchen" by configuring the build system for specific architectures…
Ch.03 CORE
Library Definitions
In Chapter 2: Documentation, we learned how to read the recipes and find the features available in CUTLASS, such as Blackwell support and C…
Ch.04 CORE
Operation Wrappers
In the previous chapter, Chapter 3: Library Definitions, we learned how to fill out the "forms" (Configurations and Arguments) to describe…
Ch.05 TOOLS
Profiler Tool
In the previous chapter, Chapter 4: Operation Wrappers, we learned how to wrap rigid C++ templates into flexible objects. We created a "Men…
Ch.06 CORE
Host Tensor Utility
In the previous chapter, Chapter 5: Profiler Tool, we learned how to use the automated profiler to benchmark kernels. But what if you want…
Ch.07 CORE
Reference GEMM Implementations
In the previous chapter, Chapter 6: Host Tensor Utility, we learned how to easily manage memory on both the CPU and GPU using HostTensor. W…
Ch.08 CORE
Core CuTe and Type Tests
In the previous chapter, Chapter 7: Reference GEMM Implementations, we built the "Gold Standard" to verify our math. We learned how to chec…
Ch.09 CORE
Blackwell Dense GEMM Tests
In the previous chapter, Chapter 8: Core CuTe and Type Tests, we verified our "bricks" (custom types like FP8) and our "cranes" (TMA data m…
Ch.10 CORE
Block Scaled GEMM Tests
In the previous chapter, Chapter 9: Blackwell Dense GEMM Tests, we successfully ran high-performance matrix multiplications on the Blackwel…
Ch.11 CORE
Sparse and Stream-K Tests
In the previous chapter, Chapter 10: Block Scaled GEMM Tests, we learned how to compress data values into tiny formats (like 4-bit) using B…
Ch.12 CORE
Sparse Compressor Test
In the previous chapter, Chapter 11: Sparse and Stream-K Tests, we explored the cutting-edge feature of Sparsity. We learned that if a matr…
Ch.13 CORE
Legacy Architecture Tests
In the previous chapter, Chapter 12: Sparse Compressor Test, we explored the cutting-edge world of structured sparsity on the newest hardwa…
Ch.14 CORE
C++ Code Generators
In the previous chapter, Chapter 13: Legacy Architecture Tests, we looked at how to manually write C++ templates for older GPUs. You might…
Ch.15 CORE
CuTe DSL Pipelines
In the previous chapter, Chapter 14: C++ Code Generators, we learned how to use Python to fill in "Mad Libs" style templates to generate C+…
Ch.16 CORE
DSL Infrastructure
In the previous chapter, Chapter 15: CuTe DSL Pipelines, we learned how to define high-level traffic control logic for data movement using…
About This Project

Generated by Code IQ

This tutorial was automatically generated by Code IQ and rendered with the shared tutorial site builder. It can be produced for any repository tutorial folder that follows the numbered markdown chapter layout.

View Code IQ ↗
python build_site.py '/home/runner/work/Code-IQ/Code-IQ/output/cutlass'

// → 16 chapters
// → source: NVIDIA/cutlass