Generated by Code IQ · v1.0

cutlass
Knowledge Tutorial

A chapter-by-chapter walkthrough of cutlass, generated from its source code and tutorial markdown.

16

Chapters

-

Subsystems

∞

Rabbit Holes

▶ Start Reading ⎇ View on GitHub

System Architecture

How the pieces fit

cutlass is organized as connected concepts and components. Start broad, then drill down chapter by chapter.

⚙️

Build Configuration

Build Configuration

⚙️

Documentation

Documentation

⚙️

Library Definitions

Library Definitions

⚙️

Operation Wrappers

Operation Wrappers

🔧

Profiler Tool

Profiler Tool

⚙️

Host Tensor Utility

Host Tensor Utility

⚙️

Reference GEMM Implementations

Reference GEMM Implementations

⚙️

Core CuTe and Type Tests

Core CuTe and Type Tests

⚙️

Blackwell Dense GEMM Tests

Blackwell Dense GEMM Tests

⚙️

Block Scaled GEMM Tests

Block Scaled GEMM Tests

⚙️

Sparse and Stream-K Tests

Sparse and Stream-K Tests

⚙️

Sparse Compressor Test

Sparse Compressor Test

cutlass — bash

➜open tutorial

◆ Scanning numbered chapters

◆ Building navigation and Mermaid diagrams

◆ Generating chapter and subsystem pages

✓ 16 chapter pages built

✓ Theme toggle enabled

➜

Repository Overview

Intro and Architecture Diagram

CUTLASS is a high-performance CUDA C++ template library for implementing matrix-matrix multiplication (GEMM) and related linear algebra primitives. It provides hierarchical abstractions to efficiently target NVIDIA GPUs, from Volta to the latest Blackwell architecture, handling complex operations like sparse GEMM, block scaling, and Stream-K scheduling. The project includes a Profiler for benchmarking, extensive unit tests, and a Python-based CuTe DSL for developing kernels with rapid iteration.

Source Repository: https://github.com/NVIDIA/cutlass

flowchart TD A0["Build Configuration"] A1["Documentation"] A2["Host Tensor Utility"] A3["Reference GEMM Implementations"] A4["Library Definitions"] A5["Operation Wrappers"] A6["Profiler Tool"] A7["Core CuTe and Type Tests"] A8["Legacy Architecture Tests"] A9["Blackwell Dense GEMM Tests"] A10["Block Scaled GEMM Tests"] A11["Sparse and Stream-K Tests"] A12["Sparse Compressor Test"] A13["C++ Code Generators"] A14["CuTe DSL Pipelines"] A15["DSL Infrastructure"] A0 -->|"Configures build"| A6 A0 -->|"Compiles"| A8 A1 -->|"Describes"| A15 A2 -->|"Supports data management"| A3 A3 -->|"Verifies correctness"| A8 A4 -->|"Defines API configuration"| A5 A5 -->|"Uses types from"| A4 A6 -->|"Profiles"| A5 A7 -->|"Uses"| A2 A9 -->|"Validates against"| A3 A10 -->|"Validates against"| A3 A11 -->|"Validates against"| A3 A12 -->|"Uses"| A2 A13 -->|"Generates code for"| A5 A14 -->|"Uses"| A15

Tutorial Chapters

All 16 chapters

Follow sequentially or jump to any topic. Start with Build Configuration.

Build Configuration

Welcome to the first chapter of the CUTLASS tutorial! Before we can write high-performance matrix multiplication kernels, we need to set up…

In the previous chapter, Chapter 1: Build Configuration, we set up our "kitchen" by configuring the build system for specific architectures…

Library Definitions

In Chapter 2: Documentation, we learned how to read the recipes and find the features available in CUTLASS, such as Blackwell support and C…

Operation Wrappers

In the previous chapter, Chapter 3: Library Definitions, we learned how to fill out the "forms" (Configurations and Arguments) to describe…

In the previous chapter, Chapter 4: Operation Wrappers, we learned how to wrap rigid C++ templates into flexible objects. We created a "Men…

Host Tensor Utility

In the previous chapter, Chapter 5: Profiler Tool, we learned how to use the automated profiler to benchmark kernels. But what if you want…

Reference GEMM Implementations

In the previous chapter, Chapter 6: Host Tensor Utility, we learned how to easily manage memory on both the CPU and GPU using HostTensor. W…

Core CuTe and Type Tests

In the previous chapter, Chapter 7: Reference GEMM Implementations, we built the "Gold Standard" to verify our math. We learned how to chec…

Blackwell Dense GEMM Tests

In the previous chapter, Chapter 8: Core CuTe and Type Tests, we verified our "bricks" (custom types like FP8) and our "cranes" (TMA data m…

Block Scaled GEMM Tests

In the previous chapter, Chapter 9: Blackwell Dense GEMM Tests, we successfully ran high-performance matrix multiplications on the Blackwel…

Sparse and Stream-K Tests

In the previous chapter, Chapter 10: Block Scaled GEMM Tests, we learned how to compress data values into tiny formats (like 4-bit) using B…

Sparse Compressor Test

In the previous chapter, Chapter 11: Sparse and Stream-K Tests, we explored the cutting-edge feature of Sparsity. We learned that if a matr…

Legacy Architecture Tests

In the previous chapter, Chapter 12: Sparse Compressor Test, we explored the cutting-edge world of structured sparsity on the newest hardwa…

C++ Code Generators

In the previous chapter, Chapter 13: Legacy Architecture Tests, we looked at how to manually write C++ templates for older GPUs. You might…

CuTe DSL Pipelines

In the previous chapter, Chapter 14: C++ Code Generators, we learned how to use Python to fill in "Mad Libs" style templates to generate C+…

DSL Infrastructure

In the previous chapter, Chapter 15: CuTe DSL Pipelines, we learned how to define high-level traffic control logic for data movement using…

About This Project

Generated by Code IQ

This tutorial was automatically generated by Code IQ and rendered with the shared tutorial site builder. It can be produced for any repository tutorial folder that follows the numbered markdown chapter layout.

View Code IQ ↗

python build_site.py '/home/runner/work/Code-IQ/Code-IQ/output/cutlass'

// → 16 chapters
// → source: NVIDIA/cutlass