In the previous chapter, Chapter 5: Profiler Tool, we learned how to use the automated profiler to benchmark kernels. But what if you want to write your own test or integration?
To run a kernel on a GPU, you need data. This creates a logistical problem:
Managing these memory allocations and copies manually (using cudaMalloc, cudaMemcpy, etc.) is tedious and error-prone.
This chapter introduces the Host Tensor Utility, a smart container that handles this "logistics" for you.
Think of the Host Tensor as a Cloud-Synced Folder (like Dropbox or Google Drive) for your data.
Without Host Tensor: You have to manually upload (copy to device) and download (copy to host) every time you change something.
With Host Tensor: You create one object. It manages both the local copy and the cloud copy. You simply say "Sync," and it handles the movement.
We want to verify a Matrix Multiplication.
The cutlass::HostTensor class owns two blocks of memory:
std::vector<Element> host_: A standard C++ vector living in system RAM.device_memory::allocation<Element> device_: A raw pointer to GPU VRAM.Just like the Definitions we learned in Chapter 3: Library Definitions, a tensor needs to know its shape (Extent) and how it is arranged in memory (Layout).
{128, 64} (Rows, Columns).The class provides simple methods to keep the two memory banks in sync:
sync_device(): Copies data Host -> Device.sync_host(): Copies data Device -> Host.Let's walk through the standard workflow for setting up a test.
First, we define the type of tensor we want. In this example, a Matrix of floats.
#include "cutlass/util/host_tensor.h"
#include "cutlass/layout/matrix.h"
// Define a Tensor for float elements, Column Major layout
using MyTensor = cutlass::HostTensor<float, cutlass::layout::ColumnMajor>;
// Define the size (M=128, N=64)
auto size = cutlass::MatrixCoord(128, 64);
We instantiate the object. The constructor automatically allocates memory on both the CPU and the GPU.
// Create the tensor
// This calls malloc() on CPU and cudaMalloc() on GPU automatically
MyTensor tensor_A(size);
// You can access the Host memory easily
tensor_A.host_data()[0] = 1.0f;
tensor_A.host_data()[1] = 2.0f;
Explanation: tensor_A now exists in two places. Right now, the data is only set on the CPU side.
We usually fill the host memory with data using a loop or a helper function.
// Simple loop to fill data on the Host
for (int i = 0; i < tensor_A.capacity(); ++i) {
tensor_A.host_data()[i] = static_cast<float>(i);
}
// At this point, GPU memory contains garbage/zeros!
Now that the "Local Folder" (CPU) is ready, we sync to the "Cloud" (GPU).
// Copy Host -> Device
tensor_A.sync_device();
// Now we can get a pointer to the GPU memory to pass to a kernel
float* gpu_ptr = tensor_A.device_data();
Explanation: sync_device() performs the cudaMemcpy. You don't need to calculate bytes or offsets.
After running a kernel (which writes to the GPU memory), we bring the data back.
// ... Imagine a Kernel ran here and modified tensor_A on the GPU ...
// Copy Device -> Host
tensor_A.sync_host();
// Now we can print the results
printf("Result: %f\n", tensor_A.host_data()[0]);
How does HostTensor manage this magic? It uses C++ templates to wrap standard allocation and CUDA API calls.
Here is the lifecycle of a HostTensor object.
host_tensor.h
The file tools/util/include/cutlass/util/host_tensor.h contains the implementation.
1. The Class Members The class holds the layout info and the two distinct storage containers.
// cutlass/util/host_tensor.h
template <typename Element_, typename Layout_>
class HostTensor {
private:
// Describes the shape (M, N, K)
TensorCoord extent_;
Layout layout_;
// CPU Memory
std::vector<StorageUnit> host_;
// GPU Memory (Wrapper around cudaMalloc)
device_memory::allocation<StorageUnit> device_;
};
2. The Constructor / Reserve
When you create the tensor, it calculates how much memory is needed based on the extent and layout.
// Simplified view of reserve()
void reserve(size_t count, bool device_backed) {
// 1. Resize the std::vector (CPU)
host_.resize(count);
// 2. Allocate on GPU if requested
if (device_backed) {
// device_memory::allocate wraps cudaMalloc
StorageUnit* ptr = device_memory::allocate<StorageUnit>(count);
device_.reset(ptr);
}
}
3. The Synchronization
The sync functions simply wrap the CUDA memory copy commands. They check device_backed() to ensure GPU memory actually exists before trying to copy.
// Simplified view of sync_device()
void sync_device() {
if (device_backed()) {
// Wraps cudaMemcpy(dst, src, bytes, cudaMemcpyHostToDevice)
device_memory::copy_to_device(
device_.get(), // Destination (GPU)
host_.data(), // Source (CPU)
host_.size() // Count
);
}
}
A powerful feature of HostTensor is that it can return TensorRef or TensorView objects. These are lightweight structs containing a pointer and a stride.
These "Views" are exactly what the Reference Implementations (CPU math) expect as input.
// Getting a view of the host data
auto host_view = tensor.host_view();
// host_view now acts like a lightweight matrix object
// You can pass this to a CPU reference GEMM
ReferenceGemm::run(host_view);
In this chapter, we learned:
sync_device(), compute, then sync_host().Now that we have a way to manage our test data, we need something to compare our results against. We need a "Gold Standard" to verify correctness.
Next Chapter: Reference GEMM Implementations
Generated by Code IQ