CUDA Fundamentals: Tiled Matrix Multiply & Bitonic Sort
Introduction: Running Code on Thousands of Threads
The first time you write a CUDA kernel that actually works, where the GPU computes the right answer across thousands of simultaneous threads, is a genuinely satisfying moment. The second feeling, right after, is usually: why is it not as fast as I expected?
Projects 1 and 2 of Georgia Tech’s CS 8803: GPU Hardware and Software cover this arc. Project 1 is tiled matrix multiplication: a problem every GPU programmer learns as the canonical application of shared memory. Project 2 is bitonic sort: a parallel sorting algorithm that tests your ability to think about memory hierarchy, occupancy, and data transfer, and then optimize all three simultaneously under profiler scrutiny on an NVIDIA H100.
Together they teach the mental model that shows up across most high-performance GPU patterns covered in this course: understand where your bottleneck is (compute, global memory, or transfer), and choose the data layout and memory tier accordingly.
I keep this write-up centered on the concepts and design tradeoffs behind Projects 1 and 2. For the full course arc, see the companion post: GPU Hardware and Software: A Retrospective.
The CUDA Programming Model
Before diving into the projects, a quick grounding in the abstractions that make GPU programming possible.
One useful way to place CUDA is through the Module 2 parallel-programming lens. The workload here is mostly data decomposition (split a large array or matrix into independent chunks), but the synchronization story looks like a shared-memory model within a block (__syncthreads) and an explicit-communication model across blocks (global-memory round trips between kernel launches). In that sense, CUDA sits between the OpenMP and MPI mental models from the notes: cheap collaboration within a local team, explicit coordination at larger scope.
Thread Hierarchy
CUDA gives you three nested levels of parallelism. A kernel is a function that runs on the GPU. When you launch it, you specify a grid of thread blocks, and each block contains a number of threads. On the hardware side, threads are grouped into warps of 32, that’s the granularity at which the GPU schedules and executes instructions.
The standard pattern for indexing into a 1D array:
__global__ void kernel(float* data, int N) {
int i = blockIdx.x * blockDim.x + threadIdx.x;
if (i < N) {
// process data[i]
}
}
The if (i < N) guard is there because the grid dimensions are rounded up to multiples of the block size, some threads in the last block may not correspond to valid array elements.
Memory Spaces
GPU performance almost always comes down to memory. CUDA exposes several distinct memory spaces, and choosing the right one for the right access pattern is the central skill of GPU optimization.
- Registers: Fastest, private to each thread. Spilling to local memory (when you run out) is expensive.
- Shared memory (
__shared__): Fast on-chip scratchpad, scoped to a thread block. The primary tool for data reuse. It usually has much lower latency and higher bandwidth than global memory, but the exact gain depends on architecture and access pattern. - Global memory: Large (tens of GBs) but high latency (often hundreds of cycles when a load goes to DRAM on a miss). Every thread in the grid can access it. This is the dominant cost to manage.
- Constant and texture memory: Read-only spaces with specialized caches, constant for broadcast patterns, texture for spatial locality.
These are order-of-magnitude heuristics, not fixed constants. In practice, Nsight Compute and the CUDA C Best Practices Guide are the ground truth for your specific kernel and GPU generation.
The compute kernel pattern that many high-performance kernels follow:
__global__ void Kernel() {
// 1. Load from global memory into shared memory
__syncthreads(); // wait for all threads in block to finish loading
// 2. Compute using shared memory
__syncthreads(); // wait before writing results
// 3. Write results back to global memory
}
The __syncthreads() calls are critical. Without them, some threads might start computing before others have finished loading their portion of the shared data, a race condition that produces wrong results.
On occupancy: how many warps keep an SM busy
Occupancy is the ratio of active warps on an SM to the maximum number of warps the SM can support. Higher occupancy generally means better latency hiding, when one warp stalls on a memory access, the scheduler has more warps available to switch to. But occupancy isn’t everything. A kernel with 50% occupancy but perfect memory coalescing can easily outperform one with 100% occupancy and scattered memory accesses. The real goal is keeping the execution units busy, which requires both enough warps to hide latency and good enough memory access patterns to keep throughput high.
Project 1: Tiled Matrix Multiplication
Why Naive Matrix Multiply is Slow on a GPU
Matrix multiplication is deceptively simple: for C = A × B, each element C[i][j] is the dot product of row i of A and column j of B. A straightforward parallel implementation assigns one thread to each output element.
The problem: each output element requires reading an entire row of A and an entire column of B from global memory. For an N×N matrix, that’s 2N global memory reads per output element, and O(N²) output elements, so O(N³) total global memory reads. For N=1024, that’s over a billion memory accesses, each potentially incurring the full ~500-cycle global memory latency.
This is the memory-bound case on the Roofline model. The arithmetic intensity (FLOPs per byte loaded) of naive matrix multiply is around 0.25 FLOP/byte for typical data types, well below the ridge point of a modern GPU. The hardware arithmetic units are sitting idle waiting for data.
The Tiling Solution
The insight behind tiled matrix multiplication: if the threads in a block all need portions of the same rows and columns, why not load those portions into shared memory once and reuse them?
The algorithm divides the matrices into square tiles of size TILE_WIDTH × TILE_WIDTH. For each output tile:
- Each thread cooperatively loads one element of the A-tile and one element of the B-tile into shared memory.
__syncthreads(): wait for the entire tile to be loaded.- All threads compute their partial dot products using the shared-memory tiles.
__syncthreads(): wait before the next iteration overwrites the tiles.- Repeat for the next pair of input tiles. Accumulate results.
- Write the final accumulated value to global memory.
With tiling, each element loaded from global memory into shared is reused TILE_WIDTH times. A tile width of 16 means 16× reuse, 16× fewer global memory accesses, 16× higher arithmetic intensity. This pushes the workload from memory-bound toward compute-bound, which is where you want to be.
// Conceptual tiled matrix multiply, illustrates the pattern,
// not the project-specific implementation.
__global__ void tiledMatMul(float* A, float* B, float* C, int N) {
__shared__ float As[TILE_WIDTH][TILE_WIDTH];
__shared__ float Bs[TILE_WIDTH][TILE_WIDTH];
int row = blockIdx.y * TILE_WIDTH + threadIdx.y;
int col = blockIdx.x * TILE_WIDTH + threadIdx.x;
float sum = 0.0f;
for (int t = 0; t < N / TILE_WIDTH; t++) {
// Cooperatively load one tile of A and B into shared memory
As[threadIdx.y][threadIdx.x] = A[row * N + (t * TILE_WIDTH + threadIdx.x)];
Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_WIDTH + threadIdx.y) * N + col];
__syncthreads();
// Compute partial dot product from this tile
for (int k = 0; k < TILE_WIDTH; k++) {
sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
}
__syncthreads();
}
C[row * N + col] = sum;
}
This is the textbook version from Chapter 5 of Programming Massively Parallel Processors. The actual project worked with cudaMallocManaged (Unified Memory) and required verifying correctness against a CPU reference implementation. The performance story is the classic GPU win: as matrix size grows, the GPU’s advantage over the CPU grows dramatically.
Design Questions This Raised
Working through this project, the questions that kept coming up were architectural rather than algorithmic:
- What’s the right tile size? Larger tiles mean more reuse per global memory load, but also more shared memory per block, which reduces occupancy. There’s a sweet spot depending on the GPU’s shared memory capacity per SM and the register pressure of the kernel.
- How do you handle matrices whose dimensions aren’t multiples of TILE_WIDTH? The naive tile loop breaks at the boundary. This requires either padding the matrices or adding guard conditions in the tile load, each with different performance implications.
- How much faster is the tiled version? The right framing is arithmetic intensity: from ~0.25 FLOP/byte to ~TILE_WIDTH/4 FLOP/byte. At TILE_WIDTH=16, that’s roughly 4 FLOP/byte, a 16× improvement in data efficiency.
Project 2: Bitonic Sort
Why Sorting on a GPU Is a Different Problem
Sorting a billion integers in parallel is not the same problem as sorting a billion integers sequentially. Comparison-based sequential algorithms (quicksort, mergesort) rely on data-dependent branches and pointer chasing, exactly the patterns GPUs handle badly. What GPUs need is a sorting algorithm built out of sorting networks: fixed, data-independent sequences of compare-and-swap operations where every comparison is predetermined regardless of input.
Bitonic sort is the classic example.
The Bitonic Sort Algorithm
A bitonic sequence is one that monotonically increases and then monotonically decreases (allowing equal values), or a circular shift of such a sequence. A useful property is that you can sort it by repeatedly splitting it into smaller bitonic sequences, a “bitonic split”, until you’re left with sorted pairs, then merging up.
The algorithm has O(log² n) stages, and within each stage, all comparisons are independent of each other, they can all execute simultaneously. That’s the GPU-friendly property. In pseudocode:
for i = 1 to log(n): // stage: builds sorted sequences of length 2^i
for j = i-1 down to 0: // sub-stage: one bitonic split pass
for k = 0 to n (parallel): // each thread handles one compare-and-swap
partner = k XOR 2^j
if partner > k:
if (2^i & k) == 0: Compare_Exchange_ascending(arr[k], arr[partner])
else: Compare_Exchange_descending(arr[k], arr[partner])
The XOR operation that finds the partner index is what makes the structure work. It computes the index that each position should compare-swap with at each sub-stage. Because those indices are determined by loop variables, not by data values, the comparator network stays fixed and avoids data-dependent branch divergence.
The Performance Challenge
The naive CUDA implementation is conceptually direct: map the innermost loop (all the compare-swaps for one sub-stage) to a kernel launch, with one thread per element. This works and produces correct results.
The profiler tells a harsh story. Launching a separate kernel for each of the O(log² n) sub-stages on a 100M element array means thousands of kernel launches. Each launch has overhead. Between launches, data sits in global memory. Each thread loads two elements, does one comparison and conditional swap, and writes two elements back, an arithmetic intensity so low that the memory subsystem is the bottleneck by a wide margin.
NSight Compute will tell you, plainly:
“This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device.”
The grading target for this project was 900+ million elements per second on 100M elements, with memory throughput ≥ 80% and achieved occupancy ≥ 70%. Getting there requires taking the profiler’s feedback seriously and addressing the bottlenecks it identifies.
Optimization Concepts
The course points toward a two-phase design as the path to high performance. The intuition:
A practical optimization loop that matched the course notes was: (1) form a bottleneck hypothesis, (2) validate it in Nsight Compute, (3) pick a kernel/data-layout change that targets that bottleneck, and (4) re-measure. The metric categories that mattered most were memory throughput, achieved occupancy, and stall reasons (memory dependency, long scoreboard, and barrier-related stalls). Treating these as a closed loop was more reliable than trying to “reason out” the best kernel shape in one shot.
Phase 1: Shared memory phase. For any sub-stage where the stride between compared elements is small enough that the relevant data fits in a block’s shared memory, do all those sub-stages inside a single kernel launch using shared memory. This collapses many kernel launches into one and keeps data on-chip between sub-stages.
Phase 2: Global memory phase. For the larger strides (when the compared elements are farther apart than fits in shared memory), revert to global memory accesses. But even here, there are optimizations: loading multiple elements per thread (coarsening) reduces the number of threads needed, which can improve occupancy and reduce scheduling overhead. The larger strides also allow for a shared-memory fast path within the global-memory kernel when the stride fits.
Several other techniques are relevant:
- Pinned (page-locked) memory (
cudaMallocHostorcudaHostRegister): Allocating host memory as pinned removes the intermediate copy that the DMA engine needs to perform when transferring from pageable host memory. Throughput is often higher, but the gain varies with transfer size, interconnect, and system configuration. - Async memory copies (
cudaMemcpyAsync): Lets the CPU continue executing while the DMA engine handles the transfer. Combined with CUDA streams, this enables overlapping transfer and computation. - CUDA streams: Multiple streams run concurrently on the GPU. If you can divide the work into chunks, you can overlap the transfer of chunk N+1 with the computation on chunk N.
- Items-per-thread coarsening: Instead of one thread per element, assign 2 or 4 elements per thread. This reduces the total number of threads launched (potentially improving occupancy on large arrays), increases the arithmetic-to-indexing ratio, and can improve register usage.
Design Questions This Raised
The interesting engineering questions in this project were about policy rather than correctness:
- Where is the shared-memory/global-memory crossover point? This depends on the GPU’s shared memory capacity, the block size, the data type size, and the number of elements. Finding the right crossover point requires measuring, not just reasoning.
- How do you handle non-power-of-2 array sizes? Bitonic sort requires power-of-2 lengths. Padding with sentinel values (INT_MAX for ascending sort) is the standard approach, but the padding adds memory traffic for large inputs.
- How many items per thread? The sweet spot depends on the hardware. Too few: too many threads, scheduling overhead dominates. Too many: registers spill, occupancy drops.
- When is pinned memory actually faster? For small transfers the overhead of pinning can outweigh the benefit. For larger transfers and overlap-heavy pipelines, pinned memory is often a win, but you still need to measure on your target platform.
The NSight Compute feedback loop was central to this project: run, profile, isolate the dominant bottleneck (achieved occupancy, memory throughput, or divergent branches), make one targeted change, and re-measure. Iterating that loop was far more reliable than one-shot “theoretical” tuning.
What These Projects Teach
These two projects establish the mental model that the rest of the course builds on.
From tiled matrix multiply:
- The shared memory tiling pattern is the foundational GPU optimization. Understand it cold. Almost every high-performance kernel is a variation.
- Arithmetic intensity is the right framing for “why is this kernel slow?” Low intensity = memory bound = need more reuse.
__syncthreads()is not optional when threads are cooperating through shared memory.
From bitonic sort:
- The profiler is not optional. You cannot reason your way to 900 million elements per second on an H100. You profile, identify the bottleneck, fix it, and repeat.
- Kernel launch overhead is real. When the algorithm has O(log² n) stages and n is 100M, you feel every unnecessary kernel launch.
- The two-phase pattern, shared memory for small strides, global memory for large strides, is a general strategy for parallel sorting algorithms on GPUs, not just bitonic sort.
- Data transfer is part of the performance budget. Pinned memory and async transfers are not micro-optimizations. For 100M elements, they are often the difference between meeting the target and missing it.
The broader pattern: Both projects teach the same underlying skill: start from the hardware’s perspective. What is the memory access pattern? Is it coalesced? How much data reuse is there? What does occupancy look like? The answers to those questions determine your optimization strategy.
Additional Resources
- CUDA C Best Practices Guide - the authoritative reference for optimization, especially the chapters on memory and execution configuration
- NSight Compute documentation - essential for understanding what the profiler metrics mean
- Improved GPU Sorting (Harris, Sengupta, Owens) - the GPU Gems 2 chapter on sorting is still one of the best resources on the topic

Programming Massively Parallel Processors: A Hands-on Approach
David B. Kirk and Wen-mei W. Hwu
Chapter 5 covers tiled matrix multiplication in depth, the definitive treatment of the pattern. Chapter 7 covers parallel patterns including sorting. The 4th edition adds material on tensor cores and modern optimization techniques.

CUDA by Example: An Introduction to General-Purpose GPU Programming
Jason Sanders and Edward Kandrot
A gentler starting point than PMPP. Good for building comfort with the CUDA API before diving into the optimization-heavy material. Available free from NVIDIA.
In accordance with Georgia Tech’s academic integrity policy and the license for course materials, the source code for these projects is kept in a private repository. This post follows Dean Joyner’s advice on sharing projects with a focus not on any particular solution but on an abstract overview of the problem and the underlying concepts involved.
I would be happy to discuss implementation details, architecture choices, or profiling results in an interview. Please feel free to reach out to request private access to the repository.