CUDA Fundamentals: Tiled Matrix Multiply & Bitonic Sort

Introduction: Running Code on Thousands of Threads

The first time you write a CUDA kernel that actually works, where the GPU computes the right answer across thousands of simultaneous threads, is a genuinely satisfying moment. The second feeling, right after, is usually: why is it not as fast as I expected?

Projects 1 and 2 of Georgia Tech’s CS 8803: GPU Hardware and Software cover this arc. Project 1 is tiled matrix multiplication: a problem every GPU programmer learns as the canonical application of shared memory. Project 2 is bitonic sort: a parallel sorting algorithm that tests your ability to think about memory hierarchy, occupancy, and data transfer, and then optimize all three simultaneously under profiler scrutiny on an NVIDIA H100.

Together they teach the mental model that shows up across most high-performance GPU patterns covered in this course: understand where your bottleneck is (compute, global memory, or transfer), and choose the data layout and memory tier accordingly.

Note

I keep this write-up centered on the concepts and design tradeoffs behind Projects 1 and 2. For the full course arc, see the companion post: GPU Hardware and Software: A Retrospective.

The CUDA Programming Model

Before diving into the projects, a quick grounding in the abstractions that make GPU programming possible.

One useful way to place CUDA is through the Module 2 parallel-programming lens. The workload here is mostly data decomposition (split a large array or matrix into independent chunks), but the synchronization story looks like a shared-memory model within a block (__syncthreads) and an explicit-communication model across blocks (global-memory round trips between kernel launches). In that sense, CUDA sits between the OpenMP and MPI mental models from the notes: cheap collaboration within a local team, explicit coordination at larger scope.

Thread Hierarchy

CUDA gives you three nested levels of parallelism. A kernel is a function that runs on the GPU. When you launch it, you specify a grid of thread blocks, and each block contains a number of threads. On the hardware side, threads are grouped into warps of 32, that’s the granularity at which the GPU schedules and executes instructions.

CUDA thread indexing — **Figure 1:** Every CUDA thread has a unique identity via built-in variables: threadIdx (position within its block) and blockIdx (position of the block within the grid). Together they let each thread compute a unique index into the data it should process.

The standard pattern for indexing into a 1D array:

__global__ void kernel(float* data, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        // process data[i]
    }
}

The if (i < N) guard is there because the grid dimensions are rounded up to multiples of the block size, some threads in the last block may not correspond to valid array elements.

CUDA data indexing pattern — **Figure 2:** Dividing 16 elements across 4 blocks of 4 threads each. The global index formula (blockIdx.x * blockDim.x + threadIdx.x) gives each thread a unique element to process.

Memory Spaces

GPU performance almost always comes down to memory. CUDA exposes several distinct memory spaces, and choosing the right one for the right access pattern is the central skill of GPU optimization.

**Figure 3:** The CUDA memory hierarchy. Registers and shared memory live on-chip and are fast. Global memory (DRAM) is large but high-latency. A common pattern is: load data from global into shared, compute in shared, then write results back to global.

Registers: Fastest, private to each thread. Spilling to local memory (when you run out) is expensive.
Shared memory (__shared__): Fast on-chip scratchpad, scoped to a thread block. The primary tool for data reuse. It usually has much lower latency and higher bandwidth than global memory, but the exact gain depends on architecture and access pattern.
Global memory: Large (tens of GBs) but high latency (often hundreds of cycles when a load goes to DRAM on a miss). Every thread in the grid can access it. This is the dominant cost to manage.
Constant and texture memory: Read-only spaces with specialized caches, constant for broadcast patterns, texture for spatial locality.

These are order-of-magnitude heuristics, not fixed constants. In practice, Nsight Compute and the CUDA C Best Practices Guide are the ground truth for your specific kernel and GPU generation.

The compute kernel pattern that many high-performance kernels follow:

__global__ void Kernel() {
    // 1. Load from global memory into shared memory
    __syncthreads();  // wait for all threads in block to finish loading
    // 2. Compute using shared memory
    __syncthreads();  // wait before writing results
    // 3. Write results back to global memory
}

The __syncthreads() calls are critical. Without them, some threads might start computing before others have finished loading their portion of the shared data, a race condition that produces wrong results.

On occupancy: how many warps keep an SM busy

Occupancy is the ratio of active warps on an SM to the maximum number of warps the SM can support. Higher occupancy generally means better latency hiding, when one warp stalls on a memory access, the scheduler has more warps available to switch to. But occupancy isn’t everything. A kernel with 50% occupancy but perfect memory coalescing can easily outperform one with 100% occupancy and scattered memory accesses. The real goal is keeping the execution units busy, which requires both enough warps to hide latency and good enough memory access patterns to keep throughput high.

Project 1: Tiled Matrix Multiplication

Why Naive Matrix Multiply is Slow on a GPU

Matrix multiplication is deceptively simple: for C = A × B, each element C[i][j] is the dot product of row i of A and column j of B. A straightforward parallel implementation assigns one thread to each output element.

The problem: each output element requires reading an entire row of A and an entire column of B from global memory. For an N×N matrix, that’s 2N global memory reads per output element, and O(N²) output elements, so O(N³) total global memory reads. For N=1024, that’s over a billion memory accesses, each potentially incurring the full ~500-cycle global memory latency.

This is the memory-bound case on the Roofline model. The arithmetic intensity (FLOPs per byte loaded) of naive matrix multiply is around 0.25 FLOP/byte for typical data types, well below the ridge point of a modern GPU. The hardware arithmetic units are sitting idle waiting for data.

Managing global memory in tiled matrix multiply — **Figure 4:** Tiling divides the output matrix into tiles. Each thread block computes one output tile by iterating over the corresponding input tiles, loading each into shared memory before computing.

The Tiling Solution

The insight behind tiled matrix multiplication: if the threads in a block all need portions of the same rows and columns, why not load those portions into shared memory once and reuse them?

The algorithm divides the matrices into square tiles of size TILE_WIDTH × TILE_WIDTH. For each output tile:

Each thread cooperatively loads one element of the A-tile and one element of the B-tile into shared memory.
__syncthreads(): wait for the entire tile to be loaded.
All threads compute their partial dot products using the shared-memory tiles.
__syncthreads(): wait before the next iteration overwrites the tiles.
Repeat for the next pair of input tiles. Accumulate results.
Write the final accumulated value to global memory.

With tiling, each element loaded from global memory into shared is reused TILE_WIDTH times. A tile width of 16 means 16× reuse, 16× fewer global memory accesses, 16× higher arithmetic intensity. This pushes the workload from memory-bound toward compute-bound, which is where you want to be.

// Conceptual tiled matrix multiply, illustrates the pattern,
// not the project-specific implementation.
__global__ void tiledMatMul(float* A, float* B, float* C, int N) {
    __shared__ float As[TILE_WIDTH][TILE_WIDTH];
    __shared__ float Bs[TILE_WIDTH][TILE_WIDTH];

    int row = blockIdx.y * TILE_WIDTH + threadIdx.y;
    int col = blockIdx.x * TILE_WIDTH + threadIdx.x;
    float sum = 0.0f;

    for (int t = 0; t < N / TILE_WIDTH; t++) {
        // Cooperatively load one tile of A and B into shared memory
        As[threadIdx.y][threadIdx.x] = A[row * N + (t * TILE_WIDTH + threadIdx.x)];
        Bs[threadIdx.y][threadIdx.x] = B[(t * TILE_WIDTH + threadIdx.y) * N + col];
        __syncthreads();

        // Compute partial dot product from this tile
        for (int k = 0; k < TILE_WIDTH; k++) {
            sum += As[threadIdx.y][k] * Bs[k][threadIdx.x];
        }
        __syncthreads();
    }

    C[row * N + col] = sum;
}

This is the textbook version from Chapter 5 of Programming Massively Parallel Processors. The actual project worked with cudaMallocManaged (Unified Memory) and required verifying correctness against a CPU reference implementation. The performance story is the classic GPU win: as matrix size grows, the GPU’s advantage over the CPU grows dramatically.

Design Questions This Raised

Working through this project, the questions that kept coming up were architectural rather than algorithmic:

What’s the right tile size? Larger tiles mean more reuse per global memory load, but also more shared memory per block, which reduces occupancy. There’s a sweet spot depending on the GPU’s shared memory capacity per SM and the register pressure of the kernel.
How do you handle matrices whose dimensions aren’t multiples of TILE_WIDTH? The naive tile loop breaks at the boundary. This requires either padding the matrices or adding guard conditions in the tile load, each with different performance implications.
How much faster is the tiled version? The right framing is arithmetic intensity: from ~0.25 FLOP/byte to ~TILE_WIDTH/4 FLOP/byte. At TILE_WIDTH=16, that’s roughly 4 FLOP/byte, a 16× improvement in data efficiency.

Project 2: Bitonic Sort

Why Sorting on a GPU Is a Different Problem

Sorting a billion integers in parallel is not the same problem as sorting a billion integers sequentially. Comparison-based sequential algorithms (quicksort, mergesort) rely on data-dependent branches and pointer chasing, exactly the patterns GPUs handle badly. What GPUs need is a sorting algorithm built out of sorting networks: fixed, data-independent sequences of compare-and-swap operations where every comparison is predetermined regardless of input.

Bitonic sort is the classic example.

The Bitonic Sort Algorithm

A bitonic sequence is one that monotonically increases and then monotonically decreases (allowing equal values), or a circular shift of such a sequence. A useful property is that you can sort it by repeatedly splitting it into smaller bitonic sequences, a “bitonic split”, until you’re left with sorted pairs, then merging up.

Bitonic sort comparator network — **Figure 5:** The bitonic sort network for 16 elements. Each horizontal line is a comparator: the two endpoints are compared and swapped if out of order. The direction of the arrow indicates which end takes the smaller value. Every comparator at a given stage can fire in parallel, this is what makes it GPU-friendly.

The algorithm has O(log² n) stages, and within each stage, all comparisons are independent of each other, they can all execute simultaneously. That’s the GPU-friendly property. In pseudocode:

for i = 1 to log(n):           // stage: builds sorted sequences of length 2^i
  for j = i-1 down to 0:       // sub-stage: one bitonic split pass
    for k = 0 to n (parallel): // each thread handles one compare-and-swap
      partner = k XOR 2^j
      if partner > k:
        if (2^i & k) == 0: Compare_Exchange_ascending(arr[k], arr[partner])
        else:              Compare_Exchange_descending(arr[k], arr[partner])

The XOR operation that finds the partner index is what makes the structure work. It computes the index that each position should compare-swap with at each sub-stage. Because those indices are determined by loop variables, not by data values, the comparator network stays fixed and avoids data-dependent branch divergence.

The Performance Challenge

The naive CUDA implementation is conceptually direct: map the innermost loop (all the compare-swaps for one sub-stage) to a kernel launch, with one thread per element. This works and produces correct results.

The profiler tells a harsh story. Launching a separate kernel for each of the O(log² n) sub-stages on a 100M element array means thousands of kernel launches. Each launch has overhead. Between launches, data sits in global memory. Each thread loads two elements, does one comparison and conditional swap, and writes two elements back, an arithmetic intensity so low that the memory subsystem is the bottleneck by a wide margin.

NSight Compute will tell you, plainly:

“This kernel exhibits low compute throughput and memory bandwidth utilization relative to the peak performance of this device.”

The grading target for this project was 900+ million elements per second on 100M elements, with memory throughput ≥ 80% and achieved occupancy ≥ 70%. Getting there requires taking the profiler’s feedback seriously and addressing the bottlenecks it identifies.

Optimization Concepts

The course points toward a two-phase design as the path to high performance. The intuition:

A practical optimization loop that matched the course notes was: (1) form a bottleneck hypothesis, (2) validate it in Nsight Compute, (3) pick a kernel/data-layout change that targets that bottleneck, and (4) re-measure. The metric categories that mattered most were memory throughput, achieved occupancy, and stall reasons (memory dependency, long scoreboard, and barrier-related stalls). Treating these as a closed loop was more reliable than trying to “reason out” the best kernel shape in one shot.

Phase 1: Shared memory phase. For any sub-stage where the stride between compared elements is small enough that the relevant data fits in a block’s shared memory, do all those sub-stages inside a single kernel launch using shared memory. This collapses many kernel launches into one and keeps data on-chip between sub-stages.

Phase 2: Global memory phase. For the larger strides (when the compared elements are farther apart than fits in shared memory), revert to global memory accesses. But even here, there are optimizations: loading multiple elements per thread (coarsening) reduces the number of threads needed, which can improve occupancy and reduce scheduling overhead. The larger strides also allow for a shared-memory fast path within the global-memory kernel when the stride fits.

Overlapping data transfer and computation with async copies — **Figure 6:** cudaMemcpyAsync overlaps host-to-device data transfer with kernel execution. Instead of waiting for the full transfer to complete before launching any kernels, the GPU can begin processing the first batch of data while the second batch is still transferring.

Pipelining data transfer with CUDA streams — **Figure 7:** CUDA streams take this further: multiple streams can run concurrently, so you can pipeline transfer, compute, and transfer-back across multiple chunks of data simultaneously.

Several other techniques are relevant:

Pinned (page-locked) memory (cudaMallocHost or cudaHostRegister): Allocating host memory as pinned removes the intermediate copy that the DMA engine needs to perform when transferring from pageable host memory. Throughput is often higher, but the gain varies with transfer size, interconnect, and system configuration.
Async memory copies (cudaMemcpyAsync): Lets the CPU continue executing while the DMA engine handles the transfer. Combined with CUDA streams, this enables overlapping transfer and computation.
CUDA streams: Multiple streams run concurrently on the GPU. If you can divide the work into chunks, you can overlap the transfer of chunk N+1 with the computation on chunk N.
Items-per-thread coarsening: Instead of one thread per element, assign 2 or 4 elements per thread. This reduces the total number of threads launched (potentially improving occupancy on large arrays), increases the arithmetic-to-indexing ratio, and can improve register usage.

Design Questions This Raised

The interesting engineering questions in this project were about policy rather than correctness:

Where is the shared-memory/global-memory crossover point? This depends on the GPU’s shared memory capacity, the block size, the data type size, and the number of elements. Finding the right crossover point requires measuring, not just reasoning.
How do you handle non-power-of-2 array sizes? Bitonic sort requires power-of-2 lengths. Padding with sentinel values (INT_MAX for ascending sort) is the standard approach, but the padding adds memory traffic for large inputs.
How many items per thread? The sweet spot depends on the hardware. Too few: too many threads, scheduling overhead dominates. Too many: registers spill, occupancy drops.
When is pinned memory actually faster? For small transfers the overhead of pinning can outweigh the benefit. For larger transfers and overlap-heavy pipelines, pinned memory is often a win, but you still need to measure on your target platform.

The NSight Compute feedback loop was central to this project: run, profile, isolate the dominant bottleneck (achieved occupancy, memory throughput, or divergent branches), make one targeted change, and re-measure. Iterating that loop was far more reliable than one-shot “theoretical” tuning.

Shared memory bank conflicts — **Figure 8:** Bank conflicts occur when multiple threads in a warp access different addresses in the same shared memory bank, those accesses serialize instead of happening in parallel. For bitonic sort, the access stride pattern must be designed carefully to avoid conflicts during the shared-memory phase.

What These Projects Teach

These two projects establish the mental model that the rest of the course builds on.

From tiled matrix multiply:

The shared memory tiling pattern is the foundational GPU optimization. Understand it cold. Almost every high-performance kernel is a variation.
Arithmetic intensity is the right framing for “why is this kernel slow?” Low intensity = memory bound = need more reuse.
__syncthreads() is not optional when threads are cooperating through shared memory.

From bitonic sort:

The profiler is not optional. You cannot reason your way to 900 million elements per second on an H100. You profile, identify the bottleneck, fix it, and repeat.
Kernel launch overhead is real. When the algorithm has O(log² n) stages and n is 100M, you feel every unnecessary kernel launch.
The two-phase pattern, shared memory for small strides, global memory for large strides, is a general strategy for parallel sorting algorithms on GPUs, not just bitonic sort.
Data transfer is part of the performance budget. Pinned memory and async transfers are not micro-optimizations. For 100M elements, they are often the difference between meeting the target and missing it.

The broader pattern: Both projects teach the same underlying skill: start from the hardware’s perspective. What is the memory access pattern? Is it coalesced? How much data reuse is there? What does occupancy look like? The answers to those questions determine your optimization strategy.

Additional Resources

CUDA C Best Practices Guide - the authoritative reference for optimization, especially the chapters on memory and execution configuration
NSight Compute documentation - essential for understanding what the profiler metrics mean
Improved GPU Sorting (Harris, Sengupta, Owens) - the GPU Gems 2 chapter on sorting is still one of the best resources on the topic

Programming Massively Parallel Processors: A Hands-on Approach

David B. Kirk and Wen-mei W. Hwu

Chapter 5 covers tiled matrix multiplication in depth, the definitive treatment of the pattern. Chapter 7 covers parallel patterns including sorting. The 4th edition adds material on tensor cores and modern optimization techniques.

CUDA by Example: An Introduction to General-Purpose GPU Programming

Jason Sanders and Edward Kandrot

A gentler starting point than PMPP. Good for building comfort with the CUDA API before diving into the optimization-heavy material. Available free from NVIDIA.

A Note on Code Availability

In accordance with Georgia Tech’s academic integrity policy and the license for course materials, the source code for these projects is kept in a private repository. This post follows Dean Joyner’s advice on sharing projects with a focus not on any particular solution but on an abstract overview of the problem and the underlying concepts involved.

I would be happy to discuss implementation details, architecture choices, or profiling results in an interview. Please feel free to reach out to request private access to the repository.

Table of Contents

Introduction: Running Code on Thousands of Threads
The CUDA Programming Model
- Thread Hierarchy
- Memory Spaces
Project 1: Tiled Matrix Multiplication
Project 2: Bitonic Sort
What These Projects Teach
Additional Resources