GPU Hardware and Software

Introduction

This article is part personal reflection, part technical summary. I recently completed Georgia Tech’s CS 8803: GPU Hardware and Software, and I want to take the time to solidify what I’ve learned, both for myself, in the spirit of the Feynman technique, and for anyone else curious about the course or what I know.

I came into this course after several OMSCS classes that helped frame it: High-Performance Computer Architecture (HPCA), Graduate Intro to Operating Systems (GIOS), and Advanced Operating Systems (AOS) on the systems side, plus Machine Learning (ML), Deep Learning (DL), and High-Dimensional Data Analytics (HDDA) on the machine learning side, while taking Reinforcement Learning concurrently. That mix made GPU Architecture feel like a reasonable next step, the hardware layer connecting systems and modern ML, and given how central GPUs are to deep learning, high-performance computing, and real-time graphics, it felt like time well spent. While those courses were great context none are strict prerequisites. Although GIOS, HPCA, and DL were particularly useful.

There’s always a tension when writing a post like this. I want to be accurate about what I’ve learned while acknowledging that I’m still building intuition in many of these areas. I’m sharing this in the spirit of learning in public.

I wrote deeper dives on each of the projects:

Projects 1 & 2: CUDA Fundamentals: Tiled Matrix Multiply & Bitonic Sort
Projects 3 & 4: GPU Simulation: Warp Scheduling & Compute/Tensor Cores
Project 5: Static Analysis: Detecting Branch Divergence
Extra Project: FlashAttention & LLM Inference on GPUs

GPU Fundamentals

Why GPUs? The CPU vs. GPU Philosophy

The central question behind GPU design is: what do you optimize for? CPUs are designed for latency, getting a single task done as fast as possible. They have large caches, sophisticated branch predictors, and out-of-order execution units all in service of minimizing the time any one instruction takes to complete. GPUs take the opposite philosophy: optimize for throughput. Instead of one powerful core, pack in thousands of simpler ones and run them all simultaneously.

CPU vs GPU design philosophy — **Figure 1:** CPUs dedicate most of their transistor budget to control logic and caches to minimize latency for a single thread. GPUs dedicate that budget to arithmetic units to maximize throughput across thousands of threads.

This difference is rooted in the kinds of problems each is solving. A CPU excels at tasks with complex branching, irregular memory access, and sequential dependencies, which make up much of general-purpose software. GPUs excel at tasks that are embarrassingly parallel: the same operation applied to millions of independent data elements. Matrix multiplication, image filtering, physics simulation, and neural network training all decompose beautifully into thousands of independent sub-problems.

The historical path is interesting too. GPUs emerged from the graphics pipeline, where they needed to shade millions of pixels per frame. The realization that this “shade every pixel” model was essentially a massively parallel compute primitive is what gave rise to GPGPU (General Purpose GPU) computing, culminating in CUDA and the modern HPC/ML GPU ecosystem.

Flynn’s taxonomy gives us a useful vocabulary here:

SISD: One instruction, one data element at a time. Classic single-core CPU.
SIMD: One instruction applied to multiple data elements simultaneously. CPU vector units (SSE, AVX).
SPMD: Single Program, Multiple Data. The GPU model, many threads run the same program but on different data, and they can diverge.

GPUs implement SPMD through the SIMT (Single Instruction, Multiple Threads) execution model. A group of 32 threads called a warp executes together in lockstep. As long as all 32 threads follow the same code path, the hardware is fully utilized. When they diverge (some take the if branch and some take the else), the hardware has to serialize execution by running the if threads while masking the else threads, then vice versa. This branch divergence is one of the central costs to manage when writing GPU code.

Parallel Decomposition Patterns

Before CUDA details, one question helps: what exactly are we parallelizing? In practice, GPU workloads usually fall into a few recurring patterns:

Data decomposition: apply the same operation to many independent elements (vector add, matrix multiply, attention tiles). This is the dominant GPU pattern.
Task decomposition: split work into stages with different responsibilities (for example, preprocessing, compute, postprocessing) and run those stages separately.
Pipeline decomposition: overlap transfer and compute so different chunks are in different stages at the same time.

Most high-performance kernels in this article are data decomposition in an SPMD style: many threads run the same kernel body on different indices. Tiled GEMM, bitonic sort phases, and FlashAttention block loops all fit this shape.

A useful coordination mental model is local vs global communication. Inside a block, threads cooperate through shared memory and barriers (fast, local coordination). Across blocks, coordination is usually explicit via global memory and kernel boundaries (slower, but scalable). This distinction is why some algorithms fit cleanly into one kernel while others naturally become multi-kernel pipelines.

The CUDA Programming Model

CUDA (Compute Unified Device Architecture) is NVIDIA’s programming model for GPUs. It extends C/C++ with a set of abstractions that map directly onto the GPU hardware.

The central abstraction is the kernel, a function that runs on the GPU. You launch a kernel with a grid of thread blocks, and each block contains a set of threads. Every thread knows its position in the grid via built-in variables.

CUDA thread indexing — **Figure 3:** Each CUDA thread has a unique position in a grid of blocks. threadIdx gives its position within its block, and blockIdx gives the block's position in the grid. Together they identify a unique element to work on.

// A minimal CUDA kernel: each thread adds one pair of elements
__global__ void vectorAdd(const float* A, const float* B, float* C, int N) {
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

// Host code to launch it
int main() {
    int N = 1 << 20;  // 1M elements
    float *d_A, *d_B, *d_C;
    cudaMalloc(&d_A, N * sizeof(float));
    cudaMalloc(&d_B, N * sizeof(float));
    cudaMalloc(&d_C, N * sizeof(float));

    int threadsPerBlock = 256;
    int blocksPerGrid = (N + threadsPerBlock - 1) / threadsPerBlock;
    vectorAdd<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, d_C, N);
    cudaDeviceSynchronize();
    // ...
}

The <<<blocksPerGrid, threadsPerBlock>>> syntax is the kernel launch configuration. Each call to vectorAdd spins up millions of threads, each computing one addition. The if (i < N) guard handles the case where N isn’t a multiple of the block size.

The GPU Memory Hierarchy

Understanding memory is the key to writing fast GPU code. GPUs have a rich memory hierarchy, and knowing which tier to use for a given access pattern is often the difference between a fast kernel and a slow one.

CUDA memory spaces — **Figure 4:** CUDA exposes several distinct memory spaces. Registers are fastest but limited per thread. Shared memory is fast and block-scoped. Global memory is large but slow. The art of GPU programming is knowing when to promote data from global to shared.

Registers: The fastest memory, private to each thread. The compiler allocates these automatically. Running out forces “register spilling” to local memory (which is slow).
Shared memory (__shared__): A fast, explicitly managed scratchpad memory shared by all threads in a block. This is the GPU programmer’s main tool for data reuse. It usually has much lower latency and higher bandwidth than global memory, but exact gains depend on architecture and access pattern.
L1 and L2 cache: Hardware-managed caches that transparently accelerate global memory accesses.
Global memory (DRAM): The main GPU memory, large (tens of GB on modern cards) but high latency. All threads can access it, but you want to minimize how often you do.
Constant and texture memory: Read-only memory spaces optimized for specific access patterns (broadcast and spatial locality respectively).

A recurring pattern in high-performance CUDA kernels is: load from global into shared, compute, then write back to global. The tiled matrix multiplication in Project 1 is the canonical example.

Related CUDA Programming Guide sections (1-3) also cover:

The full thread hierarchy (threads, blocks, grids) and how it maps to hardware.
Memory model semantics and synchronization (__syncthreads()).
How to choose block dimensions and grid dimensions for a given problem.

GPU Architecture

The Streaming Multiprocessor

The fundamental compute unit of an NVIDIA GPU is the Streaming Multiprocessor (SM). Each SM contains:

A set of CUDA cores (ALUs) for integer and floating-point arithmetic
A set of special function units (SFUs) for transcendentals (sin, cos, sqrt)
A warp scheduler and dispatch units
A register file (large, on the order of 256KB per SM on modern hardware)
Shared memory / L1 cache (configurable partition)

A GPU might have dozens to hundreds of SMs, each capable of running many warps simultaneously. The H100 SXM (the card targeted in several projects in this course) has 132 SMs (the H100 PCIe variant has 114).

GPU pipeline stages — **Figure 5:** The GPU execution pipeline: warp selection → instruction fetch → decode → operand collection → execution → writeback. The warp scheduler keeps the pipeline full by switching to a ready warp whenever the current one stalls.

Warp Scheduling and Latency Hiding

GPUs hide memory latency here. When a warp issues a global memory load, it will stall for hundreds of cycles waiting for the data to arrive. Instead of sitting idle, the SM’s warp scheduler switches to a different ready warp and starts executing its instructions. By the time the scheduler cycles back to the first warp, the memory access has (hopefully) completed.

This technique is called latency hiding through multithreading, and it’s the core reason why GPUs can keep their arithmetic units busy despite having slow memory. It works because the GPU maintains a large pool of warps in flight simultaneously, far more than on a CPU. The occupancy (ratio of active warps to the maximum possible) is a key metric for whether latency hiding is working well.

Scoreboarding and Register Dependencies

GPUs use a scoreboard to track register dependencies between instructions in a warp. Before an instruction is issued, the scoreboard checks whether its source registers are ready (i.e., the previous instruction writing those registers has completed). If not, the warp stalls at the issue stage. The warp scheduler then moves on to another ready warp.

GPU scoreboarding — **Figure 6:** The scoreboard tracks which registers are 'in-flight' (written by an instruction that hasn't completed yet). An instruction can only issue if all its source registers are marked as ready.

This is directly analogous to the scoreboarding in CPUs, but simpler in some ways because GPUs don’t do out-of-order execution within a warp. Instructions in a single warp always issue in program order.

Shared Memory and Bank Conflicts

Shared memory is organized into banks, parallel memory modules that can each serve one request per cycle. On current NVIDIA hardware, there are 32 banks, each 4 bytes wide, and consecutive 4-byte words map to consecutive banks.

A useful property is this: if all 32 threads in a warp access 32 different banks, all 32 accesses happen simultaneously in one memory transaction. If multiple threads access the same bank (a bank conflict), those accesses are serialized, costing extra cycles.

Shared memory bank conflicts — **Figure 7:** Left: conflict-free access where each thread hits a different bank. Right: a 2-way bank conflict where pairs of threads hit the same bank, doubling the number of cycles required.

The special case is a broadcast: if all threads access the exact same address, the hardware broadcasts the value for free, no conflict. The only problematic case is when multiple (but not all) threads access different addresses in the same bank.

Memory Coalescing

For global memory, the key performance concern is coalescing. When threads in a warp issue memory requests, the hardware tries to combine them into as few DRAM transactions as possible. If 32 threads each access consecutive 4-byte words (addresses 0, 4, 8, …, 124), that’s one 128-byte cache line transaction. If they access random addresses scattered across memory, that’s 32 separate transactions, a 32x slowdown.

In practice, adjacent threads should access adjacent memory locations. This is why the “row-major thread indexing” pattern is the default in most CUDA kernels.

Module 4 of the course also covered:

The full SM microarchitecture, including the register file design and shared memory organization.
How the GPU pipeline stages map onto the warp scheduling loop.
Occupancy analysis and the limits imposed by register count, shared memory usage, and block size.
DRAM architecture (GDDR, HBM) and its impact on memory bandwidth.

Architecture Optimizations

Branch Divergence and Reconvergence

Branch divergence is the performance cliff for GPU code. When threads within a warp take different paths through an if/else, the hardware must execute both paths sequentially with masking. The worst case is a 2x slowdown for a balanced if/else. For deeply nested branches or large switch statements, it gets much worse.

Modern GPUs handle this with architecture-dependent control-flow machinery. Older SIMT-stack explanations use explicit IPDOM-style reconvergence stacks, while Volta-and-later independent thread scheduling changes how reconvergence is managed in hardware. The practical performance takeaway is unchanged: divergence still reduces effective throughput when paths serialize.

Predicated execution is the optimization for small divergent branches. Instead of taking a branch at all, the compiler emits the instructions for both sides with predicate registers. Threads that shouldn’t execute a predicated instruction have it masked out, no branch, no reconvergence overhead. The cost is executing both sides of the branch (discarding results for threads where the predicate is false), which is only worthwhile when the branch body is very short.

Module 6 of the course also covered:

Dynamic warp formation: the concept of regrouping threads from different warps that happen to be at the same PC to form a new full warp (a research technique used in some architectures).
Register file virtualization for handling the case where the logical register file exceeds the physical one.
More advanced reconvergence schemes and their interaction with the thread scheduler.

Virtual Memory and TLBs at GPU Scale

GPUs need virtual memory for the same reasons CPUs do: process isolation, demand paging, and unified virtual addressing. But the scale is dramatically different. Where a CPU has a handful of threads, a GPU has thousands. A TLB miss for one warp blocks that warp’s memory access, but since the GPU is constantly switching between warps anyway, the latency is easier to hide.

The challenge is TLB reach. With thousands of concurrent threads accessing large data sets, the GPU’s TLB needs to cover more address space than a CPU’s. Modern GPU architectures address this through multi-level TLBs, large page support, and careful page table design.

Unified Virtual Addressing (UVA) and cudaMallocManaged take this further by giving the CPU and GPU a shared virtual address space. Pointers are valid on both sides, and the driver and hardware manage migrating pages to where they’re needed. This is enormously convenient for programmers but requires careful thought about access patterns to avoid excessive page migration overhead.

Module 7 of the course also covered:

The four-level TLB hierarchy in modern NVIDIA GPUs and the cost of TLB misses at scale.
How page faults are handled on the GPU (the hardware must halt the faulting warp, issue a page fault exception, and wait for the OS to resolve it).
Prefetching strategies for Unified Memory to reduce page fault overhead.

Warp Scheduling Policies

With dozens of warps competing for issue slots each cycle, the warp scheduler’s policy has a significant effect on performance. Three policies covered in this course:

Round Robin vs Greedy Then Oldest warp scheduling — **Figure 9:** Round Robin (left) cycles evenly through all ready warps. Greedy-Then-Oldest (GTO) (right) sticks with the same warp until it stalls, then picks the oldest ready warp. GTO tends to have better latency hiding because it keeps a warp warm in registers.

Round Robin (RR): The simplest policy, cycle through all ready warps in order. Fair, predictable, but doesn’t take advantage of data that’s already in cache.
Greedy-Then-Oldest (GTO): Stay with the currently executing warp until it stalls, then switch to the oldest ready warp. In many simulator studies this improves locality and stall behavior, though production scheduler policies are vendor- and architecture-specific.
Cache-Conscious Wavefront Scheduling (CCWS): A smarter policy that dynamically throttles the number of active warps based on L1 cache pressure. When the cache is being thrashed, many warps evicting each other’s lines, CCWS reduces the number of active warps to improve reuse. When there’s spare cache capacity, it lets more warps run.

Cache-Conscious Wavefront Scheduling — **Figure 10:** CCWS uses a Victim Tag Array (VTA) to track cache evictions. When a warp's data is being evicted by other warps' accesses, CCWS reduces the number of concurrent warps to protect cache locality.

The CCWS insight is that latency hiding (which benefits from more concurrent warps) and cache locality (which benefits from fewer) are in fundamental tension. The right balance is workload-dependent. Project 3 involved implementing all three of these policies in a cycle-level GPU simulator.

GPU Simulation

Why Simulate?

Real GPU hardware gives you performance counters and profilers, but they only tell you what happened, not why, and not what would have happened with a different microarchitectural decision. GPU simulators let you ask counterfactual questions: what if the cache were twice as large? What if the warp scheduler used a different policy? What if tensor cores had half the latency?

This kind of analysis is how GPU architects design and validate new hardware before it’s built. Academic researchers use simulators to prototype and evaluate new architectural ideas. Projects 3 and 4 used Macsim, a trace-driven cycle-level simulator developed at Georgia Tech’s HPArch lab.

A useful way to categorize simulator choices:

Method	Accuracy	Runtime cost	Best use case
Cycle-level (trace or execution-driven)	High	High	Microarchitectural policy work (warp schedulers, cache behavior, buffer stalls).
Event-driven / queue-based	Medium-High	Medium	Pipeline studies where stage interactions matter but cycle-by-cycle replay can be abstracted.
Sampling / interval-based	Medium	Low	Fast design-space exploration over many configurations.

Macsim sits in the first bucket. The queue-based view is still central: requests and warps move through fetch/issue/memory/execution queues with per-stage latency and width constraints, and those queue interactions produce the stall behavior seen in final metrics.

GPU simulation framework — **Figure 11:** A trace-driven GPU simulator replays a recorded execution trace through a cycle-level model of the GPU pipeline, accumulating statistics about cache behavior, warp scheduling, and instruction throughput.

The Roofline Model

Before diving into simulation, it’s worth having a mental model for where a GPU workload can be optimized. The Roofline Model plots workload performance (FLOPS) against arithmetic intensity (FLOPS per byte of memory traffic). The two bounding constraints are:

Compute bound: the workload is limited by the GPU’s peak FLOP/s rate. More arithmetic per byte means this is more likely.
Memory bound: the workload is limited by memory bandwidth. Low arithmetic intensity means each byte fetched from DRAM is barely used before more data is needed.

Tiled matrix multiplication (Project 1) is a classic example of using shared memory to increase arithmetic intensity and push a workload from memory-bound to compute-bound. The FlashAttention algorithm (Extra Project) uses the same principle at much larger scale.

Module 8 of the course also covered:

Cycle-level simulation methodology: how to model pipeline stages, caches, and memory controllers.
Trace-driven vs. execution-driven simulation and their respective trade-offs.
Event-driven simulation and queue-based modeling.
How to validate a simulator against real hardware results.

Multi-GPU Computing

At some point, a single GPU isn’t enough. Modern ML training runs span hundreds or thousands of GPUs. The hardware and software infrastructure to make this work is its own deep topic.

Multi-GPU interconnect topology — **Figure 13:** Multi-GPU systems use high-bandwidth interconnects (NVLink) between GPUs and between GPUs and the CPU, enabling much faster inter-GPU communication than traditional PCIe.

NVLink is NVIDIA’s high-bandwidth GPU interconnect, providing much higher bandwidth than PCIe for GPU-to-GPU communication. NVSwitch scales this to many-GPU topologies by acting as a high-bandwidth switch fabric. Depending on GPU generation and topology, this can deliver substantially higher peer bandwidth than PCIe alone, which is critical for collective operations (all-reduce, all-gather) in distributed training.

GPUDirect RDMA allows GPUs to directly transfer data to remote GPUs over InfiniBand or other network fabrics, bypassing the CPU entirely. This is how large GPU clusters achieve the inter-node bandwidth needed for training runs with thousands of GPUs.

The software side involves NCCL (NVIDIA Collective Communications Library), which implements the standard collective operations (broadcast, reduce, all-reduce, etc.) efficiently across GPU clusters.

Module 9 of the course also covered:

The hardware topology of multi-GPU systems, including PCIe switches, NVLink bridges, and NVSwitch fabrics.
NUMA memory effects when GPUs on different sockets access each other’s memory.
Multi-Process Service (MPS) for sharing a GPU across multiple processes.
Thread block scheduling across multiple GPUs.

Compiler Analysis for GPUs

The GPU Compilation Pipeline

CUDA code goes through a multi-stage compilation process before running on hardware:

GPU compilation flow — **Figure 14:** The CUDA compilation pipeline: the host compiler handles CPU code, while NVCC's device compiler lowers CUDA C++ through PTX (a virtual ISA) to SASS (the actual hardware ISA). The final binary contains both host and device code.

NVCC splits the CUDA source into host (CPU) and device (GPU) portions.
The device code is compiled to PTX (Parallel Thread eXecution), NVIDIA’s virtual ISA. PTX is stable across GPU generations, it’s architecture-independent.
The PTX assembler (ptxas) or driver compiles PTX to SASS (Streaming ASSembler), the actual hardware instructions for a specific GPU architecture.

This two-stage design is what lets CUDA programs run on future GPU architectures without recompilation: the PTX can be recompiled by the driver at runtime.

This post is CUDA-centric because that’s what the projects used, but the course framing also compares CUDA with OpenCL as a portability baseline. CUDA optimizes for NVIDIA-specific toolchains and libraries, while OpenCL emphasizes cross-vendor portability with a more generic runtime model. The tradeoff is usually convenience and peak-performance ecosystem depth (CUDA) versus portability and vendor neutrality (OpenCL).

Dataflow Analysis and Divergence Detection

The compiler needs to understand which branches will be divergent (executed differently by different threads in a warp) and which won’t. A branch that depends on thread ID is almost certainly divergent. A branch that depends only on uniform values (the same across all threads) is safe.

This analysis is a form of dataflow analysis, specifically, tracking which values in a program are “thread-varying” (different per thread) vs. “uniform” (same for all threads in a warp). The source of thread variation is the thread ID itself (the threadIdx and blockIdx built-in variables in CUDA). Any value computed from thread ID may be varying, while any value that never depends on thread ID is uniform.

Divergence analysis in compilers — **Figure 15:** Divergence analysis propagates 'taint' through the program's def-use graph, starting from thread ID sources (like S2R instructions in SASS). Any branch that uses a tainted value is flagged as potentially divergent.

The analysis works on the Control Flow Graph (CFG) of the program, computing def-use chains and propagating taint through use-def relationships. For programs with loops, this becomes an iterative fixed-point computation: you keep propagating taint until nothing changes. Module 11 covers the details of liveness analysis and dataflow equations, and Project 5 involved implementing branch divergence detection in exactly this way.

Modules 10 and 11 of the course also covered:

The LLVM IR and how CUDA maps onto it.
Single Static Assignment (SSA) form and why compilers prefer it for analysis.
Reaching definitions, liveness analysis, and their applications to register allocation.
The connection between GPU divergence analysis and classic compiler dataflow techniques.

ML Acceleration on GPUs

Why GPUs Dominate Machine Learning

The operations at the core of deep neural networks, including linear layers, convolutions, and attention, are all forms of General Matrix Multiply (GEMM). A GEMM takes two matrices and produces their product, and the structure is completely regular, perfectly parallel, and requires many floating-point operations per byte of data loaded. This is almost exactly the kind of workload GPUs were built for.

**Figure 16:** GEMM (General Matrix Multiply) is the workhorse of deep learning. Every linear layer, every attention computation, every convolution reduces to one or more GEMM operations. A central figure of merit is arithmetic intensity: FLOPs per byte.

For a matrix multiply of two N×N matrices, there are O(N³) multiply-add operations and O(N²) bytes of data. As N grows, arithmetic intensity grows as O(N), so computation scales faster than memory traffic. This is why large batch sizes and large model dimensions are more GPU-efficient: they push arithmetic intensity higher, keeping the GPU’s ALUs busy relative to memory bandwidth.

Tensor Cores and Mixed Precision

Tensor cores are a specialized execution unit introduced in Volta (2017) and significantly expanded in each subsequent architecture. They perform small dense matrix multiply-accumulate operations (e.g., 4×8 × 8×4 = 4×4 at the hardware micro-tile level) in a single instruction. This is different from regular CUDA cores, which do one multiply-add at a time. At the programmer level, WMMA exposes larger tile shapes such as 16×16×16.

Tensor core operation — **Figure 17:** A tensor core computes D = A*B + C as a single instruction, where A, B, C, D are small matrices (e.g., 4×8 times 8×4). Throughput gains over regular CUDA cores can be large for compatible GEMM workloads, but the exact factor depends on architecture, precision mode, and kernel shape.

The catch: tensor cores operate on reduced-precision inputs, typically FP16 (half precision) or BF16. The accumulation happens in FP32. This mixed-precision approach, FP16 computation, FP32 accumulation, is now the default training paradigm for neural networks. The FP16 inputs require 2x less memory, 2x less memory bandwidth, and unlock the full tensor core throughput, while FP32 accumulation prevents precision loss from degrading model quality.

The HMMA (Half-precision Matrix Multiply-Accumulate) instruction that tensor cores expose is accessed in CUDA via the wmma API or through libraries like cuBLAS. The compiler can also automatically use tensor cores when it detects compatible GEMM patterns.

Quantization extends the same throughput logic. Moving from FP16/BF16 to INT8 or FP8 reduces bytes moved per operation, which can shift workloads toward higher effective arithmetic intensity and larger feasible batch sizes at fixed memory bandwidth. But quantization is not free: scale factors, dequant/requant steps, and outlier handling can add overhead, and numerically sensitive paths (normalization, reductions, logits) often still need higher-precision accumulators.

Sparse operations add another tradeoff layer. If sparsity is structured enough for hardware support, you can reduce compute and memory traffic substantially. If sparsity is unstructured, metadata overhead and irregular access can erase the theoretical gains. That is why sparse acceleration is highly workload- and format-dependent in real systems.

Module 12 of the course also covered:

The full landscape of DNN operations: elementwise, reduction, GEMM, and specialized ops.
Arithmetic intensity analysis for different neural network layer types and batch sizes.
Quantization: going below FP16 to INT8 or even INT4 for inference.
Sparse matrix operations and their (limited) effectiveness on current hardware.

Course Overview

Projects

The projects were the heart of the course, and each one reinforced a different layer of the GPU stack.

P1 (Tiled Matrix Multiply) and P2 (Bitonic Sort): Writing real CUDA kernels, managing shared memory, optimizing for memory coalescing and occupancy. A hands-on introduction to the hardware concepts from the first half of the course. Full write-up.
P3 (Warp Scheduling Simulator) and P4 (Compute/Tensor Core Simulation): Implementing warp scheduling policies and execution modeling inside a cycle-level simulator. Made the microarchitectural concepts concrete in a way that reading never fully does. Full write-up.
P5 (Branch Divergence Detection): A static analysis pass that detects divergent branches by propagating thread-ID taint through a SASS control flow graph. Tied together the compiler and architecture topics from the second half of the course. Full write-up.
Extra Project (FlashAttention): A progressive implementation of self-attention, FlashAttention-2, and a KV-cached decode kernel inside a real GPT-2 inference pipeline. The capstone. Full write-up.

Grading

Here’s the grading breakdown from when I took the course in Spring 2024:

Component	Weight	Description
Project 1	5%	Tiled Matrix Multiplication
Project 2	20%	Bitonic Sort
Project 3	15%	Warp Scheduling Simulator
Project 4	15%	Compute & Tensor Core Simulation
Project 5	12%	Branch Divergence Detection
Extra Project	0%	FlashAttention & LLM Inference
Homework	15%	Multiple-Choice Quizzes
Final Exam	10%	Covers all course topics
Participation	3%	Discussion forum

The projects were weighted heavily, about 70% of the grade combined. The exams covered the lecture material and required understanding the concepts at a level deeper than just memorizing definitions.

Books

The required textbook for the course is Programming Massively Parallel Processors (PMPP). It’s dense but excellent, essentially the definitive CUDA programming reference, written by the instructors who helped develop the CUDA programming model.

Programming Massively Parallel Processors: A Hands-on Approach

David B. Kirk and Wen-mei W. Hwu

The primary textbook for the course. Covers the CUDA programming model, memory hierarchy, tiling, and optimization in rigorous detail. The 4th edition adds chapters on tensor cores and ML acceleration. Essential reading for anyone serious about GPU programming.

Computer Architecture: A Quantitative Approach

John L. Hennessy and David A. Patterson

The gold standard reference for computer architecture. Relevant for the GPU pipeline, memory hierarchy, and performance analysis sections of the course. Appendix C on pipelining and Chapter 4 on data-level parallelism are particularly applicable.

CUDA by Example: An Introduction to General-Purpose GPU Programming

Jason Sanders and Edward Kandrot

A gentler introduction to CUDA than PMPP. Good for getting comfortable with the programming model before diving into the optimization details. Available free from NVIDIA.

Other Resources

CUDA Programming Guide - the authoritative reference, surprisingly readable
Horace He’s Making Deep Learning Go Brrrr - a great intuitive explanation of compute/memory boundedness and arithmetic intensity
FlashAttention paper (Dao et al., 2022) - required reading for the Extra Project
FlashAttention-2 paper (Dao, 2023) - the improved version used in the project
OMSCS Central reviews - useful for workload estimates before starting

My Experience

I loved the broad set of topics this course covered and how it united the systems foundations of courses like HPCA, HPC, GIOS, with ML courses like DL. It changed how I think about performance at the hardware level, not just “is this fast?” but “why is this fast, and what’s the limiting factor?”

The course has a clear arc: start with programming (what can GPUs do?), move to architecture (how do they do it?), then simulation (how do we model and study them?), then compilers (how do we analyze code for GPUs?), and finally ML acceleration (what are they doing in production?). Each section builds on the last. By the time you hit the Extra Project and are writing FlashAttention kernels, you understand not just what FlashAttention does but why the specific tiling and online-softmax choices map well onto GPU memory hierarchy.

If you’re interested in ML systems, GPU programming, or computer architecture, I’d strongly recommend this course. It fills in a layer that most ML practitioners never see, and that ignorance shows up as real performance gaps when you try to optimize serious workloads.

Thanks for reading.

References

Kirk, D. B., & Hwu, W. W. (2022). Programming Massively Parallel Processors: A Hands-on Approach (4th ed.).
Hennessy, J. L., & Patterson, D. A. (2017). Computer Architecture: A Quantitative Approach (6th ed.).
Dao, T., Fu, D. Y., Ermon, S., Rudra, A., & Ré, C. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. NeurIPS 2022.
Dao, T. (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. ICLR 2024.
Lakshminarayana, N. B., & Kim, H. (2010). Effect of warp divergence in CUDA. Workshop on General Purpose Processing on Graphics Processing Units (GPGPU).
Jog, A., Kayiran, O., Nachiappan, N. C., Mishra, A. K., Kandemir, M. T., Mutlu, O., … & Das, C. R. (2013). OWL: Cooperative thread array aware scheduling techniques for improving GPGPU performance. ASPLOS.