Graduate Introduction to Operating Systems

Introduction

This article is part personal reflection, part technical summary. I recently completed Georgia Tech’s CS 6200: Graduate Introduction to Operating Systems (GIOS), and I wanted to take the time to solidify what I’ve learned—both for myself, in the spirit of the Feynman technique, and for others who might be curious about the course or about me. This includes future OMSCS students, hiring managers, and peers who want a better sense of what this course covers and what I now know (and don’t know yet).

There’s always a tension when writing something like this. I want to be honest about what I’ve learned and the depth of my understanding, while acknowledging that I’m still learning. This course was challenging and rewarding, and this write-up is my way of digesting the material and sharing it in a format that feels productive.

Operating Systems Overview

At a high level, an operating system (OS) “is a special piece of software that abstracts and arbitrates the use of a computer system.” It sits between user applications and hardware, providing the illusion of a cohesive, manageable system despite the complexity and diversity of hardware components.

OS Overview — **Figure 1:** The OS is a special piece of software that abstracts and arbitrates the use of a computer system.

This layered architecture separates user space, where applications run, from kernel space, where the OS performs core functions like managing hardware and system resources. Applications interact with the OS through standard libraries and system calls, enabling functionality like file operations or memory allocation without direct hardware access. The OS, in turn, controls and coordinates the hardware layer, which includes the CPU, memory, storage, displays, network devices, and input peripherals.

The OS fulfills several critical roles:

Controls access to critical resources like the CPU, memory, and I/O devices. Without this, applications could interfere with each other or with the hardware directly.
Provides abstractions such as files, processes, and virtual memory. These simplify development and improve portability by hiding hardware-specific details behind a consistent interface exposed via system calls.
Enforces policies to manage limited resources—deciding which processes run when, how memory is allocated, and which users or programs can access specific files or devices. These policies balance competing goals like fairness, responsiveness, efficiency, and security.
Virtualizes everything but time, enabling multiple processes (and potentially multiple users) to share the same hardware as if each had their own isolated system. For instance, CPU virtualization creates the illusion of multiple CPUs; memory virtualization ensures processes operate within isolated memory spaces.

In short, the OS is the software layer that turns raw, unreliable hardware into a stable, usable, and secure programming environment.

For more details, see the Introduction to Operating Systems lecture which covers:

Separation of Mechanism and Policy: This principle emphasizes designing flexible mechanisms that can support various policies, allowing the OS to adapt to different scenarios.
Operating system Structures like monolithic, microkernel, and hybrid kernels.
A brief overview of these architectures can highlight practical applications of OS concepts and design choices.
Examples of major operating systems like Windows, Linux, macOS, Android, and iOS can contextualize the discussion and illustrate the diversity of OS implementations.

CPU Virtualization

Processes, Threads, and Context Switching

One of the central responsibilities of the operating system is to virtualize the CPU, creating the illusion that each process has its own dedicated processor. This abstraction is achieved through process management and scheduling, enabling multitasking and isolation across programs.

A process is an instance of a running program, and the OS is responsible for managing its entire lifecycle: from creation and execution to suspension and termination. Each process is given its own private address space and a set of resources, such as open file descriptors and execution context, ensuring that it operates independently of other processes. The Process Control Block (PCB)is the data structure that OS uses manage the process’s state and metadata.

A process is an instance of a program in execution.
The OS manages the lifecycle of processes: creation, execution, suspension, and termination.
Each process has its own memory space, file descriptors, and state.

Process Thread — **Figure 2:** A process is an instance of a program in execution.

To enable more granular concurrency within a process, modern systems use threads. A thread is the smallest unit of execution and shares the same address space with other threads in the same process. This shared memory model allows for lightweight context switches and efficient communication but also introduces complexity: threads must coordinate access to shared resources to avoid data races and inconsistencies.

A thread is a unit of execution within a process.
Multiple threads in a single process share the same memory space.
Threads introduce concurrency and require careful synchronization.

A context switch is the process of switching the CPU from one process or thread to another. It involves saving the current state of the running process to its PCB and then loading the state of the next process to be executed from its PCB. This allows multiple processes to share a single CPU, enabling multitasking. The operating system code (in kernel mode) manages the context switch by interrupting the current process and switching to the next process.

To execute a system call, like read or write, the current process has to context switch to the kernel mode. The kernel mode is a privileged mode that allows the operating system to execute code that is not allowed to be executed by the user process. The kernel mode is also used to execute system calls. The system call is a way for the user process to request a service from the operating system.

A system call is a way for the user process to request a service from the operating system.
A kernel mode is a privileged mode that allows the operating system to execute code that is not allowed to be executed by the user process.
A user mode is a mode that allows the user process to execute code that is not allowed to be executed by the operating system.

For more details, see the Processes and Process Management lecture and which covers:

The memory layout of a process including the stack, heap, and data segments.
The process address space and memory management.
What’s in the process control block (PCB) and how its used to manage the process’s state and metadata.
The execution states of a process and the life cycle as it transitions between them (new, ready, running, waiting, terminated).
The paper “An Introduction to Programming with Threads” (Birrell, 1989)

Concurrency and Synchronization

The ability to run multiple threads concurrently unlocks significant performance gains.Even on a single CPU, multithreading improves responsiveness by allowing a program to perform other work while one thread is blocked on a long-running operation like disk I/O. On multi-core systems, it enables true parallelism, where multiple threads execute simultaneously. However, this concurrency introduces challenges, chiefly data races, where multiple threads access shared data and at least one is writing, leading to unpredictable outcomes.

To prevent such issues, we use synchronization constructs:

Mutexes (Mutual Exclusion) are the most basic synchronization primitive. They act as a lock, ensuring that only one thread can access a critical section of code at a time. A thread must acquire the mutex before entering the critical section and release it upon exiting.
Condition Variables allow threads to wait for a specific condition to become true. They are used with a mutex to block a thread until another thread signals that the condition is met, avoiding inefficient busy-waiting.

These constructs are critical to avoid common pitfalls like deadlocks, where two or more threads are stuck waiting for each other to release a resource, and spurious wake-ups, where a waiting thread is awakened without the condition it was waiting for being met.

For more details, see the Threads and Concurrency lecture which covers:

Synchronization primitives like the Readers/Writer lock.
Common pitfalls like spurious wake-ups and deadlocks in more detail.
Kernel vs. User-Level Threads and different multithreading models (One-to-One, Many-to-One, Many-to-Many).
Multithreading patterns like the Boss/Workers Pattern, Pipeline Pattern, and Layered Pattern are covered in the Thread Design & Performance Considerations section.

These mechanisms enable the operating system to support concurrent applications running on shared hardware, while maintaining isolation, fairness, and control.

PThreads

POSIX Threads (PThreads), are the standard C API for creating and synchronizing threads in linux and other unix-like operating systems like macOS. The following producer-consumer example demonstrates thread creation, mutexes, and condition variables in a single, concise program. I included it here to illustrate how little code is needed to work with the abstractions of threads and synchronization.

Feel free to skip the code block below and move on to the next section.

#include <stdio.h>
#include <pthread.h>
#include <unistd.h>

// Shared resources protected by a mutex
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
int message_ready = 0;

// Consumer thread waits for a condition
void* consumer(void* arg) {
    pthread_mutex_lock(&lock);
    // Use a while loop to protect against spurious wakeups
    while (message_ready == 0) {
        printf("Consumer: Waiting for message...\n");
        // Unlocks the mutex and waits; re-acquires lock upon wakeup
        pthread_cond_wait(&cond, &lock);
    }
    printf("Consumer: Message received!\n");
    pthread_mutex_unlock(&lock);
    return NULL;
}

// Producer thread signals the condition
void* producer(void* arg) {
    // sleep(1) helps ensure the consumer runs first and waits
    sleep(1); 
    
    pthread_mutex_lock(&lock);
    printf("Producer: Preparing message...\n");
    message_ready = 1;
    // Wakes up one waiting thread
    pthread_cond_signal(&cond);
    printf("Producer: Message sent!\n");
    pthread_mutex_unlock(&lock);
    return NULL;
}

int main() {
    pthread_t producer_thread, consumer_thread;

    // Create the two threads
    pthread_create(&consumer_thread, NULL, consumer, NULL);
    pthread_create(&producer_thread, NULL, producer, NULL);

    // Wait for both threads to complete
    pthread_join(producer_thread, NULL);
    pthread_join(consumer_thread, NULL);

    printf("Producer and consumer have finished.\n");
    return 0;
}

This C code demonstrates a basic producer-consumer pattern using POSIX Threads (PThreads). The producer thread prepares a “message” and signals the consumer thread when it’s ready. The consumer thread waits for this signal before processing the message, ensuring proper synchronization between the two threads using a mutex and a condition variable.

A classic real-world scenario for the producer-consumer pattern is a message queue or a task processing system. Imagine a web server (the producer) that receives many user requests. Instead of processing each request immediately, which could overwhelm the server if there’s a sudden surge, it adds these requests as “messages” to a queue. A separate pool of worker processes or threads (the consumers) then pick up these messages from the queue one by one, process them, and perform the necessary tasks (e.g., database operations, sending emails, generating reports). This allows the server to handle incoming requests efficiently by decoupling the request reception from its actual processing, ensuring smooth operation even under heavy load.

For more details, see the Threads Case Study - PThreads lecture which covers:

Thread attributes (pthread_attr_t) for customizing thread behavior (e.g., stack size, detached state).
Safety tips for using mutexes and condition variables to avoid common errors.
A complete, classic Producer-Consumer problem implementation using PThreads.

Thread Design Considerations

This section is a bit more abstract. It delves into the tradeoffs that operating systems designers make when designing their threading models.

Beyond choosing a concurrency pattern, building a robust multithreaded system requires careful consideration of how threads are managed by the operating system and how they interact with system events. While most developers are unlikely to be designing their own threading models, understanding the tradeoffs and how they are implemented helps reveal how they work under the hood.

User-level vs. Kernel-level Threads: This is a core design choice. User-level threads are managed by a library within a process, making them fast to create and switch. However, the OS kernel is unaware of them; if one user thread blocks on I/O, the entire process is blocked. Kernel-level threads are managed by the OS, which can schedule them across multiple CPU cores and handle blocking independently. This comes at the cost of slower creation and context switching due to the need for system calls.
Threading Models: To balance these trade-offs, systems use different models. The one-to-one model (e.g., modern Linux, Windows) maps each user thread to a kernel thread, offering good performance and simplicity. The many-to-one model maps many user threads to a single kernel thread, which is efficient but suffers from the blocking problem. The many-to-many model offers a hybrid approach, but adds significant complexity.
Handling Interrupts and Signals:

Figure 4: The interrupt handling cycle.

Figure 4. illustrates the interrupt handling cycle. Interrupts are asynchronous signals sent to the CPU from software, I/O devices, or internal processor exceptions. When an interrupt is issued, the CPU halts the currently running code, saves its state, and executes a special function called an interrupt handler to address the event. Once the handler finishes, the CPU restores the original state and resumes execution. This mechanism is fundamental to multitasking and allows the OS to respond to events efficiently.

In a multithreaded environment, handling these asynchronous events is particularly complex. A classic problem arises when a signal handler, which runs in the context of the interrupted thread, tries to acquire a mutex that the thread already holds, causing an instant deadlock. A common, simple solution is to disable signals around critical sections. More advanced OS designs may treat the main work of an interrupt handler as a separate, lower-priority thread (a “bottom-half” handler) to avoid such issues.
Threading in Practice: Linux: Modern Linux provides a concrete example of these designs. It uses a one-to-one threading model called the Native POSIX Thread Library (NPTL). Internally, Linux doesn’t strongly distinguish between a process and a thread; both are simply “tasks” created by the clone() system call. The flags passed to clone() determine how much of the parent’s context (memory, file descriptors, etc.) is shared, making a “thread” just a task that shares almost everything.

For more details, see the Thread Design Considerations lecture which covers:

Data structures used by the OS (like in Solaris) to manage user and kernel threads, including the concept of a Lightweight Process (LWP).
The difference between “hard” and “light” process state and how it optimizes context switching.
The mechanics of signal masks and interrupt handling on multicore systems in greater detail.
The papers “Beyond Multiprocessing: Multithreading the SunOS Kernel”(Eykholt et al., 1992) and “Implementing Lightweight Threads” (Stein & Shah, 1992)

Thread Performance Considerations

Evaluating the performance of a multithreaded system is not just about measuring raw speed; it’s about choosing the right metrics for the right goals. Different applications have different needs. A scientific computing task might prioritize minimum execution time, while a web server is likely more concerned with throughput (requests per second) and response time. As the lecture notes highlight, a design that excels in one metric (like the pipeline model for total execution time) might perform worse in another (like the boss/worker model for average response time).

This leads to fundamental architectural choices:

Multi-Process vs. Multi-Threaded: A multi-process architecture (like the original Apache web server) provides strong isolation, as each process has its own address space. This makes it robust but memory-intensive and slower due to higher context-switching and inter-process communication (IPC) costs. A multi-threaded architecture shares memory, making it much more resource-efficient and faster to switch contexts, but requires careful synchronization to prevent data races.
Event-Driven Model: An alternative approach to concurrency, exemplified by servers like Nginx or frameworks like Node.js. An event-driven server uses a single thread and an event loop to handle many connections concurrently. It listens for events (like a new connection or incoming data) and reacts to them without blocking. This model can achieve very high throughput by avoiding the overhead of thread creation and context switching, but it performs poorly if any task blocks the event loop (e.g., a synchronous disk read).

For more details, see the Thread Performance Considerations lecture which covers:

A deep dive into a case study comparing different web server architectures: multi-process, multi-threaded, and an event-driven model (the Flash web server).
A detailed look at experimental methodology, including how to define system comparisons, select appropriate workloads (e.g., real-world traces vs. synthetic loads), and choose the right performance metrics.
Practical advice on designing and running experiments to produce meaningful and reproducible results.
The paper “Flash: An Efficient and Portable Web Server” (Pai et al., 1999)

Scheduling

The OS scheduler is the component responsible for deciding which ready process or thread gets to run on the CPU next. This decision is guided by a scheduling policy, which aims to balance competing goals like fairness, responsiveness, and overall throughput. Different policies have different strengths and are suited for different workloads.

First-Come, First-Serve (FCFS): The simplest policy. Tasks are executed in the order they arrive. It’s fair in a “first-in, first-out” sense but can lead to poor performance if a long-running task blocks many shorter tasks (the “convoy effect”).
Shortest Job First (SJF): This policy executes the task with the shortest estimated execution time next. It is provably optimal for minimizing average waiting time but is impractical, as the OS can’t know the future execution time of a task.
Priority Scheduling: Each task is assigned a priority, and the scheduler always runs the highest-priority task. This is common in systems where kernel tasks must take precedence over user tasks. However, it can lead to problems like priority inversion, where a high-priority task is forced to wait for a lower-priority task holding a necessary lock.
Round Robin (RR): The foundational algorithm for timesharing systems. Each task is given a small time slice (or quantum) to run. If it’s still running when the time slice expires, it’s preempted and moved to the back of the ready queue. This ensures that all tasks make progress and the system remains responsive.

Modern systems use more sophisticated schedulers. The Linux Completely Fair Scheduler (CFS), for instance, abandons fixed time slices and instead tries to give each task a fair, proportional amount of the CPU’s processing time. On multi-core systems with hyperthreading, schedulers must also consider how to pair workloads. The best performance is often achieved by co-scheduling a mix of CPU-bound and memory-bound tasks, which maximizes the utilization of both the processor pipeline and the memory bus.

For more details, see the Scheduling lecture which covers:

The mechanics of timeslice length and how it impacts CPU-bound vs. I/O-bound tasks.
The evolution of Linux schedulers, from the O(1) scheduler to the Completely Fair Scheduler (CFS).
How hardware counters can be used to identify memory-bound vs. CPU-bound workloads to make smarter scheduling decisions on hyperthreaded systems.
The paper “Chip Multithreading Systems Need a New Operating System Scheduler”(Fedorova et al., 2004)

Memory, Persistence, & I/O

Memory Management

Memory management is the OS’s strategy for arbitrating and controlling the finite physical memory (DRAM) on behalf of all running processes. The central abstraction is virtual memory, which gives each process the illusion of having its own large, contiguous, and private address space. Each process has its own virtual address space. In reality, the OS, with crucial help from the hardware’s Memory Management Unit (MMU), translates these virtual addresses to physical addresses in DRAM. To speed up the constant address translations, the MMU uses a fast, specialized hardware cache called a Translation Lookaside Buffer (TLB), which stores recently used mappings. A “TLB miss” forces a much slower lookup in the page table.

**Figure 5:** Address Translation is the process of translating a virtual address (top) to a physical address (bottom). The fastest path is to use the TLB (green). If the TLB miss, the page table is used (yellow). If the page indicates the page isn't present a page fault occurs and the page is loaded from disk (red).

The dominant mechanism for this is paging. The virtual address space is divided into fixed-size blocks called pages, and physical memory is divided into identically sized page frames. For each process, the OS maintains a page table that maps virtual page numbers (VPNs) to physical frame numbers (PFNs). Each entry in this table, a Page Table Entry (PTE), also contains important metadata. This metadata is varies between architectures (x86, ARM, etc.) and often include a present bit (is the page in DRAM?), a dirty bit (has the page been modified?), an access bit (for read and write operations), and protection bits (read/write/execute permissions).

Demand Paging allows pages to be loaded from disk only when a process first accesses them (a “page fault”), enabling programs to run even if they don’t fit entirely in memory. When memory is full, page replacement algorithms (like approximations of Least Recently Used) use the “access” and “dirty” bits to decide which pages to evict to disk.

Since each process has it’s own virtual address space with it’s own page tables, when context switching, the OS also saves the page table of the current process and loads the page table of the next process. Copy-on-Write (COW) is an important optimization for process creation (fork()), where the parent’s pages are shared with the child instead of being copied. A real copy is only made if and when one of the processes attempts to write to a shared page.

Since a page table for a large address space can itself be enormous, modern systems use multi-level page tables to save space. Each level of the page table is a pointer to the next level of the page table. The same optimization of using a TLB to cache these entries still applies.

Beyond managing memory for user processes, the OS kernel must also efficiently allocate memory for its own internal data structures. This presents unique challenges, primarily fragmentation. External fragmentation occurs when free memory is broken into many small, non-contiguous blocks, eventually making it impossible to satisfy a large allocation request even though enough total memory is free. Internal fragmentation occurs when the allocator provides a block of memory that is larger than the request, wasting the space left over inside the block.

For more details, see the Memory Management lecture which covers:

Different memory allocation strategies inside the Linux kernel, like the Buddy Allocator and Slab Allocator.
The trade-offs of different page sizes.
Alternative memory management designs like segmentation.
Using memory management hardware for other services like checkpointing and process migration.

Inter-Process Communication (IPC)

Inter-Process Communication (IPC) provides a set of mechanisms for separate processes, each with its own private address space, to communicate and coordinate with each other. The OS provides several ways to achieve this, which generally fall into two categories: message-based and memory-based communication.

Message-based IPC is an abstraction where the OS manages a communication channel between processes. Processes send messages into the channel, and the OS handles delivering them to the receiving process. This category includes familiar tools like pipes (simple byte streams between two processes, common in shell command chaining) and sockets (a more general-purpose interface that can work between processes on the same machine or across a network). The main advantage of this approach is simplicity; the OS handles the underlying details and synchronization. The disadvantage is performance overhead, as every send and receive operation requires a system call and a data copy between the user process and the kernel.

Shared memory, the primary form of memory-based IPC, offers a much higher-performance alternative. Here, the OS maps a region of physical memory into the virtual address spaces of two or more processes. Once this mapping is established, the processes can read and write to this shared area as if it were their own memory, with no further kernel involvement. This avoids the system call and data copy overhead on every communication, making it ideal for transferring large amounts of data. However, this performance comes at the cost of complexity. The processes themselves are now responsible for synchronizing access to the shared memory using constructs like mutexes or semaphores to prevent race conditions.

For more details, see the Inter-Process Communication lecture which covers:

Specific APIs for message queues and shared memory, including SysV and POSIX standards.
How to use PThread synchronization primitives (mutexes, condition variables) for inter-process synchronization.
Command-line tools for inspecting IPC facilities on a system.
Design considerations for using shared memory, such as segment size and management.

Note

For a broader survey of a practical application of IPC concepts and other methods, see the IPC project I did for this class.

Synchronization Constructs

While mutexes and condition variables are the fundamental building blocks for synchronization, they can be low-level and error-prone. To address this, operating systems and libraries provide higher-level and more specialized synchronization constructs. The correctness and performance of all these constructs, however, depend entirely on low-level atomic instructions provided by the hardware (e.g., test-and-set, compare-and-swap). These instructions guarantee that a sequence of operations, like reading a value, checking it, and writing a new value, happens indivisibly.

On modern multi-core systems, the performance of these atomic operations is deeply tied to cache coherence. Each CPU core has its own private cache, which can lead to situations where one core’s view of a memory location is different from another’s. Cache coherence protocols (like the common write-invalidate protocol) solve this by forcing other cores to discard their local copies of a memory location whenever one core writes to it. This is crucial for correctness, but it means that when one thread modifies a shared lock, it can trigger a cascade of cache misses for all other spinning threads, significantly impacting performance.

Building on this hardware foundation, we get more expressive tools:

Spinlocks: The most basic type of lock. Unlike a mutex, which puts a waiting thread to sleep, a thread waiting on a spinlock will “spin” in a tight loop, repeatedly checking if the lock is free. This is highly efficient on multi-processor systems if the critical section is very short, as it avoids the high cost of a context switch. However, it’s extremely wasteful if the lock is held for a long time or on a single-core system.
Semaphores: A more general synchronization tool. A semaphore is essentially a counter that controls access to a resource. Threads “wait” on (decrement) the semaphore to acquire it and “signal” (increment) it to release. A semaphore initialized with a value of 1 behaves like a mutex. When initialized with a larger value N, it can be used to grant access to a pool of N identical resources.
Reader/Writer Locks: A specialized lock designed for scenarios where a resource is read frequently but written to infrequently. It allows any number of “reader” threads to access the resource concurrently, but ensures that a “writer” thread has exclusive access. This provides much better performance than a simple mutex for read-dominant workloads.

For more details, see the Synchronization Constructs lecture which covers:

The impact of cache coherence on spinlock performance and the design of more advanced spinlocks (e.g., Test-and-Test-and-Set, Queueing Locks).
Higher-level constructs like Monitors, which bundle data with the synchronization logic needed to access it.
The specific POSIX API for semaphores.
The paper “The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors” (Anderson, 1990)

Disk and File Systems

The OS manages I/O devices through device drivers, which abstract away the specific hardware details. For efficiency, modern systems use Direct Memory Access (DMA), allowing devices to transfer data directly to and from main memory without involving the CPU for every byte. The CPU initiates the transfer, but the device and DMA controller handle the rest, freeing the CPU to do other work.

To manage persistent storage, the OS provides a file system abstraction. A key component in Linux and other Unix-like systems is the Virtual File System (VFS), an interface that allows the system to support many different underlying file systems (like ext4, XFS, NFS) in a uniform way. Applications interact with the VFS using standard system calls (open, read, write), and the VFS directs these calls to the correct filesystem driver.

The VFS operates on a few key abstractions. Every file and directory is represented by an inode (index node), a data structure that stores the file’s metadata (permissions, owner, size) and, crucially, points to the actual data blocks on the disk. Because files don’t need to be stored contiguously, the inode acts as an index for all of a file’s blocks. To support files larger than what can be referenced by direct pointers in the inode, filesystems like ext2 use indirect pointers, where an inode points to a block of more pointers, adding levels of indirection to address vast amounts of data.

To make disk access efficient, filesystems rely on several optimizations. A buffer cache in main memory holds recently used disk blocks to avoid slow physical disk reads. I/O schedulers reorder disk requests to minimize the physical movement of the disk head, turning many small random writes into larger sequential ones. Finally, journaling provides reliability by writing changes to a log before committing them to their final location on disk, ensuring the filesystem can be quickly restored to a consistent state after a crash.

For more details, see the I/O Management lecture which goes deeper into:

Different types of I/O devices (block, character, network) and how they are represented in the OS.
The details of the block device stack in Linux.
The role of the dentry cache in speeding up path lookups.
The specific on-disk layout of the ext2 filesystem, including block groups and bitmaps.

System Virtualization

System virtualization allows multiple, isolated operating systems—known as Virtual Machines (VMs) or guests—to run concurrently on a single physical machine. This is orchestrated by a Virtual Machine Monitor (VMM), or hypervisor, which manages and allocates the underlying hardware resources. The VMM’s goal is to provide each VM with an environment that is a faithful, performant, and isolated duplicate of the real hardware.

There are two primary models for this:

Type 1 (Bare-Metal): The hypervisor runs directly on the physical hardware, and the guest VMs run on top of it. This model is common in datacenter environments, with examples like VMware ESX and Xen. Often, a privileged “service VM” (like Xen’s dom0) is used to manage device drivers.
Type 2 (Hosted): The hypervisor is an application that runs on top of a conventional host OS (e.g., Linux, Windows). Examples include KVM, VirtualBox, and VMware Workstation. This model leverages the host OS for device support and management.

Hypervisors — **Figure 7:** Type 1 (Bare-Metal) left and Type 2 (Hosted) right

The core challenge of virtualization is that the guest OS expects to have full, privileged control over the hardware, but the hypervisor must remain in charge. Historically, this was solved with clever software techniques. For CPU virtualization, where some privileged instructions on older x86 chips wouldn’t trap to the hypervisor, solutions included binary translation (dynamically rewriting problematic guest code) and paravirtualization (modifying the guest OS to make explicit calls—“hypercalls”—to the VMM). For memory, the hypervisor creates shadow page tables to map the guest’s virtual addresses directly to machine addresses. For I/O, the split device driver model is common, with a lightweight “front-end” driver in the guest communicating with a full “back-end” driver in the host or service VM.

Today, most of these complex software techniques have been supplanted by direct hardware support (e.g., Intel VT-x and AMD-V). Modern CPUs have features that allow a guest OS to run in a non-privileged mode where sensitive instructions automatically trap to the hypervisor. Hardware also provides support for memory virtualization (e.g., Extended Page Tables) and I/O virtualization, making the whole process much simpler and more efficient.

For more details, see the Virtualization lecture which digs into:

The benefits of virtualization, such as server consolidation and easier migration.
The different models for virtualizing I/O devices, including passthrough, hypervisor-direct, and the split device driver model.
A deeper look at the hardware features that enable efficient virtualization on modern platforms.
The papers “Formal Requirements for Virtualizable Third Generation Architectures” (Popek & Goldberg, 1974) and “Virtual Machine Monitors: Current Technology and Future Trends” (Rosenblum & Garfinkel, 2005)

Distributed Systems

Remote Procedure Calls (RPC)

Remote Procedure Calls (RPC) are a powerful abstraction that makes a function call on a remote machine look and feel like a local function call. The goal is to hide the underlying complexities of network communication—like socket programming, data serialization, and error handling—from the application developer.

The magic of RPC is implemented through several key components. First, the client and server agree on a contract defined in an Interface Definition Language (IDL). The IDL specifies the procedures that can be called, their parameters, and their return types, independent of any programming language. An RPC compiler then uses this IDL file to automatically generate stubs for both the client and the server. When the client calls a remote procedure, it’s actually calling the local client stub. This stub’s job is to take the arguments, package them into a standardized, machine-independent format (marshalling), and send them across the network to the server.

On the server side, the server stub receives the message, unpacks it (unmarshalling), and makes a regular local call to the actual procedure implementation. The return value is then marshalled and sent back to the client stub, which unmarshals it and returns it to the calling client application. Before any of this can happen, the client must first find the server, a process called binding, which is typically handled by a central registry or “portmapper” service. A major challenge in any RPC system is handling partial failures—what happens if the server crashes or the network fails? Modern RPC frameworks must provide mechanisms to handle these inevitable distributed systems problems.

For more details, see the Remote Procedure Calls lecture which also covers:

A concrete example using SunRPC (now ONC RPC) and its IDL, XDR (External Data Representation).
How RPC systems handle complex data types like pointers.
The specific encoding rules used by XDR to ensure cross-platform compatibility.
Other RPC systems like Java RMI.
The paper “Implementing Remote Procedure Calls” (Birrell & Nelson, 1984)

Distributed File Systems

A Distributed File System (DFS) abstracts away network storage to make remote files appear as if they were on the local machine. This is a powerful concept, but it introduces significant design challenges, primarily centered around performance, fault tolerance, and consistency. A fundamental choice is how to organize the servers: files can be partitioned across many servers to improve scalability, or replicated across servers to improve fault tolerance and availability. Most large systems use a hybrid approach.

The core tension in any DFS is between performance and consistency. A simple model where the client downloads a file, modifies it, and uploads it back is efficient for local work but terrible for network traffic and concurrent access. The opposite, where every single read or write is a remote operation, is great for consistency but suffers from high latency and poor server scalability. The practical compromise is client-side caching, where clients cache file blocks locally. This improves performance dramatically but creates a new problem: how to ensure a client’s cache is up-to-date if another client modifies the file on the server.

This leads to another critical design choice: is the server stateless or stateful? A stateless server keeps no information about which clients are accessing which files. This makes crash recovery trivial—the server just restarts—but it limits performance and makes consistency management difficult. A stateful server keeps track of clients and their open files, which enables high-performance caching strategies and features like file locking. However, if a stateful server crashes, recovering its state without losing data or breaking consistency is much more complex. Different file sharing semantics, from the strict “UNIX semantics” to the looser “session semantics,” represent different trade-offs in this design space.

For more details, see the Distributed File Systems lecture which also covers:

Network File System (NFS), a classic example of a DFS that evolved from a stateless protocol (v3) to a stateful one (v4) to better support caching, locking, and performance.
Sprite DFS, an influential research system whose design was based on an analysis of real-world file access patterns, which revealed that most files are short-lived, leading to a caching system that optimized for delayed writes.
The different file sharing semantics (UNIX, session, etc.) in more detail.
The paper “Caching in the Sprite Network File System” (Nelson et al., 1988)

Distributed Shared Memory (DSM) & Consistency

Distributed Shared Memory (DSM) is an abstraction that allows processes running on different machines to share a common memory address space, giving them the illusion of running on a single, large shared-memory machine. This is a powerful technique for scaling applications beyond the memory capacity of a single node. DSM can be implemented directly in expensive, specialized hardware, but it is more commonly realized in software at the OS or programming language level.

The central challenge in any DSM system is maintaining consistency. When one node writes to a shared memory location, how and when do other nodes see that change? This is defined by a consistency model, which is a contract between the system and the programmer about the ordering and visibility of memory operations. Different models offer different trade-offs between performance and programming simplicity.

The spectrum of consistency models includes:

Strict Consistency: The ideal model where any write is instantaneously visible to all other processes. This is impossible to implement in a real distributed system due to network latency.
Sequential Consistency: A more relaxed model where all processes see the same global ordering (interleaving) of memory operations, though this order might not reflect the real-time execution. Operations from a single process always appear in the order they were issued. This is easier to reason about but can be slow, as it requires global coordination.
Causal Consistency: This model only enforces ordering for writes that are causally related (i.e., a write that depends on a previous read). Concurrent, unrelated writes can be seen in different orders by different processes, which allows for better performance.
Weak Consistency: This model gives the most control (and responsibility) to the programmer. It introduces explicit synchronization points (acquire/release operations). Memory updates are only guaranteed to be visible to other processes after an explicit synchronization call, allowing the system to batch updates and reduce network traffic, significantly improving performance.

For more details, see the Distributed Shared Memory lecture which also covers:

The trade-offs of different sharing granularities (page-based vs. object-based).
DSM design choices like data migration vs. replication.
How software DSM systems leverage the Memory Management Unit (MMU) and page faults to trap on remote or protected memory accesses.
The paper “Distributed Shared Memory: Concepts and Systems” (Protic et al., 1996)

Datacenter Technologies

The final topic of the course brings everything together by looking at the technologies that power modern datacenters and cloud computing. Large-scale internet services are typically designed in one of two ways: a homogenous architecture, where any server can handle any request, which is simple to load-balance but has poor data locality; or a heterogenous architecture, where servers are specialized for specific tasks, which improves performance through caching and locality but is more complex to manage and scale.

Cloud Computing emerged as a powerful solution to the challenges of building and scaling these services. It abstracts away physical hardware, offering on-demand, elastic resources as a utility. This model is built on two key principles: the law of large numbers, which allows providers to serve many customers with variable needs using a fixed set of resources, and economies of scale, which dramatically lowers the cost of hardware and operations.

The services offered by cloud providers are typically categorized into different models, defining the trade-off between user control and provider management:

Infrastructure as a Service (IaaS): Provides the fundamental building blocks—virtual machines, storage, and networking. The user manages the OS, applications, and data. (e.g., Amazon EC2, Google Compute Engine).
Platform as a Service (PaaS): Provides a platform for developers to build and deploy applications without worrying about the underlying infrastructure. The provider manages the OS, runtime, and middleware. (e.g., Vercel, Google App Engine).
Software as a Service (SaaS): Delivers a complete software application over the internet. The user simply consumes the service. (e.g., Gmail, Salesforce).

These cloud services are made possible by many of the technologies discussed throughout the course. Virtualization provides fungible, isolated compute resources. Large-scale cluster schedulers (like Kubernetes, Mesos, or YARN) manage resource allocation across thousands of machines. Distributed file systems and NoSQL databases provide scalable storage, and the entire stack is built on the principles of fault tolerance and redundancy to handle the inevitable failures in a large, complex system.

For more details, see the Datacenter Technologies lecture which also covers:

Cloud deployment models: Public, Private, and Hybrid clouds.
The “poster child” case study of Animoto’s massive scaling on AWS.
An overview of big data processing stacks like Hadoop and Spark.

Course Overview

The projects were a great way to solidify the concepts learned in the course and took up a significant portion of the time I spent on the course. The average on OMSCS Central across all of the student reviews was 18.5 hours per week. I averaged a bit more than that, particularly on the first project, because of my lack of C and C++ experience. There were a few weeks when I hammered out the projects where I spent closer to 40 hours in a week. Once I got the projects done, I was able to focus on the lectures and readings which took up less time.

I wrote deeper dives on each of the projects:

Project 1: Concurrent File Server
(project 2 was omitted in my iteration of the course)
Project 3: Cache & Proxy Servers with Shared Memory IPC
Project 4: GRPC and Distributed Systems

Grading

Here’s the grading breakdown from when I took the course in Spring 2024:

Component	Weight	Description
Participation	5%	Views and posts in the discussion forum
Project 1	15%	Multi-threaded Web Server
Project 3	15%	Cache & Proxy Servers with Shared Memory IPC
Project 4	15%	GRPC and Distributed Systems
Midterm Exam	25%	Everything up to Memory, Persistence, & I/O
Final Exam	25%	Everything after that

The course was graded on a slight curve. A ton of folks withdrew (38.1%), so the grade distribution for the ones that stayed was pretty good. It was as follows:

Grade	Percentage
A	39.1%
B	16.5%
C	3.6%
D	1.2%
F	1.4%
Withdrew	38.1%

Books

These are the books I used to complement the lectures and help me with the projects.

Operating Systems: Three Easy Pieces

Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau

This is the main book I used to supplement the lectures. It’s filled with easy-to-understand explanations of complex topics and covers most of the material in the course. I highly recommend it.

Computer Systems: A Programmer's Perspective

Randal E. Bryant and David R. O'Hallaron

A classic for a reason. It provides a great bridge between hardware and software. Chapters 1, 6, and 8-12 were especially helpful for this course.

The C Programming Language

Brian W. Kernighan and Dennis M. Ritchie

The definitive book on C, written by its creators. I ready this cover to cover before the course started to brush up on my C knowledge.

Beej's Guides to C, Network Programming, and IPC

Brian 'Beej' Hall

An invaluable, practical, and free resource for C programming, socket programming, and inter-process communication. These were immensely helpful for the projects.

The Linux Programming Interface

Michael Kerrisk

This is structured as an encyclopedic guide to the Linux and UNIX system programming interface. It was a great resource for the projects.

Other Resources

Here are some other posts I found helpful:

My Experience

GIOS was the first course I took for the program and it was a great way to get my feet wet. The projects were difficult but meaningful, and the concepts challenged me to think at a systems level.

I hope this post is helpful for others considering GIOS, and I welcome feedback or corrections—especially if you catch something I got wrong. I’m still learning, and learning in public is part of the process.

Thanks for reading!

References

Anderson, T. E. (1990). The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 1(1), 6–16. https://doi.org/10.1109/71.80120

Birrell, A. D. (1989). An Introduction to Programming with Threads.

Birrell, A. D., & Nelson, B. J. (1984). Implementing Remote Procedure Calls. ACM Transactions on Computer Systems (TOCS), 2(1), 39–59. https://doi.org/10.1145/2080.357392

Eykholt, J. R., Kleiman, S. R., Barton, S., Faulkner, R., Shivalingiah, A., Smith, M., Voll, J., Weeks, M., & Williams, D. (1992). Beyond Multiprocessing: Multithreading the SunOS Kernel. USENIX Summer, 92, 11–18.

Fedorova, A., Small, C., Nussbaum, D., & Seltzer, M. (2004). Chip Multithreading Systems Need a New Operating System Scheduler. Proceedings of the 11th Workshop on ACM SIGOPS European Workshop, 9-es. https://doi.org/10.1145/1133572.1133597

Nelson, M. N., Welch, B. B., & Ousterhout, J. K. (1988). Caching in the Sprite Network File System. ACM Transactions on Computer Systems (TOCS), 6(1), 134–154. https://doi.org/10.1145/35037.42183

Pai, V. S., Druschel, P., & Zwaenepoel, W. (1999). Flash: An Efficient and Portable Web Server. USENIX Annual Technical Conference, General Track, 199–212.

Popek, G. J., & Goldberg, R. P. (1974). Formal Requirements for Virtualizable Third Generation Architectures. Communications of the ACM, 17(7), 412–421. https://doi.org/10.1145/361011.361073

Protic, J., Tomasevic, M., & Milutinovic, V. (1996). Distributed Shared Memory: Concepts and Systems. IEEE Parallel & Distributed Technology: Systems & Applications, 4(2), 63–71. https://doi.org/10.1109/88.494605

Rosenblum, M., & Garfinkel, T. (2005). Virtual Machine Monitors: Current Technology and Future Trends. Computer, 38(5), 39–47. https://doi.org/10.1109/MC.2005.176

Stein, D., & Shah, D. (1992). Implementing Lightweight Threads. USENIX Summer, 575.

Introduction

Operating Systems Overview

CPU Virtualization

Processes, Threads, and Context Switching

Concurrency and Synchronization

PThreads

Thread Design Considerations

Thread Performance Considerations

Scheduling

Memory, Persistence, & I/O

Memory Management

Inter-Process Communication (IPC)

Synchronization Constructs

Disk and File Systems

System Virtualization

Distributed Systems

Remote Procedure Calls (RPC)

Distributed File Systems

Distributed Shared Memory (DSM) & Consistency

Datacenter Technologies

Course Overview

Grading

Books

Operating Systems: Three Easy Pieces

Computer Systems: A Programmer's Perspective

The C Programming Language

Beej's Guides to C, Network Programming, and IPC

The Linux Programming Interface

Other Resources

My Experience

References

Table of Contents