Graduate Introduction to Operating Systems
Introduction
This article is part personal reflection, part technical summary. I recently completed Georgia Techâs CS 6200: Graduate Introduction to Operating Systems (GIOS), and I wanted to take the time to solidify what Iâve learnedâboth for myself, in the spirit of the Feynman technique, and for others who might be curious about the course or about me. This includes future OMSCS students, hiring managers, and peers who want a better sense of what this course covers and what I now know (and donât know yet).
Thereâs always a tension when writing something like this. I want to be honest about what Iâve learned and the depth of my understanding, while acknowledging that Iâm still learning. This course was challenging and rewarding, and this write-up is my way of digesting the material and sharing it in a format that feels productive.
Operating Systems Overview
At a high level, an operating system (OS) âis a special piece of software that abstracts and arbitrates the use of a computer system.â It sits between user applications and hardware, providing the illusion of a cohesive, manageable system despite the complexity and diversity of hardware components.

This layered architecture separates user space, where applications run, from kernel space, where the OS performs core functions like managing hardware and system resources. Applications interact with the OS through standard libraries and system calls, enabling functionality like file operations or memory allocation without direct hardware access. The OS, in turn, controls and coordinates the hardware layer, which includes the CPU, memory, storage, displays, network devices, and input peripherals.
The OS fulfills several critical roles:
- Controls access to critical resources like the CPU, memory, and I/O devices. Without this, applications could interfere with each other or with the hardware directly.
- Provides abstractions such as files, processes, and virtual memory. These simplify development and improve portability by hiding hardware-specific details behind a consistent interface exposed via system calls.
- Enforces policies to manage limited resourcesâdeciding which processes run when, how memory is allocated, and which users or programs can access specific files or devices. These policies balance competing goals like fairness, responsiveness, efficiency, and security.
- Virtualizes everything but time, enabling multiple processes (and potentially multiple users) to share the same hardware as if each had their own isolated system. For instance, CPU virtualization creates the illusion of multiple CPUs; memory virtualization ensures processes operate within isolated memory spaces.
In short, the OS is the software layer that turns raw, unreliable hardware into a stable, usable, and secure programming environment.
For more details, see the Introduction to Operating Systems lecture which covers:
- Separation of Mechanism and Policy: This principle emphasizes designing flexible mechanisms that can support various policies, allowing the OS to adapt to different scenarios.
- Operating system Structures like monolithic, microkernel, and hybrid kernels.
- A brief overview of these architectures can highlight practical applications of OS concepts and design choices.
- Examples of major operating systems like Windows, Linux, macOS, Android, and iOS can contextualize the discussion and illustrate the diversity of OS implementations.
CPU Virtualization
Processes, Threads, and Context Switching
One of the central responsibilities of the operating system is to virtualize the CPU, creating the illusion that each process has its own dedicated processor. This abstraction is achieved through process management and scheduling, enabling multitasking and isolation across programs.
A process is an instance of a running program, and the OS is responsible for managing its entire lifecycle: from creation and execution to suspension and termination. Each process is given its own private address space and a set of resources, such as open file descriptors and execution context, ensuring that it operates independently of other processes. The Process Control Block (PCB)is the data structure that OS uses manage the processâs state and metadata.
- A process is an instance of a program in execution.
- The OS manages the lifecycle of processes: creation, execution, suspension, and termination.
- Each process has its own memory space, file descriptors, and state.

To enable more granular concurrency within a process, modern systems use threads. A thread is the smallest unit of execution and shares the same address space with other threads in the same process. This shared memory model allows for lightweight context switches and efficient communication but also introduces complexity: threads must coordinate access to shared resources to avoid data races and inconsistencies.
- A thread is a unit of execution within a process.
- Multiple threads in a single process share the same memory space.
- Threads introduce concurrency and require careful synchronization.

A context switch is the process of switching the CPU from one process or thread to another. It involves saving the current state of the running process to its PCB and then loading the state of the next process to be executed from its PCB. This allows multiple processes to share a single CPU, enabling multitasking. The operating system code (in kernel mode) manages the context switch by interrupting the current process and switching to the next process.
To execute a system call, like read
or write
, the current process has to context switch to the kernel mode. The kernel mode is a privileged mode that allows the operating system to execute code that is not allowed to be executed by the user process. The kernel mode is also used to execute system calls. The system call is a way for the user process to request a service from the operating system.
- A system call is a way for the user process to request a service from the operating system.
- A kernel mode is a privileged mode that allows the operating system to execute code that is not allowed to be executed by the user process.
- A user mode is a mode that allows the user process to execute code that is not allowed to be executed by the operating system.
For more details, see the Processes and Process Management lecture and which covers:
- The memory layout of a process including the stack, heap, and data segments.
- The process address space and memory management.
- Whatâs in the process control block (PCB) and how its used to manage the processâs state and metadata.
- The execution states of a process and the life cycle as it transitions between them (new, ready, running, waiting, terminated).
- The paper âAn Introduction to Programming with Threadsâ (Birrell, 1989)
Concurrency and Synchronization
The ability to run multiple threads concurrently unlocks significant performance gains.Even on a single CPU, multithreading improves responsiveness by allowing a program to perform other work while one thread is blocked on a long-running operation like disk I/O. On multi-core systems, it enables true parallelism, where multiple threads execute simultaneously. However, this concurrency introduces challenges, chiefly data races, where multiple threads access shared data and at least one is writing, leading to unpredictable outcomes.
To prevent such issues, we use synchronization constructs:
- Mutexes (Mutual Exclusion) are the most basic synchronization primitive. They act as a lock, ensuring that only one thread can access a critical section of code at a time. A thread must acquire the mutex before entering the critical section and release it upon exiting.
- Condition Variables allow threads to wait for a specific condition to become true. They are used with a mutex to block a thread until another thread signals that the condition is met, avoiding inefficient busy-waiting.
These constructs are critical to avoid common pitfalls like deadlocks, where two or more threads are stuck waiting for each other to release a resource, and spurious wake-ups, where a waiting thread is awakened without the condition it was waiting for being met.
For more details, see the Threads and Concurrency lecture which covers:
- Synchronization primitives like the Readers/Writer lock.
- Common pitfalls like spurious wake-ups and deadlocks in more detail.
- Kernel vs. User-Level Threads and different multithreading models (One-to-One, Many-to-One, Many-to-Many).
- Multithreading patterns like the Boss/Workers Pattern, Pipeline Pattern, and Layered Pattern are covered in the Thread Design & Performance Considerations section.
These mechanisms enable the operating system to support concurrent applications running on shared hardware, while maintaining isolation, fairness, and control.
PThreads
POSIX Threads (PThreads), are the standard C API for creating and synchronizing threads in linux and other unix-like operating systems like macOS. The following producer-consumer example demonstrates thread creation, mutexes, and condition variables in a single, concise program. I included it here to illustrate how little code is needed to work with the abstractions of threads and synchronization.
Feel free to skip the code block below and move on to the next section.
#include <stdio.h>
#include <pthread.h>
#include <unistd.h>
// Shared resources protected by a mutex
pthread_mutex_t lock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t cond = PTHREAD_COND_INITIALIZER;
int message_ready = 0;
// Consumer thread waits for a condition
void* consumer(void* arg) {
pthread_mutex_lock(&lock);
// Use a while loop to protect against spurious wakeups
while (message_ready == 0) {
printf("Consumer: Waiting for message...\n");
// Unlocks the mutex and waits; re-acquires lock upon wakeup
pthread_cond_wait(&cond, &lock);
}
printf("Consumer: Message received!\n");
pthread_mutex_unlock(&lock);
return NULL;
}
// Producer thread signals the condition
void* producer(void* arg) {
// sleep(1) helps ensure the consumer runs first and waits
sleep(1);
pthread_mutex_lock(&lock);
printf("Producer: Preparing message...\n");
message_ready = 1;
// Wakes up one waiting thread
pthread_cond_signal(&cond);
printf("Producer: Message sent!\n");
pthread_mutex_unlock(&lock);
return NULL;
}
int main() {
pthread_t producer_thread, consumer_thread;
// Create the two threads
pthread_create(&consumer_thread, NULL, consumer, NULL);
pthread_create(&producer_thread, NULL, producer, NULL);
// Wait for both threads to complete
pthread_join(producer_thread, NULL);
pthread_join(consumer_thread, NULL);
printf("Producer and consumer have finished.\n");
return 0;
}
This C code demonstrates a basic producer-consumer pattern using POSIX Threads (PThreads). The producer
thread prepares a âmessageâ and signals the consumer
thread when itâs ready. The consumer
thread waits for this signal before processing the message, ensuring proper synchronization between the two threads using a mutex and a condition variable.
A classic real-world scenario for the producer-consumer pattern is a message queue or a task processing system. Imagine a web server (the producer) that receives many user requests. Instead of processing each request immediately, which could overwhelm the server if thereâs a sudden surge, it adds these requests as âmessagesâ to a queue. A separate pool of worker processes or threads (the consumers) then pick up these messages from the queue one by one, process them, and perform the necessary tasks (e.g., database operations, sending emails, generating reports). This allows the server to handle incoming requests efficiently by decoupling the request reception from its actual processing, ensuring smooth operation even under heavy load.
For more details, see the Threads Case Study - PThreads lecture which covers:
- Thread attributes (
pthread_attr_t
) for customizing thread behavior (e.g., stack size, detached state). - Safety tips for using mutexes and condition variables to avoid common errors.
- A complete, classic Producer-Consumer problem implementation using PThreads.
Thread Design Considerations
This section is a bit more abstract. It delves into the tradeoffs that operating systems designers make when designing their threading models.
Beyond choosing a concurrency pattern, building a robust multithreaded system requires careful consideration of how threads are managed by the operating system and how they interact with system events. While most developers are unlikely to be designing their own threading models, understanding the tradeoffs and how they are implemented helps reveal how they work under the hood.
-
User-level vs. Kernel-level Threads: This is a core design choice. User-level threads are managed by a library within a process, making them fast to create and switch. However, the OS kernel is unaware of them; if one user thread blocks on I/O, the entire process is blocked. Kernel-level threads are managed by the OS, which can schedule them across multiple CPU cores and handle blocking independently. This comes at the cost of slower creation and context switching due to the need for system calls.
-
Threading Models: To balance these trade-offs, systems use different models. The one-to-one model (e.g., modern Linux, Windows) maps each user thread to a kernel thread, offering good performance and simplicity. The many-to-one model maps many user threads to a single kernel thread, which is efficient but suffers from the blocking problem. The many-to-many model offers a hybrid approach, but adds significant complexity.
-
Handling Interrupts and Signals:
Figure 4: The interrupt handling cycle. Figure 4. illustrates the interrupt handling cycle. Interrupts are asynchronous signals sent to the CPU from software, I/O devices, or internal processor exceptions. When an interrupt is issued, the CPU halts the currently running code, saves its state, and executes a special function called an interrupt handler to address the event. Once the handler finishes, the CPU restores the original state and resumes execution. This mechanism is fundamental to multitasking and allows the OS to respond to events efficiently.
In a multithreaded environment, handling these asynchronous events is particularly complex. A classic problem arises when a signal handler, which runs in the context of the interrupted thread, tries to acquire a mutex that the thread already holds, causing an instant deadlock. A common, simple solution is to disable signals around critical sections. More advanced OS designs may treat the main work of an interrupt handler as a separate, lower-priority thread (a âbottom-halfâ handler) to avoid such issues.
-
Threading in Practice: Linux: Modern Linux provides a concrete example of these designs. It uses a one-to-one threading model called the Native POSIX Thread Library (NPTL). Internally, Linux doesnât strongly distinguish between a process and a thread; both are simply âtasksâ created by the
clone()
system call. The flags passed toclone()
determine how much of the parentâs context (memory, file descriptors, etc.) is shared, making a âthreadâ just a task that shares almost everything.
For more details, see the Thread Design Considerations lecture which covers:
- Data structures used by the OS (like in Solaris) to manage user and kernel threads, including the concept of a Lightweight Process (LWP).
- The difference between âhardâ and âlightâ process state and how it optimizes context switching.
- The mechanics of signal masks and interrupt handling on multicore systems in greater detail.
- The papers âBeyond Multiprocessing: Multithreading the SunOS Kernelâ(Eykholt et al., 1992) and âImplementing Lightweight Threadsâ (Stein & Shah, 1992)
Thread Performance Considerations
Evaluating the performance of a multithreaded system is not just about measuring raw speed; itâs about choosing the right metrics for the right goals. Different applications have different needs. A scientific computing task might prioritize minimum execution time, while a web server is likely more concerned with throughput (requests per second) and response time. As the lecture notes highlight, a design that excels in one metric (like the pipeline model for total execution time) might perform worse in another (like the boss/worker model for average response time).
This leads to fundamental architectural choices:
-
Multi-Process vs. Multi-Threaded: A multi-process architecture (like the original Apache web server) provides strong isolation, as each process has its own address space. This makes it robust but memory-intensive and slower due to higher context-switching and inter-process communication (IPC) costs. A multi-threaded architecture shares memory, making it much more resource-efficient and faster to switch contexts, but requires careful synchronization to prevent data races.
-
Event-Driven Model: An alternative approach to concurrency, exemplified by servers like Nginx or frameworks like Node.js. An event-driven server uses a single thread and an event loop to handle many connections concurrently. It listens for events (like a new connection or incoming data) and reacts to them without blocking. This model can achieve very high throughput by avoiding the overhead of thread creation and context switching, but it performs poorly if any task blocks the event loop (e.g., a synchronous disk read).
For more details, see the Thread Performance Considerations lecture which covers:
- A deep dive into a case study comparing different web server architectures: multi-process, multi-threaded, and an event-driven model (the Flash web server).
- A detailed look at experimental methodology, including how to define system comparisons, select appropriate workloads (e.g., real-world traces vs. synthetic loads), and choose the right performance metrics.
- Practical advice on designing and running experiments to produce meaningful and reproducible results.
- The paper âFlash: An Efficient and Portable Web Serverâ (Pai et al., 1999)
Scheduling
The OS scheduler is the component responsible for deciding which ready process or thread gets to run on the CPU next. This decision is guided by a scheduling policy, which aims to balance competing goals like fairness, responsiveness, and overall throughput. Different policies have different strengths and are suited for different workloads.
- First-Come, First-Serve (FCFS): The simplest policy. Tasks are executed in the order they arrive. Itâs fair in a âfirst-in, first-outâ sense but can lead to poor performance if a long-running task blocks many shorter tasks (the âconvoy effectâ).
- Shortest Job First (SJF): This policy executes the task with the shortest estimated execution time next. It is provably optimal for minimizing average waiting time but is impractical, as the OS canât know the future execution time of a task.
- Priority Scheduling: Each task is assigned a priority, and the scheduler always runs the highest-priority task. This is common in systems where kernel tasks must take precedence over user tasks. However, it can lead to problems like priority inversion, where a high-priority task is forced to wait for a lower-priority task holding a necessary lock.
- Round Robin (RR): The foundational algorithm for timesharing systems. Each task is given a small time slice (or quantum) to run. If itâs still running when the time slice expires, itâs preempted and moved to the back of the ready queue. This ensures that all tasks make progress and the system remains responsive.
Modern systems use more sophisticated schedulers. The Linux Completely Fair Scheduler (CFS), for instance, abandons fixed time slices and instead tries to give each task a fair, proportional amount of the CPUâs processing time. On multi-core systems with hyperthreading, schedulers must also consider how to pair workloads. The best performance is often achieved by co-scheduling a mix of CPU-bound and memory-bound tasks, which maximizes the utilization of both the processor pipeline and the memory bus.
For more details, see the Scheduling lecture which covers:
- The mechanics of timeslice length and how it impacts CPU-bound vs. I/O-bound tasks.
- The evolution of Linux schedulers, from the O(1) scheduler to the Completely Fair Scheduler (CFS).
- How hardware counters can be used to identify memory-bound vs. CPU-bound workloads to make smarter scheduling decisions on hyperthreaded systems.
- The paper âChip Multithreading Systems Need a New Operating System Schedulerâ(Fedorova et al., 2004)
Memory, Persistence, & I/O
Memory Management
Memory management is the OSâs strategy for arbitrating and controlling the finite physical memory (DRAM) on behalf of all running processes. The central abstraction is virtual memory, which gives each process the illusion of having its own large, contiguous, and private address space. Each process has its own virtual address space. In reality, the OS, with crucial help from the hardwareâs Memory Management Unit (MMU), translates these virtual addresses to physical addresses in DRAM. To speed up the constant address translations, the MMU uses a fast, specialized hardware cache called a Translation Lookaside Buffer (TLB), which stores recently used mappings. A âTLB missâ forces a much slower lookup in the page table.

The dominant mechanism for this is paging. The virtual address space is divided into fixed-size blocks called pages, and physical memory is divided into identically sized page frames. For each process, the OS maintains a page table that maps virtual page numbers (VPNs) to physical frame numbers (PFNs). Each entry in this table, a Page Table Entry (PTE), also contains important metadata. This metadata is varies between architectures (x86, ARM, etc.) and often include a present bit (is the page in DRAM?), a dirty bit (has the page been modified?), an access bit (for read and write operations), and protection bits (read/write/execute permissions).
Demand Paging allows pages to be loaded from disk only when a process first accesses them (a âpage faultâ), enabling programs to run even if they donât fit entirely in memory. When memory is full, page replacement algorithms (like approximations of Least Recently Used) use the âaccessâ and âdirtyâ bits to decide which pages to evict to disk.
Since each process has itâs own virtual address space with itâs own page tables, when context switching, the OS also saves the page table of the current process and loads the page table of the next process. Copy-on-Write (COW) is an important optimization for process creation (fork()
), where the parentâs pages are shared with the child instead of being copied. A real copy is only made if and when one of the processes attempts to write to a shared page.
Since a page table for a large address space can itself be enormous, modern systems use multi-level page tables to save space. Each level of the page table is a pointer to the next level of the page table. The same optimization of using a TLB to cache these entries still applies.

Beyond managing memory for user processes, the OS kernel must also efficiently allocate memory for its own internal data structures. This presents unique challenges, primarily fragmentation. External fragmentation occurs when free memory is broken into many small, non-contiguous blocks, eventually making it impossible to satisfy a large allocation request even though enough total memory is free. Internal fragmentation occurs when the allocator provides a block of memory that is larger than the request, wasting the space left over inside the block.
For more details, see the Memory Management lecture which covers:
- Different memory allocation strategies inside the Linux kernel, like the Buddy Allocator and Slab Allocator.
- The trade-offs of different page sizes.
- Alternative memory management designs like segmentation.
- Using memory management hardware for other services like checkpointing and process migration.
Inter-Process Communication (IPC)
Inter-Process Communication (IPC) provides a set of mechanisms for separate processes, each with its own private address space, to communicate and coordinate with each other. The OS provides several ways to achieve this, which generally fall into two categories: message-based and memory-based communication.
Message-based IPC is an abstraction where the OS manages a communication channel between processes. Processes send messages into the channel, and the OS handles delivering them to the receiving process. This category includes familiar tools like pipes (simple byte streams between two processes, common in shell command chaining) and sockets (a more general-purpose interface that can work between processes on the same machine or across a network). The main advantage of this approach is simplicity; the OS handles the underlying details and synchronization. The disadvantage is performance overhead, as every send and receive operation requires a system call and a data copy between the user process and the kernel.
Shared memory, the primary form of memory-based IPC, offers a much higher-performance alternative. Here, the OS maps a region of physical memory into the virtual address spaces of two or more processes. Once this mapping is established, the processes can read and write to this shared area as if it were their own memory, with no further kernel involvement. This avoids the system call and data copy overhead on every communication, making it ideal for transferring large amounts of data. However, this performance comes at the cost of complexity. The processes themselves are now responsible for synchronizing access to the shared memory using constructs like mutexes or semaphores to prevent race conditions.
For more details, see the Inter-Process Communication lecture which covers:
- Specific APIs for message queues and shared memory, including SysV and POSIX standards.
- How to use PThread synchronization primitives (mutexes, condition variables) for inter-process synchronization.
- Command-line tools for inspecting IPC facilities on a system.
- Design considerations for using shared memory, such as segment size and management.
Synchronization Constructs
While mutexes and condition variables are the fundamental building blocks for synchronization, they can be low-level and error-prone. To address this, operating systems and libraries provide higher-level and more specialized synchronization constructs. The correctness and performance of all these constructs, however, depend entirely on low-level atomic instructions provided by the hardware (e.g., test-and-set
, compare-and-swap
). These instructions guarantee that a sequence of operations, like reading a value, checking it, and writing a new value, happens indivisibly.
On modern multi-core systems, the performance of these atomic operations is deeply tied to cache coherence. Each CPU core has its own private cache, which can lead to situations where one coreâs view of a memory location is different from anotherâs. Cache coherence protocols (like the common write-invalidate protocol) solve this by forcing other cores to discard their local copies of a memory location whenever one core writes to it. This is crucial for correctness, but it means that when one thread modifies a shared lock, it can trigger a cascade of cache misses for all other spinning threads, significantly impacting performance.
Building on this hardware foundation, we get more expressive tools:
- Spinlocks: The most basic type of lock. Unlike a mutex, which puts a waiting thread to sleep, a thread waiting on a spinlock will âspinâ in a tight loop, repeatedly checking if the lock is free. This is highly efficient on multi-processor systems if the critical section is very short, as it avoids the high cost of a context switch. However, itâs extremely wasteful if the lock is held for a long time or on a single-core system.
- Semaphores: A more general synchronization tool. A semaphore is essentially a counter that controls access to a resource. Threads âwaitâ on (decrement) the semaphore to acquire it and âsignalâ (increment) it to release. A semaphore initialized with a value of
1
behaves like a mutex. When initialized with a larger valueN
, it can be used to grant access to a pool ofN
identical resources. - Reader/Writer Locks: A specialized lock designed for scenarios where a resource is read frequently but written to infrequently. It allows any number of âreaderâ threads to access the resource concurrently, but ensures that a âwriterâ thread has exclusive access. This provides much better performance than a simple mutex for read-dominant workloads.
For more details, see the Synchronization Constructs lecture which covers:
- The impact of cache coherence on spinlock performance and the design of more advanced spinlocks (e.g., Test-and-Test-and-Set, Queueing Locks).
- Higher-level constructs like Monitors, which bundle data with the synchronization logic needed to access it.
- The specific POSIX API for semaphores.
- The paper âThe Performance of Spin Lock Alternatives for Shared-Memory Multiprocessorsâ (Anderson, 1990)
Disk and File Systems
The OS manages I/O devices through device drivers, which abstract away the specific hardware details. For efficiency, modern systems use Direct Memory Access (DMA), allowing devices to transfer data directly to and from main memory without involving the CPU for every byte. The CPU initiates the transfer, but the device and DMA controller handle the rest, freeing the CPU to do other work.
To manage persistent storage, the OS provides a file system abstraction. A key component in Linux and other Unix-like systems is the Virtual File System (VFS), an interface that allows the system to support many different underlying file systems (like ext4
, XFS
, NFS
) in a uniform way. Applications interact with the VFS using standard system calls (open
, read
, write
), and the VFS directs these calls to the correct filesystem driver.
The VFS operates on a few key abstractions. Every file and directory is represented by an inode (index node), a data structure that stores the fileâs metadata (permissions, owner, size) and, crucially, points to the actual data blocks on the disk. Because files donât need to be stored contiguously, the inode acts as an index for all of a fileâs blocks. To support files larger than what can be referenced by direct pointers in the inode, filesystems like ext2
use indirect pointers, where an inode points to a block of more pointers, adding levels of indirection to address vast amounts of data.
To make disk access efficient, filesystems rely on several optimizations. A buffer cache in main memory holds recently used disk blocks to avoid slow physical disk reads. I/O schedulers reorder disk requests to minimize the physical movement of the disk head, turning many small random writes into larger sequential ones. Finally, journaling provides reliability by writing changes to a log before committing them to their final location on disk, ensuring the filesystem can be quickly restored to a consistent state after a crash.
For more details, see the I/O Management lecture which goes deeper into:
- Different types of I/O devices (block, character, network) and how they are represented in the OS.
- The details of the block device stack in Linux.
- The role of the dentry cache in speeding up path lookups.
- The specific on-disk layout of the ext2 filesystem, including block groups and bitmaps.
System Virtualization
System virtualization allows multiple, isolated operating systemsâknown as Virtual Machines (VMs) or guestsâto run concurrently on a single physical machine. This is orchestrated by a Virtual Machine Monitor (VMM), or hypervisor, which manages and allocates the underlying hardware resources. The VMMâs goal is to provide each VM with an environment that is a faithful, performant, and isolated duplicate of the real hardware.
There are two primary models for this:
- Type 1 (Bare-Metal): The hypervisor runs directly on the physical hardware, and the guest VMs run on top of it. This model is common in datacenter environments, with examples like VMware ESX and Xen. Often, a privileged âservice VMâ (like Xenâs
dom0
) is used to manage device drivers. - Type 2 (Hosted): The hypervisor is an application that runs on top of a conventional host OS (e.g., Linux, Windows). Examples include KVM, VirtualBox, and VMware Workstation. This model leverages the host OS for device support and management.

The core challenge of virtualization is that the guest OS expects to have full, privileged control over the hardware, but the hypervisor must remain in charge. Historically, this was solved with clever software techniques. For CPU virtualization, where some privileged instructions on older x86 chips wouldnât trap to the hypervisor, solutions included binary translation (dynamically rewriting problematic guest code) and paravirtualization (modifying the guest OS to make explicit callsââhypercallsââto the VMM). For memory, the hypervisor creates shadow page tables to map the guestâs virtual addresses directly to machine addresses. For I/O, the split device driver model is common, with a lightweight âfront-endâ driver in the guest communicating with a full âback-endâ driver in the host or service VM.
Today, most of these complex software techniques have been supplanted by direct hardware support (e.g., Intel VT-x and AMD-V). Modern CPUs have features that allow a guest OS to run in a non-privileged mode where sensitive instructions automatically trap to the hypervisor. Hardware also provides support for memory virtualization (e.g., Extended Page Tables) and I/O virtualization, making the whole process much simpler and more efficient.
For more details, see the Virtualization lecture which digs into:
- The benefits of virtualization, such as server consolidation and easier migration.
- The different models for virtualizing I/O devices, including passthrough, hypervisor-direct, and the split device driver model.
- A deeper look at the hardware features that enable efficient virtualization on modern platforms.
- The papers âFormal Requirements for Virtualizable Third Generation Architecturesâ (Popek & Goldberg, 1974) and âVirtual Machine Monitors: Current Technology and Future Trendsâ (Rosenblum & Garfinkel, 2005)
Distributed Systems
Remote Procedure Calls (RPC)
Remote Procedure Calls (RPC) are a powerful abstraction that makes a function call on a remote machine look and feel like a local function call. The goal is to hide the underlying complexities of network communicationâlike socket programming, data serialization, and error handlingâfrom the application developer.
The magic of RPC is implemented through several key components. First, the client and server agree on a contract defined in an Interface Definition Language (IDL). The IDL specifies the procedures that can be called, their parameters, and their return types, independent of any programming language. An RPC compiler then uses this IDL file to automatically generate stubs for both the client and the server. When the client calls a remote procedure, itâs actually calling the local client stub. This stubâs job is to take the arguments, package them into a standardized, machine-independent format (marshalling), and send them across the network to the server.
On the server side, the server stub receives the message, unpacks it (unmarshalling), and makes a regular local call to the actual procedure implementation. The return value is then marshalled and sent back to the client stub, which unmarshals it and returns it to the calling client application. Before any of this can happen, the client must first find the server, a process called binding, which is typically handled by a central registry or âportmapperâ service. A major challenge in any RPC system is handling partial failuresâwhat happens if the server crashes or the network fails? Modern RPC frameworks must provide mechanisms to handle these inevitable distributed systems problems.
For more details, see the Remote Procedure Calls lecture which also covers:
- A concrete example using SunRPC (now ONC RPC) and its IDL, XDR (External Data Representation).
- How RPC systems handle complex data types like pointers.
- The specific encoding rules used by XDR to ensure cross-platform compatibility.
- Other RPC systems like Java RMI.
- The paper âImplementing Remote Procedure Callsâ (Birrell & Nelson, 1984)
Distributed File Systems
A Distributed File System (DFS) abstracts away network storage to make remote files appear as if they were on the local machine. This is a powerful concept, but it introduces significant design challenges, primarily centered around performance, fault tolerance, and consistency. A fundamental choice is how to organize the servers: files can be partitioned across many servers to improve scalability, or replicated across servers to improve fault tolerance and availability. Most large systems use a hybrid approach.
The core tension in any DFS is between performance and consistency. A simple model where the client downloads a file, modifies it, and uploads it back is efficient for local work but terrible for network traffic and concurrent access. The opposite, where every single read or write is a remote operation, is great for consistency but suffers from high latency and poor server scalability. The practical compromise is client-side caching, where clients cache file blocks locally. This improves performance dramatically but creates a new problem: how to ensure a clientâs cache is up-to-date if another client modifies the file on the server.
This leads to another critical design choice: is the server stateless or stateful? A stateless server keeps no information about which clients are accessing which files. This makes crash recovery trivialâthe server just restartsâbut it limits performance and makes consistency management difficult. A stateful server keeps track of clients and their open files, which enables high-performance caching strategies and features like file locking. However, if a stateful server crashes, recovering its state without losing data or breaking consistency is much more complex. Different file sharing semantics, from the strict âUNIX semanticsâ to the looser âsession semantics,â represent different trade-offs in this design space.
For more details, see the Distributed File Systems lecture which also covers:
- Network File System (NFS), a classic example of a DFS that evolved from a stateless protocol (v3) to a stateful one (v4) to better support caching, locking, and performance.
- Sprite DFS, an influential research system whose design was based on an analysis of real-world file access patterns, which revealed that most files are short-lived, leading to a caching system that optimized for delayed writes.
- The different file sharing semantics (UNIX, session, etc.) in more detail.
- The paper âCaching in the Sprite Network File Systemâ (Nelson et al., 1988)
Distributed Shared Memory (DSM) & Consistency
Distributed Shared Memory (DSM) is an abstraction that allows processes running on different machines to share a common memory address space, giving them the illusion of running on a single, large shared-memory machine. This is a powerful technique for scaling applications beyond the memory capacity of a single node. DSM can be implemented directly in expensive, specialized hardware, but it is more commonly realized in software at the OS or programming language level.
The central challenge in any DSM system is maintaining consistency. When one node writes to a shared memory location, how and when do other nodes see that change? This is defined by a consistency model, which is a contract between the system and the programmer about the ordering and visibility of memory operations. Different models offer different trade-offs between performance and programming simplicity.
The spectrum of consistency models includes:
- Strict Consistency: The ideal model where any write is instantaneously visible to all other processes. This is impossible to implement in a real distributed system due to network latency.
- Sequential Consistency: A more relaxed model where all processes see the same global ordering (interleaving) of memory operations, though this order might not reflect the real-time execution. Operations from a single process always appear in the order they were issued. This is easier to reason about but can be slow, as it requires global coordination.
- Causal Consistency: This model only enforces ordering for writes that are causally related (i.e., a write that depends on a previous read). Concurrent, unrelated writes can be seen in different orders by different processes, which allows for better performance.
- Weak Consistency: This model gives the most control (and responsibility) to the programmer. It introduces explicit synchronization points (
acquire
/release
operations). Memory updates are only guaranteed to be visible to other processes after an explicit synchronization call, allowing the system to batch updates and reduce network traffic, significantly improving performance.
For more details, see the Distributed Shared Memory lecture which also covers:
- The trade-offs of different sharing granularities (page-based vs. object-based).
- DSM design choices like data migration vs. replication.
- How software DSM systems leverage the Memory Management Unit (MMU) and page faults to trap on remote or protected memory accesses.
- The paper âDistributed Shared Memory: Concepts and Systemsâ (Protic et al., 1996)
Datacenter Technologies
The final topic of the course brings everything together by looking at the technologies that power modern datacenters and cloud computing. Large-scale internet services are typically designed in one of two ways: a homogenous architecture, where any server can handle any request, which is simple to load-balance but has poor data locality; or a heterogenous architecture, where servers are specialized for specific tasks, which improves performance through caching and locality but is more complex to manage and scale.
Cloud Computing emerged as a powerful solution to the challenges of building and scaling these services. It abstracts away physical hardware, offering on-demand, elastic resources as a utility. This model is built on two key principles: the law of large numbers, which allows providers to serve many customers with variable needs using a fixed set of resources, and economies of scale, which dramatically lowers the cost of hardware and operations.
The services offered by cloud providers are typically categorized into different models, defining the trade-off between user control and provider management:
- Infrastructure as a Service (IaaS): Provides the fundamental building blocksâvirtual machines, storage, and networking. The user manages the OS, applications, and data. (e.g., Amazon EC2, Google Compute Engine).
- Platform as a Service (PaaS): Provides a platform for developers to build and deploy applications without worrying about the underlying infrastructure. The provider manages the OS, runtime, and middleware. (e.g., Vercel, Google App Engine).
- Software as a Service (SaaS): Delivers a complete software application over the internet. The user simply consumes the service. (e.g., Gmail, Salesforce).
These cloud services are made possible by many of the technologies discussed throughout the course. Virtualization provides fungible, isolated compute resources. Large-scale cluster schedulers (like Kubernetes, Mesos, or YARN) manage resource allocation across thousands of machines. Distributed file systems and NoSQL databases provide scalable storage, and the entire stack is built on the principles of fault tolerance and redundancy to handle the inevitable failures in a large, complex system.
For more details, see the Datacenter Technologies lecture which also covers:
- Cloud deployment models: Public, Private, and Hybrid clouds.
- The âposter childâ case study of Animotoâs massive scaling on AWS.
- An overview of big data processing stacks like Hadoop and Spark.
Course Overview
The projects were a great way to solidify the concepts learned in the course and took up a significant portion of the time I spent on the course. The average on OMSCS Central across all of the student reviews was 18.5 hours per week. I averaged a bit more than that, particularly on the first project, because of my lack of C and C++ experience. There were a few weeks when I hammered out the projects where I spent closer to 40 hours in a week. Once I got the projects done, I was able to focus on the lectures and readings which took up less time.
Iâm writing follow-up posts for each project to explore them in more detail. Here is the first one:
- Project 1: Concurrent File Server
Iâll update this list as I write about the other projects.
Grading
Hereâs the grading breakdown from when I took the course in Spring 2024:
Component | Weight | Description |
---|---|---|
Participation | 5% | Views and posts in the discussion forum |
Project 1 | 15% | Multi-threaded Web Server |
Project 3 | 15% | Cache & Proxy Servers with Shared Memory IPC |
Project 4 | 15% | GRPC and Distributed Systems |
Midterm Exam | 25% | Everything up to Memory, Persistence, & I/O |
Final Exam | 25% | Everything after that |
The course was graded on a slight curve. A ton of folks withdrew (38.1%), so the grade distribution for the ones that stayed was pretty good. It was as follows:
Grade | Percentage |
---|---|
A | 39.1% |
B | 16.5% |
C | 3.6% |
D | 1.2% |
F | 1.4% |
Withdrew | 38.1% |
Books
These are the books I used to complement the lectures and help me with the projects.

Operating Systems: Three Easy Pieces
Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau
This is the main book I used to supplement the lectures. Itâs filled with easy-to-understand explanations of complex topics and covers most of the material in the course. I highly recommend it.

Computer Systems: A Programmer's Perspective
Randal E. Bryant and David R. O'Hallaron
A classic for a reason. It provides a great bridge between hardware and software. Chapters 1, 6, and 8-12 were especially helpful for this course.

The C Programming Language
Brian W. Kernighan and Dennis M. Ritchie
The definitive book on C, written by its creators. I ready this cover to cover before the course started to brush up on my C knowledge.

Beej's Guides to C, Network Programming, and IPC
Brian 'Beej' Hall
An invaluable, practical, and free resource for C programming, socket programming, and inter-process communication. These were immensely helpful for the projects.

The Linux Programming Interface
Michael Kerrisk
This is structured as an encyclopedic guide to the Linux and UNIX system programming interface. It was a great resource for the projects.
Other Resources
Here are some other posts I found helpful:
My Experience
GIOS was the first course I took for the program and it was a great way to get my feet wet. The projects were difficult but meaningful, and the concepts challenged me to think at a systems level.
I hope this post is helpful for others considering GIOS, and I welcome feedback or correctionsâespecially if you catch something I got wrong. Iâm still learning, and learning in public is part of the process.
Thanks for reading!