Loading learning content...
OS system design problems are the most open-ended and challenging category of operating systems interview questions. Unlike numerical problems with exact answers, system design questions evaluate your ability to make architectural decisions, reason about tradeoffs, and communicate complex designs clearly.
These problems simulate real engineering work: you're given requirements and constraints, and must design a working system. Interviewers assess not just correctness, but also your design methodology, ability to handle ambiguity, and consideration of edge cases.
System design questions reveal how you think. They expose whether you can decompose problems systematically, identify key abstractions, and anticipate failure modes. This page provides frameworks and detailed examples for every common OS system design question.
By the end of this page, you will: (1) Apply a systematic methodology for OS design problems, (2) Design schedulers with specific properties, (3) Design memory allocators and virtual memory systems, (4) Design file systems meeting given constraints, (5) Design synchronization mechanisms and concurrent data structures.
Before diving into specific designs, establish a systematic approach. This methodology applies to any OS design question.
The DRIVE Framework:
D - Define Requirements
R - Review Existing Approaches
I - Identify Core Abstractions
V - Validate with Examples
E - Evaluate Tradeoffs
| Category | Example Problems | Key Considerations |
|---|---|---|
| Scheduling | Thread pool, Task scheduler, Fair scheduler | Fairness, priority, starvation, overhead |
| Memory | malloc/free, Object pool, Memory-mapped files | Fragmentation, speed, metadata overhead |
| File Systems | Log-structured FS, Distributed FS | Consistency, crash recovery, performance |
| Synchronization | Custom locks, Lock-free structures | Correctness, progress, performance |
| IPC | Message queue, Shared memory system | Latency, copying overhead, ordering |
Spend the first 5 minutes asking questions and restating requirements. This prevents designing the wrong system. Good questions: What's the expected load? What are the failure modes we care about? What's the priority ordering for tradeoffs? Is there existing infrastructure we can leverage?
Problem Statement:
Design a CPU scheduler for a general-purpose operating system. The scheduler should provide good interactive response time while maintaining throughput for batch processes. It should prevent starvation and adapt to process behavior.
Clarifying Questions:
Design: Multi-Level Feedback Queue (MLFQ)
After analyzing requirements, MLFQ is the best fit because:
Core Data Structures:
// Per-priority-level queue
struct run_queue {
struct list_head tasks; // Linked list of runnable tasks
int time_quantum; // Time slice for this level
int num_tasks; // Count for O(1) length check
};
// Global scheduler state
struct scheduler {
struct run_queue queues[NUM_PRIORITY_LEVELS]; // e.g., 40 levels
int highest_nonempty; // Optimization: track highest priority queue
spinlock_t lock; // Protect scheduler state
uint64_t boost_timer; // Time until next boost
};
// Per-task scheduling information
struct task_sched_info {
int priority; // Current priority level
int time_remaining; // Time left in current quantum
int time_used_at_level; // Used for demotion decisions
enum { CPU_BOUND, IO_BOUND, UNKNOWN } behavior;
};
Key Algorithms:
1. Task Selection (pick_next_task):
struct task* pick_next_task(struct scheduler* sched) {
// Start from highest priority
for (int i = sched->highest_nonempty; i >= 0; i--) {
if (!list_empty(&sched->queues[i].tasks)) {
// Round-robin within priority level
struct task* t = list_first_entry(
&sched->queues[i].tasks, struct task, run_list);
list_move_tail(&t->run_list, &sched->queues[i].tasks);
t->sched.time_remaining = sched->queues[i].time_quantum;
return t;
}
}
return idle_task; // No runnable tasks
}
2. Timer Tick Handler:
void scheduler_tick(struct task* current) {
current->sched.time_remaining--;
current->sched.time_used_at_level++;
if (current->sched.time_remaining == 0) {
// Quantum expired: demote (CPU-bound behavior)
if (current->sched.priority > 0) {
demote_task(current);
}
schedule(); // Trigger reschedule
}
// Check for periodic boost
if (should_boost()) {
boost_all_tasks();
}
}
3. I/O Return Handler:
void wake_from_io(struct task* t) {
// I/O-bound behavior: keep or boost priority
if (t->sched.time_used_at_level < QUANTUM / 2) {
// Used little CPU before blocking: likely interactive
if (t->sched.priority < MAX_PRIORITY) {
promote_task(t);
}
}
t->sched.time_used_at_level = 0;
enqueue_task(t);
}
A malicious process could game MLFQ by yielding just before quantum expiration to avoid demotion. Mitigation: Track total CPU time at each level, not just current quantum. Demote after cumulative threshold, regardless of individual quantum usage.
Evaluation:
| Requirement | How Design Satisfies It |
|---|---|
| Interactive response | High priority for I/O-bound behavior |
| Batch throughput | Lower overhead than per-process timing |
| No starvation | Periodic boost brings all to top |
| Adapts to behavior | Demotion on CPU use, promotion on I/O |
| Configurable | Per-level quantum, boost interval adjustable |
SMP Extension:
Problem Statement:
Design malloc() and free() functions for heap memory management. The allocator should minimize fragmentation, provide fast allocation/deallocation, and handle a wide range of allocation sizes.
Requirements Clarification:
Design: Segregated Free Lists with Buddy System for Large Blocks
Hybrid approach to handle different size classes optimally:
Small Allocation: Segregated Free Lists
Size classes: 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096 bytes
#define NUM_SIZE_CLASSES 10
struct free_list_node {
struct free_list_node* next;
};
struct allocator {
struct free_list_node* free_lists[NUM_SIZE_CLASSES];
size_t size_class_sizes[NUM_SIZE_CLASSES]; // 8, 16, 32...
spinlock_t locks[NUM_SIZE_CLASSES]; // Fine-grained locking
};
int size_class_index(size_t size) {
// Add header overhead, round up to power of 2
size_t total = size + sizeof(block_header);
if (total <= 8) return 0;
if (total <= 16) return 1;
// ... (or use bit manipulation: 63 - clz(total-1) - 2)
return NUM_SIZE_CLASSES - 1;
}
Block Header (minimal metadata):
struct block_header {
size_t size; // Actual allocation size (for free)
uint8_t size_class; // Which free list this belongs to
uint8_t flags; // Allocated, large block, etc.
};
malloc() Implementation:
void* malloc(size_t size) {
if (size == 0) return NULL;
if (size > LARGE_THRESHOLD) {
return large_alloc(size); // Buddy or mmap
}
int class_idx = size_class_index(size);
size_t actual_size = size_class_sizes[class_idx];
lock(&allocator.locks[class_idx]);
if (allocator.free_lists[class_idx] == NULL) {
// Refill: get a page from OS, carve into blocks
refill_free_list(class_idx);
}
struct free_list_node* block = allocator.free_lists[class_idx];
allocator.free_lists[class_idx] = block->next;
unlock(&allocator.locks[class_idx]);
// Set up header
struct block_header* header = (struct block_header*)block;
header->size = size;
header->size_class = class_idx;
header->flags = ALLOCATED;
return (void*)(header + 1); // Return pointer after header
}
free() Implementation:
void free(void* ptr) {
if (ptr == NULL) return;
struct block_header* header = ((struct block_header*)ptr) - 1;
if (header->flags & LARGE_BLOCK) {
large_free(header);
return;
}
int class_idx = header->size_class;
lock(&allocator.locks[class_idx]);
struct free_list_node* node = (struct free_list_node*)header;
node->next = allocator.free_lists[class_idx];
allocator.free_lists[class_idx] = node;
unlock(&allocator.locks[class_idx]);
}
For high-performance multi-threaded allocation, use thread-local caches. Each thread has its own small free lists. Only when thread cache is empty/full do we access shared global state. This eliminates lock contention for common cases.
Large Block Allocation: Buddy System
For blocks >4KB, use buddy allocation:
void* buddy_alloc(size_t size) {
int order = get_order(size); // Which power of 2
// Find smallest available block >= required order
for (int o = order; o < MAX_ORDER; o++) {
if (buddy_free_lists[o] != NULL) {
struct buddy_block* block = buddy_free_lists[o];
buddy_free_lists[o] = block->next;
// Split block down to required size
while (o > order) {
o--;
// Create buddy and add to free list
struct buddy_block* buddy = block + (1 << o);
buddy->next = buddy_free_lists[o];
buddy_free_lists[o] = buddy;
}
return block;
}
}
return NULL; // Out of memory
}
Coalescing on Free:
void buddy_free(void* ptr, int order) {
struct buddy_block* block = ptr;
// Try to coalesce with buddy
while (order < MAX_ORDER - 1) {
struct buddy_block* buddy = find_buddy(block, order);
if (!is_free(buddy) || buddy->order != order) {
break; // Can't coalesce
}
remove_from_free_list(buddy, order);
block = min(block, buddy); // Take lower address
order++;
}
add_to_free_list(block, order);
}
| Metric | Our Design | Why |
|---|---|---|
| Small alloc time | O(1) | Direct free list lookup |
| Large alloc time | O(log n) worst | Buddy split/coalesce |
| External fragmentation | Low | Buddy coalescing |
| Internal fragmentation | ~25% for small | Power-of-2 size classes |
| Thread scalability | Good | Per-size-class locks or TLS caches |
Problem Statement:
Design a file system optimized for a modern SSD with high write throughput and crash consistency. The file system should support POSIX-like semantics and scale to millions of files.
Requirements Clarification:
Design: Copy-on-Write B-tree File System (inspired by Btrfs/ZFS)
Key insight: SSDs change the tradeoff landscape:
On-Disk Structures:
1. Superblock:
Offset 0: Magic number, version, block size
Offset 32: Root tree location (block number)
Offset 48: Free space tree location
Offset 64: Checksum tree location
Offset 80: Filesystem size, used blocks
Offset 128: Commit generation number
Two superblocks (mirrored) for reliability.
2. B-tree Nodes:
All data organized in B-trees:
struct btree_node {
uint32_t magic;
uint32_t generation; // For CoW detection
uint16_t num_items;
uint16_t level; // 0 = leaf
uint8_t checksum[32]; // Self-checksum
// Keys and values/child pointers follow
};
struct btree_key {
uint64_t objectid; // Inode number
uint8_t type; // INODE_ITEM, DIR_ITEM, EXTENT_DATA, etc.
uint64_t offset; // Interpretation depends on type
};
Key Operations:
1. File Write (Copy-on-Write):
1. Allocate new blocks for modified data
2. Write new data to new blocks
3. Update B-tree leaf with new extent pointers (CoW the leaf)
4. CoW up the tree to root
5. Atomically update root pointer in superblock
6. Old blocks become free after checkpoint
2. Directory Lookup:
1. Hash filename
2. Search directory B-tree with key (parent_inode, hash, name)
3. Return child inode number
4. Complexity: O(log n) for n directory entries
3. Crash Recovery:
1. Read both superblocks, use one with higher valid generation
2. Tree is always consistent (CoW ensures atomicity)
3. Roll forward any pending committed transactions
4. No fsck needed!
CoW provides atomic updates without journaling overhead. A transaction is committed by atomically updating the root pointer. Either the new tree is visible (committed) or the old tree is visible (rolled back). No torn writes possible.
Free Space Management:
Extent-based Allocation:
struct extent {
uint64_t start_block;
uint64_t num_blocks;
};
Free space B-tree stores (start, length) extents. Allocation strategy: Best-fit or next-fit within preferred region.
Write Amplification Mitigation:
Scalability Features:
| Challenge | Solution |
|---|---|
| Many small files | Inline data in B-tree leaves (< 2KB) |
| Large directories | B-tree with O(log n) lookup |
| Parallel access | Per-inode locking, concurrent B-tree |
| Large files | Extent-based allocation, sparse files |
| Requirement | Solution | Complexity |
|---|---|---|
| SSD optimization | CoW avoids random overwrites | O(log n) for updates |
| Crash consistency | Atomic root pointer update | Instant recovery |
| Scale to millions | B-tree directories | O(log n) lookup |
| POSIX semantics | Full inode/directory model | Standard API |
| Space efficiency | Inline data, extents | Minimal overhead |
Problem Statement:
Design a reader-writer lock that provides writer preference (writers don't starve) while allowing high read concurrency.
Requirements:
Design: Fair Reader-Writer Lock with Writer Preference
Data Structure:
struct rw_lock {
pthread_mutex_t mutex; // Protects all state
pthread_cond_t readers_cv; // Readers wait here
pthread_cond_t writers_cv; // Writers wait here
int readers_active; // Currently reading
int readers_waiting; // Waiting to read
int writers_active; // 0 or 1
int writers_waiting; // Waiting to write
};
void rw_lock_init(struct rw_lock* lock) {
pthread_mutex_init(&lock->mutex, NULL);
pthread_cond_init(&lock->readers_cv, NULL);
pthread_cond_init(&lock->writers_cv, NULL);
lock->readers_active = 0;
lock->readers_waiting = 0;
lock->writers_active = 0;
lock->writers_waiting = 0;
}
Read Lock:
void read_lock(struct rw_lock* lock) {
pthread_mutex_lock(&lock->mutex);
// Writer preference: wait if writers are waiting or active
lock->readers_waiting++;
while (lock->writers_active > 0 || lock->writers_waiting > 0) {
pthread_cond_wait(&lock->readers_cv, &lock->mutex);
}
lock->readers_waiting--;
lock->readers_active++;
pthread_mutex_unlock(&lock->mutex);
}
void read_unlock(struct rw_lock* lock) {
pthread_mutex_lock(&lock->mutex);
lock->readers_active--;
// If last reader and writers waiting, signal a writer
if (lock->readers_active == 0 && lock->writers_waiting > 0) {
pthread_cond_signal(&lock->writers_cv);
}
pthread_mutex_unlock(&lock->mutex);
}
Write Lock:
void write_lock(struct rw_lock* lock) {
pthread_mutex_lock(&lock->mutex);
lock->writers_waiting++;
while (lock->readers_active > 0 || lock->writers_active > 0) {
pthread_cond_wait(&lock->writers_cv, &lock->mutex);
}
lock->writers_waiting--;
lock->writers_active = 1;
pthread_mutex_unlock(&lock->mutex);
}
void write_unlock(struct rw_lock* lock) {
pthread_mutex_lock(&lock->mutex);
lock->writers_active = 0;
// Prefer writers: signal another writer if waiting
if (lock->writers_waiting > 0) {
pthread_cond_signal(&lock->writers_cv);
} else {
// No writers waiting: wake all readers
pthread_cond_broadcast(&lock->readers_cv);
}
pthread_mutex_unlock(&lock->mutex);
}
With pure writer preference, continuous writer arrivals could starve readers. To guarantee bounded waiting: (1) Use a ticket system ordering arrivals, or (2) After N successive writers, allow waiting readers through. Our design above does starve readers under heavy write load—acknowledge this limitation!
Alternative: Fair Lock with Ticket System
struct fair_rw_lock {
atomic_uint64_t next_ticket; // Next ticket to issue
atomic_uint64_t serving; // Currently serving ticket
atomic_int readers_active;
bool writer_active;
// Per-ticket type: READER or WRITER
uint8_t ticket_types[MAX_PENDING];
};
Each arrival takes a ticket. Tickets are served in order. When a reader ticket is served, all consecutive reader tickets are served together. This provides FIFO fairness while maintaining read concurrency.
Lock-Free Readers (Optimistic Read):
For read-heavy workloads, optimistic reading can avoid locks entirely:
struct seqlock {
atomic_uint64_t sequence; // Odd = write in progress
// Protected data follows
};
// Reader: retry if sequence changed during read
uint64_t seq1 = atomic_load(&lock->sequence);
if (seq1 & 1) { retry; } // Write in progress
// Read data
uint64_t seq2 = atomic_load(&lock->sequence);
if (seq1 != seq2) { retry; } // Data changed during read
Readers never block or take locks—ideal for frequently-read, rarely-written data.
Problem Statement:
Design the virtual memory subsystem for an OS. It should support demand paging, memory-mapped files, copy-on-write for fork(), and shared memory between processes.
Key Components:
Address Space Data Structure:
// Virtual Memory Area - represents a contiguous region
struct vm_area {
uint64_t start; // Start virtual address
uint64_t end; // End virtual address
uint32_t flags; // VM_READ, VM_WRITE, VM_EXEC, VM_SHARED
struct file* mapped_file; // NULL for anonymous memory
uint64_t file_offset; // Offset in file
struct vm_area* next; // Next VMA in address space
};
// Process address space
struct mm_struct {
struct vm_area* vma_list; // List of VMAs (sorted by address)
struct rb_root vma_tree; // Red-black tree for fast lookup
pgd_t* page_table; // Top-level page table
atomic_int users; // Reference count
spinlock_t lock;
// Statistics
size_t total_vm, resident_vm, shared_vm;
};
VMA Operations:
find_vma(mm, addr): Find VMA containing address (O(log n) with tree)insert_vma(mm, vma): Add new VMA, merge with adjacent if possiblesplit_vma(mm, vma, addr): Split VMA at address (for partial unmap)Page Fault Handler:
int handle_page_fault(struct mm_struct* mm, uint64_t fault_addr,
uint32_t error_code) {
struct vm_area* vma = find_vma(mm, fault_addr);
// Check if address is valid
if (vma == NULL || fault_addr < vma->start) {
return -SIGSEGV; // Segmentation fault
}
// Check permissions
if ((error_code & WRITE_FAULT) && !(vma->flags & VM_WRITE)) {
// Check for copy-on-write
if (is_cow_page(mm, fault_addr)) {
return handle_cow_fault(mm, fault_addr);
}
return -SIGSEGV; // Write to read-only
}
// Page not present - need to bring it in
if (vma->mapped_file) {
return handle_file_fault(mm, vma, fault_addr);
} else {
return handle_anonymous_fault(mm, vma, fault_addr);
}
}
int handle_anonymous_fault(struct mm_struct* mm, struct vm_area* vma,
uint64_t fault_addr) {
// Allocate physical page
struct page* page = alloc_page(GFP_USER);
if (!page) {
return -ENOMEM;
}
// Zero the page (security)
memset(page_address(page), 0, PAGE_SIZE);
// Install in page table
pte_t pte = mk_pte(page, vma_prot(vma));
set_pte(mm->page_table, fault_addr, pte);
return 0;
}
Copy-on-Write for fork():
struct mm_struct* copy_mm(struct mm_struct* parent) {
struct mm_struct* child = alloc_mm();
// Copy VMA list
for (struct vm_area* vma = parent->vma_list; vma; vma = vma->next) {
struct vm_area* new_vma = copy_vma(vma);
insert_vma(child, new_vma);
}
// Mark all writable pages as COW (read-only + COW flag)
for_each_pte(parent->page_table, pte) {
if (pte_write(pte)) {
pte_clear_write(pte); // Make read-only
pte_set_cow(pte); // Mark as COW
// Increment page reference count
struct page* page = pte_page(pte);
atomic_inc(&page->refcount);
}
// Copy PTE to child (pointing to same physical page)
copy_pte(child->page_table, parent->page_table, pte_addr);
}
return child;
}
int handle_cow_fault(struct mm_struct* mm, uint64_t addr) {
pte_t* pte = lookup_pte(mm->page_table, addr);
struct page* old_page = pte_page(*pte);
if (atomic_read(&old_page->refcount) == 1) {
// We're the only user - just make writable
pte_set_write(pte);
pte_clear_cow(pte);
return 0;
}
// Allocate new page and copy
struct page* new_page = alloc_page(GFP_USER);
memcpy(page_address(new_page), page_address(old_page), PAGE_SIZE);
// Update PTE to point to new page
set_pte(mm->page_table, addr, mk_pte(new_page, VM_READ | VM_WRITE));
// Decrement old page refcount
put_page(old_page);
return 0;
}
For page replacement, use Clock (Second-Chance) algorithm: maintain a circular list of pages with reference bits. On eviction, clear reference bit and advance; evict first page with cleared bit. This approximates LRU with O(1) amortized cost.
Beyond the technical content, how you present your design matters enormously. Here are proven strategies for OS system design interviews.
Communication Strategies:
Think Out Loud
Draw Diagrams
Start Simple, Then Elaborate
Acknowledge Tradeoffs
| ❌ Red Flags | ✅ Green Flags |
|---|---|
| Jumping to code immediately | Thoroughly understanding requirements first |
| Single solution without alternatives | Discussing multiple approaches |
| Ignoring edge cases | Proactively considering failure modes |
| Claiming design is perfect | Acknowledging limitations |
| Silent thinking | Continuous communication |
| Rigid design | Iterating based on feedback |
For a 45-minute design interview: 5 min requirements, 25 min core design with data structures and algorithms, 10 min deep dive on one component, 5 min tradeoffs and alternatives. Adjust based on interviewer cues.
Congratulations! You've completed the Interview Problem Types module. You now have comprehensive knowledge of scheduling, synchronization, memory, file system, and system design problems. Practice these patterns with real problems, and you'll be prepared for any OS technical interview.