Operating SystemsDevice Drivers

Device Drivers

LevelAdvanced

Duration90 mins

TopicDevice Drivers

5 / 5

Driver Bugs

When Drivers Go Wrong

Device driver bugs are among the most severe software defects possible. Unlike application bugs that crash a single program, driver bugs can crash the entire operating system, corrupt data, or compromise security. Studies consistently show that driver code contains significantly higher bug density than most other kernel code, and drivers account for a disproportionate share of kernel vulnerabilities.

Why are drivers so error-prone? They operate at the boundary between hardware and software—two domains with different failure modes, timing characteristics, and error behaviors. Drivers must handle asynchronous events, concurrent access, and hardware that doesn't always behave as documented. Understanding common driver bugs, their symptoms, and how to debug them is essential for writing reliable kernel code.

What You Will Learn

By the end of this page, you will understand the most common categories of driver bugs, how to identify and diagnose kernel crashes and hangs, debugging tools and techniques specific to driver development, race condition detection and prevention, memory safety bugs and how to prevent them, and strategies for writing robust driver code. This knowledge is essential for developing production-quality drivers and debugging system issues.

Common Bug Categories

Driver bugs fall into several major categories, each with distinct symptoms, causes, and debugging approaches. Understanding these categories helps you recognize bugs quickly and focus your debugging efforts effectively.

Driver Bug Categories and Symptoms
Category	Common Symptoms	Typical Causes	Impact Level
Memory Corruption	Random crashes, data corruption, strange behavior	Buffer overflows, use-after-free, double-free	Critical
Race Conditions	Intermittent failures, timing-dependent bugs, data corruption	Missing locks, wrong lock type, incorrect ordering	Critical
Resource Leaks	System slowdown over time, eventual failure to allocate	Missing cleanup on error paths, reference count errors	High
Deadlocks	System freeze, hung processes, watchdog triggers	Lock ordering violations, interrupt context mistakes	Critical
Hardware Assumptions	Works on some hardware, fails on others	Undocumented hardware behavior, timing assumptions	Medium-High
Context Violations	Kernel panic, sleeping in atomic context warnings	Sleeping in interrupt handler, wrong GFP flags	Critical
Error Handling Failures	Crash on error, resource leak, security vulnerability	Not checking return values, incomplete cleanup	High
Concurrency Bugs	SMP-only failures, ref count issues	Per-CPU vs global data confusion, RCU misuse	High

Bug Frequency Distribution:

Research on Linux kernel bugs shows:

25-30%: Memory safety bugs (buffer overflows, use-after-free)
20-25%: Concurrency bugs (races, deadlocks)
15-20%: Error handling bugs (unchecked returns, cleanup failures)
10-15%: Resource management bugs (leaks, reference counting)
10-15%: Semantic bugs (incorrect logic, API misuse)
5-10%: Other (hardware interaction, performance issues)

Notably, many bugs only manifest under specific conditions: high load, SMP systems, rare error paths, or unusual hardware configurations. This makes testing challenging—a driver may work perfectly in development but fail in production.

The Testing Problem

A study of Windows driver bugs found that 70% of crashes occurred in error handling paths that were rarely exercised. Normal testing validates the happy path; bugs lurk in the code that handles failures, unusual inputs, and edge cases. Systematic testing of error paths is essential.

Memory Safety Bugs

Memory safety bugs are the most dangerous class of driver bugs. They can cause crashes, data corruption, and security vulnerabilities. The kernel has no memory protection—a stray pointer write can corrupt critical data structures anywhere in kernel memory.

Common Memory Safety Bugs:

Memory Bug Types

•Buffer Overflow — Writing past the end of an allocated buffer, corrupting adjacent memory. Stack overflows can overwrite return addresses; heap overflows corrupt allocator metadata or other objects.
•Use After Free — Accessing memory after it's been freed. Another allocation may now occupy that memory, causing data corruption or security exploits.
•Double Free — Freeing memory twice, corrupting allocator data structures. Often causes crashes on the second free or next allocation.
•Uninitialized Memory — Reading from memory that was never initialized. May contain sensitive data from previous uses or random values.
•NULL Pointer Dereference — Accessing offset from NULL pointer. In kernel, address 0 is often unmapped, causing immediate crash.
•Type Confusion — Treating memory as the wrong type, misinterpreting data layouts. Common with void pointers and union types.

memory-bugs.c
Memory Bug Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/* Bug 1: Buffer Overflow */
static void process_user_command(struct device *dev, 
                                  const char __user *cmd)
{
    char buf[64];
    
    /* BUG: User could provide more than 64 bytes! */
    copy_from_user(buf, cmd, strlen(cmd));  /* WRONG */
    
    /* CORRECT: */
    if (copy_from_user(buf, cmd, min(sizeof(buf) - 1, strlen(cmd))))
        return -EFAULT;
    buf[sizeof(buf) - 1] = '\0';
}
 
/* Bug 2: Use After Free */
static void process_and_free(struct request *req)
{
    struct mydev_data *data = req->private_data;
    
    complete_request(req);
    
    /* BUG: req was freed by complete_request()! */
    pr_info("Completed request type %d\n", req->type); /* WRONG */
    
    /* CORRECT: Save needed data before freeing */
    int type = req->type;
    complete_request(req);
    pr_info("Completed request type %d\n", type);
}
 
/* Bug 3: Double Free */
static int init_device(struct mydev *dev)
{
    dev->buffer = kmalloc(SIZE, GFP_KERNEL);
    if (!dev->buffer)
        return -ENOMEM;
    
    if (init_hardware(dev)) {
        kfree(dev->buffer);
        return -EIO;
    }
    
    if (register_device(dev)) {
        kfree(dev->buffer);  /* First free */
        return -ENODEV;
    }
    
    return 0;
}
 
static void cleanup_device(struct mydev *dev)
{
    unregister_device(dev);
    kfree(dev->buffer);  /* BUG: May already be freed! */
}
 
/* CORRECT: Use flags or NULL checks */
static void cleanup_device_safe(struct mydev *dev)
{
    unregister_device(dev);
    kfree(dev->buffer);
    dev->buffer = NULL;  /* Prevent double-free */
}
 
/* Bug 4: NULL Pointer Dereference */
static int use_optional_feature(struct device *dev)
{
    struct my_data *data = get_device_data(dev);
    
    /* BUG: data might be NULL if device not initialized! */
    data->ref_count++;  /* CRASH if NULL */
    
    /* CORRECT: */
    if (!data)
        return -ENODEV;
    data->ref_count++;
}

Detection with KASAN:

# Enable KASAN in kernel config
CONFIG_KASAN=y
CONFIG_KASAN_INLINE=y

# KASAN output on use-after-free:
[   12.345678] BUG: KASAN: use-after-free in mydriver_process+0x123/0x456
[   12.345679] Read of size 4 at addr ffff888012345678 by task test/1234
[   12.345680]
[   12.345681] Allocated by task 1234:
[   12.345682]  kmalloc+0x12/0x34
[   12.345683]  mydriver_init+0x56/0x78
[   12.345684]
[   12.345685] Freed by task 1234:
[   12.345686]  kfree+0x12/0x34
[   12.345687]  mydriver_process+0x100/0x456

KASAN shows exactly where the memory was allocated, where it was freed, and where the use-after-free occurred.

Use-After-Free Prevention

Set pointers to NULL after freeing. Use reference counting for shared objects. Consider devm_* functions which tie lifetime to device. For complex lifetimes, use RCU which guarantees old data remains valid through grace periods.

Race Conditions and Deadlocks

Concurrency bugs are particularly insidious because they may only manifest under specific timing conditions. A driver might work for years before a race condition triggers under unusual load. When they do occur, the symptoms often manifest far from the actual bug.

Race Condition Examples:

race-conditions.c
Race Condition Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* Race 1: Check-then-act (TOCTOU vulnerability) */
static int allocate_if_needed(struct mydev *dev)
{
    /* BUG: Another thread could allocate between check and set! */
    if (dev->buffer == NULL) {         /* Thread A checks */
        /* Thread B also sees NULL, allocates */
        dev->buffer = kmalloc(SIZE, GFP_KERNEL);  /* Both allocate! */
        /* Memory leak: one allocation lost */
    }
    
    /* CORRECT: Use locking */
    mutex_lock(&dev->lock);
    if (dev->buffer == NULL)
        dev->buffer = kmalloc(SIZE, GFP_KERNEL);
    mutex_unlock(&dev->lock);
}
 
/* Race 2: Interrupt vs process context */
static void update_status(struct mydev *dev)
{
    /* BUG: Interrupt could fire between read and write! */
    dev->status_reg |= NEW_FLAG;  /* Read-modify-write is NOT atomic */
    
    /* CORRECT: Disable interrupts or use atomics */
    unsigned long flags;
    spin_lock_irqsave(&dev->lock, flags);
    dev->status_reg |= NEW_FLAG;
    spin_unlock_irqrestore(&dev->lock, flags);
    
    /* Or for simple flags: */
    set_bit(FLAG_BIT, &dev->flags);  /* Atomic bit operation */
}
 
/* Race 3: Missing memory barriers */
/* Producer: */
data->value = 42;
data->ready = true;  /* BUG: Compiler/CPU might reorder! */
 
/* Consumer: */
while (!data->ready) cpu_relax();
use(data->value);    /* Might see stale value! */
 
/* CORRECT: Use proper primitives */
/* Producer: */
WRITE_ONCE(data->value, 42);
smp_store_release(&data->ready, true);  /* Barrier before */
 
/* Consumer: */
while (!smp_load_acquire(&data->ready)) cpu_relax();
use(READ_ONCE(data->value));  /* Now guaranteed to see 42 */
 
/* Race 4: Object lifetime during operation */
static void mydev_work(struct work_struct *work)
{
    struct mydev *dev = container_of(work, struct mydev, work);
    
    /* BUG: Device might be freed while we're running! */
    process_data(dev);
    
    /* CORRECT: Take reference before scheduling work */
    /* In scheduler: */
    get_device(&dev->dev);
    schedule_work(&dev->work);
    
    /* In worker: */
    process_data(dev);
    put_device(&dev->dev);  /* Release at end */
}

Deadlock Scenarios:

deadlocks.c
Deadlock Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
/* Deadlock 1: ABBA ordering violation */
/* Thread A: */
mutex_lock(&lock_A);
mutex_lock(&lock_B);    /* Waits for B */
 
/* Thread B: */
mutex_lock(&lock_B);
mutex_lock(&lock_A);    /* Waits for A - DEADLOCK! */
 
/* CORRECT: Always acquire in consistent order (A before B) */
 
/* Deadlock 2: Self-deadlock (reentrant lock needed) */
static void outer_function(struct mydev *dev)
{
    mutex_lock(&dev->lock);
    inner_function(dev);  /* Calls back to locked code */
    mutex_unlock(&dev->lock);
}
 
static void inner_function(struct mydev *dev)
{
    mutex_lock(&dev->lock);  /* DEADLOCK: Already held! */
    /* ... */
    mutex_unlock(&dev->lock);
}
 
/* CORRECT: Don't reacquire, or use recursive locking (discouraged) */
 
/* Deadlock 3: Lock held across sleeping operation */
static int read_with_spinlock(struct mydev *dev)
{
    spin_lock(&dev->lock);
    
    /* BUG: copy_to_user can sleep (page fault)! */
    copy_to_user(buf, dev->data, len);  /* DEADLOCK risk */
    
    spin_unlock(&dev->lock);
    
    /* CORRECT: Copy to local buffer first */
    spin_lock(&dev->lock);
    memcpy(local_buf, dev->data, len);
    spin_unlock(&dev->lock);
    copy_to_user(buf, local_buf, len);
}
 
/* Deadlock 4: Interrupt handler vs process lock */
static irqreturn_t my_isr(int irq, void *data)
{
    struct mydev *dev = data;
    
    spin_lock(&dev->lock);  /* Works */
    /* ... */
    spin_unlock(&dev->lock);
}
 
static int process_func(struct mydev *dev)
{
    spin_lock(&dev->lock);  /* Acquires lock */
    /* Interrupt fires here - ISR tries to acquire same lock */
    /* DEADLOCK: ISR spins forever waiting for us! */
    
    /* CORRECT: Disable IRQ when taking lock in process context */
    spin_lock_irqsave(&dev->lock, flags);
    /* Now interrupt is disabled, ISR can't run */
    spin_unlock_irqrestore(&dev->lock, flags);
}

lockdep is Your Friend

Enable CONFIG_PROVE_LOCKING during development. lockdep tracks every lock acquisition and immediately warns on ordering violations, even if no actual deadlock occurred. It finds deadlock potential before the bug manifests. Always run development with lockdep enabled.

Debugging Kernel Crashes

When a driver bug causes a kernel crash (Oops, panic, or BUG), the kernel prints diagnostic information to the console and logs. Learning to read and interpret this information is crucial for debugging.

Types of Kernel Crashes:

Oops: Non-fatal error, kernel may continue (process killed)
Panic: Fatal error, system halts immediately
BUG(): Developer assertion failure, triggers Oops
WARN(): Non-fatal warning with stack trace
Hung Task: Process stuck too long, watchdog triggers

oops-example.txt

Kernel Oops Example

[  123.456789] BUG: unable to handle page fault for address: 0000000000000010
[  123.456790] #PF: supervisor read access in kernel mode
[  123.456791] #PF: error_code(0x0000) - not-present page
[  123.456792] PGD 0 P4D 0
[  123.456793] Oops: 0000 [#1] PREEMPT SMP KASAN
[  123.456794] CPU: 2 PID: 1234 Comm: test_program Tainted: G        W   E      5.10.0
[  123.456795] Hardware name: QEMU Standard PC
[  123.456796] RIP: 0010:mydriver_process+0x123/0x456 [mydriver]
[  123.456797] Code: 48 89 f3 48 8b 43 10 48 85 c0 74 20 48 8b 00 <48> 8b 50 10 48 85 d2 74 0e
[  123.456798] RSP: 0018:ffffc9000123def0 EFLAGS: 00010286
[  123.456799] RAX: 0000000000000000 RBX: ffff888012345678 RCX: 0000000000000001
[  123.456800] RDX: 0000000000000000 RSI: ffff888023456789 RDI: ffff888012345678
[  123.456801] RBP: ffffc9000123df30 R08: 0000000000000000 R09: 0000000000000001
[  123.456802] ...
[  123.456803] Call Trace:
[  123.456804]  mydriver_ioctl+0x78/0x1a0 [mydriver]
[  123.456805]  __x64_sys_ioctl+0x91/0xc0
[  123.456806]  do_syscall_64+0x33/0x40
[  123.456807]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  123.456808] Modules linked in: mydriver(OE) other_module...

Reading the Oops:

Fault Address (0x10): Near-zero address suggests NULL pointer dereference plus offset. A field at offset 0x10 in a NULL structure pointer.
RIP (Instruction Pointer): mydriver_process+0x123 — crash at offset 0x123 in mydriver_process function.
Register Values: RAX=0 is the NULL pointer; we tried to read offset 0x10 from it.
Call Trace: Shows how we got here: ioctl → mydriver_ioctl → mydriver_process.
Taint Flags: 'E' means unsigned module loaded (less trustworthy).

Using addr2line:

# Convert offset to source line
addr2line -e mydriver.ko 0x123
# Output: /path/to/mydriver.c:456

# Or with GDB
gdb mydriver.ko
(gdb) list *mydriver_process+0x123

Kernel Debugging Tools
Tool	Purpose	Usage
dmesg	View kernel log messages	dmesg \| tail -50
addr2line	Convert addresses to source lines	addr2line -e module.ko 0x123
objdump	Disassemble to see exact instruction	objdump -dS module.ko > module.asm
gdb	Debug kernel/modules	gdb vmlinux /proc/kcore
crash	Analyze kernel memory dumps	crash vmlinux vmcore
ftrace	Trace function calls live	echo func > /sys/kernel/debug/tracing/current_tracer
kgdb	Source-level kernel debugging	Requires serial connection
perf	Performance analysis and sampling	perf record -g myprogram

Enable Debug Info

Build your module with CONFIG_DEBUG_INFO and -g CFLAGS. Without debug symbols, you only get addresses and offsets. With debug info, addr2line and gdb give exact source lines, making debugging vastly easier.

Dynamic Debug and Tracing

Sometimes you need to observe driver behavior without reproducing crashes. Dynamic debug lets you enable/disable debug prints at runtime. Tracing with ftrace provides detailed function-level visibility into kernel execution.

Dynamic Debug (pr_debug/dev_dbg):

When you use pr_debug() or dev_dbg() in driver code, and CONFIG_DYNAMIC_DEBUG is enabled, these messages are compiled in but disabled by default. You can enable them at runtime:

dynamic-debug.sh
Dynamic Debug Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# View available debug messages
cat /sys/kernel/debug/dynamic_debug/control | grep mydriver
 
# Enable all debug messages in a file
echo 'file mydriver.c +p' > /sys/kernel/debug/dynamic_debug/control
 
# Enable specific function
echo 'func mydriver_process +p' > /sys/kernel/debug/dynamic_debug/control
 
# Enable with line numbers and function names
echo 'file mydriver.c +pfl' > /sys/kernel/debug/dynamic_debug/control
 
# Enable by module name
echo 'module mydriver +p' > /sys/kernel/debug/dynamic_debug/control
 
# Disable all
echo 'module mydriver -p' > /sys/kernel/debug/dynamic_debug/control
 
# In driver code:
#include <linux/printk.h>
 
static void my_function(struct device *dev, int value)
{
    dev_dbg(dev, "Entering function with value=%d\n", value);
    /* This prints nothing by default, enable via dynamic debug */
    
    /* Always print (not controllable): */
    dev_info(dev, "Processing value %d\n", value);
}

ftrace.sh
Function Tracing with ftrace
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Enable tracing
cd /sys/kernel/debug/tracing
 
# See available tracers
cat available_tracers
# nop function function_graph wakeup...
 
# Enable function tracer
echo function > current_tracer
 
# Filter to specific module functions
echo ':mod:mydriver' > set_ftrace_filter
 
# Or specific functions
echo 'mydriver_open mydriver_read mydriver_ioctl' > set_ftrace_filter
 
# Enable tracing
echo 1 > tracing_on
 
# Perform operations that use the driver
./test_program
 
# Stop and view trace
echo 0 > tracing_on
cat trace
 
# Example output:
#           test_program-1234  [001] ....   123.456: mydriver_open <-chrdev_open
#           test_program-1234  [001] ....   123.457: mydriver_ioctl <-vfs_ioctl
#           test_program-1234  [001] ....   123.458: mydriver_process <-mydriver_ioctl
 
# Function graph tracer shows call hierarchy
echo function_graph > current_tracer
echo 1 > tracing_on
 
# Output shows indented call tree:
#  0)               |  mydriver_ioctl() {
#  0)   0.123 us    |    get_device();
#  0)               |    mydriver_process() {
#  0)   0.456 us    |      validate_input();
#  0)   1.234 us    |    }
#  0)   2.345 us    |  }

Trace Events

For more structured tracing, use kernel trace events and trace_printk(). Trace events are defined in the kernel and provide standardized tracing points. They're more efficient than function tracing and can be filtered and processed by tools like perf.

Hardware Interaction Bugs

Some of the most frustrating driver bugs stem from incorrect assumptions about hardware behavior. Hardware doesn't always behave as documented, and different revisions may have different quirks.

Common Hardware Interaction Bugs:

Hardware Bug Categories

•Timing Violations — Not waiting long enough for hardware operations. Some registers need delays before/after access. Missing delays cause intermittent failures depending on CPU speed.
•Register Access Ordering — CPU and compiler may reorder memory operations. Hardware often requires specific access order. Missing memory barriers cause random failures.
•Undocumented Behavior — Hardware does something not in the datasheet. Often discovered through reverse engineering or vendor hints. Must handle gracefully.
•Power State Transitions — Not waiting for power-up stabilization. Accessing registers before device fully powered causes undefined behavior.
•Interrupt Handling — Not clearing interrupt cause properly. Stuck interrupts (IRQ storm) or lost interrupts. Edge vs level triggering confusion.
•DMA Coherence — Not flushing caches before/after DMA. CPU sees stale data or device sees stale data. Manifests as data corruption.

hardware-bugs.c
Hardware Interaction Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* Timing: Wait for hardware operation */
static int reset_device(struct mydev *dev)
{
    /* Write reset command */
    writel(RESET_CMD, dev->regs + CTRL_REG);
    
    /* BUG: Checking status immediately - reset takes time! */
    if (readl(dev->regs + STATUS) & RESET_DONE)  /* WRONG */
        return 0;
    
    /* CORRECT: Wait with timeout */
    int timeout = 1000;  /* 1 second */
    while (--timeout) {
        if (readl(dev->regs + STATUS) & RESET_DONE)
            return 0;
        msleep(1);
    }
    return -ETIMEDOUT;
    
    /* BETTER: Use readl_poll_timeout helper */
    u32 val;
    int ret = readl_poll_timeout(dev->regs + STATUS, val,
                                  val & RESET_DONE,
                                  1000,    /* poll interval us */
                                  1000000); /* timeout us */
    return ret;
}
 
/* Register ordering with memory barriers */
static void start_dma(struct mydev *dev, dma_addr_t addr)
{
    /* Set up DMA address */
    writel(lower_32_bits(addr), dev->regs + DMA_ADDR_LO);
    writel(upper_32_bits(addr), dev->regs + DMA_ADDR_HI);
    
    /* BUG: Hardware might see START before address is set! */
    writel(DMA_START, dev->regs + DMA_CTRL);  /* WRONG */
    
    /* CORRECT: Ensure writes complete in order */
    writel(lower_32_bits(addr), dev->regs + DMA_ADDR_LO);
    writel(upper_32_bits(addr), dev->regs + DMA_ADDR_HI);
    wmb();  /* Write memory barrier */
    writel(DMA_START, dev->regs + DMA_CTRL);
    
    /* Note: writel() on x86 includes barrier, but be explicit */
}
 
/* Interrupt acknowledgment */
static irqreturn_t mydev_isr(int irq, void *data)
{
    struct mydev *dev = data;
    u32 status;
    
    status = readl(dev->regs + INT_STATUS);
    if (!status)
        return IRQ_NONE;
    
    /* Clear interrupt BEFORE processing 
     * (for level-triggered interrupts) */
    writel(status, dev->regs + INT_CLEAR);
    
    /* Ensure clear completes before returning 
     * (some hardware needs this) */
    readl(dev->regs + INT_CLEAR);  /* Flush posted write */
    
    /* Now process the interrupt */
    handle_interrupt(dev, status);
    
    return IRQ_HANDLED;
}

Posted Writes

PCI and many buses use posted writes—the CPU write returns before data reaches the device. This can cause races: you write to clear an interrupt, return from handler, but the clear hasn't reached hardware yet, triggering a spurious second interrupt. A read from the device (any register) flushes posted writes.

Prevention Strategies

The best debugging is not needing to debug. Following defensive coding practices and using kernel infrastructure correctly prevents entire categories of bugs.

Prevention Best Practices:

Bug Prevention Strategies
Strategy	Prevents	Implementation
Use devm_* functions	Resource leaks, cleanup bugs	devm_kzalloc, devm_request_irq, etc.
Enable sanitizers	Memory corruption	CONFIG_KASAN, CONFIG_KMSAN, CONFIG_UBSAN
Enable lockdep	Deadlocks	CONFIG_PROVE_LOCKING
Use helpers/macros	Off-by-one, overflow	array_size(), struct_size(), ARRAY_SIZE()
Static analysis	Various static bugs	sparse, smatch, coccinelle
Consistent patterns	Ad-hoc errors	Use established driver patterns
Defensive checks	NULL deref, bad input	BUG_ON, WARN_ON, parameter validation
Annotation	Context violations	__user, __iomem, __must_check

defensive-coding.c
Defensive Coding Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/* Use size macros to prevent overflow */
struct my_device {
    struct dev_info info;
    struct entry entries[];  /* Flexible array */
};
 
/* WRONG: Manual calculation can overflow */
size_t size = sizeof(struct my_device) + count * sizeof(struct entry);
 
/* CORRECT: Use struct_size helper */
size_t size = struct_size(dev, entries, count);
 
/* WRONG: Array size confusion */
int process_all(struct entry *entries)
{
    for (int i = 0; i < sizeof(entries); i++)  /* WRONG: pointer size! */
        process(&entries[i]);
}
 
/* CORRECT: Pass count or use ARRAY_SIZE for arrays */
for (int i = 0; i < ARRAY_SIZE(entries); i++)
 
/* Defensive assertions */
static int configure_device(struct mydev *dev, struct config *cfg)
{
    /* Check invariants */
    BUG_ON(dev == NULL);  /* Kernel BUG if violated - fatal */
    
    if (WARN_ON(cfg == NULL))  /* Print warning, continue */
        return -EINVAL;
    
    /* Validate input ranges */
    if (cfg->speed > MAX_SPEED || cfg->speed < MIN_SPEED) {
        dev_err(dev->dev, "Invalid speed %u\n", cfg->speed);
        return -EINVAL;
    }
    
    /* Check device state */
    if (dev->state != STATE_READY) {
        dev_warn(dev->dev, "Device not ready for configuration\n");
        return -EBUSY;
    }
    
    /* All checks passed, proceed */
    return do_configure(dev, cfg);
}
 
/* Annotate for static analysis */
static int __must_check read_user_data(void __user *src, void *dst, size_t n)
{
    /* __must_check: caller must check return value */
    if (copy_from_user(dst, src, n))
        return -EFAULT;
    return 0;
}
 
/* Use this: */
int ret = read_user_data(user_ptr, kernel_buf, len);
if (ret)  /* Compiler warns if you forget this check */
    return ret;

Defense in Depth

Apply multiple prevention strategies simultaneously. Static analysis catches some bugs, sanitizers catch others, code review finds yet others. No single technique catches everything. Use them all, especially during development. The small performance cost of sanitizers during development is nothing compared to debugging crashes in production.

Summary: Driver Bugs

We've explored the landscape of driver bugs—the most dangerous category of software defects. Let's consolidate the key takeaways:

Key Takeaways

•Driver bugs are uniquely severe — They can crash the system, corrupt data, and compromise security, unlike application bugs that affect only one process.
•Memory safety bugs dominate — Buffer overflows, use-after-free, and double-free are the most common and dangerous categories.
•Concurrency bugs are subtle — Race conditions and deadlocks often require specific timing to manifest, making them hard to find but devastating when they occur.
•Learn to read Oops output — Understanding crash dumps enables rapid identification of bug location and cause.
•Dynamic debug enables visibility — pr_debug and ftrace provide runtime insight without recompiling.
•Hardware interaction requires care — Timing, ordering, and undocumented behavior cause many hard-to-debug issues.
•Prevention beats debugging — Sanitizers, lockdep, static analysis, and defensive coding prevent bugs before they occur.

Module Complete:

You've now completed the comprehensive study of device drivers—from architecture and interfaces through development, loading, and debugging. This knowledge equips you to understand, develop, maintain, and troubleshoot the critical software layer that connects operating systems to hardware.

Module Complete

You now understand the challenging world of driver bugs—how to identify, debug, and prevent them. Combined with the previous pages on architecture, interfaces, development, and loading, you have a complete foundation for working with device drivers in operating systems.

5 / 5

Loading learning content...

Operating SystemsDevice Drivers

Device Drivers

LevelAdvanced

Duration90 mins

TopicDevice Drivers

5 / 5

Driver Bugs

When Drivers Go Wrong

What You Will Learn

Common Bug Categories

Driver Bug Categories and Symptoms
Category	Common Symptoms	Typical Causes	Impact Level
Memory Corruption	Random crashes, data corruption, strange behavior	Buffer overflows, use-after-free, double-free	Critical
Race Conditions	Intermittent failures, timing-dependent bugs, data corruption	Missing locks, wrong lock type, incorrect ordering	Critical
Resource Leaks	System slowdown over time, eventual failure to allocate	Missing cleanup on error paths, reference count errors	High
Deadlocks	System freeze, hung processes, watchdog triggers	Lock ordering violations, interrupt context mistakes	Critical
Hardware Assumptions	Works on some hardware, fails on others	Undocumented hardware behavior, timing assumptions	Medium-High
Context Violations	Kernel panic, sleeping in atomic context warnings	Sleeping in interrupt handler, wrong GFP flags	Critical
Error Handling Failures	Crash on error, resource leak, security vulnerability	Not checking return values, incomplete cleanup	High
Concurrency Bugs	SMP-only failures, ref count issues	Per-CPU vs global data confusion, RCU misuse	High

Bug Frequency Distribution:

Research on Linux kernel bugs shows:

25-30%: Memory safety bugs (buffer overflows, use-after-free)
20-25%: Concurrency bugs (races, deadlocks)
15-20%: Error handling bugs (unchecked returns, cleanup failures)
10-15%: Resource management bugs (leaks, reference counting)
10-15%: Semantic bugs (incorrect logic, API misuse)
5-10%: Other (hardware interaction, performance issues)

The Testing Problem

Memory Safety Bugs

Common Memory Safety Bugs:

Memory Bug Types

•Buffer Overflow — Writing past the end of an allocated buffer, corrupting adjacent memory. Stack overflows can overwrite return addresses; heap overflows corrupt allocator metadata or other objects.
•Use After Free — Accessing memory after it's been freed. Another allocation may now occupy that memory, causing data corruption or security exploits.
•Double Free — Freeing memory twice, corrupting allocator data structures. Often causes crashes on the second free or next allocation.
•Uninitialized Memory — Reading from memory that was never initialized. May contain sensitive data from previous uses or random values.
•NULL Pointer Dereference — Accessing offset from NULL pointer. In kernel, address 0 is often unmapped, causing immediate crash.
•Type Confusion — Treating memory as the wrong type, misinterpreting data layouts. Common with void pointers and union types.

memory-bugs.c
Memory Bug Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
/* Bug 1: Buffer Overflow */
static void process_user_command(struct device *dev, 
                                  const char __user *cmd)
{
    char buf[64];
    
    /* BUG: User could provide more than 64 bytes! */
    copy_from_user(buf, cmd, strlen(cmd));  /* WRONG */
    
    /* CORRECT: */
    if (copy_from_user(buf, cmd, min(sizeof(buf) - 1, strlen(cmd))))
        return -EFAULT;
    buf[sizeof(buf) - 1] = '\0';
}
 
/* Bug 2: Use After Free */
static void process_and_free(struct request *req)
{
    struct mydev_data *data = req->private_data;
    
    complete_request(req);
    
    /* BUG: req was freed by complete_request()! */
    pr_info("Completed request type %d\n", req->type); /* WRONG */
    
    /* CORRECT: Save needed data before freeing */
    int type = req->type;
    complete_request(req);
    pr_info("Completed request type %d\n", type);
}
 
/* Bug 3: Double Free */
static int init_device(struct mydev *dev)
{
    dev->buffer = kmalloc(SIZE, GFP_KERNEL);
    if (!dev->buffer)
        return -ENOMEM;
    
    if (init_hardware(dev)) {
        kfree(dev->buffer);
        return -EIO;
    }
    
    if (register_device(dev)) {
        kfree(dev->buffer);  /* First free */
        return -ENODEV;
    }
    
    return 0;
}
 
static void cleanup_device(struct mydev *dev)
{
    unregister_device(dev);
    kfree(dev->buffer);  /* BUG: May already be freed! */
}
 
/* CORRECT: Use flags or NULL checks */
static void cleanup_device_safe(struct mydev *dev)
{
    unregister_device(dev);
    kfree(dev->buffer);
    dev->buffer = NULL;  /* Prevent double-free */
}
 
/* Bug 4: NULL Pointer Dereference */
static int use_optional_feature(struct device *dev)
{
    struct my_data *data = get_device_data(dev);
    
    /* BUG: data might be NULL if device not initialized! */
    data->ref_count++;  /* CRASH if NULL */
    
    /* CORRECT: */
    if (!data)
        return -ENODEV;
    data->ref_count++;
}

Detection with KASAN:

# Enable KASAN in kernel config
CONFIG_KASAN=y
CONFIG_KASAN_INLINE=y

# KASAN output on use-after-free:
[   12.345678] BUG: KASAN: use-after-free in mydriver_process+0x123/0x456
[   12.345679] Read of size 4 at addr ffff888012345678 by task test/1234
[   12.345680]
[   12.345681] Allocated by task 1234:
[   12.345682]  kmalloc+0x12/0x34
[   12.345683]  mydriver_init+0x56/0x78
[   12.345684]
[   12.345685] Freed by task 1234:
[   12.345686]  kfree+0x12/0x34
[   12.345687]  mydriver_process+0x100/0x456

KASAN shows exactly where the memory was allocated, where it was freed, and where the use-after-free occurred.

Use-After-Free Prevention

Race Conditions and Deadlocks

Race Condition Examples:

race-conditions.c
Race Condition Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
/* Race 1: Check-then-act (TOCTOU vulnerability) */
static int allocate_if_needed(struct mydev *dev)
{
    /* BUG: Another thread could allocate between check and set! */
    if (dev->buffer == NULL) {         /* Thread A checks */
        /* Thread B also sees NULL, allocates */
        dev->buffer = kmalloc(SIZE, GFP_KERNEL);  /* Both allocate! */
        /* Memory leak: one allocation lost */
    }
    
    /* CORRECT: Use locking */
    mutex_lock(&dev->lock);
    if (dev->buffer == NULL)
        dev->buffer = kmalloc(SIZE, GFP_KERNEL);
    mutex_unlock(&dev->lock);
}
 
/* Race 2: Interrupt vs process context */
static void update_status(struct mydev *dev)
{
    /* BUG: Interrupt could fire between read and write! */
    dev->status_reg |= NEW_FLAG;  /* Read-modify-write is NOT atomic */
    
    /* CORRECT: Disable interrupts or use atomics */
    unsigned long flags;
    spin_lock_irqsave(&dev->lock, flags);
    dev->status_reg |= NEW_FLAG;
    spin_unlock_irqrestore(&dev->lock, flags);
    
    /* Or for simple flags: */
    set_bit(FLAG_BIT, &dev->flags);  /* Atomic bit operation */
}
 
/* Race 3: Missing memory barriers */
/* Producer: */
data->value = 42;
data->ready = true;  /* BUG: Compiler/CPU might reorder! */
 
/* Consumer: */
while (!data->ready) cpu_relax();
use(data->value);    /* Might see stale value! */
 
/* CORRECT: Use proper primitives */
/* Producer: */
WRITE_ONCE(data->value, 42);
smp_store_release(&data->ready, true);  /* Barrier before */
 
/* Consumer: */
while (!smp_load_acquire(&data->ready)) cpu_relax();
use(READ_ONCE(data->value));  /* Now guaranteed to see 42 */
 
/* Race 4: Object lifetime during operation */
static void mydev_work(struct work_struct *work)
{
    struct mydev *dev = container_of(work, struct mydev, work);
    
    /* BUG: Device might be freed while we're running! */
    process_data(dev);
    
    /* CORRECT: Take reference before scheduling work */
    /* In scheduler: */
    get_device(&dev->dev);
    schedule_work(&dev->work);
    
    /* In worker: */
    process_data(dev);
    put_device(&dev->dev);  /* Release at end */
}

Deadlock Scenarios:

deadlocks.c
Deadlock Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
/* Deadlock 1: ABBA ordering violation */
/* Thread A: */
mutex_lock(&lock_A);
mutex_lock(&lock_B);    /* Waits for B */
 
/* Thread B: */
mutex_lock(&lock_B);
mutex_lock(&lock_A);    /* Waits for A - DEADLOCK! */
 
/* CORRECT: Always acquire in consistent order (A before B) */
 
/* Deadlock 2: Self-deadlock (reentrant lock needed) */
static void outer_function(struct mydev *dev)
{
    mutex_lock(&dev->lock);
    inner_function(dev);  /* Calls back to locked code */
    mutex_unlock(&dev->lock);
}
 
static void inner_function(struct mydev *dev)
{
    mutex_lock(&dev->lock);  /* DEADLOCK: Already held! */
    /* ... */
    mutex_unlock(&dev->lock);
}
 
/* CORRECT: Don't reacquire, or use recursive locking (discouraged) */
 
/* Deadlock 3: Lock held across sleeping operation */
static int read_with_spinlock(struct mydev *dev)
{
    spin_lock(&dev->lock);
    
    /* BUG: copy_to_user can sleep (page fault)! */
    copy_to_user(buf, dev->data, len);  /* DEADLOCK risk */
    
    spin_unlock(&dev->lock);
    
    /* CORRECT: Copy to local buffer first */
    spin_lock(&dev->lock);
    memcpy(local_buf, dev->data, len);
    spin_unlock(&dev->lock);
    copy_to_user(buf, local_buf, len);
}
 
/* Deadlock 4: Interrupt handler vs process lock */
static irqreturn_t my_isr(int irq, void *data)
{
    struct mydev *dev = data;
    
    spin_lock(&dev->lock);  /* Works */
    /* ... */
    spin_unlock(&dev->lock);
}
 
static int process_func(struct mydev *dev)
{
    spin_lock(&dev->lock);  /* Acquires lock */
    /* Interrupt fires here - ISR tries to acquire same lock */
    /* DEADLOCK: ISR spins forever waiting for us! */
    
    /* CORRECT: Disable IRQ when taking lock in process context */
    spin_lock_irqsave(&dev->lock, flags);
    /* Now interrupt is disabled, ISR can't run */
    spin_unlock_irqrestore(&dev->lock, flags);
}

lockdep is Your Friend

Debugging Kernel Crashes

Types of Kernel Crashes:

Oops: Non-fatal error, kernel may continue (process killed)
Panic: Fatal error, system halts immediately
BUG(): Developer assertion failure, triggers Oops
WARN(): Non-fatal warning with stack trace
Hung Task: Process stuck too long, watchdog triggers

oops-example.txt

Kernel Oops Example

[  123.456789] BUG: unable to handle page fault for address: 0000000000000010
[  123.456790] #PF: supervisor read access in kernel mode
[  123.456791] #PF: error_code(0x0000) - not-present page
[  123.456792] PGD 0 P4D 0
[  123.456793] Oops: 0000 [#1] PREEMPT SMP KASAN
[  123.456794] CPU: 2 PID: 1234 Comm: test_program Tainted: G        W   E      5.10.0
[  123.456795] Hardware name: QEMU Standard PC
[  123.456796] RIP: 0010:mydriver_process+0x123/0x456 [mydriver]
[  123.456797] Code: 48 89 f3 48 8b 43 10 48 85 c0 74 20 48 8b 00 <48> 8b 50 10 48 85 d2 74 0e
[  123.456798] RSP: 0018:ffffc9000123def0 EFLAGS: 00010286
[  123.456799] RAX: 0000000000000000 RBX: ffff888012345678 RCX: 0000000000000001
[  123.456800] RDX: 0000000000000000 RSI: ffff888023456789 RDI: ffff888012345678
[  123.456801] RBP: ffffc9000123df30 R08: 0000000000000000 R09: 0000000000000001
[  123.456802] ...
[  123.456803] Call Trace:
[  123.456804]  mydriver_ioctl+0x78/0x1a0 [mydriver]
[  123.456805]  __x64_sys_ioctl+0x91/0xc0
[  123.456806]  do_syscall_64+0x33/0x40
[  123.456807]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  123.456808] Modules linked in: mydriver(OE) other_module...

Reading the Oops:

Fault Address (0x10): Near-zero address suggests NULL pointer dereference plus offset. A field at offset 0x10 in a NULL structure pointer.
RIP (Instruction Pointer): mydriver_process+0x123 — crash at offset 0x123 in mydriver_process function.
Register Values: RAX=0 is the NULL pointer; we tried to read offset 0x10 from it.
Call Trace: Shows how we got here: ioctl → mydriver_ioctl → mydriver_process.
Taint Flags: 'E' means unsigned module loaded (less trustworthy).

Using addr2line:

# Convert offset to source line
addr2line -e mydriver.ko 0x123
# Output: /path/to/mydriver.c:456

# Or with GDB
gdb mydriver.ko
(gdb) list *mydriver_process+0x123

Kernel Debugging Tools
Tool	Purpose	Usage
dmesg	View kernel log messages	dmesg \| tail -50
addr2line	Convert addresses to source lines	addr2line -e module.ko 0x123
objdump	Disassemble to see exact instruction	objdump -dS module.ko > module.asm
gdb	Debug kernel/modules	gdb vmlinux /proc/kcore
crash	Analyze kernel memory dumps	crash vmlinux vmcore
ftrace	Trace function calls live	echo func > /sys/kernel/debug/tracing/current_tracer
kgdb	Source-level kernel debugging	Requires serial connection
perf	Performance analysis and sampling	perf record -g myprogram

Enable Debug Info

Dynamic Debug and Tracing

Dynamic Debug (pr_debug/dev_dbg):

When you use pr_debug() or dev_dbg() in driver code, and CONFIG_DYNAMIC_DEBUG is enabled, these messages are compiled in but disabled by default. You can enable them at runtime:

dynamic-debug.sh
Dynamic Debug Usage
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
# View available debug messages
cat /sys/kernel/debug/dynamic_debug/control | grep mydriver
 
# Enable all debug messages in a file
echo 'file mydriver.c +p' > /sys/kernel/debug/dynamic_debug/control
 
# Enable specific function
echo 'func mydriver_process +p' > /sys/kernel/debug/dynamic_debug/control
 
# Enable with line numbers and function names
echo 'file mydriver.c +pfl' > /sys/kernel/debug/dynamic_debug/control
 
# Enable by module name
echo 'module mydriver +p' > /sys/kernel/debug/dynamic_debug/control
 
# Disable all
echo 'module mydriver -p' > /sys/kernel/debug/dynamic_debug/control
 
# In driver code:
#include <linux/printk.h>
 
static void my_function(struct device *dev, int value)
{
    dev_dbg(dev, "Entering function with value=%d\n", value);
    /* This prints nothing by default, enable via dynamic debug */
    
    /* Always print (not controllable): */
    dev_info(dev, "Processing value %d\n", value);
}

ftrace.sh
Function Tracing with ftrace
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# Enable tracing
cd /sys/kernel/debug/tracing
 
# See available tracers
cat available_tracers
# nop function function_graph wakeup...
 
# Enable function tracer
echo function > current_tracer
 
# Filter to specific module functions
echo ':mod:mydriver' > set_ftrace_filter
 
# Or specific functions
echo 'mydriver_open mydriver_read mydriver_ioctl' > set_ftrace_filter
 
# Enable tracing
echo 1 > tracing_on
 
# Perform operations that use the driver
./test_program
 
# Stop and view trace
echo 0 > tracing_on
cat trace
 
# Example output:
#           test_program-1234  [001] ....   123.456: mydriver_open <-chrdev_open
#           test_program-1234  [001] ....   123.457: mydriver_ioctl <-vfs_ioctl
#           test_program-1234  [001] ....   123.458: mydriver_process <-mydriver_ioctl
 
# Function graph tracer shows call hierarchy
echo function_graph > current_tracer
echo 1 > tracing_on
 
# Output shows indented call tree:
#  0)               |  mydriver_ioctl() {
#  0)   0.123 us    |    get_device();
#  0)               |    mydriver_process() {
#  0)   0.456 us    |      validate_input();
#  0)   1.234 us    |    }
#  0)   2.345 us    |  }

Trace Events

Hardware Interaction Bugs

Some of the most frustrating driver bugs stem from incorrect assumptions about hardware behavior. Hardware doesn't always behave as documented, and different revisions may have different quirks.

Common Hardware Interaction Bugs:

Hardware Bug Categories

•Timing Violations — Not waiting long enough for hardware operations. Some registers need delays before/after access. Missing delays cause intermittent failures depending on CPU speed.
•Register Access Ordering — CPU and compiler may reorder memory operations. Hardware often requires specific access order. Missing memory barriers cause random failures.
•Undocumented Behavior — Hardware does something not in the datasheet. Often discovered through reverse engineering or vendor hints. Must handle gracefully.
•Power State Transitions — Not waiting for power-up stabilization. Accessing registers before device fully powered causes undefined behavior.
•Interrupt Handling — Not clearing interrupt cause properly. Stuck interrupts (IRQ storm) or lost interrupts. Edge vs level triggering confusion.
•DMA Coherence — Not flushing caches before/after DMA. CPU sees stale data or device sees stale data. Manifests as data corruption.

hardware-bugs.c
Hardware Interaction Patterns
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
/* Timing: Wait for hardware operation */
static int reset_device(struct mydev *dev)
{
    /* Write reset command */
    writel(RESET_CMD, dev->regs + CTRL_REG);
    
    /* BUG: Checking status immediately - reset takes time! */
    if (readl(dev->regs + STATUS) & RESET_DONE)  /* WRONG */
        return 0;
    
    /* CORRECT: Wait with timeout */
    int timeout = 1000;  /* 1 second */
    while (--timeout) {
        if (readl(dev->regs + STATUS) & RESET_DONE)
            return 0;
        msleep(1);
    }
    return -ETIMEDOUT;
    
    /* BETTER: Use readl_poll_timeout helper */
    u32 val;
    int ret = readl_poll_timeout(dev->regs + STATUS, val,
                                  val & RESET_DONE,
                                  1000,    /* poll interval us */
                                  1000000); /* timeout us */
    return ret;
}
 
/* Register ordering with memory barriers */
static void start_dma(struct mydev *dev, dma_addr_t addr)
{
    /* Set up DMA address */
    writel(lower_32_bits(addr), dev->regs + DMA_ADDR_LO);
    writel(upper_32_bits(addr), dev->regs + DMA_ADDR_HI);
    
    /* BUG: Hardware might see START before address is set! */
    writel(DMA_START, dev->regs + DMA_CTRL);  /* WRONG */
    
    /* CORRECT: Ensure writes complete in order */
    writel(lower_32_bits(addr), dev->regs + DMA_ADDR_LO);
    writel(upper_32_bits(addr), dev->regs + DMA_ADDR_HI);
    wmb();  /* Write memory barrier */
    writel(DMA_START, dev->regs + DMA_CTRL);
    
    /* Note: writel() on x86 includes barrier, but be explicit */
}
 
/* Interrupt acknowledgment */
static irqreturn_t mydev_isr(int irq, void *data)
{
    struct mydev *dev = data;
    u32 status;
    
    status = readl(dev->regs + INT_STATUS);
    if (!status)
        return IRQ_NONE;
    
    /* Clear interrupt BEFORE processing 
     * (for level-triggered interrupts) */
    writel(status, dev->regs + INT_CLEAR);
    
    /* Ensure clear completes before returning 
     * (some hardware needs this) */
    readl(dev->regs + INT_CLEAR);  /* Flush posted write */
    
    /* Now process the interrupt */
    handle_interrupt(dev, status);
    
    return IRQ_HANDLED;
}

Posted Writes

Prevention Strategies

The best debugging is not needing to debug. Following defensive coding practices and using kernel infrastructure correctly prevents entire categories of bugs.

Prevention Best Practices:

Bug Prevention Strategies
Strategy	Prevents	Implementation
Use devm_* functions	Resource leaks, cleanup bugs	devm_kzalloc, devm_request_irq, etc.
Enable sanitizers	Memory corruption	CONFIG_KASAN, CONFIG_KMSAN, CONFIG_UBSAN
Enable lockdep	Deadlocks	CONFIG_PROVE_LOCKING
Use helpers/macros	Off-by-one, overflow	array_size(), struct_size(), ARRAY_SIZE()
Static analysis	Various static bugs	sparse, smatch, coccinelle
Consistent patterns	Ad-hoc errors	Use established driver patterns
Defensive checks	NULL deref, bad input	BUG_ON, WARN_ON, parameter validation
Annotation	Context violations	__user, __iomem, __must_check

defensive-coding.c
Defensive Coding Examples
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
/* Use size macros to prevent overflow */
struct my_device {
    struct dev_info info;
    struct entry entries[];  /* Flexible array */
};
 
/* WRONG: Manual calculation can overflow */
size_t size = sizeof(struct my_device) + count * sizeof(struct entry);
 
/* CORRECT: Use struct_size helper */
size_t size = struct_size(dev, entries, count);
 
/* WRONG: Array size confusion */
int process_all(struct entry *entries)
{
    for (int i = 0; i < sizeof(entries); i++)  /* WRONG: pointer size! */
        process(&entries[i]);
}
 
/* CORRECT: Pass count or use ARRAY_SIZE for arrays */
for (int i = 0; i < ARRAY_SIZE(entries); i++)
 
/* Defensive assertions */
static int configure_device(struct mydev *dev, struct config *cfg)
{
    /* Check invariants */
    BUG_ON(dev == NULL);  /* Kernel BUG if violated - fatal */
    
    if (WARN_ON(cfg == NULL))  /* Print warning, continue */
        return -EINVAL;
    
    /* Validate input ranges */
    if (cfg->speed > MAX_SPEED || cfg->speed < MIN_SPEED) {
        dev_err(dev->dev, "Invalid speed %u\n", cfg->speed);
        return -EINVAL;
    }
    
    /* Check device state */
    if (dev->state != STATE_READY) {
        dev_warn(dev->dev, "Device not ready for configuration\n");
        return -EBUSY;
    }
    
    /* All checks passed, proceed */
    return do_configure(dev, cfg);
}
 
/* Annotate for static analysis */
static int __must_check read_user_data(void __user *src, void *dst, size_t n)
{
    /* __must_check: caller must check return value */
    if (copy_from_user(dst, src, n))
        return -EFAULT;
    return 0;
}
 
/* Use this: */
int ret = read_user_data(user_ptr, kernel_buf, len);
if (ret)  /* Compiler warns if you forget this check */
    return ret;

Defense in Depth

Summary: Driver Bugs

We've explored the landscape of driver bugs—the most dangerous category of software defects. Let's consolidate the key takeaways:

Key Takeaways

•Driver bugs are uniquely severe — They can crash the system, corrupt data, and compromise security, unlike application bugs that affect only one process.
•Memory safety bugs dominate — Buffer overflows, use-after-free, and double-free are the most common and dangerous categories.
•Concurrency bugs are subtle — Race conditions and deadlocks often require specific timing to manifest, making them hard to find but devastating when they occur.
•Learn to read Oops output — Understanding crash dumps enables rapid identification of bug location and cause.
•Dynamic debug enables visibility — pr_debug and ftrace provide runtime insight without recompiling.
•Hardware interaction requires care — Timing, ordering, and undocumented behavior cause many hard-to-debug issues.
•Prevention beats debugging — Sanitizers, lockdep, static analysis, and defensive coding prevent bugs before they occur.

Module Complete:

Module Complete

5 / 5