Operating SystemsLinux Memory Management

Linux Memory Management

LevelAdvanced

Duration120 mins

TopicLinux Memory Management

5 / 5

OOM Killer

When Memory Runs Out

Memory is finite. No matter how much RAM a system has, it's possible for processes to collectively demand more than is available. When swap is exhausted and page reclamation cannot free enough memory to satisfy allocations, the kernel faces a stark choice: let the system lock up and become unresponsive, or terminate processes to free memory and keep the system running.

Linux chooses survival. The Out-of-Memory (OOM) Killer is the kernel's mechanism for selecting and terminating processes when memory is critically exhausted. It's a controversial feature—killing processes is inherently destructive—but the alternative (system hang) is usually worse.

This page provides an expert-level examination of the OOM killer: when it triggers, how it selects victims, how to tune its behavior, and strategies for preventing OOM situations in the first place.

What You Will Learn

By the end of this page, you will understand: (1) when and why the OOM killer activates, (2) the victim selection algorithm and scoring system, (3) oom_score_adj tuning for protecting critical processes, (4) memory cgroups and OOM handling, (5) the oom_reaper and its role in recovery, and (6) strategies for preventing and detecting OOM situations.

When the OOM Killer Triggers

The OOM killer is a mechanism of last resort. The kernel employs multiple layers of memory management before reaching this point:

Memory Pressure Response Hierarchy:

Normal allocation: Sufficient free memory exists, allocation succeeds immediately.
Zone reclaim: Free pages drop below watermarks, kswapd wakes up and reclaims pages in the background.
Direct reclaim: Allocation fails initial checks, the requesting process synchronously reclaims pages before retrying.
Compaction: High-order allocation fails, system compacts memory to create contiguous regions.
Swap writeback: Anonymous pages are written to swap to free physical memory.
OOM kill: All reclamation options exhausted, memory still unavailable, a process must die.

OOM is a Last Resort

By the time OOM triggers, the system has already been under severe memory pressure for some time. kswapd has been frantically reclaiming, processes have been stalling in direct reclaim, and performance has likely degraded significantly. OOM is the culmination of escalating memory problems.

The OOM Trigger Condition:

The OOM killer is invoked when:

An allocation cannot be satisfied
Direct reclaim fails to free enough pages
The allocation cannot fail (e.g., __GFP_NOFAIL flag)
No more swap space is available (or swapping is too slow)
Compaction cannot help (for high-order allocations)

At this point, out_of_memory() is called, initiating the OOM selection and kill process.

oom_trigger.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/* OOM invocation path (mm/page_alloc.c, simplified) */
 
struct page *__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                     struct alloc_context *ac)
{
    struct page *page = NULL;
    unsigned int alloc_flags;
    
    /* Try direct reclaim */
    page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                         &did_some_progress);
    if (page)
        return page;
    
    /* Try compaction for high-order allocations */
    if (order > 0) {
        page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
                                            ac, compact_priority);
        if (page)
            return page;
    }
    
    /* Check if we should invoke OOM */
    if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                             did_some_progress > 0))
        goto retry;  /* Keep trying reclaim */
    
    /* Should we invoke OOM killer? */
    if (gfp_mask & __GFP_NOFAIL) {
        /* This allocation MUST succeed - invoke OOM */
        if (oom_killer_disabled)
            wait_event_freezable(oom_wait, false); /* Wait forever */
        
        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
        if (page)
            return page;
            
        goto retry;
    }
    
    /* Allocation can fail - check if OOM would help */
    if (check_oom_conditions())  {
        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
        if (page)
            return page;
    }
    
    return NULL;  /* Allocation failed */
}
 
/* The actual OOM decision point */
static struct page *__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
                                          struct alloc_context *ac,
                                          unsigned long *did_some_progress)
{
    struct oom_control oc = {
        .zonelist = ac->zonelist,
        .nodemask = ac->nodemask,
        .memcg = NULL,  /* System-wide OOM */
        .gfp_mask = gfp_mask,
        .order = order,
    };
    
    /* Serialize OOM killing - only one at a time */
    if (!mutex_trylock(&oom_lock))
        return NULL;
    
    /* Try once more after getting lock */
    page = get_page_from_freelist(...);
    if (page)
        goto out;
    
    /* Invoke the OOM killer */
    if (!out_of_memory(&oc)) {
        /* OOM killer decided not to kill (panic, etc.) */
    }
    
    *did_some_progress = 1;
    
out:
    mutex_unlock(&oom_lock);
    return page;
}

Victim Selection: The OOM Score

When OOM is triggered, the kernel must select which process to kill. This is a complex decision with significant consequences—killing the wrong process could make things worse (killing a critical system service) or be ineffective (killing a small process that frees little memory).

The Selection Criteria:

The OOM killer aims to:

Free the most memory possible
Kill the fewest processes
Avoid killing important system processes
Avoid killing processes the user has marked as important
Prefer killing processes causing the memory pressure

The oom_score:

Each process has an oom_score visible in /proc/[pid]/oom_score. This score (0-1000+) represents how "good" a candidate the process is for killing:

Higher score = more likely to be killed
Score is based primarily on memory usage (RSS)
Normalized to a scale where a score of 1000 means the process is using all memory

oom_scoring.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/* OOM badness scoring (mm/oom_kill.c) */
 
/**
 * oom_badness - calculate oom_score for a task
 * @p: task struct of which task we should calculate
 * @totalpages: total number of pages in the system
 *
 * The formula is:
 *   points = (process_rss + swap_usage) / totalpages * 1000
 *   points += oom_score_adj  (user-adjustable)
 */
long oom_badness(struct task_struct *p, unsigned long totalpages)
{
    long points;
    long adj;
    
    /* Never kill init (PID 1) */
    if (is_global_init(p))
        return LONG_MIN;
    
    /* Never kill kernel threads */
    if (p->flags & PF_KTHREAD)
        return LONG_MIN;
    
    /* Get user-specified adjustment */
    adj = (long)p->signal->oom_score_adj;
    
    /* OOM_SCORE_ADJ_MIN (-1000) means "never kill" */
    if (adj == OOM_SCORE_ADJ_MIN)
        return LONG_MIN;
    
    /* Calculate base score from memory usage */
    points = get_mm_rss(p->mm);              /* Resident Set Size */
    points += get_mm_counter(p->mm, MM_SWAPENTS);  /* Swap entries */
    
    /* Thread group: count children's memory too */
    /* (children inherit parent's oom_score_adj) */
    
    /* Normalize to 0-1000 scale */
    /* Score of 1000 = using 100% of memory */
    points = points * 1000 / totalpages;
    
    /* Apply user adjustment */
    /* oom_score_adj ranges from -1000 to +1000 */
    points += adj;
    
    /* Ensure non-negative (unless OOM_SCORE_ADJ_MIN) */
    if (points < 1)
        points = 1;
    
    return points;
}
 
/* Select the worst (highest scoring) process */
static void select_bad_process(struct oom_control *oc)
{
    struct task_struct *p;
    long worst_score = 0;
    struct task_struct *worst = NULL;
    
    rcu_read_lock();
    
    for_each_process(p) {
        /* Skip tasks that can't be killed */
        if (oom_unkillable_task(p))
            continue;
        
        /* Check if task is in target memcg (if applicable) */
        if (!oom_cpuset_eligible(p, oc))
            continue;
        
        /* Calculate score */
        long score = oom_badness(p, oc->totalpages);
        
        /* Skip tasks with negative score */
        if (score == LONG_MIN)
            continue;
        
        if (score > worst_score) {
            worst_score = score;
            worst = p;
        }
    }
    
    rcu_read_unlock();
    
    oc->chosen = worst;
    oc->chosen_points = worst_score;
}

RSS-Based Scoring

The OOM score is based primarily on Resident Set Size (RSS)—the amount of physical memory actually used by the process. Virtual memory size doesn't matter because OOM is about freeing physical memory. A process with 10 GB virtual size but only 100 MB RSS is a poor candidate compared to one with 500 MB RSS.

Tuning OOM Behavior with oom_score_adj

The kernel allows administrators to influence OOM victim selection through the oom_score_adj mechanism. This is crucial for protecting critical services and ensuring less important processes are killed first.

oom_score_adj Range:

-1000: Never kill this process (requires CAP_SYS_RESOURCE)
-999 to -1: Reduce likelihood of being killed
0: Default, no adjustment
1 to 999: Increase likelihood of being killed
1000: Always kill this process first

Common Use Cases:

oom_score_adj.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# === Viewing OOM Scores ===
 
# Current OOM score (kernel-calculated)
cat /proc/self/oom_score
# 150
 
# Current adjustment (user-set)
cat /proc/self/oom_score_adj
# 0
 
# View for all processes
ps -eo pid,comm,oom_score,oom_score_adj --sort=-oom_score | head -20
#   PID COMMAND        OOM_SCORE OOM_SCORE_ADJ
# 15234 firefox          823         0
# 12456 chrome          756         0
# 23456 slack           412         0
#  1234 sshd              2      -900
 
# === Setting oom_score_adj ===
 
# Protect a critical service (e.g., database)
echo -900 > /proc/$(pidof postgres)/oom_score_adj
 
# Mark a process as expendable
echo 500 > /proc/$(pidof batch_worker)/oom_score_adj
 
# Make process immune to OOM (requires root)
echo -1000 > /proc/$(pidof critical_daemon)/oom_score_adj
 
# Make process always first candidate
echo 1000 > /proc/$(pidof test_process)/oom_score_adj
 
# === Systemd Integration ===
# In service unit files:
# [Service]
# OOMScoreAdjust=-900
 
# Example: /etc/systemd/system/database.service
cat << 'EOF'
[Unit]
Description=Critical Database Service
 
[Service]
ExecStart=/usr/bin/database_server
OOMScoreAdjust=-900
Restart=always
 
[Install]
WantedBy=multi-user.target
EOF
 
# === Protecting SSH ===
# sshd is often protected to ensure remote access during OOM
cat /proc/$(pidof sshd)/oom_score_adj
# -900 (typical default set by systemd)
 
# === Docker/Container OOM adjustment ===
# In docker-compose.yml:
# services:
#   myservice:
#     oom_score_adj: -500
 
# Docker run:
docker run --oom-score-adj=-500 myimage

Typical oom_score_adj Settings
Process Type	Suggested Value	Rationale
SSH daemon	-900	Maintain remote access during OOM
Database	-800 to -900	Critical data service
Init system (systemd)	-900	Core system process
Web server	-200 to -400	Important but not critical
Worker processes	0	Default, normal priority
Batch jobs	200 to 500	Expendable, kill first
Test/dev processes	800 to 1000	Kill before production

Don't Overuse -1000

Setting oom_score_adj to -1000 makes a process completely immune to OOM killing. If too many processes are immune and they collectively exhaust memory, the OOM killer cannot free any memory and the system may hang. Use -1000 sparingly for only the most critical processes.

The OOM Kill Process

Once a victim is selected, the OOM killer executes a carefully orchestrated sequence to terminate the process and reclaim its memory.

oom_kill_process.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* OOM kill execution (mm/oom_kill.c) */
 
static void __oom_kill_process(struct task_struct *victim,
                                const char *message)
{
    struct task_struct *p;
    struct task_struct *t;
    struct mm_struct *mm;
    
    /* Get victim's mm_struct */
    mm = victim->mm;
    
    /* Log the OOM kill event */
    pr_err("%s: Killed process %d (%s) total-vm:%ldkB, "
           "anon-rss:%ldkB, file-rss:%ldkB, shmem-rss:%ldkB, "
           "oom_score_adj:%hd\n",
           message, task_pid_nr(victim), victim->comm,
           K(mm->total_vm), K(get_mm_counter(mm, MM_ANONPAGES)),
           K(get_mm_counter(mm, MM_FILEPAGES)),
           K(get_mm_counter(mm, MM_SHMEMPAGES)),
           victim->signal->oom_score_adj);
    
    /* Dump system/process memory info */
    dump_header(oc);
    
    /* Mark the mm as OOM victim - prevents further faults */
    mark_mm_as_oom_victim(mm);
    
    /* 
     * Set TIF_MEMDIE flag - grants access to memory reserves
     * This allows the victim to exit cleanly even under 
     * extreme memory pressure
     */
    mark_oom_victim(victim);
    
    /* Wake the oom_reaper to reclaim memory in parallel */
    wake_oom_reaper(victim);
    
    /* Send SIGKILL to the victim */
    do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
    
    /* Kill all threads in the process */
    for_each_thread(victim, t) {
        if (t != victim)
            do_send_sig_info(SIGKILL, SEND_SIG_PRIV, t, PIDTYPE_PID);
    }
    
    /* Also kill processes sharing the same mm (unusual) */
    for_each_process(p) {
        if (p->mm == mm && !same_thread_group(p, victim)) {
            pr_err("oom_kill: Killing multi-threaded process %d (%s)\n",
                   task_pid_nr(p), p->comm);
            do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
        }
    }
}
 
/*
 * OOM logged message format (dmesg output):
 *
 * [12345.678901] Out of memory: Killed process 4567 (firefox) 
 *                total-vm:4567890kB, anon-rss:1234567kB, 
 *                file-rss:123456kB, shmem-rss:12345kB, 
 *                oom_score_adj:0
 */

Key Steps in OOM Kill:

Logging: Extensive information is logged to help diagnose why OOM occurred
TIF_MEMDIE flag: Grants the victim access to memory reserves, allowing it to execute exit handlers even under severe pressure
SIGKILL: Sends an un-catchable signal to terminate the process
Thread group kill: All threads in the process group are killed, not just the main thread
oom_reaper: A kernel thread that can reclaim memory from the victim's address space without waiting for the victim to fully exit (handles victims stuck in D-state)

The OOM Reaper

A subtle but critical problem: the OOM-killed process must run to exit. But if it's stuck waiting for memory (D-state), it can't run at all. The system deadlocks.

The Solution: oom_reaper

Linux 4.6 introduced the oom_reaper kernel thread. Instead of waiting for the victim to exit, the reaper directly reclaims the victim's memory:

OOM killer selects victim and sends SIGKILL
OOM killer wakes oom_reaper
oom_reaper unmaps the victim's anonymous memory
Memory is freed immediately, allocation can proceed
Victim eventually exits (or is cleaned up)

This bypasses the deadlock—memory is freed without requiring the victim to execute.

oom_reaper.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/* OOM Reaper (mm/oom_kill.c) */
 
static struct task_struct *oom_reaper_th;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
static struct task_struct *oom_reaper_list;
 
/* Wake the reaper for a new victim */
static void wake_oom_reaper(struct task_struct *victim)
{
    /* Add victim to reaper's work list */
    victim->oom_reaper_list = oom_reaper_list;
    oom_reaper_list = victim;
    
    /* Wake the reaper thread */
    wake_up(&oom_reaper_wait);
}
 
/* The reaper thread */
static int oom_reaper(void *unused)
{
    set_freezable();
    
    while (true) {
        struct task_struct *victim;
        
        /* Wait for work */
        wait_event_freezable(oom_reaper_wait,
                             oom_reaper_list != NULL);
        
        /* Process all victims on list */
        while ((victim = oom_reaper_list) != NULL) {
            /* Remove from list */
            oom_reaper_list = victim->oom_reaper_list;
            
            /* Attempt to reap the victim's memory */
            oom_reap_task(victim);
        }
    }
    
    return 0;
}
 
/* Actually reap the memory */
static void oom_reap_task(struct task_struct *victim)
{
    struct mm_struct *mm = victim->mm;
    struct vm_area_struct *vma;
    
    if (!mm || !mmget_not_zero(mm))
        return;  /* Already dead */
    
    /* 
     * Only reap anonymous memory (private, not file-backed)
     * File-backed pages may be dirty and need writeback
     */
    down_read(&mm->mmap_lock);
    
    for_each_vma(mm, vma) {
        /* Skip non-anonymous VMAs */
        if (vma->vm_file)
            continue;
        if (vma->vm_flags & (VM_SHARED | VM_HUGETLB))
            continue;
        
        /* Unmap and free the pages */
        unmap_page_range(&mm->mmap, vma, vma->vm_start, vma->vm_end, NULL);
    }
    
    up_read(&mm->mmap_lock);
    
    pr_info("oom_reaper: reaped process %d (%s), "
            "freed anonymous memory\n",
            task_pid_nr(victim), victim->comm);
    
    mmput(mm);
}

Why Only Anonymous Memory?

The oom_reaper only frees anonymous (non-file-backed) memory because file-backed dirty pages must be written back to disk. Unmapping them without writeback would cause data loss. Anonymous pages (stack, heap, private mmap) have no backing store, so they can be safely discarded.

Memory Cgroups and OOM

Memory cgroups (cgroup v2 memory controller) add another dimension to OOM handling. Each cgroup can have memory limits, and OOM can be scoped to a cgroup rather than affecting the entire system.

Cgroup-Level OOM:

When a cgroup exceeds its memory limit:

The kernel tries to reclaim memory within the cgroup
If reclaim fails, cgroup-local OOM is triggered
Only processes within that cgroup are considered for killing
The rest of the system is unaffected

This is crucial for container isolation—a runaway container shouldn't affect other containers.

cgroup_oom.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# === Cgroup v2 Memory Controller ===
 
# Create a cgroup with memory limit
mkdir -p /sys/fs/cgroup/myapp
echo "+memory" > /sys/fs/cgroup/myapp/cgroup.subtree_control
 
# Set memory limit (512 MB)
echo 536870912 > /sys/fs/cgroup/myapp/memory.max
 
# Set memory + swap limit (if swap is enabled)
echo 1073741824 > /sys/fs/cgroup/myapp/memory.swap.max
 
# Move a process to the cgroup
echo $PID > /sys/fs/cgroup/myapp/cgroup.procs
 
# === OOM Behavior Configuration ===
 
# View OOM kill events
cat /sys/fs/cgroup/myapp/memory.events
# low 0
# high 12
# max 5
# oom 2
# oom_kill 2
# oom_group_kill 0
 
# Configure OOM behavior
 
# Option 1: Kill individual process (default)
# When OOM triggers, kill one process in cgroup
 
# Option 2: Kill entire cgroup (oom_group_kill)
# When OOM triggers, kill ALL processes in cgroup
echo 1 > /sys/fs/cgroup/myapp/memory.oom.group
 
# === Systemd/Container Integration ===
 
# Docker with memory limit:
docker run --memory=512m --memory-swap=512m myimage
 
# Kubernetes memory limit:
# resources:
#   limits:
#     memory: "512Mi"
 
# Systemd unit with memory limit:
# [Service]
# MemoryMax=512M
# MemorySwapMax=0
 
# === Monitoring Cgroup OOM ===
 
# Watch for OOM events
inotifywait -m /sys/fs/cgroup/myapp/memory.events
 
# Or parse from dmesg
dmesg | grep -E "memory cgroup out of memory|oom_kill_process"
 
# === OOM Notification for Application Handling ===
 
# eventfd-based notification (for applications that want to handle OOM)
# Create eventfd, register with memory.oom.eventfd (cgroup v1)
# Or use memory.events inotify (cgroup v2)
 
# Example: Docker OOM event
docker events --filter 'event=oom'

System OOM vs Cgroup OOM
Aspect	System OOM	Cgroup OOM
Trigger	System-wide memory exhaustion	Cgroup memory limit exceeded
Scope	All user processes considered	Only cgroup members considered
Impact	Any process may be killed	Only cgroup processes affected
Isolation	None	Other cgroups/containers protected
Configuration	oom_score_adj	oom_score_adj + cgroup settings

Preventing OOM Situations

The best OOM handling is avoiding OOM in the first place. Here are strategies for preventing out-of-memory conditions:

OOM Prevention Strategies

•Adequate RAM — Size your system's memory for expected workload plus headroom. Monitor usage trends to plan capacity.
•Swap space — Provides a buffer when RAM is low. Size swap appropriately (traditionally 1-2x RAM, but depends on workload).
•Memory limits — Use cgroups/containers to limit memory per application. A memory leak in one app won't kill the system.
•Resource monitoring — Alert on memory usage before it becomes critical. Monitor both used memory and available memory (free + cache).
•Memory-efficient code — Fix memory leaks, avoid unnecessary allocations, use memory-mapped files for large datasets.
•Overcommit tuning — Adjust vm.overcommit_memory to match your workload's characteristics.

oom_prevention.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# === Memory Monitoring ===
 
# Basic memory stats
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           31Gi        8.2Gi       1.5Gi       512Mi        22Gi        22Gi
# Swap:          8.0Gi          0B       8.0Gi
 
# "available" = memory available for new allocations (free + reclaimable cache)
# This is a better indicator than "free"
 
# Watch memory over time
watch -n 1 'free -h; echo; vmstat 1 5'
 
# Per-process memory usage
ps aux --sort=-%mem | head -20
 
# Detailed memory stats
cat /proc/meminfo | grep -E 'MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree'
 
# === Overcommit Configuration ===
 
# View current overcommit settings
cat /proc/sys/vm/overcommit_memory
# 0 = Heuristic overcommit (default)
# 1 = Always overcommit (never fail malloc)
# 2 = Don't overcommit (strict, commit limit = swap + ratio*RAM)
 
cat /proc/sys/vm/overcommit_ratio
# 50 (default) - when mode=2, limit = swap + 50% RAM
 
# Disable overcommit (strict memory accounting)
echo 2 > /proc/sys/vm/overcommit_memory
echo 80 > /proc/sys/vm/overcommit_ratio
 
# View commit limit and committed memory
cat /proc/meminfo | grep -E 'CommitLimit|Committed_AS'
# CommitLimit:    49123456 kB  (how much can be committed)
# Committed_AS:   12345678 kB  (currently committed)
 
# === Early Warning with OOM Threshold ===
 
# Set up earlyoom (userspace OOM killer with warnings)
# https://github.com/rfjakob/earlyoom
earlyoom -m 5 -s 5 -n --prefer '(^|/)(java|chromium)$'
# Kills when <5% RAM or <5% swap available
# Notifies before killing
# Prefers to kill java or chromium
 
# === Systemd Resource Control ===
 
# View memory pressure
cat /proc/pressure/memory
# some avg10=0.00 avg60=0.00 avg300=0.00 total=12345678
# full avg10=0.00 avg60=0.00 avg300=0.00 total=1234567
 
# Interpret: "some" means at least one task waiting for memory
# If avg60 > 10%, system is under significant memory pressure
 
# === Swap Configuration ===
 
# View swap priority and usage
swapon --show
# NAME      TYPE SIZE USED PRIO
# /dev/sda2 partition 8G   0B   -2
 
# Add more swap (emergency)
dd if=/dev/zero of=/swapfile bs=1G count=8
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
 
# Adjust swappiness (0-100, higher = swap more aggressively)
cat /proc/sys/vm/swappiness
# 60 (default)
echo 10 > /proc/sys/vm/swappiness  # Prefer keeping data in RAM

The 'available' Memory Metric

Don't panic when 'free' memory is low—Linux uses free RAM for caching. The 'available' metric from free/meminfo shows memory that can actually be allocated (free + easily reclaimable cache). Monitor 'available' falling toward zero, not 'free'.

Diagnosing OOM Kills

When OOM kills occur, understanding why is crucial for prevention. The kernel logs extensive information about each OOM event.

oom_diagnosis.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# === Viewing OOM Kill Logs ===
 
# Check dmesg for OOM events
dmesg | grep -A 50 "Out of memory"
 
# Example OOM log output:
# [12345.678901] myapp invoked oom-killer: 
#                gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE),
#                order=0, oom_score_adj=0
# [12345.678902] CPU: 3 PID: 12345 Comm: myapp Kdump: loaded Not tainted
# [12345.678903] Hardware name: Dell Inc. PowerEdge R640/...
# [12345.678904] Call Trace:
#                dump_stack+0x5c/0x80
#                dump_header+0x4a/0x1e0
#                out_of_memory+0x2a9/0x550
#                __alloc_pages_slowpath+0xa5a/0xc30
#                ...
# [12345.678920] Mem-Info:
# [12345.678921] active_anon:123456 inactive_anon:234567 
#                isolated_anon:0
# [12345.678922] active_file:345678 inactive_file:456789 
#                isolated_file:0
# [12345.678923] unevictable:0 dirty:1234 writeback:0 unstable:0
# [12345.678924] slab_reclaimable:12345 slab_unreclaimable:67890
# [12345.678925] mapped:45678 shmem:12345 pagetables:6789 bounce:0
# [12345.678926] free:1234 free_pcp:123 free_cma:0
# [12345.678930] Node 0 hugepages_total=0 hugepages_free=0 ...
# [12345.678935] Tasks state (memory values in pages):
# [12345.678936] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
#                swapents oom_score_adj name
# [12345.678937] [   1234]     0  1234   123456    45678   2097152
#                    0           -900 sshd
# [12345.678938] [   5678]  1000  5678  4567890  1234567  10485760
#                    0              0 firefox
# [12345.678939] ... (list of all processes)
# [12345.678990] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
#                cpuset=/,mems_allowed=0,
#                global_oom,task_memcg=/user.slice/...,
#                task=firefox,pid=5678,uid=1000
# [12345.678991] Out of memory: Killed process 5678 (firefox) 
#                total-vm:18271560kB, anon-rss:4938268kB, 
#                file-rss:123456kB, shmem-rss:0kB, 
#                oom_score_adj:0
 
# === Parse OOM logs ===
 
# Extract killed processes
dmesg | grep "Out of memory: Killed process" | tail -20
 
# Find what triggered OOM
dmesg | grep "invoked oom-killer" | tail -20
 
# === journalctl (systemd) ===
 
journalctl -k | grep -i oom
journalctl --since "1 hour ago" | grep -i "out of memory"
 
# === System-wide OOM counter ===
 
cat /proc/vmstat | grep oom_kill
# oom_kill 3
 
# === Per-cgroup OOM events ===
 
cat /sys/fs/cgroup/user.slice/memory.events
# oom 2
# oom_kill 2

Key Information in OOM Logs:

Triggering process: Which allocation request triggered OOM
gfp_mask: What kind of memory was requested
Mem-Info: System memory state at OOM time
Tasks state: All processes with their memory usage and oom_score_adj
Killed process: Which process was selected and why
Memory breakdown: total-vm, anon-rss, file-rss, shmem-rss of victim

Understanding the Process Table

In OOM logs, the 'Tasks state' section lists all processes. Look at the 'rss' column (resident set size) to identify memory hogs. The process with highest RSS near time of OOM is often the culprit—even if it wasn't killed (it might have grown and then been killed before the log).

Summary: OOM Killer

The OOM killer is a critical safety mechanism that keeps Linux systems running when memory is exhausted. Let's consolidate the key concepts:

Key Takeaways

•OOM is a last resort — triggered only after reclaim, direct reclaim, compaction, and swapping have all failed to free sufficient memory.
•Victim selection uses oom_score — based primarily on RSS (resident memory), normalized and adjusted by oom_score_adj.
•oom_score_adj allows tuning — protect critical processes (-1000 to never kill, -900 for important services) and mark expendable processes (positive values).
•The oom_reaper — prevents deadlock by reclaiming victim's anonymous memory without requiring the victim to run.
•Memory cgroups scope OOM — container OOM affects only that container, protecting the rest of the system.
•Prevention is better than cure — adequate RAM, swap, memory limits, monitoring, and proper application design prevent most OOM situations.
•OOM logs are detailed — providing extensive information for post-mortem analysis and prevention of future occurrences.

Module Complete:

You have now completed the Linux Memory Management module. You understand:

Virtual address space layout — How Linux organizes process memory
Page table management — The multi-level hierarchy translating virtual to physical addresses
Slab allocator — High-performance kernel object caching
Memory zones — Partitioning physical memory by hardware constraints
OOM killer — The last resort for surviving memory exhaustion

This knowledge is essential for kernel development, system administration, performance tuning, and debugging memory-related issues.

Module Complete

Congratulations! You now have an expert-level understanding of Linux memory management internals. From virtual address spaces to page tables, from slab caching to zone management, from watermarks to the OOM killer—you understand how Linux manages the most precious computing resource: memory.

5 / 5

Loading learning content...

Operating SystemsLinux Memory Management

Linux Memory Management

LevelAdvanced

Duration120 mins

TopicLinux Memory Management

5 / 5

OOM Killer

When Memory Runs Out

This page provides an expert-level examination of the OOM killer: when it triggers, how it selects victims, how to tune its behavior, and strategies for preventing OOM situations in the first place.

What You Will Learn

When the OOM Killer Triggers

The OOM killer is a mechanism of last resort. The kernel employs multiple layers of memory management before reaching this point:

Memory Pressure Response Hierarchy:

Normal allocation: Sufficient free memory exists, allocation succeeds immediately.
Zone reclaim: Free pages drop below watermarks, kswapd wakes up and reclaims pages in the background.
Direct reclaim: Allocation fails initial checks, the requesting process synchronously reclaims pages before retrying.
Compaction: High-order allocation fails, system compacts memory to create contiguous regions.
Swap writeback: Anonymous pages are written to swap to free physical memory.
OOM kill: All reclamation options exhausted, memory still unavailable, a process must die.

OOM is a Last Resort

The OOM Trigger Condition:

The OOM killer is invoked when:

An allocation cannot be satisfied
Direct reclaim fails to free enough pages
The allocation cannot fail (e.g., __GFP_NOFAIL flag)
No more swap space is available (or swapping is too slow)
Compaction cannot help (for high-order allocations)

At this point, out_of_memory() is called, initiating the OOM selection and kill process.

oom_trigger.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
/* OOM invocation path (mm/page_alloc.c, simplified) */
 
struct page *__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
                                     struct alloc_context *ac)
{
    struct page *page = NULL;
    unsigned int alloc_flags;
    
    /* Try direct reclaim */
    page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
                                         &did_some_progress);
    if (page)
        return page;
    
    /* Try compaction for high-order allocations */
    if (order > 0) {
        page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
                                            ac, compact_priority);
        if (page)
            return page;
    }
    
    /* Check if we should invoke OOM */
    if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
                             did_some_progress > 0))
        goto retry;  /* Keep trying reclaim */
    
    /* Should we invoke OOM killer? */
    if (gfp_mask & __GFP_NOFAIL) {
        /* This allocation MUST succeed - invoke OOM */
        if (oom_killer_disabled)
            wait_event_freezable(oom_wait, false); /* Wait forever */
        
        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
        if (page)
            return page;
            
        goto retry;
    }
    
    /* Allocation can fail - check if OOM would help */
    if (check_oom_conditions())  {
        page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
        if (page)
            return page;
    }
    
    return NULL;  /* Allocation failed */
}
 
/* The actual OOM decision point */
static struct page *__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
                                          struct alloc_context *ac,
                                          unsigned long *did_some_progress)
{
    struct oom_control oc = {
        .zonelist = ac->zonelist,
        .nodemask = ac->nodemask,
        .memcg = NULL,  /* System-wide OOM */
        .gfp_mask = gfp_mask,
        .order = order,
    };
    
    /* Serialize OOM killing - only one at a time */
    if (!mutex_trylock(&oom_lock))
        return NULL;
    
    /* Try once more after getting lock */
    page = get_page_from_freelist(...);
    if (page)
        goto out;
    
    /* Invoke the OOM killer */
    if (!out_of_memory(&oc)) {
        /* OOM killer decided not to kill (panic, etc.) */
    }
    
    *did_some_progress = 1;
    
out:
    mutex_unlock(&oom_lock);
    return page;
}

Victim Selection: The OOM Score

The Selection Criteria:

The OOM killer aims to:

Free the most memory possible
Kill the fewest processes
Avoid killing important system processes
Avoid killing processes the user has marked as important
Prefer killing processes causing the memory pressure

The oom_score:

Each process has an oom_score visible in /proc/[pid]/oom_score. This score (0-1000+) represents how "good" a candidate the process is for killing:

Higher score = more likely to be killed
Score is based primarily on memory usage (RSS)
Normalized to a scale where a score of 1000 means the process is using all memory

oom_scoring.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
/* OOM badness scoring (mm/oom_kill.c) */
 
/**
 * oom_badness - calculate oom_score for a task
 * @p: task struct of which task we should calculate
 * @totalpages: total number of pages in the system
 *
 * The formula is:
 *   points = (process_rss + swap_usage) / totalpages * 1000
 *   points += oom_score_adj  (user-adjustable)
 */
long oom_badness(struct task_struct *p, unsigned long totalpages)
{
    long points;
    long adj;
    
    /* Never kill init (PID 1) */
    if (is_global_init(p))
        return LONG_MIN;
    
    /* Never kill kernel threads */
    if (p->flags & PF_KTHREAD)
        return LONG_MIN;
    
    /* Get user-specified adjustment */
    adj = (long)p->signal->oom_score_adj;
    
    /* OOM_SCORE_ADJ_MIN (-1000) means "never kill" */
    if (adj == OOM_SCORE_ADJ_MIN)
        return LONG_MIN;
    
    /* Calculate base score from memory usage */
    points = get_mm_rss(p->mm);              /* Resident Set Size */
    points += get_mm_counter(p->mm, MM_SWAPENTS);  /* Swap entries */
    
    /* Thread group: count children's memory too */
    /* (children inherit parent's oom_score_adj) */
    
    /* Normalize to 0-1000 scale */
    /* Score of 1000 = using 100% of memory */
    points = points * 1000 / totalpages;
    
    /* Apply user adjustment */
    /* oom_score_adj ranges from -1000 to +1000 */
    points += adj;
    
    /* Ensure non-negative (unless OOM_SCORE_ADJ_MIN) */
    if (points < 1)
        points = 1;
    
    return points;
}
 
/* Select the worst (highest scoring) process */
static void select_bad_process(struct oom_control *oc)
{
    struct task_struct *p;
    long worst_score = 0;
    struct task_struct *worst = NULL;
    
    rcu_read_lock();
    
    for_each_process(p) {
        /* Skip tasks that can't be killed */
        if (oom_unkillable_task(p))
            continue;
        
        /* Check if task is in target memcg (if applicable) */
        if (!oom_cpuset_eligible(p, oc))
            continue;
        
        /* Calculate score */
        long score = oom_badness(p, oc->totalpages);
        
        /* Skip tasks with negative score */
        if (score == LONG_MIN)
            continue;
        
        if (score > worst_score) {
            worst_score = score;
            worst = p;
        }
    }
    
    rcu_read_unlock();
    
    oc->chosen = worst;
    oc->chosen_points = worst_score;
}

RSS-Based Scoring

Tuning OOM Behavior with oom_score_adj

oom_score_adj Range:

-1000: Never kill this process (requires CAP_SYS_RESOURCE)
-999 to -1: Reduce likelihood of being killed
0: Default, no adjustment
1 to 999: Increase likelihood of being killed
1000: Always kill this process first

Common Use Cases:

oom_score_adj.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
# === Viewing OOM Scores ===
 
# Current OOM score (kernel-calculated)
cat /proc/self/oom_score
# 150
 
# Current adjustment (user-set)
cat /proc/self/oom_score_adj
# 0
 
# View for all processes
ps -eo pid,comm,oom_score,oom_score_adj --sort=-oom_score | head -20
#   PID COMMAND        OOM_SCORE OOM_SCORE_ADJ
# 15234 firefox          823         0
# 12456 chrome          756         0
# 23456 slack           412         0
#  1234 sshd              2      -900
 
# === Setting oom_score_adj ===
 
# Protect a critical service (e.g., database)
echo -900 > /proc/$(pidof postgres)/oom_score_adj
 
# Mark a process as expendable
echo 500 > /proc/$(pidof batch_worker)/oom_score_adj
 
# Make process immune to OOM (requires root)
echo -1000 > /proc/$(pidof critical_daemon)/oom_score_adj
 
# Make process always first candidate
echo 1000 > /proc/$(pidof test_process)/oom_score_adj
 
# === Systemd Integration ===
# In service unit files:
# [Service]
# OOMScoreAdjust=-900
 
# Example: /etc/systemd/system/database.service
cat << 'EOF'
[Unit]
Description=Critical Database Service
 
[Service]
ExecStart=/usr/bin/database_server
OOMScoreAdjust=-900
Restart=always
 
[Install]
WantedBy=multi-user.target
EOF
 
# === Protecting SSH ===
# sshd is often protected to ensure remote access during OOM
cat /proc/$(pidof sshd)/oom_score_adj
# -900 (typical default set by systemd)
 
# === Docker/Container OOM adjustment ===
# In docker-compose.yml:
# services:
#   myservice:
#     oom_score_adj: -500
 
# Docker run:
docker run --oom-score-adj=-500 myimage

Typical oom_score_adj Settings
Process Type	Suggested Value	Rationale
SSH daemon	-900	Maintain remote access during OOM
Database	-800 to -900	Critical data service
Init system (systemd)	-900	Core system process
Web server	-200 to -400	Important but not critical
Worker processes	0	Default, normal priority
Batch jobs	200 to 500	Expendable, kill first
Test/dev processes	800 to 1000	Kill before production

Don't Overuse -1000

The OOM Kill Process

Once a victim is selected, the OOM killer executes a carefully orchestrated sequence to terminate the process and reclaim its memory.

oom_kill_process.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
/* OOM kill execution (mm/oom_kill.c) */
 
static void __oom_kill_process(struct task_struct *victim,
                                const char *message)
{
    struct task_struct *p;
    struct task_struct *t;
    struct mm_struct *mm;
    
    /* Get victim's mm_struct */
    mm = victim->mm;
    
    /* Log the OOM kill event */
    pr_err("%s: Killed process %d (%s) total-vm:%ldkB, "
           "anon-rss:%ldkB, file-rss:%ldkB, shmem-rss:%ldkB, "
           "oom_score_adj:%hd\n",
           message, task_pid_nr(victim), victim->comm,
           K(mm->total_vm), K(get_mm_counter(mm, MM_ANONPAGES)),
           K(get_mm_counter(mm, MM_FILEPAGES)),
           K(get_mm_counter(mm, MM_SHMEMPAGES)),
           victim->signal->oom_score_adj);
    
    /* Dump system/process memory info */
    dump_header(oc);
    
    /* Mark the mm as OOM victim - prevents further faults */
    mark_mm_as_oom_victim(mm);
    
    /* 
     * Set TIF_MEMDIE flag - grants access to memory reserves
     * This allows the victim to exit cleanly even under 
     * extreme memory pressure
     */
    mark_oom_victim(victim);
    
    /* Wake the oom_reaper to reclaim memory in parallel */
    wake_oom_reaper(victim);
    
    /* Send SIGKILL to the victim */
    do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID);
    
    /* Kill all threads in the process */
    for_each_thread(victim, t) {
        if (t != victim)
            do_send_sig_info(SIGKILL, SEND_SIG_PRIV, t, PIDTYPE_PID);
    }
    
    /* Also kill processes sharing the same mm (unusual) */
    for_each_process(p) {
        if (p->mm == mm && !same_thread_group(p, victim)) {
            pr_err("oom_kill: Killing multi-threaded process %d (%s)\n",
                   task_pid_nr(p), p->comm);
            do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID);
        }
    }
}
 
/*
 * OOM logged message format (dmesg output):
 *
 * [12345.678901] Out of memory: Killed process 4567 (firefox) 
 *                total-vm:4567890kB, anon-rss:1234567kB, 
 *                file-rss:123456kB, shmem-rss:12345kB, 
 *                oom_score_adj:0
 */

Key Steps in OOM Kill:

Logging: Extensive information is logged to help diagnose why OOM occurred
TIF_MEMDIE flag: Grants the victim access to memory reserves, allowing it to execute exit handlers even under severe pressure
SIGKILL: Sends an un-catchable signal to terminate the process
Thread group kill: All threads in the process group are killed, not just the main thread
oom_reaper: A kernel thread that can reclaim memory from the victim's address space without waiting for the victim to fully exit (handles victims stuck in D-state)

The OOM Reaper

A subtle but critical problem: the OOM-killed process must run to exit. But if it's stuck waiting for memory (D-state), it can't run at all. The system deadlocks.

The Solution: oom_reaper

Linux 4.6 introduced the oom_reaper kernel thread. Instead of waiting for the victim to exit, the reaper directly reclaims the victim's memory:

OOM killer selects victim and sends SIGKILL
OOM killer wakes oom_reaper
oom_reaper unmaps the victim's anonymous memory
Memory is freed immediately, allocation can proceed
Victim eventually exits (or is cleaned up)

This bypasses the deadlock—memory is freed without requiring the victim to execute.

oom_reaper.c
C (Kernel)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
/* OOM Reaper (mm/oom_kill.c) */
 
static struct task_struct *oom_reaper_th;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
static struct task_struct *oom_reaper_list;
 
/* Wake the reaper for a new victim */
static void wake_oom_reaper(struct task_struct *victim)
{
    /* Add victim to reaper's work list */
    victim->oom_reaper_list = oom_reaper_list;
    oom_reaper_list = victim;
    
    /* Wake the reaper thread */
    wake_up(&oom_reaper_wait);
}
 
/* The reaper thread */
static int oom_reaper(void *unused)
{
    set_freezable();
    
    while (true) {
        struct task_struct *victim;
        
        /* Wait for work */
        wait_event_freezable(oom_reaper_wait,
                             oom_reaper_list != NULL);
        
        /* Process all victims on list */
        while ((victim = oom_reaper_list) != NULL) {
            /* Remove from list */
            oom_reaper_list = victim->oom_reaper_list;
            
            /* Attempt to reap the victim's memory */
            oom_reap_task(victim);
        }
    }
    
    return 0;
}
 
/* Actually reap the memory */
static void oom_reap_task(struct task_struct *victim)
{
    struct mm_struct *mm = victim->mm;
    struct vm_area_struct *vma;
    
    if (!mm || !mmget_not_zero(mm))
        return;  /* Already dead */
    
    /* 
     * Only reap anonymous memory (private, not file-backed)
     * File-backed pages may be dirty and need writeback
     */
    down_read(&mm->mmap_lock);
    
    for_each_vma(mm, vma) {
        /* Skip non-anonymous VMAs */
        if (vma->vm_file)
            continue;
        if (vma->vm_flags & (VM_SHARED | VM_HUGETLB))
            continue;
        
        /* Unmap and free the pages */
        unmap_page_range(&mm->mmap, vma, vma->vm_start, vma->vm_end, NULL);
    }
    
    up_read(&mm->mmap_lock);
    
    pr_info("oom_reaper: reaped process %d (%s), "
            "freed anonymous memory\n",
            task_pid_nr(victim), victim->comm);
    
    mmput(mm);
}

Why Only Anonymous Memory?

Memory Cgroups and OOM

Memory cgroups (cgroup v2 memory controller) add another dimension to OOM handling. Each cgroup can have memory limits, and OOM can be scoped to a cgroup rather than affecting the entire system.

Cgroup-Level OOM:

When a cgroup exceeds its memory limit:

The kernel tries to reclaim memory within the cgroup
If reclaim fails, cgroup-local OOM is triggered
Only processes within that cgroup are considered for killing
The rest of the system is unaffected

This is crucial for container isolation—a runaway container shouldn't affect other containers.

cgroup_oom.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
# === Cgroup v2 Memory Controller ===
 
# Create a cgroup with memory limit
mkdir -p /sys/fs/cgroup/myapp
echo "+memory" > /sys/fs/cgroup/myapp/cgroup.subtree_control
 
# Set memory limit (512 MB)
echo 536870912 > /sys/fs/cgroup/myapp/memory.max
 
# Set memory + swap limit (if swap is enabled)
echo 1073741824 > /sys/fs/cgroup/myapp/memory.swap.max
 
# Move a process to the cgroup
echo $PID > /sys/fs/cgroup/myapp/cgroup.procs
 
# === OOM Behavior Configuration ===
 
# View OOM kill events
cat /sys/fs/cgroup/myapp/memory.events
# low 0
# high 12
# max 5
# oom 2
# oom_kill 2
# oom_group_kill 0
 
# Configure OOM behavior
 
# Option 1: Kill individual process (default)
# When OOM triggers, kill one process in cgroup
 
# Option 2: Kill entire cgroup (oom_group_kill)
# When OOM triggers, kill ALL processes in cgroup
echo 1 > /sys/fs/cgroup/myapp/memory.oom.group
 
# === Systemd/Container Integration ===
 
# Docker with memory limit:
docker run --memory=512m --memory-swap=512m myimage
 
# Kubernetes memory limit:
# resources:
#   limits:
#     memory: "512Mi"
 
# Systemd unit with memory limit:
# [Service]
# MemoryMax=512M
# MemorySwapMax=0
 
# === Monitoring Cgroup OOM ===
 
# Watch for OOM events
inotifywait -m /sys/fs/cgroup/myapp/memory.events
 
# Or parse from dmesg
dmesg | grep -E "memory cgroup out of memory|oom_kill_process"
 
# === OOM Notification for Application Handling ===
 
# eventfd-based notification (for applications that want to handle OOM)
# Create eventfd, register with memory.oom.eventfd (cgroup v1)
# Or use memory.events inotify (cgroup v2)
 
# Example: Docker OOM event
docker events --filter 'event=oom'

System OOM vs Cgroup OOM
Aspect	System OOM	Cgroup OOM
Trigger	System-wide memory exhaustion	Cgroup memory limit exceeded
Scope	All user processes considered	Only cgroup members considered
Impact	Any process may be killed	Only cgroup processes affected
Isolation	None	Other cgroups/containers protected
Configuration	oom_score_adj	oom_score_adj + cgroup settings

Preventing OOM Situations

The best OOM handling is avoiding OOM in the first place. Here are strategies for preventing out-of-memory conditions:

OOM Prevention Strategies

•Adequate RAM — Size your system's memory for expected workload plus headroom. Monitor usage trends to plan capacity.
•Swap space — Provides a buffer when RAM is low. Size swap appropriately (traditionally 1-2x RAM, but depends on workload).
•Memory limits — Use cgroups/containers to limit memory per application. A memory leak in one app won't kill the system.
•Resource monitoring — Alert on memory usage before it becomes critical. Monitor both used memory and available memory (free + cache).
•Memory-efficient code — Fix memory leaks, avoid unnecessary allocations, use memory-mapped files for large datasets.
•Overcommit tuning — Adjust vm.overcommit_memory to match your workload's characteristics.

oom_prevention.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
# === Memory Monitoring ===
 
# Basic memory stats
free -h
#               total        used        free      shared  buff/cache   available
# Mem:           31Gi        8.2Gi       1.5Gi       512Mi        22Gi        22Gi
# Swap:          8.0Gi          0B       8.0Gi
 
# "available" = memory available for new allocations (free + reclaimable cache)
# This is a better indicator than "free"
 
# Watch memory over time
watch -n 1 'free -h; echo; vmstat 1 5'
 
# Per-process memory usage
ps aux --sort=-%mem | head -20
 
# Detailed memory stats
cat /proc/meminfo | grep -E 'MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree'
 
# === Overcommit Configuration ===
 
# View current overcommit settings
cat /proc/sys/vm/overcommit_memory
# 0 = Heuristic overcommit (default)
# 1 = Always overcommit (never fail malloc)
# 2 = Don't overcommit (strict, commit limit = swap + ratio*RAM)
 
cat /proc/sys/vm/overcommit_ratio
# 50 (default) - when mode=2, limit = swap + 50% RAM
 
# Disable overcommit (strict memory accounting)
echo 2 > /proc/sys/vm/overcommit_memory
echo 80 > /proc/sys/vm/overcommit_ratio
 
# View commit limit and committed memory
cat /proc/meminfo | grep -E 'CommitLimit|Committed_AS'
# CommitLimit:    49123456 kB  (how much can be committed)
# Committed_AS:   12345678 kB  (currently committed)
 
# === Early Warning with OOM Threshold ===
 
# Set up earlyoom (userspace OOM killer with warnings)
# https://github.com/rfjakob/earlyoom
earlyoom -m 5 -s 5 -n --prefer '(^|/)(java|chromium)$'
# Kills when <5% RAM or <5% swap available
# Notifies before killing
# Prefers to kill java or chromium
 
# === Systemd Resource Control ===
 
# View memory pressure
cat /proc/pressure/memory
# some avg10=0.00 avg60=0.00 avg300=0.00 total=12345678
# full avg10=0.00 avg60=0.00 avg300=0.00 total=1234567
 
# Interpret: "some" means at least one task waiting for memory
# If avg60 > 10%, system is under significant memory pressure
 
# === Swap Configuration ===
 
# View swap priority and usage
swapon --show
# NAME      TYPE SIZE USED PRIO
# /dev/sda2 partition 8G   0B   -2
 
# Add more swap (emergency)
dd if=/dev/zero of=/swapfile bs=1G count=8
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile
 
# Adjust swappiness (0-100, higher = swap more aggressively)
cat /proc/sys/vm/swappiness
# 60 (default)
echo 10 > /proc/sys/vm/swappiness  # Prefer keeping data in RAM

The 'available' Memory Metric

Diagnosing OOM Kills

When OOM kills occur, understanding why is crucial for prevention. The kernel logs extensive information about each OOM event.

oom_diagnosis.sh
Bash
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
# === Viewing OOM Kill Logs ===
 
# Check dmesg for OOM events
dmesg | grep -A 50 "Out of memory"
 
# Example OOM log output:
# [12345.678901] myapp invoked oom-killer: 
#                gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE),
#                order=0, oom_score_adj=0
# [12345.678902] CPU: 3 PID: 12345 Comm: myapp Kdump: loaded Not tainted
# [12345.678903] Hardware name: Dell Inc. PowerEdge R640/...
# [12345.678904] Call Trace:
#                dump_stack+0x5c/0x80
#                dump_header+0x4a/0x1e0
#                out_of_memory+0x2a9/0x550
#                __alloc_pages_slowpath+0xa5a/0xc30
#                ...
# [12345.678920] Mem-Info:
# [12345.678921] active_anon:123456 inactive_anon:234567 
#                isolated_anon:0
# [12345.678922] active_file:345678 inactive_file:456789 
#                isolated_file:0
# [12345.678923] unevictable:0 dirty:1234 writeback:0 unstable:0
# [12345.678924] slab_reclaimable:12345 slab_unreclaimable:67890
# [12345.678925] mapped:45678 shmem:12345 pagetables:6789 bounce:0
# [12345.678926] free:1234 free_pcp:123 free_cma:0
# [12345.678930] Node 0 hugepages_total=0 hugepages_free=0 ...
# [12345.678935] Tasks state (memory values in pages):
# [12345.678936] [  pid  ]   uid  tgid total_vm      rss pgtables_bytes
#                swapents oom_score_adj name
# [12345.678937] [   1234]     0  1234   123456    45678   2097152
#                    0           -900 sshd
# [12345.678938] [   5678]  1000  5678  4567890  1234567  10485760
#                    0              0 firefox
# [12345.678939] ... (list of all processes)
# [12345.678990] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),
#                cpuset=/,mems_allowed=0,
#                global_oom,task_memcg=/user.slice/...,
#                task=firefox,pid=5678,uid=1000
# [12345.678991] Out of memory: Killed process 5678 (firefox) 
#                total-vm:18271560kB, anon-rss:4938268kB, 
#                file-rss:123456kB, shmem-rss:0kB, 
#                oom_score_adj:0
 
# === Parse OOM logs ===
 
# Extract killed processes
dmesg | grep "Out of memory: Killed process" | tail -20
 
# Find what triggered OOM
dmesg | grep "invoked oom-killer" | tail -20
 
# === journalctl (systemd) ===
 
journalctl -k | grep -i oom
journalctl --since "1 hour ago" | grep -i "out of memory"
 
# === System-wide OOM counter ===
 
cat /proc/vmstat | grep oom_kill
# oom_kill 3
 
# === Per-cgroup OOM events ===
 
cat /sys/fs/cgroup/user.slice/memory.events
# oom 2
# oom_kill 2

Key Information in OOM Logs:

Triggering process: Which allocation request triggered OOM
gfp_mask: What kind of memory was requested
Mem-Info: System memory state at OOM time
Tasks state: All processes with their memory usage and oom_score_adj
Killed process: Which process was selected and why
Memory breakdown: total-vm, anon-rss, file-rss, shmem-rss of victim

Understanding the Process Table

Summary: OOM Killer

The OOM killer is a critical safety mechanism that keeps Linux systems running when memory is exhausted. Let's consolidate the key concepts:

Key Takeaways

•OOM is a last resort — triggered only after reclaim, direct reclaim, compaction, and swapping have all failed to free sufficient memory.
•Victim selection uses oom_score — based primarily on RSS (resident memory), normalized and adjusted by oom_score_adj.
•oom_score_adj allows tuning — protect critical processes (-1000 to never kill, -900 for important services) and mark expendable processes (positive values).
•The oom_reaper — prevents deadlock by reclaiming victim's anonymous memory without requiring the victim to run.
•Memory cgroups scope OOM — container OOM affects only that container, protecting the rest of the system.
•Prevention is better than cure — adequate RAM, swap, memory limits, monitoring, and proper application design prevent most OOM situations.
•OOM logs are detailed — providing extensive information for post-mortem analysis and prevention of future occurrences.

Module Complete:

You have now completed the Linux Memory Management module. You understand:

Virtual address space layout — How Linux organizes process memory
Page table management — The multi-level hierarchy translating virtual to physical addresses
Slab allocator — High-performance kernel object caching
Memory zones — Partitioning physical memory by hardware constraints
OOM killer — The last resort for surviving memory exhaustion

This knowledge is essential for kernel development, system administration, performance tuning, and debugging memory-related issues.

Module Complete

5 / 5