Loading learning content...
Memory is finite. No matter how much RAM a system has, it's possible for processes to collectively demand more than is available. When swap is exhausted and page reclamation cannot free enough memory to satisfy allocations, the kernel faces a stark choice: let the system lock up and become unresponsive, or terminate processes to free memory and keep the system running.
Linux chooses survival. The Out-of-Memory (OOM) Killer is the kernel's mechanism for selecting and terminating processes when memory is critically exhausted. It's a controversial feature—killing processes is inherently destructive—but the alternative (system hang) is usually worse.
This page provides an expert-level examination of the OOM killer: when it triggers, how it selects victims, how to tune its behavior, and strategies for preventing OOM situations in the first place.
By the end of this page, you will understand: (1) when and why the OOM killer activates, (2) the victim selection algorithm and scoring system, (3) oom_score_adj tuning for protecting critical processes, (4) memory cgroups and OOM handling, (5) the oom_reaper and its role in recovery, and (6) strategies for preventing and detecting OOM situations.
The OOM killer is a mechanism of last resort. The kernel employs multiple layers of memory management before reaching this point:
Memory Pressure Response Hierarchy:
Normal allocation: Sufficient free memory exists, allocation succeeds immediately.
Zone reclaim: Free pages drop below watermarks, kswapd wakes up and reclaims pages in the background.
Direct reclaim: Allocation fails initial checks, the requesting process synchronously reclaims pages before retrying.
Compaction: High-order allocation fails, system compacts memory to create contiguous regions.
Swap writeback: Anonymous pages are written to swap to free physical memory.
OOM kill: All reclamation options exhausted, memory still unavailable, a process must die.
By the time OOM triggers, the system has already been under severe memory pressure for some time. kswapd has been frantically reclaiming, processes have been stalling in direct reclaim, and performance has likely degraded significantly. OOM is the culmination of escalating memory problems.
The OOM Trigger Condition:
The OOM killer is invoked when:
__GFP_NOFAIL flag)At this point, out_of_memory() is called, initiating the OOM selection and kill process.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283
/* OOM invocation path (mm/page_alloc.c, simplified) */ struct page *__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac){ struct page *page = NULL; unsigned int alloc_flags; /* Try direct reclaim */ page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac, &did_some_progress); if (page) return page; /* Try compaction for high-order allocations */ if (order > 0) { page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac, compact_priority); if (page) return page; } /* Check if we should invoke OOM */ if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags, did_some_progress > 0)) goto retry; /* Keep trying reclaim */ /* Should we invoke OOM killer? */ if (gfp_mask & __GFP_NOFAIL) { /* This allocation MUST succeed - invoke OOM */ if (oom_killer_disabled) wait_event_freezable(oom_wait, false); /* Wait forever */ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) return page; goto retry; } /* Allocation can fail - check if OOM would help */ if (check_oom_conditions()) { page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress); if (page) return page; } return NULL; /* Allocation failed */} /* The actual OOM decision point */static struct page *__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, struct alloc_context *ac, unsigned long *did_some_progress){ struct oom_control oc = { .zonelist = ac->zonelist, .nodemask = ac->nodemask, .memcg = NULL, /* System-wide OOM */ .gfp_mask = gfp_mask, .order = order, }; /* Serialize OOM killing - only one at a time */ if (!mutex_trylock(&oom_lock)) return NULL; /* Try once more after getting lock */ page = get_page_from_freelist(...); if (page) goto out; /* Invoke the OOM killer */ if (!out_of_memory(&oc)) { /* OOM killer decided not to kill (panic, etc.) */ } *did_some_progress = 1; out: mutex_unlock(&oom_lock); return page;}When OOM is triggered, the kernel must select which process to kill. This is a complex decision with significant consequences—killing the wrong process could make things worse (killing a critical system service) or be ineffective (killing a small process that frees little memory).
The Selection Criteria:
The OOM killer aims to:
The oom_score:
Each process has an oom_score visible in /proc/[pid]/oom_score. This score (0-1000+) represents how "good" a candidate the process is for killing:
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465666768697071727374757677787980818283848586878889
/* OOM badness scoring (mm/oom_kill.c) */ /** * oom_badness - calculate oom_score for a task * @p: task struct of which task we should calculate * @totalpages: total number of pages in the system * * The formula is: * points = (process_rss + swap_usage) / totalpages * 1000 * points += oom_score_adj (user-adjustable) */long oom_badness(struct task_struct *p, unsigned long totalpages){ long points; long adj; /* Never kill init (PID 1) */ if (is_global_init(p)) return LONG_MIN; /* Never kill kernel threads */ if (p->flags & PF_KTHREAD) return LONG_MIN; /* Get user-specified adjustment */ adj = (long)p->signal->oom_score_adj; /* OOM_SCORE_ADJ_MIN (-1000) means "never kill" */ if (adj == OOM_SCORE_ADJ_MIN) return LONG_MIN; /* Calculate base score from memory usage */ points = get_mm_rss(p->mm); /* Resident Set Size */ points += get_mm_counter(p->mm, MM_SWAPENTS); /* Swap entries */ /* Thread group: count children's memory too */ /* (children inherit parent's oom_score_adj) */ /* Normalize to 0-1000 scale */ /* Score of 1000 = using 100% of memory */ points = points * 1000 / totalpages; /* Apply user adjustment */ /* oom_score_adj ranges from -1000 to +1000 */ points += adj; /* Ensure non-negative (unless OOM_SCORE_ADJ_MIN) */ if (points < 1) points = 1; return points;} /* Select the worst (highest scoring) process */static void select_bad_process(struct oom_control *oc){ struct task_struct *p; long worst_score = 0; struct task_struct *worst = NULL; rcu_read_lock(); for_each_process(p) { /* Skip tasks that can't be killed */ if (oom_unkillable_task(p)) continue; /* Check if task is in target memcg (if applicable) */ if (!oom_cpuset_eligible(p, oc)) continue; /* Calculate score */ long score = oom_badness(p, oc->totalpages); /* Skip tasks with negative score */ if (score == LONG_MIN) continue; if (score > worst_score) { worst_score = score; worst = p; } } rcu_read_unlock(); oc->chosen = worst; oc->chosen_points = worst_score;}The OOM score is based primarily on Resident Set Size (RSS)—the amount of physical memory actually used by the process. Virtual memory size doesn't matter because OOM is about freeing physical memory. A process with 10 GB virtual size but only 100 MB RSS is a poor candidate compared to one with 500 MB RSS.
The kernel allows administrators to influence OOM victim selection through the oom_score_adj mechanism. This is crucial for protecting critical services and ensuring less important processes are killed first.
oom_score_adj Range:
Common Use Cases:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
# === Viewing OOM Scores === # Current OOM score (kernel-calculated)cat /proc/self/oom_score# 150 # Current adjustment (user-set)cat /proc/self/oom_score_adj# 0 # View for all processesps -eo pid,comm,oom_score,oom_score_adj --sort=-oom_score | head -20# PID COMMAND OOM_SCORE OOM_SCORE_ADJ# 15234 firefox 823 0# 12456 chrome 756 0# 23456 slack 412 0# 1234 sshd 2 -900 # === Setting oom_score_adj === # Protect a critical service (e.g., database)echo -900 > /proc/$(pidof postgres)/oom_score_adj # Mark a process as expendableecho 500 > /proc/$(pidof batch_worker)/oom_score_adj # Make process immune to OOM (requires root)echo -1000 > /proc/$(pidof critical_daemon)/oom_score_adj # Make process always first candidateecho 1000 > /proc/$(pidof test_process)/oom_score_adj # === Systemd Integration ===# In service unit files:# [Service]# OOMScoreAdjust=-900 # Example: /etc/systemd/system/database.servicecat << 'EOF'[Unit]Description=Critical Database Service [Service]ExecStart=/usr/bin/database_serverOOMScoreAdjust=-900Restart=always [Install]WantedBy=multi-user.targetEOF # === Protecting SSH ===# sshd is often protected to ensure remote access during OOMcat /proc/$(pidof sshd)/oom_score_adj# -900 (typical default set by systemd) # === Docker/Container OOM adjustment ===# In docker-compose.yml:# services:# myservice:# oom_score_adj: -500 # Docker run:docker run --oom-score-adj=-500 myimage| Process Type | Suggested Value | Rationale |
|---|---|---|
| SSH daemon | -900 | Maintain remote access during OOM |
| Database | -800 to -900 | Critical data service |
| Init system (systemd) | -900 | Core system process |
| Web server | -200 to -400 | Important but not critical |
| Worker processes | 0 | Default, normal priority |
| Batch jobs | 200 to 500 | Expendable, kill first |
| Test/dev processes | 800 to 1000 | Kill before production |
Setting oom_score_adj to -1000 makes a process completely immune to OOM killing. If too many processes are immune and they collectively exhaust memory, the OOM killer cannot free any memory and the system may hang. Use -1000 sparingly for only the most critical processes.
Once a victim is selected, the OOM killer executes a carefully orchestrated sequence to terminate the process and reclaim its memory.
1234567891011121314151617181920212223242526272829303132333435363738394041424344454647484950515253545556575859606162636465
/* OOM kill execution (mm/oom_kill.c) */ static void __oom_kill_process(struct task_struct *victim, const char *message){ struct task_struct *p; struct task_struct *t; struct mm_struct *mm; /* Get victim's mm_struct */ mm = victim->mm; /* Log the OOM kill event */ pr_err("%s: Killed process %d (%s) total-vm:%ldkB, " "anon-rss:%ldkB, file-rss:%ldkB, shmem-rss:%ldkB, " "oom_score_adj:%hd\n", message, task_pid_nr(victim), victim->comm, K(mm->total_vm), K(get_mm_counter(mm, MM_ANONPAGES)), K(get_mm_counter(mm, MM_FILEPAGES)), K(get_mm_counter(mm, MM_SHMEMPAGES)), victim->signal->oom_score_adj); /* Dump system/process memory info */ dump_header(oc); /* Mark the mm as OOM victim - prevents further faults */ mark_mm_as_oom_victim(mm); /* * Set TIF_MEMDIE flag - grants access to memory reserves * This allows the victim to exit cleanly even under * extreme memory pressure */ mark_oom_victim(victim); /* Wake the oom_reaper to reclaim memory in parallel */ wake_oom_reaper(victim); /* Send SIGKILL to the victim */ do_send_sig_info(SIGKILL, SEND_SIG_PRIV, victim, PIDTYPE_TGID); /* Kill all threads in the process */ for_each_thread(victim, t) { if (t != victim) do_send_sig_info(SIGKILL, SEND_SIG_PRIV, t, PIDTYPE_PID); } /* Also kill processes sharing the same mm (unusual) */ for_each_process(p) { if (p->mm == mm && !same_thread_group(p, victim)) { pr_err("oom_kill: Killing multi-threaded process %d (%s)\n", task_pid_nr(p), p->comm); do_send_sig_info(SIGKILL, SEND_SIG_PRIV, p, PIDTYPE_TGID); } }} /* * OOM logged message format (dmesg output): * * [12345.678901] Out of memory: Killed process 4567 (firefox) * total-vm:4567890kB, anon-rss:1234567kB, * file-rss:123456kB, shmem-rss:12345kB, * oom_score_adj:0 */Key Steps in OOM Kill:
Logging: Extensive information is logged to help diagnose why OOM occurred
TIF_MEMDIE flag: Grants the victim access to memory reserves, allowing it to execute exit handlers even under severe pressure
SIGKILL: Sends an un-catchable signal to terminate the process
Thread group kill: All threads in the process group are killed, not just the main thread
oom_reaper: A kernel thread that can reclaim memory from the victim's address space without waiting for the victim to fully exit (handles victims stuck in D-state)
A subtle but critical problem: the OOM-killed process must run to exit. But if it's stuck waiting for memory (D-state), it can't run at all. The system deadlocks.
The Solution: oom_reaper
Linux 4.6 introduced the oom_reaper kernel thread. Instead of waiting for the victim to exit, the reaper directly reclaims the victim's memory:
This bypasses the deadlock—memory is freed without requiring the victim to execute.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
/* OOM Reaper (mm/oom_kill.c) */ static struct task_struct *oom_reaper_th;static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);static struct task_struct *oom_reaper_list; /* Wake the reaper for a new victim */static void wake_oom_reaper(struct task_struct *victim){ /* Add victim to reaper's work list */ victim->oom_reaper_list = oom_reaper_list; oom_reaper_list = victim; /* Wake the reaper thread */ wake_up(&oom_reaper_wait);} /* The reaper thread */static int oom_reaper(void *unused){ set_freezable(); while (true) { struct task_struct *victim; /* Wait for work */ wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); /* Process all victims on list */ while ((victim = oom_reaper_list) != NULL) { /* Remove from list */ oom_reaper_list = victim->oom_reaper_list; /* Attempt to reap the victim's memory */ oom_reap_task(victim); } } return 0;} /* Actually reap the memory */static void oom_reap_task(struct task_struct *victim){ struct mm_struct *mm = victim->mm; struct vm_area_struct *vma; if (!mm || !mmget_not_zero(mm)) return; /* Already dead */ /* * Only reap anonymous memory (private, not file-backed) * File-backed pages may be dirty and need writeback */ down_read(&mm->mmap_lock); for_each_vma(mm, vma) { /* Skip non-anonymous VMAs */ if (vma->vm_file) continue; if (vma->vm_flags & (VM_SHARED | VM_HUGETLB)) continue; /* Unmap and free the pages */ unmap_page_range(&mm->mmap, vma, vma->vm_start, vma->vm_end, NULL); } up_read(&mm->mmap_lock); pr_info("oom_reaper: reaped process %d (%s), " "freed anonymous memory\n", task_pid_nr(victim), victim->comm); mmput(mm);}The oom_reaper only frees anonymous (non-file-backed) memory because file-backed dirty pages must be written back to disk. Unmapping them without writeback would cause data loss. Anonymous pages (stack, heap, private mmap) have no backing store, so they can be safely discarded.
Memory cgroups (cgroup v2 memory controller) add another dimension to OOM handling. Each cgroup can have memory limits, and OOM can be scoped to a cgroup rather than affecting the entire system.
Cgroup-Level OOM:
When a cgroup exceeds its memory limit:
This is crucial for container isolation—a runaway container shouldn't affect other containers.
123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566
# === Cgroup v2 Memory Controller === # Create a cgroup with memory limitmkdir -p /sys/fs/cgroup/myappecho "+memory" > /sys/fs/cgroup/myapp/cgroup.subtree_control # Set memory limit (512 MB)echo 536870912 > /sys/fs/cgroup/myapp/memory.max # Set memory + swap limit (if swap is enabled)echo 1073741824 > /sys/fs/cgroup/myapp/memory.swap.max # Move a process to the cgroupecho $PID > /sys/fs/cgroup/myapp/cgroup.procs # === OOM Behavior Configuration === # View OOM kill eventscat /sys/fs/cgroup/myapp/memory.events# low 0# high 12# max 5# oom 2# oom_kill 2# oom_group_kill 0 # Configure OOM behavior # Option 1: Kill individual process (default)# When OOM triggers, kill one process in cgroup # Option 2: Kill entire cgroup (oom_group_kill)# When OOM triggers, kill ALL processes in cgroupecho 1 > /sys/fs/cgroup/myapp/memory.oom.group # === Systemd/Container Integration === # Docker with memory limit:docker run --memory=512m --memory-swap=512m myimage # Kubernetes memory limit:# resources:# limits:# memory: "512Mi" # Systemd unit with memory limit:# [Service]# MemoryMax=512M# MemorySwapMax=0 # === Monitoring Cgroup OOM === # Watch for OOM eventsinotifywait -m /sys/fs/cgroup/myapp/memory.events # Or parse from dmesgdmesg | grep -E "memory cgroup out of memory|oom_kill_process" # === OOM Notification for Application Handling === # eventfd-based notification (for applications that want to handle OOM)# Create eventfd, register with memory.oom.eventfd (cgroup v1)# Or use memory.events inotify (cgroup v2) # Example: Docker OOM eventdocker events --filter 'event=oom'| Aspect | System OOM | Cgroup OOM |
|---|---|---|
| Trigger | System-wide memory exhaustion | Cgroup memory limit exceeded |
| Scope | All user processes considered | Only cgroup members considered |
| Impact | Any process may be killed | Only cgroup processes affected |
| Isolation | None | Other cgroups/containers protected |
| Configuration | oom_score_adj | oom_score_adj + cgroup settings |
The best OOM handling is avoiding OOM in the first place. Here are strategies for preventing out-of-memory conditions:
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576
# === Memory Monitoring === # Basic memory statsfree -h# total used free shared buff/cache available# Mem: 31Gi 8.2Gi 1.5Gi 512Mi 22Gi 22Gi# Swap: 8.0Gi 0B 8.0Gi # "available" = memory available for new allocations (free + reclaimable cache)# This is a better indicator than "free" # Watch memory over timewatch -n 1 'free -h; echo; vmstat 1 5' # Per-process memory usageps aux --sort=-%mem | head -20 # Detailed memory statscat /proc/meminfo | grep -E 'MemTotal|MemFree|MemAvailable|Buffers|Cached|SwapTotal|SwapFree' # === Overcommit Configuration === # View current overcommit settingscat /proc/sys/vm/overcommit_memory# 0 = Heuristic overcommit (default)# 1 = Always overcommit (never fail malloc)# 2 = Don't overcommit (strict, commit limit = swap + ratio*RAM) cat /proc/sys/vm/overcommit_ratio# 50 (default) - when mode=2, limit = swap + 50% RAM # Disable overcommit (strict memory accounting)echo 2 > /proc/sys/vm/overcommit_memoryecho 80 > /proc/sys/vm/overcommit_ratio # View commit limit and committed memorycat /proc/meminfo | grep -E 'CommitLimit|Committed_AS'# CommitLimit: 49123456 kB (how much can be committed)# Committed_AS: 12345678 kB (currently committed) # === Early Warning with OOM Threshold === # Set up earlyoom (userspace OOM killer with warnings)# https://github.com/rfjakob/earlyoomearlyoom -m 5 -s 5 -n --prefer '(^|/)(java|chromium)$'# Kills when <5% RAM or <5% swap available# Notifies before killing# Prefers to kill java or chromium # === Systemd Resource Control === # View memory pressurecat /proc/pressure/memory# some avg10=0.00 avg60=0.00 avg300=0.00 total=12345678# full avg10=0.00 avg60=0.00 avg300=0.00 total=1234567 # Interpret: "some" means at least one task waiting for memory# If avg60 > 10%, system is under significant memory pressure # === Swap Configuration === # View swap priority and usageswapon --show# NAME TYPE SIZE USED PRIO# /dev/sda2 partition 8G 0B -2 # Add more swap (emergency)dd if=/dev/zero of=/swapfile bs=1G count=8chmod 600 /swapfilemkswap /swapfileswapon /swapfile # Adjust swappiness (0-100, higher = swap more aggressively)cat /proc/sys/vm/swappiness# 60 (default)echo 10 > /proc/sys/vm/swappiness # Prefer keeping data in RAMDon't panic when 'free' memory is low—Linux uses free RAM for caching. The 'available' metric from free/meminfo shows memory that can actually be allocated (free + easily reclaimable cache). Monitor 'available' falling toward zero, not 'free'.
When OOM kills occur, understanding why is crucial for prevention. The kernel logs extensive information about each OOM event.
12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667
# === Viewing OOM Kill Logs === # Check dmesg for OOM eventsdmesg | grep -A 50 "Out of memory" # Example OOM log output:# [12345.678901] myapp invoked oom-killer: # gfp_mask=0x1100cca(GFP_HIGHUSER_MOVABLE),# order=0, oom_score_adj=0# [12345.678902] CPU: 3 PID: 12345 Comm: myapp Kdump: loaded Not tainted# [12345.678903] Hardware name: Dell Inc. PowerEdge R640/...# [12345.678904] Call Trace:# dump_stack+0x5c/0x80# dump_header+0x4a/0x1e0# out_of_memory+0x2a9/0x550# __alloc_pages_slowpath+0xa5a/0xc30# ...# [12345.678920] Mem-Info:# [12345.678921] active_anon:123456 inactive_anon:234567 # isolated_anon:0# [12345.678922] active_file:345678 inactive_file:456789 # isolated_file:0# [12345.678923] unevictable:0 dirty:1234 writeback:0 unstable:0# [12345.678924] slab_reclaimable:12345 slab_unreclaimable:67890# [12345.678925] mapped:45678 shmem:12345 pagetables:6789 bounce:0# [12345.678926] free:1234 free_pcp:123 free_cma:0# [12345.678930] Node 0 hugepages_total=0 hugepages_free=0 ...# [12345.678935] Tasks state (memory values in pages):# [12345.678936] [ pid ] uid tgid total_vm rss pgtables_bytes# swapents oom_score_adj name# [12345.678937] [ 1234] 0 1234 123456 45678 2097152# 0 -900 sshd# [12345.678938] [ 5678] 1000 5678 4567890 1234567 10485760# 0 0 firefox# [12345.678939] ... (list of all processes)# [12345.678990] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),# cpuset=/,mems_allowed=0,# global_oom,task_memcg=/user.slice/...,# task=firefox,pid=5678,uid=1000# [12345.678991] Out of memory: Killed process 5678 (firefox) # total-vm:18271560kB, anon-rss:4938268kB, # file-rss:123456kB, shmem-rss:0kB, # oom_score_adj:0 # === Parse OOM logs === # Extract killed processesdmesg | grep "Out of memory: Killed process" | tail -20 # Find what triggered OOMdmesg | grep "invoked oom-killer" | tail -20 # === journalctl (systemd) === journalctl -k | grep -i oomjournalctl --since "1 hour ago" | grep -i "out of memory" # === System-wide OOM counter === cat /proc/vmstat | grep oom_kill# oom_kill 3 # === Per-cgroup OOM events === cat /sys/fs/cgroup/user.slice/memory.events# oom 2# oom_kill 2Key Information in OOM Logs:
In OOM logs, the 'Tasks state' section lists all processes. Look at the 'rss' column (resident set size) to identify memory hogs. The process with highest RSS near time of OOM is often the culprit—even if it wasn't killed (it might have grown and then been killed before the log).
The OOM killer is a critical safety mechanism that keeps Linux systems running when memory is exhausted. Let's consolidate the key concepts:
Module Complete:
You have now completed the Linux Memory Management module. You understand:
This knowledge is essential for kernel development, system administration, performance tuning, and debugging memory-related issues.
Congratulations! You now have an expert-level understanding of Linux memory management internals. From virtual address spaces to page tables, from slab caching to zone management, from watermarks to the OOM killer—you understand how Linux manages the most precious computing resource: memory.