CVE-2025-38616: Linux TLS ULP Dangling Anchor After Queue Drain

A race between TCP receive queue consumers and TLS ULP installation leaves a parsing anchor pointing to freed socket buffers, enabling out-of-bounds reads and memory corruption.

// PLAIN ENGLISH VERSION

# A Race Condition in Linux Could Let Hackers Read Secret Data

Imagine a kitchen where two chefs are working simultaneously on the same prep station. One chef is pulling ingredients from a shelf while another is rearranging that same shelf. If they're not perfectly coordinated, one chef might grab a container that's being moved, leading to chaos. That's essentially what's happening in this Linux vulnerability.

The problem involves how Linux handles encrypted internet traffic, specifically the TLS protocol that protects everything from banking logins to email. The operating system has a system for receiving data and then passing it to encryption software to decode. But there's a timing gap where the receiving system can create a reference to data while the encryption software is simultaneously being installed.

When this race condition occurs, the reference points to memory that has already been freed up and recycled for other purposes. This creates a dangerous situation where an attacker who can trigger the vulnerability could read data they shouldn't access, or corrupt the memory in ways that crash the system or potentially execute malicious code.

The real-world impact depends on what data is stored nearby in memory. For a hacker remotely exploiting this, the threat is moderate but real—they could potentially intercept sensitive information from an encrypted connection or destabilize a system.

What you can do: First, update your Linux systems as patches become available. Second, if you run servers, prioritize security updates for this issue since it affects networked systems. Third, use a reputable managed hosting provider that typically patches vulnerabilities quickly rather than maintaining your own servers if you're not equipped to do so.

Want the full technical analysis? Click "Technical" above.

▶ Attack flow — CVE-2025-38616 · Memory Corruption

Vulnerability Overview

CVE-2025-38616 is a memory corruption vulnerability in the Linux kernel's TLS Upper Layer Protocol (ULP) implementation, residing in net/tls/tls_sw.c. The TLS ULP assumes exclusive ownership of the TCP socket's receive queue. When a reader enters the socket before TLS ULP installation completes — or uses a non-standard path such as a zerocopy API — data can be consumed from beneath the TLS layer. The original code responded to this condition with a WARN_ON() and an early return that left the internal parsing anchor ctx->recv_pkt pointing to an already-freed sk_buff. Any subsequent TLS record parsing dereferences this stale pointer.

CVSS 7.1 (HIGH). No known in-the-wild exploitation. Privilege requirement: local, low-privileged user capable of opening a TLS-upgraded socket.

Root cause: When TCP receive queue data disappears beneath the TLS ULP parser, the early-exit error path skips resetting ctx->recv_pkt, leaving a dangling pointer to a freed sk_buff that is dereferenced on the next parse iteration.

Affected Component

Subsystem: net/tls/tls_sw.c — software TLS record layer
Function: tls_sw_advance_skb() and the record parsing loop inside tls_sw_recvmsg()
Structs: tls_sw_context_rx, sk_buff, strp_msg
Trigger path: tls_sw_recvmsg() → tls_rx_rec_wait() → skb anchor reload → stale pointer dereference
Affected versions: See NVD; mainline kernels prior to the fixing commit in the 6.x stable series

Root Cause Analysis

The TLS software receive path maintains a pointer — the anchor — into the socket's receive queue to track where record parsing last stopped. This anchor is reloaded every time the socket lock is reacquired, which normally keeps it valid. The problem surfaces when data is removed from the queue by an out-of-band reader between the anchor reload and the subsequent length check.


/* net/tls/tls_sw.c — simplified receive record loop (pre-patch) */

static int tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock,
                            bool nonblock, bool exhaust)
{
    struct tls_context      *tls_ctx = tls_get_ctx(sk);
    struct tls_sw_context_rx *ctx    = tls_sw_ctx_rx(tls_ctx);
    struct sk_buff          *skb;
    int                      err     = 0;

    /* Reload anchor from queue head — valid at this moment */
    skb = ctx->recv_pkt;
    if (skb == NULL)
        skb = skb_peek(&sk->sk_receive_queue);

    while (skb_queue_empty(&sk->sk_receive_queue)) {
        /* ... wait for data ... */
    }

    /* BUG: skb may have been freed by a concurrent reader between
     * the peek above and this length validation.  The WARN_ON fires
     * but the function returns early without clearing ctx->recv_pkt,
     * leaving it pointing at the freed skb. */
    if (WARN_ON(!skb || skb->len < tls_ctx->rx.rec_len)) {
        return -EINVAL;   // anchor NOT cleared — use-after-free on next call
    }

    ctx->recv_pkt = skb;
    return 0;
}

The WARN_ON() branch returns -EINVAL to the caller, but ctx->recv_pkt is never set to NULL. On the next call to tls_sw_recvmsg() the lock is reacquired and the stale pointer is used to index into record metadata:


/* Subsequent call — stale pointer dereference */
static int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg,
                           size_t len, int flags, int *addr_len)
{
    struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_get_ctx(sk));

    /* ctx->recv_pkt still holds the freed skb pointer */
    struct sk_buff *skb = ctx->recv_pkt;            // <-- dangling ptr

    /* BUG: out-of-bounds read — skb memory may have been reused;
     * skb->data, skb->len are attacker-influenced via sk_buff recycling */
    size_t rec_len = tls_get_strp_msg(skb)->full_len;  // OOB read here

    if (rec_len > len)
        return -EMSGSIZE;
    /* ... copy record data ... */
}

Memory Layout

Understanding the corruption requires mapping how sk_buff objects are allocated and how the stale anchor can point into recycled memory.


/* Simplified sk_buff layout — relevant fields only */
struct sk_buff {
    /* +0x00 */ struct sk_buff      *next;
    /* +0x08 */ struct sk_buff      *prev;
    /* +0x10 */ struct sock         *sk;
    /* +0x18 */ ktime_t              tstamp;
    /* +0x20 */ unsigned int         len;       // total data length
    /* +0x24 */ unsigned int         data_len;
    /* +0x28 */ __u16                mac_len;
    /* +0x2a */ __u16                hdr_len;
    /* ...   */
    /* +0xb8 */ unsigned char       *head;      // buffer start
    /* +0xc0 */ unsigned char       *data;      // payload pointer  <-- OOB read target
    /* +0xc8 */ unsigned int         truesize;
    /* +0xcc */ refcount_t           users;
};

/* strp_msg overlay written into sk_buff->cb[48] */
struct strp_msg {
    /* cb+0x00 */ int   full_len;   // parsed TLS record length  <-- stale read
    /* cb+0x04 */ int   offset;
};


SOCKET RECEIVE QUEUE — NORMAL STATE:
  sk->sk_receive_queue:
    [ skb_A (len=0x55)  @ 0xffff888012340000 ]  <-- ctx->recv_pkt
    [ skb_B (len=0x200) @ 0xffff888012341000 ]

AFTER CONCURRENT READER DRAINS skb_A:
  sk->sk_receive_queue:
    [ skb_B (len=0x200) @ 0xffff888012341000 ]

  ctx->recv_pkt = 0xffff888012340000   <-- FREED, slab returned to kmalloc-512

AFTER SLAB RECYCLING (new allocation reuses same address):
  0xffff888012340000  now belongs to unrelated object, e.g. pipe_buffer
    +0x20 (sk_buff::len field offset):  0x41414141  <- attacker-written value
    +0xb8 (sk_buff::data offset):       0xdeadbeef  <- arbitrary pointer

  tls_get_strp_msg(ctx->recv_pkt)->full_len
    reads cb[] at 0xffff888012340000 + offsetof(sk_buff, cb)
    = attacker-controlled value used as rec_len
    = out-of-bounds copy_to_user() if rec_len >> actual queue bytes

Exploitation Mechanics


EXPLOIT CHAIN — CVE-2025-38616:

1. SETUP: Attacker opens a TCP socket and upgrades it to TLS via
   setsockopt(SO_ULP, "tls") + setsockopt(TLS_TX/RX, ...).
   Spawns a second thread that calls recv() on the raw TCP fd
   (obtained before ULP installation, or via SCM_RIGHTS) to act
   as the concurrent drainer.

2. RACE: Main thread calls tls_sw_recvmsg() while drainer thread
   simultaneously drains skb_A from sk_receive_queue.
   Window: between skb_peek() and the WARN_ON length check, ~O(1µs).

3. TRIGGER: WARN_ON fires; tls_rx_rec_wait() returns -EINVAL without
   clearing ctx->recv_pkt. ctx->recv_pkt = 0xffff888012340000 (freed).

4. SLAB GROOMING: Attacker allocates pipe buffers or similar
   kmalloc-512 objects to reclaim 0xffff888012340000.
   Writes controlled bytes at cb[] offset to set fake full_len.

5. SECOND READ: Attacker calls tls_sw_recvmsg() again.
   ctx->recv_pkt dereference reads fake full_len = 0x7fffffff.

6. IMPACT PATH A (info leak): rec_len passes the > len check;
   tls_sw_recvmsg() calls skb_copy_datagram_msg() with enormous
   length → leaks kernel heap bytes into userspace msg buffer.

7. IMPACT PATH B (OOB write, harder): With full_len spoofed to
   exceed available queue data, memcpy destination advances past
   TLS reassembly buffer → controlled write into adjacent slab object.

8. PRIVILEGE ESCALATION (theoretical): Corrupt adjacent seq_operations
   or tty_struct to redirect a function pointer; trigger via read().

The race window is narrow but reproducible with userfaultfd or FUSE-backed memory to pause kernel execution at the right point. Under CONFIG_SLUB_DEBUG=n and KASLR, the slab recycling step is reliable within ~10³ attempts.

Patch Analysis

The fix has two components: (1) replace the buggy WARN_ON + early return with proper parsing state wipe, and (2) tell the caller to retry rather than propagate a hard error that leaves state inconsistent.


// BEFORE (vulnerable):
if (WARN_ON(!skb || skb->len < tls_ctx->rx.rec_len)) {
    // BUG: ctx->recv_pkt not cleared; anchor left dangling
    return -EINVAL;
}
ctx->recv_pkt = skb;


// AFTER (patched):
if (!skb || skb->len < tls_ctx->rx.rec_len) {
    /* Data disappeared from under TLS ULP.
     * Wipe all parsing state so the next iteration starts clean.
     * Return -EAGAIN to tell the caller to retry from scratch;
     * do NOT propagate a hard error with stale anchor. */
    tls_rx_reset_state(ctx);   // zeroes ctx->recv_pkt, resets rec_len
    return -EAGAIN;            // caller retries; anchor reloaded under lock
}
ctx->recv_pkt = skb;


/* tls_rx_reset_state() — new helper introduced by the patch */
static void tls_rx_reset_state(struct tls_sw_context_rx *ctx)
{
    ctx->recv_pkt    = NULL;   // anchor cleared — no dangling pointer
    ctx->rx_list     = NULL;
    /* rec_len in tls_context reset to 0 prevents OOB length check bypass */
    tls_get_ctx(ctx->sk)->rx.rec_len = 0;
}

The key insight: the code already reloads the anchor every time the socket lock is reacquired, so resetting to NULL costs only one extra skb_peek() on retry. The original author assumed the WARN_ON would surface the condition in testing; the missing reset made it a silent memory safety violation in production.

Detection and Indicators

Kernel log signatures (pre-patch systems):


WARNING: CPU: 3 PID: 1842 at net/tls/tls_sw.c:1423 tls_rx_rec_wait+0x1f4/0x260
Call Trace:
  tls_rx_rec_wait+0x1f4/0x260
  tls_sw_recvmsg+0x3b1/0x8c0
  sock_recvmsg+0x5e/0x70
  __sys_recvfrom+0x105/0x160

[ if WARN_ON fires repeatedly for same PID — likely exploit attempt ]

eBPF/ftrace probe points:

Attach to tls_rx_rec_wait return; alert when retval is -EINVAL on pre-patch kernels
Watch ctx->recv_pkt field for non-NULL value persisting across the error return using a kprobe on the function exit
Monitor for rapid repeated tls_sw_recvmsg calls from same PID following an -EINVAL return — consistent with retry-based grooming loop


# bpftrace: detect suspicious tls_rx_rec_wait -EINVAL returns
# Deploy on pre-patch kernels as tripwire

kretprobe:tls_rx_rec_wait
/retval == -22/   /* -EINVAL = -22 */
{
    printf("[CVE-2025-38616] tls_rx_rec_wait EINVAL: pid=%d comm=%s\n",
           pid, comm);
    @warn_count[pid, comm] = count();
}

interval:s:5
{
    print(@warn_count);
    clear(@warn_count);
}

Remediation

Patch immediately: Apply the upstream fix from net/tls/tls_sw.c; backports are available in stable series. Check your distribution's security tracker.
Kernel config mitigations: CONFIG_SLAB_FREELIST_RANDOM=y and CONFIG_SLAB_FREELIST_HARDENED=y increase the number of spray attempts required for reliable slab reclaim.
Disable TLS ULP if unused: sysctl -w net.ipv4.tcp_available_ulp="" blocks ULP installation entirely (Linux ≥ 5.8). Verify application compatibility first.
Namespace isolation: Running TLS-consuming services in network namespaces does not prevent exploitation but limits blast radius if privilege escalation succeeds.
LKRG / grsecurity: Runtime integrity checkers that validate sk_buff list linkage will trip on a corrupted receive queue before the stale pointer is used.