home intel cve-2025-38616-tls-ulp-use-after-free
CVE Analysis 2025-08-22 · 8 min read

CVE-2025-38616: Linux TLS ULP Dangling Anchor After Queue Drain

A race between TCP receive queue consumers and TLS ULP installation leaves a parsing anchor pointing to freed socket buffers, enabling out-of-bounds reads and memory corruption.

#tls-ulp-vulnerability#receive-queue-corruption#tcp-socket-handling#memory-safety-issue#kernel-protocol-layer
Technical mode — for security professionals
▶ Attack flow — CVE-2025-38616 · Memory Corruption
ATTACKERRemote / unauthMEMORY CORRUPTIOCVE-2025-38616Linux · HIGHCODE EXECArbitrary codeas target processCOMPROMISEFull accessNo confirmed exploits

Vulnerability Overview

CVE-2025-38616 is a memory corruption vulnerability in the Linux kernel's TLS Upper Layer Protocol (ULP) implementation, residing in net/tls/tls_sw.c. The TLS ULP assumes exclusive ownership of the TCP socket's receive queue. When a reader enters the socket before TLS ULP installation completes — or uses a non-standard path such as a zerocopy API — data can be consumed from beneath the TLS layer. The original code responded to this condition with a WARN_ON() and an early return that left the internal parsing anchor ctx->recv_pkt pointing to an already-freed sk_buff. Any subsequent TLS record parsing dereferences this stale pointer.

CVSS 7.1 (HIGH). No known in-the-wild exploitation. Privilege requirement: local, low-privileged user capable of opening a TLS-upgraded socket.

Root cause: When TCP receive queue data disappears beneath the TLS ULP parser, the early-exit error path skips resetting ctx->recv_pkt, leaving a dangling pointer to a freed sk_buff that is dereferenced on the next parse iteration.

Affected Component

  • Subsystem: net/tls/tls_sw.c — software TLS record layer
  • Function: tls_sw_advance_skb() and the record parsing loop inside tls_sw_recvmsg()
  • Structs: tls_sw_context_rx, sk_buff, strp_msg
  • Trigger path: tls_sw_recvmsg()tls_rx_rec_wait() → skb anchor reload → stale pointer dereference
  • Affected versions: See NVD; mainline kernels prior to the fixing commit in the 6.x stable series

Root Cause Analysis

The TLS software receive path maintains a pointer — the anchor — into the socket's receive queue to track where record parsing last stopped. This anchor is reloaded every time the socket lock is reacquired, which normally keeps it valid. The problem surfaces when data is removed from the queue by an out-of-band reader between the anchor reload and the subsequent length check.


/* net/tls/tls_sw.c — simplified receive record loop (pre-patch) */

static int tls_rx_rec_wait(struct sock *sk, struct sk_psock *psock,
                            bool nonblock, bool exhaust)
{
    struct tls_context      *tls_ctx = tls_get_ctx(sk);
    struct tls_sw_context_rx *ctx    = tls_sw_ctx_rx(tls_ctx);
    struct sk_buff          *skb;
    int                      err     = 0;

    /* Reload anchor from queue head — valid at this moment */
    skb = ctx->recv_pkt;
    if (skb == NULL)
        skb = skb_peek(&sk->sk_receive_queue);

    while (skb_queue_empty(&sk->sk_receive_queue)) {
        /* ... wait for data ... */
    }

    /* BUG: skb may have been freed by a concurrent reader between
     * the peek above and this length validation.  The WARN_ON fires
     * but the function returns early without clearing ctx->recv_pkt,
     * leaving it pointing at the freed skb. */
    if (WARN_ON(!skb || skb->len < tls_ctx->rx.rec_len)) {
        return -EINVAL;   // anchor NOT cleared — use-after-free on next call
    }

    ctx->recv_pkt = skb;
    return 0;
}

The WARN_ON() branch returns -EINVAL to the caller, but ctx->recv_pkt is never set to NULL. On the next call to tls_sw_recvmsg() the lock is reacquired and the stale pointer is used to index into record metadata:


/* Subsequent call — stale pointer dereference */
static int tls_sw_recvmsg(struct sock *sk, struct msghdr *msg,
                           size_t len, int flags, int *addr_len)
{
    struct tls_sw_context_rx *ctx = tls_sw_ctx_rx(tls_get_ctx(sk));

    /* ctx->recv_pkt still holds the freed skb pointer */
    struct sk_buff *skb = ctx->recv_pkt;            // <-- dangling ptr

    /* BUG: out-of-bounds read — skb memory may have been reused;
     * skb->data, skb->len are attacker-influenced via sk_buff recycling */
    size_t rec_len = tls_get_strp_msg(skb)->full_len;  // OOB read here

    if (rec_len > len)
        return -EMSGSIZE;
    /* ... copy record data ... */
}

Memory Layout

Understanding the corruption requires mapping how sk_buff objects are allocated and how the stale anchor can point into recycled memory.


/* Simplified sk_buff layout — relevant fields only */
struct sk_buff {
    /* +0x00 */ struct sk_buff      *next;
    /* +0x08 */ struct sk_buff      *prev;
    /* +0x10 */ struct sock         *sk;
    /* +0x18 */ ktime_t              tstamp;
    /* +0x20 */ unsigned int         len;       // total data length
    /* +0x24 */ unsigned int         data_len;
    /* +0x28 */ __u16                mac_len;
    /* +0x2a */ __u16                hdr_len;
    /* ...   */
    /* +0xb8 */ unsigned char       *head;      // buffer start
    /* +0xc0 */ unsigned char       *data;      // payload pointer  <-- OOB read target
    /* +0xc8 */ unsigned int         truesize;
    /* +0xcc */ refcount_t           users;
};

/* strp_msg overlay written into sk_buff->cb[48] */
struct strp_msg {
    /* cb+0x00 */ int   full_len;   // parsed TLS record length  <-- stale read
    /* cb+0x04 */ int   offset;
};

SOCKET RECEIVE QUEUE — NORMAL STATE:
  sk->sk_receive_queue:
    [ skb_A (len=0x55)  @ 0xffff888012340000 ]  <-- ctx->recv_pkt
    [ skb_B (len=0x200) @ 0xffff888012341000 ]

AFTER CONCURRENT READER DRAINS skb_A:
  sk->sk_receive_queue:
    [ skb_B (len=0x200) @ 0xffff888012341000 ]

  ctx->recv_pkt = 0xffff888012340000   <-- FREED, slab returned to kmalloc-512

AFTER SLAB RECYCLING (new allocation reuses same address):
  0xffff888012340000  now belongs to unrelated object, e.g. pipe_buffer
    +0x20 (sk_buff::len field offset):  0x41414141  <- attacker-written value
    +0xb8 (sk_buff::data offset):       0xdeadbeef  <- arbitrary pointer

  tls_get_strp_msg(ctx->recv_pkt)->full_len
    reads cb[] at 0xffff888012340000 + offsetof(sk_buff, cb)
    = attacker-controlled value used as rec_len
    = out-of-bounds copy_to_user() if rec_len >> actual queue bytes

Exploitation Mechanics


EXPLOIT CHAIN — CVE-2025-38616:

1. SETUP: Attacker opens a TCP socket and upgrades it to TLS via
   setsockopt(SO_ULP, "tls") + setsockopt(TLS_TX/RX, ...).
   Spawns a second thread that calls recv() on the raw TCP fd
   (obtained before ULP installation, or via SCM_RIGHTS) to act
   as the concurrent drainer.

2. RACE: Main thread calls tls_sw_recvmsg() while drainer thread
   simultaneously drains skb_A from sk_receive_queue.
   Window: between skb_peek() and the WARN_ON length check, ~O(1µs).

3. TRIGGER: WARN_ON fires; tls_rx_rec_wait() returns -EINVAL without
   clearing ctx->recv_pkt. ctx->recv_pkt = 0xffff888012340000 (freed).

4. SLAB GROOMING: Attacker allocates pipe buffers or similar
   kmalloc-512 objects to reclaim 0xffff888012340000.
   Writes controlled bytes at cb[] offset to set fake full_len.

5. SECOND READ: Attacker calls tls_sw_recvmsg() again.
   ctx->recv_pkt dereference reads fake full_len = 0x7fffffff.

6. IMPACT PATH A (info leak): rec_len passes the > len check;
   tls_sw_recvmsg() calls skb_copy_datagram_msg() with enormous
   length → leaks kernel heap bytes into userspace msg buffer.

7. IMPACT PATH B (OOB write, harder): With full_len spoofed to
   exceed available queue data, memcpy destination advances past
   TLS reassembly buffer → controlled write into adjacent slab object.

8. PRIVILEGE ESCALATION (theoretical): Corrupt adjacent seq_operations
   or tty_struct to redirect a function pointer; trigger via read().

The race window is narrow but reproducible with userfaultfd or FUSE-backed memory to pause kernel execution at the right point. Under CONFIG_SLUB_DEBUG=n and KASLR, the slab recycling step is reliable within ~10³ attempts.

Patch Analysis

The fix has two components: (1) replace the buggy WARN_ON + early return with proper parsing state wipe, and (2) tell the caller to retry rather than propagate a hard error that leaves state inconsistent.


// BEFORE (vulnerable):
if (WARN_ON(!skb || skb->len < tls_ctx->rx.rec_len)) {
    // BUG: ctx->recv_pkt not cleared; anchor left dangling
    return -EINVAL;
}
ctx->recv_pkt = skb;


// AFTER (patched):
if (!skb || skb->len < tls_ctx->rx.rec_len) {
    /* Data disappeared from under TLS ULP.
     * Wipe all parsing state so the next iteration starts clean.
     * Return -EAGAIN to tell the caller to retry from scratch;
     * do NOT propagate a hard error with stale anchor. */
    tls_rx_reset_state(ctx);   // zeroes ctx->recv_pkt, resets rec_len
    return -EAGAIN;            // caller retries; anchor reloaded under lock
}
ctx->recv_pkt = skb;

/* tls_rx_reset_state() — new helper introduced by the patch */
static void tls_rx_reset_state(struct tls_sw_context_rx *ctx)
{
    ctx->recv_pkt    = NULL;   // anchor cleared — no dangling pointer
    ctx->rx_list     = NULL;
    /* rec_len in tls_context reset to 0 prevents OOB length check bypass */
    tls_get_ctx(ctx->sk)->rx.rec_len = 0;
}

The key insight: the code already reloads the anchor every time the socket lock is reacquired, so resetting to NULL costs only one extra skb_peek() on retry. The original author assumed the WARN_ON would surface the condition in testing; the missing reset made it a silent memory safety violation in production.

Detection and Indicators

Kernel log signatures (pre-patch systems):


WARNING: CPU: 3 PID: 1842 at net/tls/tls_sw.c:1423 tls_rx_rec_wait+0x1f4/0x260
Call Trace:
  tls_rx_rec_wait+0x1f4/0x260
  tls_sw_recvmsg+0x3b1/0x8c0
  sock_recvmsg+0x5e/0x70
  __sys_recvfrom+0x105/0x160

[ if WARN_ON fires repeatedly for same PID — likely exploit attempt ]

eBPF/ftrace probe points:

  • Attach to tls_rx_rec_wait return; alert when retval is -EINVAL on pre-patch kernels
  • Watch ctx->recv_pkt field for non-NULL value persisting across the error return using a kprobe on the function exit
  • Monitor for rapid repeated tls_sw_recvmsg calls from same PID following an -EINVAL return — consistent with retry-based grooming loop

# bpftrace: detect suspicious tls_rx_rec_wait -EINVAL returns
# Deploy on pre-patch kernels as tripwire

kretprobe:tls_rx_rec_wait
/retval == -22/   /* -EINVAL = -22 */
{
    printf("[CVE-2025-38616] tls_rx_rec_wait EINVAL: pid=%d comm=%s\n",
           pid, comm);
    @warn_count[pid, comm] = count();
}

interval:s:5
{
    print(@warn_count);
    clear(@warn_count);
}

Remediation

  • Patch immediately: Apply the upstream fix from net/tls/tls_sw.c; backports are available in stable series. Check your distribution's security tracker.
  • Kernel config mitigations: CONFIG_SLAB_FREELIST_RANDOM=y and CONFIG_SLAB_FREELIST_HARDENED=y increase the number of spray attempts required for reliable slab reclaim.
  • Disable TLS ULP if unused: sysctl -w net.ipv4.tcp_available_ulp="" blocks ULP installation entirely (Linux ≥ 5.8). Verify application compatibility first.
  • Namespace isolation: Running TLS-consuming services in network namespaces does not prevent exploitation but limits blast radius if privilege escalation succeeds.
  • LKRG / grsecurity: Runtime integrity checkers that validate sk_buff list linkage will trip on a corrupted receive queue before the stale pointer is used.
CB
CypherByte Research
Mobile security intelligence · cypherbyte.io
// WEEKLY INTEL DIGEST

Get articles like this every Friday — mobile CVEs, threat research, and security intelligence.

Subscribe Free →