VulnAgent-X: From One-Shot Vulnerability Guessing to Evidence-Grounded Agentic Auditing

A layered multi-agent workflow for repository-level bug and vulnerability detection: risk triage → context expansion → specialist agents → selective verification → evidence fusion, aiming for higher recall, fewer false positives, and better interpretability.

Sep 1, 2025OngoingFirst author

SecurityVulnerability DetectionAgentsLLMCode IntelligenceRepository-levelVerificationExplainability

Paper Code

Background: Why vulnerability detection is hard

Vulnerabilities rarely live in a single line of code. In real repositories, security issues often depend on:

Interprocedural flows (source → sink across functions/files)
Configurations and policies (authz rules, env vars, feature flags)
Runtime conditions (only triggered by specific inputs/states)
Project-specific conventions (custom sanitizers, wrappers, guards)

This is why one-shot “LLM code review” typically fails in two ways:

False negatives: the model lacks the right context.
False positives: the model produces a plausible story without strong evidence (unreachable paths, already-sanitized inputs, etc.).

Problem: The ceiling of one-shot LLM auditing

Even with strong code models, practical auditing still suffers from:

Insufficient context for real data/control flows
No verification to distinguish “suspicious” from “exploitable”
No counter-evidence mechanisms to challenge claims
Weak interpretability (conclusions without a defensible evidence trail)

So the key question becomes:

Can we turn vulnerability detection into a reproducible auditing loop that collects evidence, validates hypotheses, and produces actionable, review-ready findings?

Method: VulnAgent-X (layered agents + evidence-driven decisions)

VulnAgent-X decomposes detection into five stages:

1) Fast Risk Screening

Low-cost triage to reduce the search space:

lightweight static signals (dangerous APIs, missing checks)
change metadata (auth modules, input processing, config diffs)
lightweight model scoring

Only the Top-K regions proceed to deeper analysis.

2) Context Expansion (minimal sufficient context)

Instead of dumping the whole repo into the model, we retrieve a bounded, relevant context:

callers/callees (call chain)
configs and policy definitions
tests and exception handling
similar historical fixes

Goal: enough evidence, minimal noise.

3) Multi-Agent Collaborative Analysis

Specialised agents analyse the same target from complementary viewpoints:

Router Agent: categorises the issue and dispatches experts
Semantic Agent: control flow, boundaries, exception logic
Security Agent: source→sink reasoning, injection/authz risks, dangerous sinks
Logic Bug Agent: state/ordering/transaction/concurrency failures
Sceptic Agent: searches for counter-evidence (unreachability, sanitisation, constraints)

This is not “more text”. It is structured evidence collection and adversarial cross-checking.

4) Selective Dynamic Verification

For high-risk or uncertain cases, a verification step generates:

minimal repro inputs / payloads
unit or regression tests
bounded execution traces (if sandboxed execution is allowed)

This helps distinguish “looks vulnerable” from “is triggerable”.

5) Evidence Fusion

Final decisions are made via evidence aggregation:

static + contextual + cross-agent agreement + dynamic signals
minus counter-evidence

Outputs are structured findings:

type, file/line localisation
evidence trail (e.g., source→sink path)
confidence and severity
fix hints and regression test suggestions

Experiments: How we validate it

We evaluate:

Detection quality: Precision/Recall/F1/AUROC
Localisation: Top-1/Top-3/MRR
Practical cost: tokens, latency, verification rate, early-exit rate

Public benchmarks (function-level + commit-level):

Devign
Big-Vul
PrimeVul (more realistic evaluation settings)
a just-in-time commit-level benchmark (detection + localisation)

What the figures highlight

Pareto plot: performance vs cost, showing the benefit of layered escalation
Sensitivity curves: performance saturates with larger K/context budgets while cost grows
Threshold effects: how confidence thresholds trade off quality vs verification cost
Error breakdown: environment-dependent semantics and missing helper-function meaning remain major failure modes

Key takeaways

The main bottleneck is often not “understanding code tokens” but collecting and validating evidence.
A staged workflow (triage → context → specialists → verification → fusion) improves reliability and interpretability.
A dedicated sceptic/counter-evidence step is crucial for suppressing false positives.
Verification can be selective: targeting only high-risk/uncertain cases yields strong gains without always paying full execution cost.

Limitations and next steps

Environment-dependent vulnerabilities remain hard without stronger execution context.
Dataset label noise can distort evaluation; auditing-style evaluation is preferable.
Next steps include tighter integration into PR review workflows (review comments, patch diffs, and auto-generated regression tests).

Links to the paper and code are provided in the header.