Limited Scope of Vulnerability Coverage
The evaluation primarily focuses on technical vulnerabilities accessible through HTTP and verifiable with concrete exploits. This excludes network-level issues, physical security, social engineering, and certain application logic flaws not easily demonstrable through automated testing.
Potential for False Positives
While the multi-agent design isolates tool execution within Docker containers, the potential for false positives remains, especially with complex business logic vulnerabilities requiring deeper application context.
Limited Real-World Evaluation
The real-world application assessment uses a small sample size (10 open-source projects) and lacks a formal comparative analysis against other security testing methods. This makes it hard to generalize the findings about real-world effectiveness and cost-benefit.
The study acknowledges the dual-use nature of the technology and its potential for malicious applications. While the authors describe ethical considerations and safeguards, the open-source release carries inherent risks of misuse by malicious actors.
Dependence on Closed-Source LLM
The reliance on GPT-5 for core reasoning introduces dependencies on closed-source LLM technology, limiting transparency and reproducibility. The performance and cost characteristics are specific to GPT-5 and may not generalize to other LLMs.