The future of penetration testing and vulnerability hunting will most likely not be with AI, but rather with AI – as in multiples, security experts have warned.
Researchers at the University of Illinois Urbana-Champaign (UIUC) found that a group of Large Language Models (LLMs) outperformed single AI use, significantly outperforming ZAP and MetaSploit software.
“While individual AI agents are incredibly powerful, they are limited by existing LLM capabilities. For example, if an AI agent takes one path (e.g., trying to exploit an XSS), it is difficult for the agent to return and trying to exploit another vulnerability,” noted researcher Daniel Kang, “Additionally, LLMs perform best when they focus on one task.”
Effective system
The shortcoming of AI looking for vulnerabilities is also its greatest strength: once it goes down one route, it can’t go back and take another. It also performs best when it focuses on one task.
So the group designed a system called Hierarchical Planning and Task-Specific Agents (HPTSA), which consists of a Planner, a Manager, and multiple agents. In this system, a scheduler examines the app (or website) to determine which exploits need to be investigated, and then assigns them to a manager. The manager then delegates different capabilities to different agent LLMs.
Although the system may sound complicated, it has proven to be quite effective in practice. Of the fifteen vulnerabilities tested in the experiment, the HPTSA exploited eight. A single GPT-4 agent only operated 3, meaning HPTSA was more than twice as effective. In comparison, ZAP and MetaSploit software were unable to exploit any vulnerabilities.
There was one instance where a single GPT-4 agent outperformed HPTSA, and that was when a description of the vulnerability was provided in the prompt. In this way, it managed to exploit eleven of the fifteen vulnerabilities. However, this requires the researcher to craft the prompt carefully, which many people may not be able to replicate.
The clues used in this experiment will not be shared publicly and will only be given to other researchers upon request, it said.
Through Tom’s hardware