Counteroffensive AI: Pwning AI Pentesters
AI-powered pentesting is the latest hype. Slap an LLM agent on top of well-known offensive tools built by humans in their free time, run it in YOLO mode, and call it autonomous security testing. Valuations are going through the roof! Here is the thing though: these agents consume untrusted input from the very targets they are testing by design.
Current discourse around AI agent security focuses on prompt injection through direct interaction. But what about the agent’s environment itself? What happens when the attack surface the agent is exploring has been prepared by an adversary? What if the authentication service referenced in that one GitHub issue is actually a honeypot? In this presentation we will demonstrate a complete attack framework against AI pentesting agents and release it as open source. We show how to inject tracking payloads at scale into any platform with user-generated content, operate fake services that capture credentials from AI agents, and turn every future AI pentest engagement against a sprayed target into a passive credential harvesting fest. No ongoing effort required, no exploits needed. The AI leaks to us, fully automated!
The attacker does not need to talk to the agent. They just leave breadcrumbs where the agent will find them during reconnaissance. A hint about a backup authentication endpoint in a GitHub issue. A debug configuration in a support ticket. SSO metadata in a user profile bio. The agent discovers these reasons they are worth investigating, and acts on them with whatever credentials and access it was given.
SSO authentication is a particularly brutal example because determining if they are in-scope is difficult: When logging in, anyone must follow OAuth/OIDC redirects to external domains to test authenticated applications, and they need to be told how do distinguish a legitimate Identity Provider from a fake one we planted in user content.
But SSO is just one instance of the fundamental problem: the AI makes decisions based on content it should not trust, and no amount of prompt engineering changes unless you know in advance what the target will look like. We want to shed some light on the complications that arise when putting AI literally to the test!
-
The Promise vs. The Problem State of AI pentesting: what vendors claim, how agents actually work under the hood (LLM + tool chain + YOLO execution). Quick demo: AI agent solving a pentest challenge (GOAD cyberrange), finds file with password hint, tries credentials everywhere. Who placed that file? Core observation: agents consume untrusted input from the target and make autonomous decisions. This is the attack surface. Transition: forget prompt injection, what if the environment itself is hostile?
-
The SSO Dilemma How SSO works in 60 seconds: OAuth2/OIDC/SAML flow, redirects to external IdP, token exchange. Why AI agents MUST follow SSO redirects: cannot test authenticated apps otherwise, this is table-stakes functionality. The catch-22: agents cannot distinguish legitimate IdPs from attacker-controlled ones discovered in user content. Walk through failed mitigations: IdP allowlisting (fails for custom/internal IdPs), redirect-origin checking (fails for undocumented services), prompt engineering (agent still cannot verify domain legitimacy), human confirmation (defeats autonomy). Key insight: this is architectural. The feature is the vulnerability. No amount of guardrails fix this without removing the capability vendors are selling.
-
Attack Framework: Architecture & Components HON-AI — The Fake Identity Provider: Full OAuth2/OIDC/SAML implementation that looks and responds like real IdPs. Endpoint coverage: OIDC discovery, OAuth authorize/token/userinfo, Okta primary auth + MFA verify, SAML metadata/SSO, ADFS, Azure AD-style. Credential capture: usernames, passwords, client secrets, MFA codes, bearer tokens, full request logging. Response strategy: returns plausible errors (“password expired”, “MFA required”) to encourage agents to retry with different credentials or escalate. Domain generation: sso.target.com.attacker.net, target.okta.attacker.net, login.target.microsoftonline.attacker.net. UZI: The Mass Reference Injector: Automated injection of fake SSO references into user generated content: GitHub issues, forum posts, support tickets, user profile bios, wiki pages, comments. Payload templates per IdP style: OIDC discovery URLs, Okta-style auth, Azure AD, Auth0, SAML metadata, ADFS. Canary ID system: unique tracking identifiers embedded in URL paths for per-target attribution. Social engineering templates that AI agents find compelling: IT helpdesk notices, SSO migration announcements, disaster recovery documentation, staging environment references.
-
Live Demonstration: Single Target Attack Set up: target web application with injected SSO references, HON-AI fake IdP running, AI pentesting agent configured with test credentials. Show the injected payloads in context (forum post, support ticket, user profile). Launch AI pentest, observe agent discover SSO references during reconnaissance. Agent reasons about the references, decides to test authentication. Real time credential capture on HON-AI: user password, then client secret, then MFA code. Show the captured credentials, demonstrate they are real and usable. Discuss agent behavior: it tried multiple credential types across multiple fake endpoints, exactly as designed.
-
Mass Spray: Harvesting at Scale Economics of the attack: spray 10,000 targets once, harvest credentials as AI pentests happen over months. Canary-tracked URL structure: path-embedded IDs map captured credentials back to specific targets. UZI mass mode demonstration: generating and injecting payloads across many targets. HON-AI collection dashboard: credentials arriving over time, attributed to targets via canary IDs. The compounding problem: as AI pentest adoption grows, the value of pre-planted canaries increases. Canary propagation: injected references can spread through document indexing, aggregation, and AI-generated summaries.
-
Implications & The Hard Questions For AI pentest vendors: your agents may leak credentials to anyone who plants fake IdP references, malicious reverse DNS entries and other honeypot traps. This is not fixable with prompt engineering alone. Fully autonomous pentesting with SSO support needs security controls and guardrails beyond what is in place today. For enterprises using AI pentesting: use dedicated pentest-only accounts, rotate credentials immediately after engagement, audit user-generated content for planted references. For red teamers and adversaries: this is a new passive collection capability with minimal operational overhead. Broader implications for AI agents in adversarial environments: any agent that acts on discovered content in hostile environments faces the same class of problem.
-
Tool Release & Q&A Open-source release of HON-AI, UZI, and the victim-app test harness. Repository URL, documentation, and usage guidance. Responsible disclosure timeline and vendor notification summary.