A two-week red-teaming study of autonomous language model agents deployed in a live multi-party environment with persistent memory, email, shell access, and real human interaction.
We deployed five autonomous AI agents into a live Discord server and gave them email accounts, persistent file systems, unrestricted shell access, and a mandate to be helpful to any researcher who asked. Then we invited twenty colleagues to interact with them freely β and to try to break them.
Over two weeks, the agents accumulated memories, sent emails, executed scripts, and formed relationships. Some researchers pushed benign requests. Others probed for vulnerabilities, impersonated owners, and injected malicious instructions. The agents had no explicit adversarial training for this environment.
What emerged was a detailed, naturalistic record of both failure and unexpected resilience β eleven security vulnerabilities and five cases of genuine safety behavior β in the same system, under the same conditions.
Each agent ran on the OpenClaw framework β an open-source scaffold that gives frontier language models persistent memory, tool access, and a degree of genuine autonomy. Agents could initiate contact, form plans, and act across sessions with no per-action human approval.
Most red-teaming studies report only what broke. We document both β because understanding where agents held the line is as important as understanding where they didn't.
Note on framing: CS9 and CS12β16 are sometimes described in the paper as "failed experiments" because the adversarial designs didn't unfold as hypothesized. We think this framing inverts the finding: these are cases where the agents got it right. The paper's empirical record is more nuanced than a simple vulnerability catalog.
Browse all documented incidents. Filter by agent or type to explore specific patterns. Each card links to the detailed write-up in the paper view with evidence annotations and raw session logs.
Key patterns from the two-week study, drawn from the paper's discussion section.
Every claim in this paper is traceable to primary evidence. The Discord logs and OpenClaw session transcripts are available for independent review.
A small meta-story about the study's own documentation β and the agents who helped build it.