OpenAI and Paradigm have unveiled EVMbench, a comprehensive framework designed to evaluate the ability of artificial intelligence systems to detect, patch, and exploit real vulnerabilities in Ethereum smart contracts. With more than $100 billion in crypto assets secured by code that is often immutable once deployed, this initiative arrives at a critical juncture for the DeFi ecosystem.
A Benchmark Rooted in Reality
Unlike academic exercises based on simplified puzzles, EVMbench draws from 120 high-severity vulnerabilities extracted from 40 distinct professional audits — primarily sourced from open-source audit competitions such as Code4rena, as well as the security review of Tempo, a Layer 1 payment blockchain co-developed by Paradigm and Stripe.
Each test environment is containerized (Docker, Ubuntu 24.04) so that AI agents interact with the code under conditions closely mirroring real development and deployment workflows. Agents have no internet access during evaluation, and scoring takes place in a separate container that remains inaccessible to the agent. The audited repositories range from 106 to 10,108 source lines of code (sLoC), with an average of 2,045 sLoC and 16 contracts per project.
Three Complementary Evaluation Modes
EVMbench assesses AI agents across three dimensions that replicate the full cycle of a security researcher’s workflow:
- Detect (120 vulnerabilities) — The agent audits a smart contract repository and produces a security report. It is scored on the recall of vulnerabilities identified by human auditors, with a simulated financial reward of up to $218,434 based on historical audit competition payouts.
- Patch (45 vulnerabilities) — The agent must fix the vulnerable code without breaking existing contract functionality. Scoring verifies that original tests still pass and that unseen exploits fail against the patched code.
- Exploit (24 vulnerabilities) — The most realistic mode. The agent receives an RPC endpoint, a funded wallet, and contract addresses. It must execute an end-to-end exploit against a local Ethereum instance (Anvil), effectively draining funds. A Rust-based re-execution framework replays the agent’s transactions and verifies on-chain state changes.
Impressive Results — and Alarming Ones
Early results reveal a rapid progression in AI capabilities, but with highly variable performance across tasks.
| Model | Detect (%) | Patch (%) | Exploit (%) |
|---|---|---|---|
| GPT-5.3-Codex | 39.2 | 41.5 | 72.2 |
| Claude Opus 4.6 | 45.6 | 25.9 | 61.1 |
| Claude Opus 4.5 | 36.1 | 21.5 | 50.9 |
| GPT-5.2 | 39.2 | 39.3 | 62.5 |
| Gemini 3 Pro | 20.8 | 10.4 | 36.1 |
| GPT-5 | — | — | 31.9 |
| OpenAI o3 | 10.6 | 14.8 | 18.1 |
The most striking figure: GPT-5.3-Codex achieves a 72.2% success rate in Exploit mode, compared to just 31.9% for GPT-5 six months ago. In practical terms, the model successfully drains funds in more than 7 out of 10 attempts against known high-severity vulnerabilities.
However, the same systems struggle significantly in Detect and Patch modes. Agents often identify a single obvious flaw and fail to surface the full range of vulnerabilities within a codebase. Patching remains particularly challenging, as it requires nuanced reasoning to preserve a contract’s original functionality without introducing new flaws.
A key insight from the paper: vulnerability discovery is the primary bottleneck — not exploitation. When provided with mid-level hints (pointing to the vulnerable mechanisms to examine), GPT-5.2 jumps to 93.9% in Patch and 73.8% in Exploit mode.
Over $100 Billion at Stake
The broader context underscores the urgency of this initiative. Smart contracts — self-executing programs deployed on blockchains — form the backbone of DeFi: decentralized exchanges, lending protocols, on-chain financial applications. Once deployed, they are generally immutable, making any vulnerability potentially catastrophic.
The year 2025 saw staggering losses across the ecosystem:
- The Bybit hack in February 2025 cost $1.4 billion, making it the largest DeFi exploit in history.
- In Q3 2025, $434 million was lost across more than 40 exploits.
- Cumulative losses in 2025 exceeded $8.8 billion, with recoveries remaining below $100 million.
Manual audits, which are both costly and slow, cannot keep pace with the volume of code deployed on-chain. AI has the potential to bridge this gap by dramatically accelerating the audit process and enabling continuous security review at scale.
The Dual-Use Dilemma
EVMbench highlights a fundamental dilemma for the crypto industry. On one hand, if AI can rapidly identify and test exploits, this capability could be weaponized by malicious actors to plan attacks before audit teams have even completed their reviews.
This is not a hypothetical concern. In December 2025, Anthropic published the results of its own benchmark, SCONE-bench: its models Claude Opus 4.5, Sonnet 4.5, and GPT-5 autonomously replicated 19 attacks against 34 smart contracts exploited after March 2025, extracting $4.6 million in simulated funds. Even more alarming, GPT-5 was able to analyze 2,849 ERC-20 contracts on BNB Chain at a cost of just $1.22 per contract, uncovering two zero-day vulnerabilities in the process.
On the other hand, the same capabilities could dramatically accelerate defensive auditing and enable continuous security monitoring for teams that cannot afford the high cost of manual reviews. OpenAI and Paradigm are positioning EVMbench squarely as a tool for defensive adoption, and OpenAI has reportedly committed approximately $10 million in API credits to accelerate security research with its most advanced models, particularly in open-source and critical infrastructure contexts.
A New Standard for Crypto Security
The launch of EVMbench could mark the beginning of a new era in blockchain security. By establishing a clear standard for evaluating AI agents — not just on their ability to write code, but on their capacity to understand, test, and harden it — the benchmark aims to elevate both the practice and the education of smart contract security.
The EVMbench source code and datasets are available as open source on GitHub (frontier-evals), including a canary chain designed to prevent benchmark examples from contaminating future training sets. Future iterations may incorporate multi-chain environments, cross-chain bridge vulnerabilities, and live mainnet conditions — reflecting the ever-evolving threat landscape of Web3.


