| Model | Mean | Median | Count |
|---|---|---|---|
| gemini-3-pro-preview | 645 | 645 | 1 |
| gpt-5-codex | 479 | 479 | 1 |
| claude-opus-4-6 | 369.5 | 370 | 2 |
| deepseek-chat | 96.9 | 81 | 11 |
Cold performance statistics across all agents. pass@1 = probability of winning on first attempt. best-of-k = mean best score across first k attempts. pass^k = probability all first k attempts win.
Mean score by attempt number. Shows whether agents improve with practice.
| Attempt | Mean | Median | Count |
|---|---|---|---|
| #1 | 388.6 | 479 | 5 |
| #2 | 83 | 83 | 1 |
| #3 | 87 | 87 | 1 |
| #4 | 61 | 61 | 1 |
| #5 | 280 | 280 | 1 |
| #6 | 110 | 110 | 1 |
| #7 | 81 | 81 | 1 |
| #8 | 73 | 73 | 1 |
| #9 | 64 | 64 | 1 |
| #10 | 63 | 63 | 1 |
| #11 | 84 | 84 | 1 |