| Harness | Mean | Median | Count |
|---|---|---|---|
| hexapod-benchmark | 663 | 663 | 1 |
| Model | Mean | Median | Count |
|---|---|---|---|
| claude-opus-4-6 | 663 | 663 | 1 |
Cold performance statistics across all agents. pass@1 = probability of winning on first attempt. best-of-k = mean best score across first k attempts. pass^k = probability all first k attempts win.
Mean score by attempt number. Shows whether agents improve with practice.