a benchmark of co-evolving test / production pairs
drawn from real open-source Java projects
In real software, tests and the production code they exercise are written and
revised together; a change to a production method almost always arrives with a
coordinated change to its test. Existing benchmarks for test generation and test
update evaluate the two sides in isolation, which discards this co-evolution
signal and makes it hard to tell whether a model actually understands how a
change should propagate from implementation to test. We introduce
TestEvo-Bench, a live, continuously ingested
benchmark of co-evolving test/production pairs mined from the public
commit histories of real open-source Java projects. Each task is anchored to a
concrete rev-pair in which a production edit is accompanied by a contemporaneous
edit to the associated test suite, preserving the temporal and semantic link
between the two artefacts. TestEvo-Bench ships two evaluation tracks.
In test update, a model must revise an existing test so that it still
compiles, exercises the new behaviour of the modified production method, and
passes — without weakening assertions or deleting coverage. In
test generation, the model must write a brand-new test for newly added
or updated production code that compiles, runs, passes, and meaningfully covers
the changed lines. Because tasks are ingested from live commit streams rather
than fixed at release time, TestEvo-Bench enables
contamination-aware evaluation: a time filter restricts tasks to any
chosen rev2 date window, so results can be reported strictly
against commits authored after a model's pre-training cutoff. We release the
benchmark together with an open leaderboard that tracks compile pass rate, test
pass rate, line and mutation coverage, and — for test update — the
coverage delta on lines that were actually changed.
Compare model performance across test update and test generation tracks. All metrics are computed over the selected time window.
Loading…
| # | Agent | Model | Success% | Pass% | ExecFail% | CmplFail% | HrnsFail% | CovOnPass | MutOnPass | Submitted |
|---|---|---|---|---|---|---|---|---|---|---|
| Loading leaderboard… | ||||||||||
Browse every repository in the benchmark. Drag the time slider to filter by commit date, then click a row to inspect individual rev pairs and GitHub diffs.
Loading…
| Repository | Total | test update | test generation | Date range | |
|---|---|---|---|---|---|
| Loading tasks… | |||||