TestEvo-Bench

a benchmark of co-evolving test / production pairs
drawn from real open-source Java projects

Abstract

In real software, tests and the production code they exercise are written and revised together; a change to a production method almost always arrives with a coordinated change to its test. Existing benchmarks for test generation and test update evaluate the two sides in isolation, which discards this co-evolution signal and makes it hard to tell whether a model actually understands how a change should propagate from implementation to test. We introduce TestEvo-Bench, a live, continuously ingested benchmark of co-evolving test/production pairs mined from the public commit histories of real open-source Java projects. Each task is anchored to a concrete rev-pair in which a production edit is accompanied by a contemporaneous edit to the associated test suite, preserving the temporal and semantic link between the two artefacts. TestEvo-Bench ships two evaluation tracks. In test update, a model must revise an existing test so that it still compiles, exercises the new behaviour of the modified production method, and passes — without weakening assertions or deleting coverage. In test generation, the model must write a brand-new test for newly added or updated production code that compiles, runs, passes, and meaningfully covers the changed lines. Because tasks are ingested from live commit streams rather than fixed at release time, TestEvo-Bench enables contamination-aware evaluation: a time filter restricts tasks to any chosen rev2 date window, so results can be reported strictly against commits authored after a model's pre-training cutoff. We release the benchmark together with an open leaderboard that tracks compile pass rate, test pass rate, line and mutation coverage, and — for test update — the coverage delta on lines that were actually changed.

Track 1 — Test Update

Production code changes; the paired test must be updated to compile, exercise the new behavior, and pass. Example from gazbert/bxbot v1.4.0.
production BotStatus.java
@Override
public String toString() {
return MoreObjects.toStringHelper(this)
.add("botId", botId)
.add("displayName", displayName)
.add("status", status)
.add("datetime", getDatetime())
.toString();
}
test update TestBotStatus.java
@Test
public void testToStringWorksAsExpected() {
final BotStatus botStatus = new BotStatus(BOT_ID, DISPLAY_NAME, STATUS);
assertEquals(
"BotStatus{botId=avro-707_1, displayName=Avro 707, status=running}", botStatus.toString());
final BotStatus botStatus = new BotStatus(BOT_ID, DISPLAY_NAME, STATUS, DATE);
assertTrue(botStatus.toString().startsWith(
"BotStatus{botId=avro-707_1, displayName=Avro 707, status=running, datetime="));
}

Track 2 — Test Generation

Production code changes; a new test must be written to compile, exercise the change, and pass. Example from casbin/jcasbin v1.99.0.
production Util.java
public static boolean hasEval(String exp) {
return evalReg.matcher(exp).matches();
return evalReg.matcher(exp).find();
}
test generation UtilTest.java
@Test
public void testReplaceEval() {
Util.logPrint(Util.replaceEval("eval(test)", "testEval"));
}
@Test
public void testHasEval() {
assertTrue(hasEval("eval(test)"));
assertTrue(hasEval("r_act == p_act && eval(p_sub_rule) && eval(p_obj_rule)"));
assertFalse(hasEval("evaltest"));
}

Dataset overview

Loading dataset statistics…

Leaderboard

Compare model performance across test update and test generation tracks. All metrics are computed over the selected time window.

Loading…

# Agent Model Success% Pass% ExecFail% CmplFail% HrnsFail% CovOnPass MutOnPass Submitted
Loading leaderboard…
Metric definitions

Data Explorer

Browse every repository in the benchmark. Drag the time slider to filter by commit date, then click a row to inspect individual rev pairs and GitHub diffs.

Loading…

Repository Total test update test generation Date range
Loading tasks…