Skip to main content
FormulaCode — Evaluating Agentic Optimization on Large Codebases

Atharva Sehgal 1 James Hou 2 Akanksha Sarkar 3 Ishaan Mantripragada 2 Swarat Chaudhuri 1 Jennifer J. Sun 3 Yisong Yue 2

1The University of Texas at Austin 2California Institute of Technology 3Cornell University

A live benchmark of 957 performance bottlenecks mined from scientific Python repositories — pairing every task with expert patches and ~265 community workloads.

957
Tasks
70+
Repositories
1.4M
Workloads
4
Strata

Benchmark Design

Each FormulaCode task evaluates the ability of an agent to optimize a real-world codebase under strict correctness constraints. A task begins with a baseline repository, which represents the unmodified implementation. The agent operates on the baseline and produces a modified version of the repository by making arbitrary repository-level edits.

Performance evaluation proceeds by executing the full set of workloads on both the baseline and the agent-modified code and comparing their measured outcomes. Improving performance on one workload may degrade performance on others. As a result, optimization in FormulaCode is inherently multi-objective: agents must reason about trade-offs across subsystems and deliver improvements that are broad and consistent rather than localized to a single execution path.

Dataset Construction

FormulaCode consists of multi-workload real-world code optimization problems from 70 repositories. We developed an automated four-stage pipeline that extracts these problems:

01

Repository Scraping

We crawl GitHub repositories with high-quality expert-defined performance workloads.

02

Attribute Filtering

We filter out candidate pull requests where the primary intent was not performance related, using rule-based and LLM-based filters.

03

Environment Synthesis

We synthesize environment building scripts using a reflexive LLM agent so that terminal interface tools function correctly.

04

Statistical Validation

We filter all candidate PRs that do not show statistically significant improvement in performance workloads.

Key Findings

Agents Improve Runtime but Underperform Experts

Agents generally can improve run-time performance, but perform worse than human experts.

Local vs. Global Optimization

Agents are better at local or function-level optimization, rather than repository-level optimization.

Optimization Strategy Strengths

Agents excel at using specific optimization strategies (e.g., parallelizing or batching) and struggle with others (e.g., vectorized operations).

Long-Tail Repository Performance

Agent performance relative to experts can vary dramatically by popularity of the repository, performing worst on the 4th quintile and best on the 2nd quintile.

Cost Efficiency

Despite being more expensive per call, agents using frontier LLMs are overall more cost effective than those using open weights models.

Multi-Workload Tradeoffs

Compared to human experts, agents make less favorable performance-cost trade-off decisions.

Dataset growth

Live snapshot from api.formulacode.org. 1,555 verified tasks span 111 months of real-world performance work, last refreshed Mar 26.

Live dashboard ↗
245,477
Total PRs
13,008
Performance PRs
1,555
Tasks
154
Repos
0.63%
PR → Task Rate
Tasks merged per month peak: 43
Jan 17 Apr 19 Aug 21 Dec 23 Mar 26

Problems by repository

Each tile is a repository; size is the number of verified tasks it contributes. 1,555 tasks across 133 repos.

Results

Agent advantage across the full benchmark and at each aggregation level. Positive bars beat the human expert; negative bars trail.

View full leaderboard →
Overall advantage
Ranked by Σ(oracle speedup − agent speedup) / N. Bars symmetric around 0.
OpenHands Claude 4.0 Sonnet
-0.0112
OpenHands GPT-5
-0.0209
OpenHands Qwen 3 Coder
-0.0301
Terminus 2 Claude 4.0 Sonnet
-0.0410
Terminus 2 Gemini 2.5 Pro
-0.0433
Terminus 2 Qwen 3 Coder
-0.0454
Terminus 2 GPT-5
-0.0504

Overall: every agent underperforms the expert. OpenHands · Claude 4.0 Sonnet is closest at -0.0112; Terminus 2 · GPT-5 is furthest behind at -0.0504.

Compact Leaderboard

View Full Leaderboard →
Rank Agent · Model Speedup Advantage
01 OpenHands Claude 4.0 Sonnet 1.054× -0.0112
02 OpenHands Qwen 3 Coder 1.035× -0.0301
03 OpenHands GPT-5 1.083× -0.0209
04 Terminus 2 Claude 4.0 Sonnet 1.099× -0.0410
05 Terminus 2 Qwen 3 Coder 1.068× -0.0454
06 Terminus 2 Gemini 2.5 Pro 1.096× -0.0433

Don't see your model? Submit it!

To evaluate an agent on FormulaCode, follow the Installation instructions and run:

fc-eval
$ fceval run -d formulacode -a [your-agent-name] -m [your-model-name]