FormulaCode

Atharva Sehgal ¹ James Hou ² Akanksha Sarkar ³ Ishaan Mantripragada ² Swarat Chaudhuri ¹ Jennifer J. Sun ³ Yisong Yue ²

¹The University of Texas at Austin ²California Institute of Technology ³Cornell University

A live benchmark of 957 performance bottlenecks mined from scientific Python repositories — pairing every task with expert patches and ~265 community workloads.

⌕ Browse all 957 tasks

Arxiv Code Huggingface

957

Tasks

70+

Repositories

1.4M

Workloads

Strata

Benchmark Design

Each FormulaCode task evaluates the ability of an agent to optimize a real-world codebase under strict correctness constraints. A task begins with a baseline repository, which represents the unmodified implementation. The agent operates on the baseline and produces a modified version of the repository by making arbitrary repository-level edits.

Performance evaluation proceeds by executing the full set of workloads on both the baseline and the agent-modified code and comparing their measured outcomes. Improving performance on one workload may degrade performance on others. As a result, optimization in FormulaCode is inherently multi-objective: agents must reason about trade-offs across subsystems and deliver improvements that are broad and consistent rather than localized to a single execution path.

Dataset Construction

FormulaCode consists of multi-workload real-world code optimization problems from 70 repositories. We developed an automated four-stage pipeline that extracts these problems:

Repository Scraping

We crawl GitHub repositories with high-quality expert-defined performance workloads.

Attribute Filtering

We filter out candidate pull requests where the primary intent was not performance related, using rule-based and LLM-based filters.

Environment Synthesis

We synthesize environment building scripts using a reflexive LLM agent so that terminal interface tools function correctly.

Statistical Validation

We filter all candidate PRs that do not show statistically significant improvement in performance workloads.

Key Findings

Agents Improve Runtime but Underperform Experts

Agents generally can improve run-time performance, but perform worse than human experts.

Local vs. Global Optimization

Agents are better at local or function-level optimization, rather than repository-level optimization.

Optimization Strategy Strengths

Agents excel at using specific optimization strategies (e.g., parallelizing or batching) and struggle with others (e.g., vectorized operations).

Long-Tail Repository Performance

Agent performance relative to experts can vary dramatically by popularity of the repository, performing worst on the 4th quintile and best on the 2nd quintile.

Cost Efficiency

Despite being more expensive per call, agents using frontier LLMs are overall more cost effective than those using open weights models.

Multi-Workload Tradeoffs

Compared to human experts, agents make less favorable performance-cost trade-off decisions.

Dataset growth

Live snapshot from api.formulacode.org. 1,555 verified tasks span 111 months of real-world performance work, last refreshed Mar 26.

Live dashboard ↗

245,477

Total PRs

13,008

Performance PRs

1,555

Tasks

154

Repos

0.63%

PR → Task Rate

Tasks merged per month peak: 43

Jan 17 Apr 19 Aug 21 Dec 23 Mar 26

Problems by repository

Each tile is a repository; size is the number of verified tasks it contributes. 1,555 tasks across 133 repos.

Results

Agent advantage across the full benchmark and at each aggregation level. Positive bars beat the human expert; negative bars trail.

View full leaderboard →

Overall advantage

Ranked by Σ(oracle speedup − agent speedup) / N. Bars symmetric around 0.

OpenHands Claude 4.0 Sonnet

-0.0112

OpenHands GPT-5

-0.0209

OpenHands Qwen 3 Coder

-0.0301

Terminus 2 Claude 4.0 Sonnet

-0.0410

Terminus 2 Gemini 2.5 Pro

-0.0433

Terminus 2 Qwen 3 Coder

-0.0454

Terminus 2 GPT-5

-0.0504

Overall: every agent underperforms the expert. OpenHands · Claude 4.0 Sonnet is closest at -0.0112; Terminus 2 · GPT-5 is furthest behind at -0.0504.

Compact Leaderboard

View Full Leaderboard →

Rank Agent · Model Speedup Advantage

01 OpenHands Claude 4.0 Sonnet 1.054× -0.0112

02 OpenHands Qwen 3 Coder 1.035× -0.0301

03 OpenHands GPT-5 1.083× -0.0209

04 Terminus 2 Claude 4.0 Sonnet 1.099× -0.0410

05 Terminus 2 Qwen 3 Coder 1.068× -0.0454

06 Terminus 2 Gemini 2.5 Pro 1.096× -0.0433

Don't see your model? Submit it!

To evaluate an agent on FormulaCode, follow the Installation instructions and run:

fc-eval

$ fceval run -d formulacode -a [your-agent-name] -m [your-model-name]

Read installation instructions →