EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Wayne Chi; Valerie Chen; Ryan Shar; Anuj Mittal; Jenny T. Liang; Wei-Lin Chiang; Anastasios N. Angelopoulos; Ion Stoica; Graham Neubig; Ameet Talwalkar; Chris Donahue

doi:10.48550/arxiv.2511.04486

Verified authors • Institutional access • DOI aware

50,000+ researchers120,000+ datasets90% satisfaction

Preprint

2025

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

0 Datasets

0 Files

English

2025

DOI: 10.48550/arxiv.2511.04486 arxiv.org/abs/2511.04486

Get instant academic access to this publication’s datasets.

Create free account How it works

Frequently asked questions

Is access really free for academics and students?

Yes. After verification, you can browse and download datasets at no cost. Some premium assets may require author approval.

How is my data protected?

Files are stored on encrypted storage. Access is restricted to verified users and all downloads are logged.

Can I request additional materials?

Yes, message the author after sign-up to request supplementary files or replication code.

Advance your research today

Join 50,000+ researchers worldwide. Get instant access to peer-reviewed datasets, advanced analytics, and global collaboration tools.

Get free academic access Learn more

✓ Immediate verification • ✓ Free institutional access • ✓ Global collaboration

Instructed code editing, where LLMs directly modify a developer's existing code based on a user instruction, is becoming a widely used interaction mode in AI coding assistants. However, few benchmarks directly evaluate this capability and current datasets often rely on artificial sources. We introduce EDIT-Bench, a benchmark for evaluating LLM code editing capabilities grounded in real-world usage, i.e., user instructions and code contexts collected in the wild. EDIT-Bench comprises of 540 problems, multiple natural and programming languages, and a diverse set of real-world use cases, ranging from resolving errors to adding features. EDIT-Bench introduces context-dependent problems that require the model to understand code context, highlighted code, and cursor position in addition to the user instruction. We evaluate 40 diverse LLMs and observe that EDIT-Bench is a challenging set of problems where only 1 model scores over 60%. We find that model performance varies across different categories of user instructions. Further, we find that varying levels of contextual information greatly affect task success rate, with performance varying up to 11%, indicating the importance of evaluating with realistic context.

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Frequently asked questions

Is access really free for academics and students?

How is my data protected?

Can I request additional materials?

Advance your research today

EDIT-Bench: Evaluating LLM Abilities to Perform Real-World Instructed Code Edits

Frequently asked questions

Is access really free for academics and students?

How is my data protected?

Can I request additional materials?

Advance your research today

Access Research Data

This PDF is not available in different languages.

Ion Stoica

Abstract

How to cite this publication

Related publications

Why join Raw Data Library?

Quality

Control

Free for Academia

Publication Details

Join Research Community