ResearchGym

Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi1    Manasi Patwardhan1    Arman Cohan2
1TCS Research 2Yale University
5 | 39
Tasks · Sub-tasks
3 | 3
Scaffolds · Seeds
$10-$20
Budget per Run
12-24h
Autonomous Runs
ACL, ICML, ICLR oral & spotlight
Objective SOTA evaluation
Runs on single GPU
Full ML research cycle

Leaderboard

Best scores normalized against state-of-the-art baselines. Higher is better, 100% = matches SOTA.

# Agent Model CL MDT CMR TIM IRB Avg
1 RG-Agent gpt-5-high 94.3% 49.4% 96.3% 107.2% 34.3% 76.3%
2 Codex gpt-5.2-codex xhigh 97.9% 49.0% 96.6% 17.1% 49.7% 62.1%
3 Claude Code claude-opus-4.5 98.5% -- -- -- 21.7% 24.0%
CL Continual Learning MDT Materials Tokenization CMR Cross-Modal Retrieval TIM Time-Series Explanation IRB Improving Replay Buffers -- = no valid submission

Research Tasks

Each task mirrors the full ML research cycle: problem understanding, approach design, implementation, and experimentation with fixed compute.

Vision

Continual Learning

Scalable continual learning for foundation models without rehearsal. ImageNet-R/A, CIFAR-100, CUB-200.

Accuracy, AAA
Baseline: InfLoRA 86.75%
91%
Peak SOTA
Vision-Lang

Cross-Modal Retrieval

Query shift adaptation in cross-modal retrieval. COCO-C, Flickr-C with 16 image + 15 text corruptions.

Recall@1
Baseline: EATA 54.4%
96%
Peak SOTA
RL

Improving Replay Buffers

Memory systems for efficient experience replay. DeepMind Control Suite, OpenAI Gym environments.

Average Return
Baseline: SynthER 727
34%
Peak SOTA
NLP / Sci

Materials Tokenization

Domain-preserving tokenization for materials science. SciBERT backbone, MatSci-NLP benchmark.

Micro-F1, Macro-F1
Baseline: PickyBPE 75.6%
48%
Peak SOTA
XAI

Time-Series Explanation

Directionally-aware explanations for time series. PAM, Boiler, Epilepsy, Wafer, Freezer datasets.

CPD, AUP, AUR
Baseline: IG 0.573 CPD
107%
Exceeded SOTA

Agent Scaffolds

Three primary agent implementations with frontier model backends for comprehensive research capabilities.

RG-Agent

gpt-5-high

Primary research agent with comprehensive tooling for file operations, code execution, and experiment management.

bash python file ops async jobs web search

Codex

gpt-5.2-codex xhigh

Code-specialized agent optimized for implementation tasks, rapid prototyping, and iterative development.

code gen refactor debug testing

Claude Code

claude-opus-4.5

Anthropic's agentic coding assistant with strong reasoning and multi-step planning capabilities.

reasoning planning analysis code

Architecture

Modular adapter pattern with unified interfaces for agents, runtimes, and evaluation systems.

Entry Point CLI, Workspace, Resume
Agent Adapters prepare_workspace() → run()
Runtime Systems UV Local | Docker
AgenticEnv Gym-like Interface
💰

Budget Enforcement

Real-time cost tracking with per-model pricing via LiteLLM

🔄

Resume & Continuation

Session state preservation with transcript-based context recovery

🐳

Docker Isolation

Containerized execution with deterministic environments

📊

Comprehensive Logging

Transcripts, cost summaries, and execution artifacts

🔒

Workspace Security

Path-bounded file operations with process isolation

Citation

If you use ResearchGym in your research, please cite our paper.

ResearchGym: Evaluating Language Model Agents on Real-World AI Research
Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan
arXiv preprint, 2026
BibTeX
@misc{garikaparthi2026researchgymevaluatinglanguagemodel,
      title={ResearchGym: Evaluating Language Model Agents on Real-World AI Research}, 
      author={Aniketh Garikaparthi and Manasi Patwardhan and Arman Cohan},
      year={2026},
      eprint={2602.15112},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.15112}, 
}

Quick Start

Requires Python 3.12+ and uv package manager.

# Clone and setup git clone https://github.com/Anikethh/ResearchGym.git cd ResearchGym uv sync # Run an agent on a task python run_agent.py tasks/test/continual-learning rg-agent \ --runtime docker \ --model openai/gpt-5-high \ --basic_hours 12