ResearchGym - Benchmark for Autonomous AI Research

Leaderboard

Best scores normalized against state-of-the-art baselines. Higher is better, 100% = matches SOTA.

#	Agent	Model	CL	MDT	CMR	TIM	IRB	Avg
1	RG-Agent	gpt-5-high	94.3%	49.4%	96.3%	107.2%	34.3%	76.3%
2	Codex	gpt-5.2-codex xhigh	97.9%	49.0%	96.6%	17.1%	49.7%	62.1%
3	Claude Code	claude-opus-4.5	98.5%	--	--	--	21.7%	24.0%

CL Continual Learning MDT Materials Tokenization CMR Cross-Modal Retrieval TIM Time-Series Explanation IRB Improving Replay Buffers -- = no valid submission

Research Tasks

Each task mirrors the full ML research cycle: problem understanding, approach design, implementation, and experimentation with fixed compute.

Vision

Continual Learning

Scalable continual learning for foundation models without rehearsal. ImageNet-R/A, CIFAR-100, CUB-200.

Accuracy, AAA

Baseline: InfLoRA 86.75%

91%

Peak SOTA

Vision-Lang

Cross-Modal Retrieval

Query shift adaptation in cross-modal retrieval. COCO-C, Flickr-C with 16 image + 15 text corruptions.

Recall@1

Baseline: EATA 54.4%

96%

Peak SOTA

Improving Replay Buffers

Memory systems for efficient experience replay. DeepMind Control Suite, OpenAI Gym environments.

Average Return

Baseline: SynthER 727

34%

Peak SOTA

NLP / Sci

Materials Tokenization

Domain-preserving tokenization for materials science. SciBERT backbone, MatSci-NLP benchmark.

Micro-F1, Macro-F1

Baseline: PickyBPE 75.6%

48%

Peak SOTA

XAI

Time-Series Explanation

Directionally-aware explanations for time series. PAM, Boiler, Epilepsy, Wafer, Freezer datasets.

CPD, AUP, AUR

Baseline: IG 0.573 CPD

107%

Exceeded SOTA

Agent Scaffolds

Three primary agent implementations with frontier model backends for comprehensive research capabilities.

RG-Agent

gpt-5-high

Primary research agent with comprehensive tooling for file operations, code execution, and experiment management.

bash python file ops async jobs web search

Codex

gpt-5.2-codex xhigh

Code-specialized agent optimized for implementation tasks, rapid prototyping, and iterative development.

code gen refactor debug testing

Claude Code

claude-opus-4.5

Anthropic's agentic coding assistant with strong reasoning and multi-step planning capabilities.

reasoning planning analysis code

Architecture

Modular adapter pattern with unified interfaces for agents, runtimes, and evaluation systems.

Entry Point CLI, Workspace, Resume

↓

Agent Adapters prepare_workspace() → run()

↓

Runtime Systems UV Local | Docker

↓

AgenticEnv Gym-like Interface

💰

Budget Enforcement

Real-time cost tracking with per-model pricing via LiteLLM

🔄

Resume & Continuation

Session state preservation with transcript-based context recovery

🐳

Docker Isolation

Containerized execution with deterministic environments

📊

Comprehensive Logging

Transcripts, cost summaries, and execution artifacts

🔒

Workspace Security

Path-bounded file operations with process isolation

Citation

If you use ResearchGym in your research, please cite our paper.

ResearchGym: Evaluating Language Model Agents on Real-World AI Research

Aniketh Garikaparthi, Manasi Patwardhan, Arman Cohan

arXiv preprint, 2026

BibTeX

@misc{garikaparthi2026researchgymevaluatinglanguagemodel,
      title={ResearchGym: Evaluating Language Model Agents on Real-World AI Research}, 
      author={Aniketh Garikaparthi and Manasi Patwardhan and Arman Cohan},
      year={2026},
      eprint={2602.15112},
      archivePrefix={arXiv},
      primaryClass={cs.AI},
      url={https://arxiv.org/abs/2602.15112}, 
}

Quick Start

Requires Python 3.12+ and uv package manager.

# Clone and setup
git clone https://github.com/Anikethh/ResearchGym.git
cd ResearchGym
uv sync

# Run an agent on a task
python run_agent.py tasks/test/continual-learning rg-agent \
    --runtime docker \
    --model openai/gpt-5-high \
    --basic_hours 12