OpenHands Index Results

This repository contains benchmark results for various OpenHands agents and LLM configurations.

Data Structure

Agent-Centric Format (Recommended)

Results are organized in the results/ directory with the following structure:

results/
├── {version}_{model_name}/
│   ├── metadata.json
│   └── scores.json

Directory Naming Convention

Each agent directory follows the format: {version}_{model_name}/

{version}: Agent version (semantic version starting with 'v', e.g., v1.8.3)
{model_name}: LLM model name (e.g., claude-sonnet-4-5, GPT-5.2)

metadata.json

Contains agent metadata and configuration:

{
  "agent_name": "OpenHands",
  "agent_version": "v1.8.3",
  "model": "claude-sonnet-4-5",
  "openness": "closed_api_available",
  "tool_usage": "standard",
  "submission_time": "2025-11-24T19:56:00.092895",
  "directory_name": "v1.8.3_claude-sonnet-4-5"
}

Fields:

agent_name: Display name of the agent
agent_version: Semantic version number (e.g., "1.0.0", "1.0.2")
model: LLM model used
openness: Model availability type
- closed_api_available: Commercial API-based models
- open_api_available: Open-source models with API access
- open_weights_available: Open-weights models that can be self-hosted
tool_usage: Agent tooling type
- standard: Standard tool usage
- custom_interface: Custom tool interface
submission_time: ISO 8601 timestamp

scores.json

Contains benchmark scores and performance metrics:

[
  {
    "benchmark": "swe-bench",
    "score": 45.1,
    "metric": "resolve_rate",
    "total_cost": 32.55,
    "average_runtime": 3600,
    "tags": ["bug_fixing"]
  },
  ...
]

Fields:

benchmark: Benchmark identifier (e.g., "swe-bench", "commit0")
score: Primary metric score (percentage or numeric value)
metric: Type of metric (e.g., "resolve_rate", "success_rate")
total_cost: Total API cost in USD
average_runtime: Average runtime per instance in seconds (optional)
tags: Category tags for grouping (e.g., ["bug_fixing"], ["app_creation"])

Alternative Agents

Results from non-OpenHands agents (Claude Code, Codex, etc.) are stored in the alternative_agents/ directory, organized by agent type and model:

alternative_agents/
├── claude_code/
│   └── {model_name}/
│       ├── metadata.json
│       └── scores.json
├── openhands_subagents/
│   └── {model_name}/
│       ├── metadata.json
│       └── scores.json

Each model directory under an agent type uses the same metadata.json and scores.json format as results/. The directory_name field in metadata.json should match the model name (same rule as results/).

The validation script (scripts/validate_schema.py) automatically validates both results/ and alternative_agents/ directories. Alternative agent results are reported separately in both validation and progress reports.

Legacy Format (Backward Compatible)

The 1.0.0-dev1/ directory contains the original benchmark-centric JSONL files:

swe-bench.jsonl
swe-bench-multimodal.jsonl
commit0.jsonl
swt-bench.jsonl
gaia.jsonl

This format is maintained for backward compatibility.

Supported Benchmarks

Bug Fixing

SWE-Bench: Resolving GitHub issues from real Python repositories
SWE-Bench-Multimodal: Similar to SWE-Bench with multimodal inputs

App Creation

Commit0: Building applications from scratch based on specifications

Test Generation

SWT-Bench: Generating comprehensive test suites

Information Gathering

GAIA: General AI assistant tasks requiring web search and reasoning

Benchmark Categories

Results are grouped into 4 main categories on the leaderboard:

Bug Fixing: SWE-Bench, SWE-Bench-Multimodal
App Creation: Commit0
Test Generation: SWT-Bench
Information Gathering: GAIA

Adding New Results

To add new benchmark results:

Create a directory following the naming convention: results/{version}_{model_name}/
Add metadata.json with agent configuration
Add scores.json with benchmark results
Commit and push to the repository

Example:

# Create directory
mkdir -p results/v1.8.3_claude-sonnet-4-5/

# Add metadata
cat > results/v1.8.3_claude-sonnet-4-5/metadata.json << 'EOF'
{
  "agent_name": "OpenHands",
  "agent_version": "v1.8.3",
  "model": "claude-sonnet-4-5",
  "openness": "closed_api_available",
  "tool_usage": "standard",
  "submission_time": "2025-11-24T19:56:00.092895",
  "directory_name": "v1.8.3_claude-sonnet-4-5"
}
EOF

# Add scores
cat > results/v1.8.3_claude-sonnet-4-5/scores.json << 'EOF'
[
  {
    "benchmark": "swe-bench",
    "score": 45.1,
    "metric": "accuracy",
    "cost_per_instance": 0.412,
    "average_runtime": 3600,
    "tags": ["bug_fixing"]
  },
  ...
]
EOF

# Commit and push
git add results/v1.8.3_claude-sonnet-4-5/
git commit -m "Add results for OpenHands v1.8.3 with Claude 4.5 Sonnet"
git push origin main

Leaderboard

View the live leaderboard at: https://huggingface.co/spaces/OpenHands/openhands-index

License

MIT License - See repository for details.

Name		Name	Last commit message	Last commit date
Latest commit History 625 Commits
.github/workflows		.github/workflows
alternative_agents		alternative_agents
results		results
scripts		scripts
tests		tests
.gitignore		.gitignore
ADDINGMETADATA.md		ADDINGMETADATA.md
README.md		README.md
complete-models.json		complete-models.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

OpenHands Index Results

Data Structure

Agent-Centric Format (Recommended)

Directory Naming Convention

metadata.json

scores.json

Alternative Agents

Legacy Format (Backward Compatible)

Supported Benchmarks

Bug Fixing

App Creation

Test Generation

Information Gathering

Benchmark Categories

Adding New Results

Leaderboard

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

OpenHands Index Results

Data Structure

Agent-Centric Format (Recommended)

Directory Naming Convention

metadata.json

scores.json

Alternative Agents

Legacy Format (Backward Compatible)

Supported Benchmarks

Bug Fixing

App Creation

Test Generation

Information Gathering

Benchmark Categories

Adding New Results

Leaderboard

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages