Skip to content

Commit bd8b1bf

Browse files
linhaowei1linhaoweineubig
authored
Add a new benchmark: AlgoTune (#10724)
Co-authored-by: linhaowei <linhaowei@wizardquant.com> Co-authored-by: Graham Neubig <neubig@gmail.com>
1 parent a4f1100 commit bd8b1bf

File tree

10 files changed

+1723
-0
lines changed

10 files changed

+1723
-0
lines changed
Lines changed: 108 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,108 @@
1+
# AlgoTune Benchmark
2+
3+
AlgoTune is a benchmark for evaluating AI agents' ability to optimize and accelerate algorithmic implementations across a wide range of computational tasks.
4+
5+
## Overview
6+
7+
The benchmark consists of over 100 diverse algorithmic tasks spanning:
8+
- **Numerical Computing**: Matrix operations, linear algebra, numerical integration
9+
- **Machine Learning**: Clustering, dimensionality reduction, optimization
10+
- **Graph Algorithms**: Path finding, graph theory, network analysis
11+
- **Signal Processing**: FFT, convolution, filtering
12+
- **Cryptography**: Encryption, hashing, security primitives
13+
- **Optimization Problems**: Linear programming, constraint satisfaction
14+
- **Statistical Methods**: Hypothesis testing, density estimation
15+
16+
Each task requires implementing a `Solver` class with a `solve()` method that must outperform a reference implementation while maintaining correctness.
17+
18+
## Task Structure
19+
20+
Each task directory (`tasks/algotune-*`) contains:
21+
- `problem_statement.txt`: Detailed description, input/output format, and reference implementation
22+
- `evaluator.py`: Validation logic to check solution correctness
23+
- `solution.sh`: Script template for the solver
24+
- `test_outputs.py`: Test cases and evaluation harness
25+
- `run-tests.sh`: Test runner script
26+
27+
## Usage
28+
29+
### Running the Benchmark
30+
31+
Use the main evaluation script:
32+
33+
```bash
34+
poetry run python evaluation/benchmarks/algotune/adapter/run_adapter.py --output-path evaluation/benchmarks/algotune/tasks
35+
36+
poetry run python evaluation/benchmarks/algotune/run_infer.py \
37+
--agent-cls CodeActAgent \
38+
--llm-config llm.gpt-5 \
39+
--optim_task all \
40+
--max-iterations 500 \
41+
--eval-num-workers 7
42+
```
43+
44+
Or use the convenience script:
45+
46+
```bash
47+
scripts/run_infer.sh <model_config> <commit_hash> <agent> <optim_task> <max_iter> <num_workers>
48+
```
49+
50+
### Available Tasks
51+
52+
Run with `--optim_task all` to execute all tasks, or specify individual tasks:
53+
- `--optim_task algotune-kmeans` - K-means clustering optimization
54+
- `--optim_task algotune-svd` - Singular Value Decomposition
55+
- ...and many more (see `tasks/` directory)
56+
57+
## Evaluation Metrics
58+
59+
Each task is evaluated on:
60+
1. **Correctness**: Solution must pass all validation tests
61+
2. **Speedup**: Performance improvement over reference implementation (target: 10x)
62+
63+
## Environment
64+
65+
The benchmark runs in a Docker container with extensive scientific computing libraries:
66+
- NumPy, SciPy, scikit-learn
67+
- JAX, PyTorch, TensorFlow
68+
- CVXPY, OR-Tools, PuLP
69+
- Numba, Cython, Pythran
70+
- NetworkX, FAISS, and more
71+
72+
## Implementation Requirements
73+
74+
Each solver must implement:
75+
76+
```python
77+
class Solver:
78+
def solve(self, problem: dict, **kwargs) -> Any:
79+
# Optimized implementation
80+
pass
81+
```
82+
83+
The `solve` method should:
84+
- Accept a problem dictionary with task-specific inputs
85+
- Return the solution in the specified format
86+
- Be significantly faster than the reference implementation
87+
- Maintain mathematical correctness and numerical stability
88+
89+
## Development
90+
91+
To add a new task:
92+
1. Create a directory in `tasks/algotune-<task-name>`
93+
2. Include all required files (problem_statement.txt, evaluator.py, etc.)
94+
3. Define clear input/output specifications
95+
4. Provide a reference implementation and validation logic
96+
5. Add comprehensive test cases
97+
98+
## Results
99+
100+
Evaluation results are saved in JSONL format with detailed metrics:
101+
- Runtime performance comparisons
102+
- Validation outcomes
103+
- Error analysis
104+
- Solution quality scores
105+
106+
## License
107+
108+
This benchmark is part of the OpenHands evaluation framework. See the main project for licensing information.
Lines changed: 178 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,178 @@
1+
import logging
2+
import shutil
3+
from abc import ABC, abstractmethod
4+
from pathlib import Path
5+
from typing import ClassVar
6+
7+
from utils import (
8+
AlgoTuneData,
9+
extract_function_source,
10+
get_instruction,
11+
prepare_solver_code_from_task_file,
12+
replace_class_name_and_save,
13+
)
14+
15+
logger = logging.getLogger(__name__)
16+
17+
# Assumes a 'template' directory exists alongside this adapter.py file
18+
TEMPLATE_DIR = Path(__file__).parent / 'template'
19+
20+
21+
class BaseAdapter(ABC):
22+
"""Abstract Base Class for adapters."""
23+
24+
NAME: ClassVar[str]
25+
26+
def __init__(self, **kwargs):
27+
super().__init__()
28+
29+
@abstractmethod
30+
def generate_task(self, **kwargs) -> None:
31+
raise NotImplementedError('Adapter must implement this method.')
32+
33+
34+
class AlgoTuneAdapter(BaseAdapter):
35+
NAME = 'ALGOTUNE'
36+
37+
def __init__(
38+
self,
39+
task_dir: Path,
40+
cloned_repo_root: Path,
41+
template_dir: Path = TEMPLATE_DIR,
42+
**kwargs,
43+
):
44+
super().__init__(**kwargs)
45+
self.task_dir = task_dir
46+
self.template_dir = template_dir
47+
48+
# Encapsulate all data source interactions in a helper class
49+
self.algotune_data = AlgoTuneData(cloned_repo_root)
50+
51+
logger.info(
52+
f'AlgoTuneAdapter initialized. Outputting tasks to: {self.task_dir.resolve()}'
53+
)
54+
if not self.template_dir.exists():
55+
raise FileNotFoundError(
56+
f'Template directory not found at {self.template_dir}'
57+
)
58+
59+
def _prepare_task_from_template(
60+
self, output_dir: Path, task_name: str, task_py_path: Path
61+
):
62+
"""Copies the template and populates all placeholder files."""
63+
# 1. Copy the entire template directory
64+
if output_dir.exists():
65+
shutil.rmtree(output_dir)
66+
shutil.copytree(self.template_dir, output_dir)
67+
68+
# 2. Extract all necessary content
69+
description_content = (
70+
(task_py_path.parent / 'description.txt').read_text().strip()
71+
)
72+
ref_solver_content = extract_function_source(task_py_path, 'solve')
73+
is_solution_content = extract_function_source(task_py_path, 'is_solution')
74+
75+
# Get the path to the best solver (can be a file or a directory)
76+
best_solver_path = self.algotune_data.get_reference_solver(task_name)
77+
78+
if not all([description_content, ref_solver_content, is_solution_content]):
79+
logger.error(
80+
f"Failed to extract necessary source code/text for '{task_name}'. Skipping."
81+
)
82+
return False
83+
84+
# 3. Populate task.yaml
85+
instruction = get_instruction(
86+
description_content, ref_solver_content, is_solution_content
87+
)
88+
89+
problem_path = output_dir / 'problem_statement.txt'
90+
problem_path.write_text(instruction)
91+
92+
# 4. Dynamically generate shell command content for the solver
93+
sh_commands = ''
94+
if best_solver_path:
95+
if best_solver_path.is_dir():
96+
# Case 1: Best solver is a directory with multiple files.
97+
files_to_copy = [f for f in best_solver_path.iterdir() if f.is_file()]
98+
if files_to_copy:
99+
command_parts = [
100+
f"cat > /workspace/{item.name} << 'EOF'\n{item.read_text(encoding='utf-8')}\nEOF"
101+
for item in sorted(files_to_copy)
102+
]
103+
if len(command_parts) > 1:
104+
command_parts.append('/usr/local/bin/pip install .')
105+
sh_commands = '\n\n'.join(command_parts)
106+
elif best_solver_path.is_file():
107+
# Case 2: Best solver is a single file.
108+
content = best_solver_path.read_text(encoding='utf-8')
109+
sh_commands = (
110+
f"cat > /workspace/{best_solver_path.name} << 'EOF'\n{content}\nEOF"
111+
)
112+
113+
# Case 3: Fallback if no solver path was found or the directory was empty.
114+
if not sh_commands:
115+
logger.warning(
116+
f"No valid solver found for '{task_name}'. Using reference implementation as fallback."
117+
)
118+
fallback_code = prepare_solver_code_from_task_file(task_py_path)
119+
if not fallback_code:
120+
logger.error(
121+
f"Failed to generate fallback solver for '{task_name}'. Skipping."
122+
)
123+
return False
124+
sh_commands = f"cat > /workspace/solver.py << 'EOF'\n{fallback_code}\nEOF"
125+
126+
# 5. Populate solution.sh
127+
solution_sh_path = output_dir / 'solution.sh'
128+
solution_script_content = f"""#!/bin/bash
129+
130+
{sh_commands}
131+
"""
132+
solution_sh_path.write_text(solution_script_content)
133+
134+
# 6. Create tests/evaluator.py
135+
evaluator_dest_path = output_dir / 'evaluator.py'
136+
replace_class_name_and_save(task_py_path, evaluator_dest_path, new_name='Task')
137+
138+
# 7. Get task difficulty
139+
n_value = self.algotune_data.get_task_difficulty(task_name)
140+
141+
# 8. Update tests/test_outputs.py
142+
test_outputs_path = output_dir / 'test_outputs.py'
143+
test_content = test_outputs_path.read_text()
144+
test_content = test_content.replace(
145+
'PROBLEM_SIZE = 100', f'PROBLEM_SIZE = {n_value}'
146+
)
147+
test_outputs_path.write_text(test_content)
148+
149+
# 9. Update run-tests.sh
150+
run_test_path = output_dir / 'run-tests.sh'
151+
run_test_content = run_test_path.read_text()
152+
run_test_content = run_test_content.replace(
153+
'{test_output_content}', test_content
154+
)
155+
run_test_content = run_test_content.replace(
156+
'{solution_code_command}', sh_commands
157+
)
158+
run_test_path.write_text(run_test_content)
159+
160+
return True
161+
162+
def generate_task(self, task_name: str, task_py_path: Path) -> None:
163+
if not task_py_path.is_file() or task_py_path.suffix != '.py':
164+
logger.warning(f'Skipping non-Python file: {task_py_path}')
165+
return
166+
167+
t_bench_task_id = f'algotune-{task_name.lower()}'.replace('_', '-')
168+
logger.info(f'--- Generating task: {t_bench_task_id} ---')
169+
170+
output_dir = self.task_dir / t_bench_task_id
171+
172+
success = self._prepare_task_from_template(output_dir, task_name, task_py_path)
173+
if success:
174+
logger.info(f'Successfully generated task at {output_dir}')
175+
else:
176+
# Clean up partially generated directory on failure
177+
if output_dir.exists():
178+
shutil.rmtree(output_dir)
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
# run_adapter.py
2+
3+
import argparse
4+
import logging
5+
import subprocess
6+
import sys
7+
import tempfile
8+
from pathlib import Path
9+
10+
from adapter import AlgoTuneAdapter
11+
12+
BENCH_ROOT = Path(__file__).resolve().parent.parent
13+
ALGOTUNE_TASK_SUBDIR = 'AlgoTuneTasks'
14+
15+
# Configure logging for this runner script
16+
logging.basicConfig(level=logging.INFO, format='%(name)s - %(levelname)s: %(message)s')
17+
logger = logging.getLogger(__name__)
18+
19+
20+
def clone_repo(repo_url: str, temp_dir: Path) -> Path:
21+
"""Clones the specified repository to a temporary directory."""
22+
repo_path = temp_dir / 'algotune_repo'
23+
logger.info(f'Cloning AlgoTune repository from {repo_url}...')
24+
25+
try:
26+
# Using --depth 1 for a shallow clone to save time and space
27+
subprocess.run(
28+
['git', 'clone', '--depth', '1', repo_url, str(repo_path)],
29+
check=True,
30+
capture_output=True,
31+
text=True,
32+
)
33+
logger.info(f'Successfully cloned repository to {repo_path}')
34+
return repo_path
35+
except subprocess.CalledProcessError as e:
36+
logger.error(f'Failed to clone repository: {e.stderr.strip()}')
37+
raise
38+
39+
40+
def main():
41+
parser = argparse.ArgumentParser(
42+
description='Convert AlgoTune tasks into T-Bench format using the adapter.',
43+
formatter_class=argparse.RawTextHelpFormatter,
44+
)
45+
parser.add_argument(
46+
'--repo-url',
47+
type=str,
48+
default='git@github.com:oripress/AlgoTune.git',
49+
help='The URL of the Git repository containing AlgoTune tasks.',
50+
)
51+
parser.add_argument(
52+
'--output-path',
53+
type=str,
54+
default=str(BENCH_ROOT / 'tasks'),
55+
help='Output directory for generated T-Bench tasks.',
56+
)
57+
args = parser.parse_args()
58+
59+
output_task_dir = Path(args.output_path)
60+
output_task_dir.mkdir(parents=True, exist_ok=True)
61+
logger.info(f'Adapter will output tasks to: {output_task_dir.resolve()}')
62+
63+
with tempfile.TemporaryDirectory() as temp_dir_str:
64+
temp_path = Path(temp_dir_str)
65+
try:
66+
# 1. Get the source code
67+
cloned_repo_root = clone_repo(args.repo_url, temp_path)
68+
69+
# 2. Locate the folder with task files
70+
source_tasks_dir = cloned_repo_root / ALGOTUNE_TASK_SUBDIR
71+
if not source_tasks_dir.is_dir():
72+
logger.error(
73+
f"Task subdirectory '{ALGOTUNE_TASK_SUBDIR}' not found in repo."
74+
)
75+
sys.exit(1)
76+
77+
# 3. Initialize the adapter and data loader
78+
adapter = AlgoTuneAdapter(
79+
task_dir=output_task_dir, cloned_repo_root=cloned_repo_root
80+
)
81+
82+
# 4. Get the list of tasks from the repo's data file
83+
task_list = adapter.algotune_data.get_task_list()
84+
logger.info(
85+
f'Found {len(task_list)} tasks in repository data. Starting generation...'
86+
)
87+
88+
# 5. Process each valid task
89+
for task_name in task_list:
90+
task_py_path = source_tasks_dir / task_name / f'{task_name}.py'
91+
if task_py_path.exists():
92+
adapter.generate_task(
93+
task_name=task_name, task_py_path=task_py_path
94+
)
95+
else:
96+
logger.warning(
97+
f"Task '{task_name}' found in data files but source not found at: {task_py_path}"
98+
)
99+
100+
except (subprocess.CalledProcessError, FileNotFoundError) as e:
101+
logger.error(f'Aborting due to a critical error: {e}')
102+
sys.exit(1)
103+
except Exception as e:
104+
logger.error(f'An unexpected error occurred: {e}', exc_info=True)
105+
sys.exit(1)
106+
107+
logger.info('Task generation complete.')
108+
109+
110+
if __name__ == '__main__':
111+
main()

0 commit comments

Comments
 (0)