|
|
--- |
|
|
title: SearchAgent_Leaderboard |
|
|
emoji: 🥇 |
|
|
colorFrom: green |
|
|
colorTo: indigo |
|
|
sdk: gradio |
|
|
app_file: app.py |
|
|
pinned: true |
|
|
license: apache-2.0 |
|
|
short_description: A standardized leaderboard for search agents |
|
|
sdk_version: 4.23.0 |
|
|
tags: |
|
|
- leaderboard |
|
|
--- |
|
|
|
|
|
# Overview |
|
|
|
|
|
SearchAgent Leaderboard provides a simple, standardized way to compare search-augmented QA agents across: |
|
|
- General QA: NQ, TriviaQA, PopQA |
|
|
- Multi-hop QA: HotpotQA, 2wiki, Musique, Bamboogle |
|
|
- Novel closed-world: FictionalHot |
|
|
|
|
|
We display a minimal set of columns for clarity: |
|
|
- Rank, Model, Average, per-dataset scores, Model Size (3B/7B) |
|
|
|
|
|
|
|
|
# Data format (results) |
|
|
|
|
|
Place model result files in `eval-results/` as JSON. Scores are decimals in [0,1] (the UI multiplies by 100). |
|
|
|
|
|
```json |
|
|
{ |
|
|
"config": { |
|
|
"model_dtype": "torch.float16", |
|
|
"model_name": "YourMethod-Qwen2.5-7b-Instruct", |
|
|
"model_sha": "main" |
|
|
}, |
|
|
"results": { |
|
|
"nq": { "exact_match": 0.469 }, |
|
|
"triviaqa": { "exact_match": 0.640 }, |
|
|
"popqa": { "exact_match": 0.501 }, |
|
|
"hotpotqa": { "exact_match": 0.389 }, |
|
|
"2wiki": { "exact_match": 0.382 }, |
|
|
"musique": { "exact_match": 0.185 }, |
|
|
"bamboogle": { "exact_match": 0.392 }, |
|
|
"fictionalhot": { "exact_match": 0.061 } |
|
|
} |
|
|
} |
|
|
``` |
|
|
|
|
|
Notes: |
|
|
- `model_name` uses the format `Method-Qwen2.5-{3b|7b}-Instruct` (no org prefix required) |
|
|
- Tasks: `nq`, `triviaqa`, `popqa`, `hotpotqa`, `2wiki`, `musique`, `bamboogle`, `fictionalhot` |
|
|
- Metric key: `exact_match` |
|
|
|
|
|
|
|
|
# Submission (via Community) |
|
|
|
|
|
We accept submissions via the Space Community (Discussions): |
|
|
1) Open the Space page and go to Community: `https://huggingface.co/spaces/TencentBAC/SearchAgent_Leaderboard` |
|
|
2) Create a discussion with title `Submission: <YourMethod>-<model_name>-<model_size>` |
|
|
3) Include: |
|
|
- Model weights link (HF or GitHub) |
|
|
- Short method description |
|
|
- Evaluation JSON (inline or attached) |
|
|
|
|
|
|
|
|
# Local development |
|
|
|
|
|
Run locally (example): |
|
|
```bash |
|
|
python app.py |
|
|
``` |
|
|
|
|
|
The app reads local data only (no remote download) from: |
|
|
- Results: `./eval-results` |
|
|
- (Optional) Requests: `./eval-queue` (not required for the simplified table) |
|
|
|
|
|
If you see missing dependencies, install minimally: |
|
|
```bash |
|
|
pip install gradio gradio_leaderboard pandas huggingface_hub apscheduler |
|
|
``` |
|
|
|
|
|
|
|
|
# Customize |
|
|
|
|
|
- Tasks and page texts: `src/about.py` |
|
|
- Displayed columns: `src/display/utils.py` (we keep Rank, Model, Average, per-dataset, Model Size) |
|
|
- Custom model links (name→URL mapping): `src/display/formatting.py` (`custom_links` dict) |
|
|
- Data loading and ranking: `src/leaderboard/read_evals.py`, `src/populate.py` |
|
|
|
|
|
Restart the app after changes. |