AfroBench

Sleeping

App Files Files Community

Update data/leaderboard_json/afrobench_lite.json

#15

by seun-ajayi - opened 7 days ago

base: refs/heads/main

←

from: refs/pr/15

Discussion Files changed

+38

-7

Update data/leaderboard_json/afrobench_lite.jsonc42bef2e

seun-ajayi

7 days ago

Add N-ATLAS-LLM AfroBench-LITE scores to leaderboard

This PR adds N-ATLAS-LLM results from an AfroBench-LITE evaluation run to the leaderboard.

Scores added (N-ATLAS-LLM)

NLI (AfriXNLI): 47.7
Intent (InjongoIntent): 70.0
MT (FLORES en→xx): 50.1
MMLU (AfriMMLU): 37.8
Math (AfriMGSM): 40.4
Topic (SIB): 80.5
RC (Belebele): 51.4

Reference

https://huggingface.co/blog/seun-ajayi/n-atlas-evaluation-report

JessicaOjo

McGill NLP Group org 4 days ago

Thank you for the contribution. This is a timely evaluation.
Please update;
https://huggingface.co/spaces/McGill-NLP/AfroBench/blob/main/data/leaderboard_json/lite_language_scores.json - with per language average across the tasks
https://huggingface.co/spaces/McGill-NLP/AfroBench/blob/main/data/community_results/New%20Results%20-%20June2025.csv - full results of each tasks

add N-ATLAS results to leaderboard6c0eeb20

seun-ajayi

3 days ago

I believe the results on the leaderboard are the zero-shot results?

Also what does the prompt value in the New%20Results%20-%20June2025.csv file represent?

JessicaOjo

McGill NLP Group org 1 day ago

Yes, the results on the leaderboard are zero shot.
We prompted the models with multiple prompts for each task and report the best performing prompt in the leaderboard. The prompt value represents which prompt produced the reported scores, they map back to the prompts specified in the Appendix section of the paper.

seun-ajayi

1 day ago

I see, what I did was aggregate the results across the prompts. How then do you choose which result to present, the best performing prompt?

JessicaOjo

McGill NLP Group org 1 day ago

Hey Seun, thank you for updating the pull request. At the moment, though, we can’t merge the results because the evaluation only covers 3 of the 14 AfroBench languages (plus English). Since the leaderboard compares models on the 14-language, partial results aren’t directly comparable and would distort the ranking.

I understand that the model is specialized for Nigerian languages, but for inclusion on the AfroBench leaderboard we need results across all 14 languages, even if the model performs poorly on these languages. This is the same requirement applied to all models so that the relative scores remain meaningful.

If you can run the model across all 14 languages, we’d be happy to include the results in the leaderboard.

JessicaOjo

McGill NLP Group org 1 day ago

I see, what I did was aggregate the results across the prompts. How then do you choose which result to present, the best performing prompt?

We pick the prompt with the highest average across languages per task.

seun-ajayi

1 day ago

Hey Seun, thank you for updating the pull request. At the moment, though, we can’t merge the results because the evaluation only covers 3 of the 14 AfroBench languages (plus English). Since the leaderboard compares models on the 14-language, partial results aren’t directly comparable and would distort the ranking.

I understand that the model is specialized for Nigerian languages, but for inclusion on the AfroBench leaderboard we need results across all 14 languages, even if the model performs poorly on these languages. This is the same requirement applied to all models so that the relative scores remain meaningful.

If you can run the model across all 14 languages, we’d be happy to include the results in the leaderboard.

I understand, the resource required to run the evaluation across the 14 languages is a concern but I will still into the feasibility.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment