Spaces:
Sleeping
Update data/leaderboard_json/afrobench_lite.json
Add N-ATLAS-LLM AfroBench-LITE scores to leaderboard
This PR adds N-ATLAS-LLM results from an AfroBench-LITE evaluation run to the leaderboard.
Scores added (N-ATLAS-LLM)
- NLI (AfriXNLI): 47.7
- Intent (InjongoIntent): 70.0
- MT (FLORES en→xx): 50.1
- MMLU (AfriMMLU): 37.8
- Math (AfriMGSM): 40.4
- Topic (SIB): 80.5
- RC (Belebele): 51.4
Reference
Thank you for the contribution. This is a timely evaluation.
Please update;
https://huggingface.co/spaces/McGill-NLP/AfroBench/blob/main/data/leaderboard_json/lite_language_scores.json - with per language average across the tasks
https://huggingface.co/spaces/McGill-NLP/AfroBench/blob/main/data/community_results/New%20Results%20-%20June2025.csv - full results of each tasks
I believe the results on the leaderboard are the zero-shot results?
Also what does the prompt value in the New%20Results%20-%20June2025.csv file represent?
Yes, the results on the leaderboard are zero shot.
We prompted the models with multiple prompts for each task and report the best performing prompt in the leaderboard. The prompt value represents which prompt produced the reported scores, they map back to the prompts specified in the Appendix section of the paper.
I see, what I did was aggregate the results across the prompts. How then do you choose which result to present, the best performing prompt?
Hey Seun, thank you for updating the pull request. At the moment, though, we can’t merge the results because the evaluation only covers 3 of the 14 AfroBench languages (plus English). Since the leaderboard compares models on the 14-language, partial results aren’t directly comparable and would distort the ranking.
I understand that the model is specialized for Nigerian languages, but for inclusion on the AfroBench leaderboard we need results across all 14 languages, even if the model performs poorly on these languages. This is the same requirement applied to all models so that the relative scores remain meaningful.
If you can run the model across all 14 languages, we’d be happy to include the results in the leaderboard.
I see, what I did was aggregate the results across the prompts. How then do you choose which result to present, the best performing prompt?
We pick the prompt with the highest average across languages per task.
Hey Seun, thank you for updating the pull request. At the moment, though, we can’t merge the results because the evaluation only covers 3 of the 14 AfroBench languages (plus English). Since the leaderboard compares models on the 14-language, partial results aren’t directly comparable and would distort the ranking.
I understand that the model is specialized for Nigerian languages, but for inclusion on the AfroBench leaderboard we need results across all 14 languages, even if the model performs poorly on these languages. This is the same requirement applied to all models so that the relative scores remain meaningful.
If you can run the model across all 14 languages, we’d be happy to include the results in the leaderboard.
I understand, the resource required to run the evaluation across the 14 languages is a concern but I will still into the feasibility.