xml-RoBERTa based model with CRF layer for location and address span extraction. Trained on weakly labeled dataset (Part of Dataset 10 of Epstein files). Geographical entities in dataset labeled with qwen3 70b + BIO-tags added automatically. Still needs tests in the wild.
Trained until: Epoch 6 | Loss: 62.9988 (CFR-loss) | Token F1: 0.8452 | Binary F1: 0.8170 | Token Acc: 0.9842 | Span Acc: 0.6357 | Partial: 0.7419 Token F1 - based on token matching Binary F1 - shows performance of Geo Entety extraction only Token Acc - based on token matching Span Acc - based on span (whole geo entity) matching partial - based on span (whole geo entity) matching (at least 50% correct overlap)
language:
- en
metrics:
- f1
- accuracy
base_model:
- FacebookAI/xlm-roberta-base
pipeline_tag: text-classification
tags:
- geoparsing
- location
- ner
- informationextraction