Training Data

by jo-mengr - opened Jul 30, 2025

Jul 30, 2025

Hi!
Love the model and am working with it for my phd. Would it possible for you to share the training dataset? I would like to train a modern Bert model with a larger context window with the same objective.
Thanks!
Jonatan

davidmezzetti

NeuML org Jul 30, 2025

Thank you, I appreciate it!

The dataset is just a random sample of PubMed title/abstract pairs, so I don't think it's hard to reproduce and probably could even be improved upon with good dataset engineering/analysis/parameter tuning. Then for each randomly selected article, a similar title is found. PaperETL can handle all the PubMed article processing.

There is also another model that uses a ModernBERT fine-tuned model as the base: https://huggingface.co/NeuML/bioclinical-modernbert-base-embeddings

jo-mengr

Jul 30, 2025

Perfect! In that case I'll just use that model instead.
Thanks!

jo-mengr changed discussion status to closed Jul 30, 2025

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment