Could you please disclose the full list of training data for embedding supervised finetuning?

#32
by kwang2049 - opened

Although there is a general mention in the paper about it as

"Dataset with annotated negatives: We have prepared retrieval datasets, such as MSMarco [Bajaj et al., 2016] and Natural Questions
(NQ) [Kwiatkowski et al., 2019], in addition to multiple non-retrieval datasets like the Natural Language Inference (NLI) dataset [Bowman et al.,2015]. "

Could you please disclose the full list of the dataset names? This is very important for research work that wants to use Jina or follows it. Thanks in advance.

hi @kwang2049 yes, we used

  1. snli data from simcse: https://github.com/princeton-nlp/SimCSE#training, 1 hard negative + random negatives.
  2. msmarco, nq, quora-qa, hotpotqa and fever with mined hard negatives.
  3. cc news title description pairs with random negatives. https://huggingface.co/datasets/cc_news

each row consist of 17 items, including 1 anchor, 1 positive and 15 negatives.

Thanks❤️!

kwang2049 changed discussion status to closed

Sign up or log in to comment