Declarative Experimentation in Information Retrieval using PyTerrier
Paper โข 2007.14271 โข Published
How to use macavaney/doc2query-t5-base-msmarco with Transformers:
# Use a pipeline as a high-level helper
# Warning: Pipeline type "translation" is no longer supported in transformers v5.
# You must load the model directly (see below) or downgrade to v4.x with:
# 'pip install "transformers<5.0.0'
from transformers import pipeline
pipe = pipeline("translation", model="macavaney/doc2query-t5-base-msmarco") # Load model directly
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("macavaney/doc2query-t5-base-msmarco")
model = AutoModelForSeq2SeqLM.from_pretrained("macavaney/doc2query-t5-base-msmarco")YAML Metadata Error:"datasets[0]" with value "irds:msmarco-passage" is not valid. If possible, use a dataset id from https://hf.co/datasets.
A Doc2Query model based on t5-base and trained on MS MARCO. This is a version of the checkpoint released by the original authors, converted to pytorch format and ready for use in pyterrier_doc2query.
Creating a transformer:
import pyterrier as pt
pt.init()
from pyterrier_doc2query import Doc2Query
doc2query = Doc2Query('macavaney/doc2query-t5-base-msmarco')
Transforming documents
import pandas as pd
doc2query(pd.DataFrame([
{'docno': '0', 'text': 'Hello Terrier!'},
{'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
]))
# docno text querygen
# 0 Hello Terrier! hello terrier what kind of dog is a terrier wh...
# 1 Doc2Query expands queries with potentially rel... can dodoc2query extend query query? what is do...
Indexing transformed documents
doc2query.append = True # append querygen to text
indexer = pt.IterDictIndexer('./my_index', fields=['text'])
pipeline = doc2query >> indexer
pipeline.index([
{'docno': '0', 'text': 'Hello Terrier!'},
{'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
])
Expanding and indexing a dataset
dataset = pt.get_dataset('irds:vaswani')
pipeline.index(dataset.get_corpus_iter())