PreTrainedTokenizerFast.convert_tokens_to_string always assumes the presence of decoder

dszeto · November 1, 2024, 7:32am

According to Tokenizer, decoder is an optional property, but in PreTrainedTokenizerFast.convert_tokens_to_string (transformers/src/transformers/tokenization_utils_fast.py at main · huggingface/transformers · GitHub), there is no check whether decoder is None. Some tokenizers (such as those trained by WordLevelTrainer) do not have decoders and this is causing problems with projects such as TGI / Outlines because they use the convert_tokens_to_string method.

What is the correct approach here? Should I convert the fast tokenizer to a slow one? Or should I create a PR that checks for None in decoder, and if so, just do a simple join of tokens?

John6666 · November 4, 2024, 1:28am

Or should I create a PR that checks for None in decoder

Newly created parts of toransoformers sometimes have bugs or unimplemented parts. If it is possible, please do so.

dszeto · November 7, 2024, 6:26pm

Opened a PR at Fix convert_tokens_to_string when decoder is None by dszeto · Pull Request #34569 · huggingface/transformers · GitHub. This is working in my production environment, just waiting to be reviewed and merged in.

Topic		Replies	Views
How to avoid PreTrainedTokenizerFast.decode to add space between tokens 🤗Transformers	3	85	April 22, 2025
How can I check the implementation of tokenizer.decode() 🤗Transformers	6	87	September 30, 2024
Update encode function slowTokenizer vs FastTokenizer 🤗Tokenizers	0	63	July 12, 2024
Decode token IDs into a list (not a single string) 🤗Tokenizers	4	4764	March 11, 2025
Pre_tokenization 🤗Transformers	0	345	April 13, 2023

PreTrainedTokenizerFast.convert_tokens_to_string always assumes the presence of decoder

Related topics