For now version:
Short version:
Transformers.js does not currently expose anything like return_offsets_mapping. You will need a separate tokenizer that can give you offsets, then pass only input_ids into transformers.js.
Below is the detailed reasoning plus concrete options, with URLs.
1. What return_offsets_mapping is in Python
In Python, fast tokenizers (PreTrainedTokenizerFast / PythonBackend) let you do:
enc = tokenizer(
"Some text",
return_offsets_mapping=True,
)
print(enc["offset_mapping"])
# [(start0, end0), (start1, end1), ...]
Docs (the page you linked):
- Python tokenizer docs:
https://huggingface.co/docs/transformers/en/main_classes/tokenizer
Key points:
- Works only for fast tokenizers (backed by the Rust
tokenizerslibrary). offset_mappingis a list of pairs[start_char, end_char]in the original string.- This is exactly what powers the official QA and token-classification examples in Python.
So you are asking for the same thing, but in JavaScript.
2. What transformers.js exposes today
Transformers.js tokenizers are documented here:
- Transformers.js tokenizer API:
https://huggingface.co/docs/transformers.js/en/api/tokenizers
From that page and the general examples:
import { AutoTokenizer } from '@huggingface/transformers';
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/t5-small');
const { input_ids, attention_mask } = await tokenizer('text here');
The documented outputs are:
input_idsattention_mask- maybe
token_type_idsfor some models
There is no offset_mapping field and no flag like return_offsets_mapping documented.
There is also an external âgood first issuesâ list that includes:
- âhuggingface/transformers.js â [Feature request] Return offset mapping using tokenizerâ
https://github.com/drkrillo/good-first-issues
That tells you:
- This capability is not available yet.
- It has been requested as a feature for transformers.js.
So: Xenovaâs JS tokenizer does not currently give you offset mappings.
3. Why âdecode and build a mapâ is unsafe
Your first idea:
- Tokenize with transformers.js â get
input_ids. - Decode those IDs back to a string.
- Align decoded text with the original, and infer character spans.
The Rust tokenizers maintainers have an issue and a forum thread for this exact pattern:
- GitHub issue âReturn_offsets_mapping when decodingâ:
https://github.com/huggingface/tokenizers/issues/1769 - HF forum âReturn_offsets_mapping when decodingâ:
https://huggingface.co/static-proxy/discuss.huggingface.co/t/return-offsets-mapping-when-decoding/152215
Core problems they describe:
-
Encoding and decoding are not inverse in general
People tried to get offsets by doing:- decode
input_idsto text - re-encode that text with
return_offsets_mapping=True
and discovered that the new IDs do not always match the original IDs. So any offsets you compute this way can be wrong.
- decode
-
Normalization destroys information
The tokenizer can:
- lowercase
- remove accents
- apply Unicode normalization
After that, you no longer have a simple position-by-position mapping to the original characters.
-
Special markers and whitespace
Some tokenizers use special markers for spaces (
â,Ä) or collapse multiple spaces. Offsets will jump in non-obvious ways.
Conclusion: a âdecode and guess offsetsâ approach is fine as a rough visualization, but it is not a safe substitute for offset_mapping if you need a correct map.
4. The robust pattern: separate tokenizer with offsets
The reliable solution is:
Use a tokenizer implementation that already supports offsets (Rust
tokenizers), and then feed onlyinput_idsinto transformers.js.
You make tokenization + offsets its own subsystem. Transformers.js becomes an inference layer that consumes your input_ids, instead of being your only tokenizer.
4.1. Node: use Hugging Face tokenizers bindings
Official tokenizers repo (Rust + Python + Node):
Node bindings expose an Encoding object which has:
getIds()â token IDsgetOffsets()â list of[start, end]character positions
See the Encoding docs:
https://huggingface.co/docs/tokenizers/en/api/encoding
To make Node usage easier, there are multi-arch wrappers such as:
@anush008/tokenizers(multi-arch Node 20 bindings):
GitHub: https://github.com/Anush008/tokenizers
From its README:
import { Tokenizer } from "@anush008/tokenizers";
const tokenizer = await Tokenizer.fromFile("tokenizer.json");
const enc = await tokenizer.encode("Who is John?");
console.log(enc.getIds()); // token IDs
console.log(enc.getOffsets()); // [[start, end], ...]
So a realistic Node-side helper:
type Offset = [number, number];
interface EncodedWithOffsets {
ids: number[];
offsets: Offset[];
}
async function encodeWithOffsets(
tokenizer: any,
text: string,
): Promise<EncodedWithOffsets> {
const enc = await tokenizer.encode(text);
return {
ids: enc.getIds(),
offsets: enc.getOffsets(),
};
}
Then you:
- Load the same
tokenizer.jsonas your model (download it from the modelâs Hugging Face repo; transformers.js expects the same file). - Use Rust tokenizers via Node to get
ids + offsets. - Convert
idsinto tensors and pass them to transformers.js models. - Keep
offsetsin your own data structures and use them to map predictions back to the original text.
This gives you exactly what Pythonâs return_offsets_mapping=True provides, but from JavaScript.
4.2. Browser-only: WASM or a microservice
If you are strictly in the browser and cannot run Node:
-
Tiny tokenizer service (simplest)
- Backend (Node or Python) exposes
/tokenizethat returnsinput_idsandoffsets. - Frontend uses transformers.js for inference.
- Backend (Node or Python) exposes
-
WASM build of tokenizers in the browser
- Use a WebAssembly port of Hugging Face tokenizers (there are several community projects).
- Same API idea:
encodegives IDs and offsets; pass IDs to transformers.js.
One Reddit / JS ecosystem pattern is exactly this: use HF tokenizers (Node/WASM) plus ONNX or JS runtimes that only consume IDs. Example discussion: âHuggingface tokenizers in javascript for webâ describes pairing JS runtimes with external tokenizers.
5. What about âusing Xenovaâs tokenizer but with offsetsâ?
Right now:
- The official transformers.js tokenizer docs show the public API. No offset or mapping field is listed.
- The external âgood first issuesâ list confirms that âReturn offset mapping using tokenizerâ is a requested feature, not an existing one.
So you cannot enable this with a hidden option. The only ways today are:
- Fork transformers.js and add full offset tracking yourself (complex, easy to get wrong), or
- Use Rust tokenizers separately and feed IDs into transformers.js as described above.
Given that Rust tokenizers are already battle-tested for offsets and used in the official Python APIs and course material, they are the safer choice.
6. If you really do not want Python at all
You said:
Can you just use Python: Although I could communicate between languages, Iâd rather not.
So the order of options, in practice:
-
Best JS-friendly option
- Node environment: use
@huggingface/tokenizersvia a multi-arch wrapper like@anush008/tokenizers. - Return
ids + offsets. - Call transformers.js with those IDs.
- Node environment: use
-
Browser with backend
- Backend (Node or Python) provides
ids + offsets. - Frontend only runs transformers.js models.
- Backend (Node or Python) provides
-
Pure browser
- Use a WASM build of tokenizers in a Web Worker.
- Same pattern:
encodeâids + offsets, then transformers.js withids.
-
Last-resort heuristics
- Decode tokens and approximate offsets only for non-critical UI highlighting, accepting that it will be wrong in edge cases.
- Not suitable if you âreally need a mapâ in the precise sense.
You only need Python if you decide a small Python tokenization service is simpler than wiring Node/WASM. The mapping idea is identical: return_offsets_mapping in Python or getOffsets() in Rust/Node.
7. Summary
- Pythonâs
return_offsets_mappingcomes from the Rusttokenizerslibrary and gives per-token[start_char, end_char]spans in the original text.
Docs: https://huggingface.co/docs/transformers/en/main_classes/tokenizer - transformers.js tokenizers currently expose
input_ids,attention_mask, etc., but nooffset_mappingand noreturn_offsets_mapping-style flag.
Docs: https://huggingface.co/docs/transformers.js/en/api/tokenizers - There is an external listing for a transformers.js feature request called âReturn offset mapping using tokenizerâ, which confirms it is missing right now.
Listing: https://github.com/drkrillo/good-first-issues - Building a map by decoding tokens is not reliable, as shown by the
tokenizersissue and discussion on âReturn_offsets_mapping when decodingâ.
Issue: https://github.com/huggingface/tokenizers/issues/1769
Discussion: https://huggingface.co/static-proxy/discuss.huggingface.co/t/return-offsets-mapping-when-decoding/152215 - The robust solution, without Python, is to use Hugging Faceâs Rust tokenizers via Node or WASM, get
ids + offsetsfrom there, and feed onlyidsinto transformers.js.
Tokenizers repo: https://github.com/huggingface/tokenizers
Node wrapper example: https://github.com/Anush008/tokenizers