Transformers.js need for token to char mapping

John6666 · December 11, 2025, 1:42am

For now version:

Short version:
Transformers.js does not currently expose anything like return_offsets_mapping. You will need a separate tokenizer that can give you offsets, then pass only input_ids into transformers.js.

Below is the detailed reasoning plus concrete options, with URLs.

1. What `return_offsets_mapping` is in Python

In Python, fast tokenizers (PreTrainedTokenizerFast / PythonBackend) let you do:

enc = tokenizer(
    "Some text",
    return_offsets_mapping=True,
)
print(enc["offset_mapping"])
# [(start0, end0), (start1, end1), ...]

Docs (the page you linked):

Python tokenizer docs:
https://huggingface.co/docs/transformers/en/main_classes/tokenizer

Key points:

Works only for fast tokenizers (backed by the Rust tokenizers library).
offset_mapping is a list of pairs [start_char, end_char] in the original string.
This is exactly what powers the official QA and token-classification examples in Python.

So you are asking for the same thing, but in JavaScript.

2. What transformers.js exposes today

Transformers.js tokenizers are documented here:

Transformers.js tokenizer API:
https://huggingface.co/docs/transformers.js/en/api/tokenizers

From that page and the general examples:

import { AutoTokenizer } from '@huggingface/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('Xenova/t5-small');
const { input_ids, attention_mask } = await tokenizer('text here');

The documented outputs are:

input_ids
attention_mask
maybe token_type_ids for some models

There is no offset_mapping field and no flag like return_offsets_mapping documented.

There is also an external “good first issues” list that includes:

“huggingface/transformers.js – [Feature request] Return offset mapping using tokenizer”
https://github.com/drkrillo/good-first-issues

That tells you:

This capability is not available yet.
It has been requested as a feature for transformers.js.

So: Xenova’s JS tokenizer does not currently give you offset mappings.

3. Why “decode and build a map” is unsafe

Your first idea:

Tokenize with transformers.js → get input_ids.
Decode those IDs back to a string.
Align decoded text with the original, and infer character spans.

The Rust tokenizers maintainers have an issue and a forum thread for this exact pattern:

GitHub issue “Return_offsets_mapping when decoding”:
https://github.com/huggingface/tokenizers/issues/1769
HF forum “Return_offsets_mapping when decoding”:
https://huggingface.co/static-proxy/discuss.huggingface.co/t/return-offsets-mapping-when-decoding/152215

Core problems they describe:

Encoding and decoding are not inverse in general
People tried to get offsets by doing:
- decode input_ids to text
- re-encode that text with return_offsets_mapping=True
and discovered that the new IDs do not always match the original IDs. So any offsets you compute this way can be wrong.
Normalization destroys information

The tokenizer can:
- lowercase
- remove accents
- apply Unicode normalization
After that, you no longer have a simple position-by-position mapping to the original characters.
Special markers and whitespace

Some tokenizers use special markers for spaces (▁, Ġ) or collapse multiple spaces. Offsets will jump in non-obvious ways.

Conclusion: a “decode and guess offsets” approach is fine as a rough visualization, but it is not a safe substitute for offset_mapping if you need a correct map.

4. The robust pattern: separate tokenizer with offsets

The reliable solution is:

Use a tokenizer implementation that already supports offsets (Rust tokenizers), and then feed only input_ids into transformers.js.

You make tokenization + offsets its own subsystem. Transformers.js becomes an inference layer that consumes your input_ids, instead of being your only tokenizer.

4.1. Node: use Hugging Face tokenizers bindings

Official tokenizers repo (Rust + Python + Node):

https://github.com/huggingface/tokenizers

Node bindings expose an Encoding object which has:

getIds() → token IDs
getOffsets() → list of [start, end] character positions
See the Encoding docs:
https://huggingface.co/docs/tokenizers/en/api/encoding

To make Node usage easier, there are multi-arch wrappers such as:

@anush008/tokenizers (multi-arch Node 20 bindings):
GitHub: https://github.com/Anush008/tokenizers

From its README:

import { Tokenizer } from "@anush008/tokenizers";

const tokenizer = await Tokenizer.fromFile("tokenizer.json");
const enc = await tokenizer.encode("Who is John?");

console.log(enc.getIds());      // token IDs
console.log(enc.getOffsets());  // [[start, end], ...]

So a realistic Node-side helper:

type Offset = [number, number];

interface EncodedWithOffsets {
  ids: number[];
  offsets: Offset[];
}

async function encodeWithOffsets(
  tokenizer: any,
  text: string,
): Promise<EncodedWithOffsets> {
  const enc = await tokenizer.encode(text);
  return {
    ids: enc.getIds(),
    offsets: enc.getOffsets(),
  };
}

Then you:

Load the same tokenizer.json as your model (download it from the model’s Hugging Face repo; transformers.js expects the same file).
Use Rust tokenizers via Node to get ids + offsets.
Convert ids into tensors and pass them to transformers.js models.
Keep offsets in your own data structures and use them to map predictions back to the original text.

This gives you exactly what Python’s return_offsets_mapping=True provides, but from JavaScript.

4.2. Browser-only: WASM or a microservice

If you are strictly in the browser and cannot run Node:

Tiny tokenizer service (simplest)
- Backend (Node or Python) exposes /tokenize that returns input_ids and offsets.
- Frontend uses transformers.js for inference.
WASM build of tokenizers in the browser
- Use a WebAssembly port of Hugging Face tokenizers (there are several community projects).
- Same API idea: encode gives IDs and offsets; pass IDs to transformers.js.

One Reddit / JS ecosystem pattern is exactly this: use HF tokenizers (Node/WASM) plus ONNX or JS runtimes that only consume IDs. Example discussion: “Huggingface tokenizers in javascript for web” describes pairing JS runtimes with external tokenizers.

5. What about “using Xenova’s tokenizer but with offsets”?

Right now:

The official transformers.js tokenizer docs show the public API. No offset or mapping field is listed.
The external “good first issues” list confirms that “Return offset mapping using tokenizer” is a requested feature, not an existing one.

So you cannot enable this with a hidden option. The only ways today are:

Fork transformers.js and add full offset tracking yourself (complex, easy to get wrong), or
Use Rust tokenizers separately and feed IDs into transformers.js as described above.

Given that Rust tokenizers are already battle-tested for offsets and used in the official Python APIs and course material, they are the safer choice.

6. If you really do not want Python at all

You said:

Can you just use Python: Although I could communicate between languages, I’d rather not.

So the order of options, in practice:

Best JS-friendly option
- Node environment: use @huggingface/tokenizers via a multi-arch wrapper like @anush008/tokenizers.
- Return ids + offsets.
- Call transformers.js with those IDs.
Browser with backend
- Backend (Node or Python) provides ids + offsets.
- Frontend only runs transformers.js models.
Pure browser
- Use a WASM build of tokenizers in a Web Worker.
- Same pattern: encode → ids + offsets, then transformers.js with ids.
Last-resort heuristics
- Decode tokens and approximate offsets only for non-critical UI highlighting, accepting that it will be wrong in edge cases.
- Not suitable if you “really need a map” in the precise sense.

You only need Python if you decide a small Python tokenization service is simpler than wiring Node/WASM. The mapping idea is identical: return_offsets_mapping in Python or getOffsets() in Rust/Node.

7. Summary

Python’s return_offsets_mapping comes from the Rust tokenizers library and gives per-token [start_char, end_char] spans in the original text.
Docs: https://huggingface.co/docs/transformers/en/main_classes/tokenizer
transformers.js tokenizers currently expose input_ids, attention_mask, etc., but no offset_mapping and no return_offsets_mapping-style flag.
Docs: https://huggingface.co/docs/transformers.js/en/api/tokenizers
There is an external listing for a transformers.js feature request called “Return offset mapping using tokenizer”, which confirms it is missing right now.
Listing: https://github.com/drkrillo/good-first-issues
Building a map by decoding tokens is not reliable, as shown by the tokenizers issue and discussion on “Return_offsets_mapping when decoding”.
Issue: https://github.com/huggingface/tokenizers/issues/1769
Discussion: https://huggingface.co/static-proxy/discuss.huggingface.co/t/return-offsets-mapping-when-decoding/152215
The robust solution, without Python, is to use Hugging Face’s Rust tokenizers via Node or WASM, get ids + offsets from there, and feed only ids into transformers.js.
Tokenizers repo: https://github.com/huggingface/tokenizers
Node wrapper example: https://github.com/Anush008/tokenizers

Topic		Replies	Views
Offset mappings differ for tokenizers 🤗Tokenizers	0	1899	October 30, 2023
Different Behaviors between Tokenizers for Question Answering 🤗Transformers	0	345	October 20, 2021
Xlm-Roberta Tokenizing 🤗Transformers	3	489	January 19, 2021
Return_offsets_mapping when decoding 🤗Tokenizers	4	116	November 20, 2025
Tokenizer.encode not returning encodings 🤗Tokenizers	2	917	October 9, 2021