ID Token Nicer
A nicer tokenizer
Agents, IDs, and a tiny repo of utilities that makes them less painful
If you’ve built LLM agents that call tools or reference source materials, you’ve probably hit the “identifiers problem”: UUIDs and numeric IDs are everywhere, and agents are weirdly bad at handling them reliably — (shit em-dash, well I did pass the content through an LLM for feedback, but I am a human) — especially cheaper/faster models fail with this.
So I bundled a set of practical fixes into a small repo: id-tokenizer (in id-token-nicer). It’s a grab-bag of utilities for turning UUIDs and integer IDs into LLM-friendlier forms, plus some extra modes for validation, pattern hiding, and “UUID-shaped” public IDs. (GitHub)
The headache
Two failure modes show up constantly:
1) UUIDs get “almost” copied
An agent sees a UUID like:
550e8400-e29b-41d4-a716-446655440000
…and later reproduces something that’s UUID-shaped but wrong—maybe one character off. It looks valid. It points at nothing. This seem to be extra common for structured output, where you force the agent to provide a UUID even in cases where no id exists (like if it came up empty handed on a tool call, but your instruction demanded an id)
2) Integers get invented
If the model sees a handful of integer IDs in a tight range, it’ll happily output other numbers in that range. Again: plausible, wrong.
I’m not claiming a universal explanation here, but in practice both UUIDs and plain digit strings are dense, semantically empty, and easy for models to mutate while still “looking right”.
What I’m solving for (and what I’m not)
What we can solve for:
Make IDs harder to accidentally mutate
Add ways to detect outright fabrication (typos, invented words, invented numbers)
What we can’t solve for (without application logic):
An agent copying a valid ID into the wrong context
An agent reusing a valid-looking ID it has seen before, where it shouldn’t
So the goal here is not “perfect truth.” It’s to reduce the worst failure mode: silent corruption.
The core idea: stop feeding the model raw hex
A v4 UUID is 36 characters of hex + dashes, but the useful information content is only 122 bits. So the repo offers an alternative representation: encode the payload into a fixed number of words from a controlled vocabulary. Let’s use a vocabulary that minimize the number of tokens — and has other interesting properties.
Example:
550e8400-e29b-41d4-a716-446655440000
--> all-ecize-vejovis-minos-abb-allseed-heretic-signum-archhead
Instead of copying a long hex string, the model is dealing with discrete “units” (words). If it mutates one, the checksum is likely to catch it.
There are two wordlist styles:
memorable: more human-friendly words
token: words chosen so each
-wordtends to be a single token in common tokenizers, making cost more predictable
Fun finding, since UUIDs are so common, it seems the tokenizers/embeddings contain a lot of “hex n-gram with dash” tokens — likely with little semantic distinction, meaning they’re easy to mix up when models are less sophisticated or run on high temperatures. Anywho, thanks to this, UUIDs are surprisingly token efficient out of the box.
Another interesting property of the word list we construct here, words are chosen to hash to unique buckets between 0 and {dictionary size} - 1. So to translate a word/token to an integer, you just hash it. To translate a number to a word, you just look up that index in an array — both O(n) operations.
The “just remove UUIDs entirely” trick: UUID substitution
This feature and it’s the bluntest instrument—in a good way, and if you have the luxury of persisted state, it’s what I’d recommend.
Instead of trying to make UUIDs easier for an LLM to copy, you keep UUIDs out of the LLM context entirely:
Before calling the model, replace UUIDs with short placeholders like
${UUID_1}After the model responds, swap placeholders back to the original UUIDs
If the model never sees a real UUID, it can’t hallucinate one. It just echoes ${UUID_1} (or it doesn’t). In the “agent pipelines” world, that’s often exactly what you want: stable references, zero drift.
This substitution/restore pattern is especially useful when UUIDs appear inside long prompts, tool output, or multi-step plans where copy mistakes are common. Again though, you can’t guarantee it is using the references correctly, but you can limit the typos.
The "numeric suffix” in the ID here is a possible anti-pattern, an agent might guess that ${UUID_2} is valid if it has seen 1 and 3, and “hallucinate it”. See the section on Pattern-hiding below
Token cost and why numeric mode is underrated
If you process a ton of UUIDs, they aren’t the most token efficient.
A neat alternative is numeric mode: convert a UUID to its integer form. In many tokenizers this is materially cheaper than hex UUID text (and has fewer “places to screw up”), while still round-tripping exactly.
So sometimes the best “ID safety improvement” is literally: don’t represent UUIDs as hex — a lot of numbers represent as single tokens, so you get away with about half the tokens with a numeric instead of the hex.
Pattern-hiding and “faux UUIDs”
Two more utilities that turn out to be practical:
Feistel mixing / shuffled fixed-width encodings: sequential IDs don’t look sequential anymore, which helps both with agent “pattern completion” (“I saw 1,2,3 so 4 must exist”) and with not leaking obvious ordering signals.
Faux UUIDs: map internal integer IDs to UUID-formatted public IDs (bijective and deterministic with a salt), so you can keep integer primary keys but expose UUID-looking values. This is more of a side-effect of running our “UUID to INT” pipeline backwards. The target audience might want to be making a URL param look like a UUID instead of “leaking” the size of the database through the sequential user-identifiers — a non-agentic use-case.
The linear-time bit (why it stays fast)
Agents call tools in tight loops, so encode/decode needs to be cheap.
The repo keeps encoding/decoding O(n) in the number of words because runtime work is just: hash each word once + arithmetic mapping. No searching, no iterative matching. The “smart vocabulary selection” is done ahead of time when constructing the word lists, so decode can be a direct computation per token.
Where I’d use this
Not at all :) no, but in fairness, I’d probably pick some pattern from here an re-implement with Claude — than to actually install the repo, anyway…
Any multi-step agent workflow that must carry entity references across steps
Structured outputs where “looks valid” is dangerous
Toolchains where you want a clean distinction between:
“model reasoning tokens” and
“ground-truth identifiers”
If your goal is to reduce silent ID drift, id-tokenizer gives you a few different levers—from “more robust encodings” to “don’t show the model UUIDs at all.”
