Why There Is No "Tokenization-Only Model" in Modern NLP

Jayant Kumar
Jan 06
2k
0
0

Article

Introduction

When working with Large Language Models (LLMs), developers frequently encounter tokenizers—AutoTokenizer, tiktoken, SentencePiece, or WordPiece. A natural question arises:

Why isn’t there a standalone “tokenization model,” like there are embedding and transformer models?

This article explains why tokenization is not treated as a neural model, how tokenizers actually work, and why they are always tightly coupled with language models.

What Tokenization Really Is

Tokenization is the process of converting raw text into discrete units called tokens, which can be:

Words ("hello")
Subwords ("play" + "ing")
Characters
Byte-level units

Example:

"I am learning NLP"
→ ["I", "am", "learn", "ing", "NLP"]

These tokens are later mapped to token IDs, which are numerical indices used by neural networks.

Why Tokenization Is Not a Neural Model

Tokenization Is Deterministic, Not Predictive

A neural model:

Learns patterns
Makes predictions
Produces probabilities

A tokenizer:

Applies fixed rules
Uses a static vocabulary
Always produces the same output for the same input

Input: "ChatGPT"
Output tokens: always the same

There is no inference, no gradients, and no learning during runtime.

Tokenizers Are Algorithms + Lookup Tables

Modern tokenizers are built using algorithms such as:

Algorithm	Used By
BPE (Byte Pair Encoding)	GPT, RoBERTa
WordPiece	BERT
SentencePiece	T5, LLaMA
Byte-level BPE	GPT-2

These consist of:

A merge rules file
A vocabulary file
A normalization pipeline

This is fundamentally different from neural network weights.

Tokenization Happens Before the Model

LLM pipelines look like this:

Text- Tokenizer (rules + vocab)- Token IDs- Embedding Layer- Transformer Layers- Output

The tokenizer is outside the neural network graph.

Since it doesn’t participate in training or inference, it isn’t treated as a "model".

Why Tokenizers Are Bundled with Models

Each LLM is trained using a specific tokenizer.

Token ID Dependency

If a model expects:

"hello" → token ID 15496

and you use a different tokenizer that produces:

"hello" → token ID 2091

The model breaks.

That’s why tokenizers are inseparable from their models.

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

You never download a tokenizer independently because:

It must match the model’s embedding matrix
Token IDs must align exactly

Why We Can Train Tokenizers but Still Don’t Call Them Models

You can train a tokenizer:

from tokenizers import Tokenizer, models, trainers

tokenizer = Tokenizer(models.BPE())
trainer = trainers.BpeTrainer(vocab_size=5000)

But this training:

Does not involve backpropagation
Does not optimize loss
Does not generalize

It simply builds a compressed vocabulary based on frequency.

Thus, it’s considered preprocessing, not modeling.

Role of AutoTokenizer

AutoTokenizer is a factory, not a model.

AutoTokenizer.from_pretrained("bert-base-uncased")

What it does:

Reads tokenizer configuration
Loads correct algorithm (BPE, WordPiece, etc.)
Loads vocabulary and merges
Applies normalization rules

It abstracts tokenizer differences, so developers don’t have to manage them manually.

Why Embedding Models Exist but Tokenization Models Don’t

Feature	Tokenizer	Embedding Model
Learns from data	No	Yes
Uses neural network	No	Yes
Trainable with loss	No	Yes
Produces vectors	No	Yes
Deterministic	Yes	No

Embedding models learn semantic meaning, which requires neural networks.

Tokenization does not.

Key Takeaways

Tokenization is not a neural task
It is deterministic preprocessing
Tokenizers are algorithm + vocabulary, not models
They are tightly coupled to LLMs
AutoTokenizer exists for convenience, not learning

Final Thought

Tokenization is to LLMs what a compiler is to a program It prepares the input—but it doesn’t think.

That’s why there is no standalone tokenization-only model in modern NLP.