What is tokenization in AI?

Question

Accepted Answer

Tokenization is the process of breaking text into smaller pieces called tokens that AI models can process. It's the first step in how language models read and generate text, and understanding it explains some otherwise puzzling AI behaviors — like why models struggle with character counting or simple spelling tasks.

**Why tokenization matters**: Computers don't understand words. They need numerical inputs. Tokenization converts human-readable text into sequences of numbers that neural networks can process. The way text is split into tokens directly affects model performance, cost, and capabilities.

**How modern tokenizers work:**

Most current models use **subword tokenization** — specifically algorithms called Byte Pair Encoding (BPE) or SentencePiece. These split text into a mix of whole words, word parts, and individual characters:

- Common words stay whole: "the" = 1 token, "hello" = 1 token
- Less common words get split: "tokenization" = "token" + "ization" (2 tokens)
- Rare words split further: "defenestration" = "def" + "en" + "est" + "ration" (4 tokens)
- Numbers often split: "2024" might be "20" + "24" (2 tokens)

**Why this design**: A vocabulary of 50,000-100,000 subword tokens balances efficiency and coverage. Pure word-level tokenization would need millions of entries and couldn't handle new words. Pure character-level tokenization would make sequences too long and lose word-level patterns.

**Practical implications you should know:**

**Cost**: API pricing is based on tokens. GPT-4 charges per 1,000 tokens. A typical English word averages about 1.3 tokens. So 750 words is roughly 1,000 tokens. Knowing this helps you estimate costs: a 1,000-word document costs approximately 1,300 input tokens to process.

**Context windows**: When a model has a 128K token context window, that's approximately 96,000 words or about 300 pages. Understanding token counts helps you know how much information you can feed to a model.

**Non-English languages**: Many tokenizers were trained primarily on English text, so other languages often require more tokens per word. Chinese, Japanese, and Korean can require 2-3x more tokens for equivalent content, making API costs higher.

**Why AI struggles with some simple tasks**: Because models see tokens, not individual characters, they can't easily count letters in a word or do character-level manipulations. "How many r's in strawberry?" is hard because the model sees tokens like "straw" + "berry," not individual letters.

**Code tokenization**: Programming languages tokenize differently from natural language. Whitespace, brackets, and operators each become tokens. Heavily formatted code with lots of nesting uses tokens inefficiently.

Understanding tokenization helps you write better prompts, estimate costs accurately, and understand model limitations that might otherwise seem inexplicable.