Tokenization & Byte Pair Encoding (BPE)
Tokenization is the very first step in how AI processes your text. Neural networks cannot read words — they only understand numbers. A tokenizer breaks your text into smaller units called tokens and maps each one to a unique integer ID.
The most common method is Byte Pair Encoding (BPE). It works by starting with the smallest units (individual characters or bytes) and repeatedly merging the most frequently occurring adjacent pairs into new tokens. Over time, common fragments like 'ing' or 'tion' become single tokens. For example, the word 'walking' becomes two tokens: 'walk' + 'ing', mapped to integer IDs like 14502 and 389.
Key insight: AI models don't process raw words — they process integer sequences. This is why the same word can be split differently across languages, and why token limits (like '128K context window') matter. Every word you type costs tokens, and tokens cost compute.
Why it matters: Understanding tokenization explains why AI sometimes struggles with counting letters in words, why different languages use different amounts of tokens, and why there are limits on how much text you can send to ChatGPT or Claude in one message.