A map of meaning

“A map of meaning where similar ideas live close together.”

Tokens: the unit of language for a model

An LLM doesn't read words. It reads tokens. A token is a chunk of text, usually a sub-word: "hello" is one token, "unbelievable" is three ("un", "believ", "able"), an emoji is one or two, a Chinese character is one. The exact split is decided by a tokenizer trained alongside the model.

Why sub-words and not full words? Two reasons. (1) Vocabulary stays small (roughly 50 to 100k tokens) instead of millions, which keeps the network manageable. (2) The model can handle new words it never saw during training, by composing them from familiar pieces. "Sden-pilled" is novel, but "sden" + "pill" + "ed" are all in the vocabulary.

A single English word can split into several tokens. Each chunk is mapped to a numeric id from a fixed vocabulary.

Embeddings: turning tokens into meaning

Each token in the vocabulary is mapped to a vector, a list of, say, 4096 numbers. That vector is called the token's embedding. It is *learned during training*, not designed by hand. After training, similar words end up with similar vectors.

Concretely: the vectors for "king" and "queen" are close. So are "Paris" and "Berlin." Famously, you can do arithmetic: vector("king") − vector("man") + vector("woman") ≈ vector("queen"). Meaning has been compressed into geometry.

Meaning compressed into geometry. The vector from “man” to “woman” lines up with the vector from “king” to “queen”.

Static vs contextual embeddings

Early embedding methods (word2vec 2013, GloVe 2014) gave each word *one* vector forever. Useful but limited: "bank" (river) and "bank" (money) had the same vector. Modern LLMs use contextual embeddings: the vector for "bank" in "river bank" is genuinely different from "bank" in "bank account." The vector you get depends on the whole sentence.

How? The token's initial embedding is just a lookup, but then it flows through layers of attention (chapter 5) that *blend* it with the embeddings of nearby tokens. By the last layer, each token's vector has absorbed information about its context. That's contextual embedding in one sentence.

What's actually in those numbers

Each dimension of the vector isn't "the gender dimension" or "the food dimension." The directions that carry meaning are not aligned with the axes. They're learned, distributed across many dimensions. You can sometimes find interpretable directions such as "royalty," "plural," or "capital city," but they're not pre-labelled. Interpretability research is a young field trying to read the vectors back into human concepts.

Cosine similarity measures the angle between two vectors, not the distance, so two ideas can match in direction even when their magnitudes differ.

Embeddings outside LLMs

Embeddings are not just an internal LLM artefact; they're a product on their own. You can call an API, give it a chunk of text, and get back a 1024- or 4096-dim vector. That vector is the input to:

Semantic search: "find me documents about Q4 hiring" instead of "match the exact words."
RAG (retrieval-augmented generation): find the right docs, then feed them to the LLM.
Clustering: group support tickets by topic without labels.
Recommendation: products with embeddings close to what the user liked.
Anomaly detection: flag inputs whose embedding is far from anything else.

Embeddings turn out to be one of the most useful, cheapest, lowest-risk applications of LLM technology. We come back to RAG in chapter 7.

In one line each

Models read tokens, not words. Pricing, limits, and many subtle bugs live at this layer.
Each token maps to a learned vector (embedding) so that similar meanings end up close in vector space.
Modern embeddings are contextual: the same word has different vectors in different sentences.
Embeddings are useful on their own: semantic search, RAG, clustering, recommendation, anomaly detection.

Where to go next

Chapter 5: Transformers & attention