Skip to content
Chapter 04 · 10 min

From words to numbers

A neural network only understands numbers. So before any AI can read your sentence, it has to turn the sentence into vectors. The way it does that is more interesting than it sounds, and it's where words start having geometry.

A map of meaningA 2D plane scattered with small dots in three labeled clusters: finance, food, and music. Similar concepts sit close together; unrelated ones are far apart.FINANCEFOODMUSICSIMILAR MEANINGS LIVE CLOSE TOGETHER

A map of meaning where similar ideas live close together.

Tokens: the unit of language for a model

An LLM doesn't read words. It reads tokens. A token is a chunk of text, usually a sub-word: "hello" is one token, "unbelievable" is three ("un", "believ", "able"), an emoji is one or two, a Chinese character is one. The exact split is decided by a tokenizer trained alongside the model.

Why sub-words and not full words? Two reasons. (1) Vocabulary stays small (roughly 50 to 100k tokens) instead of millions, which keeps the network manageable. (2) The model can handle new words it never saw during training, by composing them from familiar pieces. "Sden-pilled" is novel, but "sden" + "pill" + "ed" are all in the vocabulary.

Tokenization splits a word into piecesThe word "unbelievable" is split into three sub-word tokens ("un", "believ", "able"), each mapped to a numeric token id from the model's vocabulary.unbelievableunID 1986believID 2113ableID 4041 WORD → 3 TOKENS
A single English word can split into several tokens. Each chunk is mapped to a numeric id from a fixed vocabulary.

Embeddings: turning tokens into meaning

Each token in the vocabulary is mapped to a vector, a list of, say, 4096 numbers. That vector is called the token's embedding. It is *learned during training*, not designed by hand. After training, similar words end up with similar vectors.

Concretely: the vectors for "king" and "queen" are close. So are "Paris" and "Berlin." Famously, you can do arithmetic: vector("king") − vector("man") + vector("woman") ≈ vector("queen"). Meaning has been compressed into geometry.

Embedding arithmeticFour word-points in a 2D space. The vector from "man" to "woman" is parallel to the vector from "king" to "queen", visualising the famous king − man + woman ≈ queen relationship.manwomankingqueenking − man + woman ≈ queen2D PROJECTION OF EMBEDDING SPACE
Meaning compressed into geometry. The vector from “man” to “woman” lines up with the vector from “king” to “queen”.

Static vs contextual embeddings

Early embedding methods (word2vec 2013, GloVe 2014) gave each word *one* vector forever. Useful but limited: "bank" (river) and "bank" (money) had the same vector. Modern LLMs use contextual embeddings: the vector for "bank" in "river bank" is genuinely different from "bank" in "bank account." The vector you get depends on the whole sentence.

How? The token's initial embedding is just a lookup, but then it flows through layers of attention (chapter 5) that *blend* it with the embeddings of nearby tokens. By the last layer, each token's vector has absorbed information about its context. That's contextual embedding in one sentence.

What's actually in those numbers

Each dimension of the vector isn't "the gender dimension" or "the food dimension." The directions that carry meaning are not aligned with the axes. They're learned, distributed across many dimensions. You can sometimes find interpretable directions such as "royalty," "plural," or "capital city," but they're not pre-labelled. Interpretability research is a young field trying to read the vectors back into human concepts.

Cosine similarityTwo vectors A and B drawn from a shared origin, with the small angle θ between them marked. Cosine similarity is the cosine of that angle, close to 1 when vectors point the same way, 0 when perpendicular.θABcos(θ) = (A · B) / (‖A‖·‖B‖)1 = same direction · 0 = perpendicular
Cosine similarity measures the angle between two vectors, not the distance, so two ideas can match in direction even when their magnitudes differ.

Embeddings outside LLMs

Embeddings are not just an internal LLM artefact; they're a product on their own. You can call an API, give it a chunk of text, and get back a 1024- or 4096-dim vector. That vector is the input to:

  • Semantic search: "find me documents about Q4 hiring" instead of "match the exact words."
  • RAG (retrieval-augmented generation): find the right docs, then feed them to the LLM.
  • Clustering: group support tickets by topic without labels.
  • Recommendation: products with embeddings close to what the user liked.
  • Anomaly detection: flag inputs whose embedding is far from anything else.

Embeddings turn out to be one of the most useful, cheapest, lowest-risk applications of LLM technology. We come back to RAG in chapter 7.

In one line each

  • Models read tokens, not words. Pricing, limits, and many subtle bugs live at this layer.
  • Each token maps to a learned vector (embedding) so that similar meanings end up close in vector space.
  • Modern embeddings are contextual: the same word has different vectors in different sentences.
  • Embeddings are useful on their own: semantic search, RAG, clustering, recommendation, anomaly detection.
From words to numbers · AI courses · SDEN