A confidence gauge stuck near 100%

“A confident first-year intern with photographic memory and no judgment.”

What LLMs are genuinely great at

Anything that's mostly a transformation of text from one form to another, where being approximately right is acceptable, plays to the model's strengths. Summaries. Drafts. Translations. Rewrites. Extracting structured data from unstructured prose. Brainstorming a list. Explaining a paragraph at a different reading level. These are not toys; they are some of the highest-leverage tasks in knowledge work, and the model does them well.

Generation is fast, plausibility is high, and the cost per task is approaching zero. The right mental model is not "oracle" but "infinite patient assistant who drafts everything in seconds."

What they can do, with the right scaffolding

Multi-step reasoning, code generation, tool use, knowledge retrieval. None of these are reliable from a model alone; all of them work well when you wrap the model in some structure.

Coding: accept that the first answer is a draft. Pair the model with a real type checker, a test suite, and a feedback loop. The model is excellent at producing plausible code; correctness comes from the loop.
Math and arithmetic: give the model a calculator or a Python tool. On its own it confabulates numbers.
Knowledge retrieval: pair it with a search index or vector database (RAG). Don't expect the model to remember accurate facts after its training cutoff.
Multi-step tasks: break the task into smaller prompts, or use an explicit "chain of thought" approach. Reasoning improves dramatically when the model is allowed to think out loud.

What they cannot do, no matter how you prompt

There are limits that no amount of clever prompting fixes. Recognising them is the difference between a working system and a broken one.

They do not know what they don't know. The model will produce a plausible answer with the same confidence whether it actually knows or is guessing. This is what "hallucination" really means: not malice or error, but uncalibrated confidence.

They have no persistent state. Between two API calls, the model remembers nothing. The illusion of memory is just the conversation being replayed in the prompt every turn. When the context runs out, the earliest parts of the conversation fall off the edge of the world.

They cannot truly plan over long horizons. Anything that requires sustained, multi-step strategy where errors compound (booking a complex trip, executing a non-trivial project, debugging a system end-to-end) degrades quickly. "Agent" frameworks help but do not solve this.

They are not calibrated. Probability estimates that come out of the model are not real probabilities. "I'm 90% confident" means very little.

They cannot learn from your conversation. Whatever they got wrong today, they will get wrong tomorrow. Fine-tuning happens on a separate, expensive track.

The deception of fluency

The single most dangerous property of an LLM is that it is fluent. Fluent text feels authoritative. A wrong fact in clumsy English raises suspicion; the same wrong fact in elegant English doesn't. Your job as a user, and especially as an operator, is to remain suspicious *in proportion to the stakes*, regardless of how good the prose sounds.

Numbers that calibrate expectations

Context window: frontier models support 100k to 2M tokens in 2025. That's between a novel and a small library. The catch: performance degrades within the window; what's in the middle gets less attention than what's at the start or end ("lost in the middle").

Context windows have grown by orders of magnitude, but more tokens isn't always better. Quality of attention degrades long before the limit.

Cost: a single inference call ranges from $0.0001 to $0.10 depending on model and length. At application scale this matters; for one-off use it's negligible.

Latency: 0.5 to 10 seconds for a typical answer. Streaming hides this. Tool-using agents stack latency multiplicatively; a 10-step agent at 2s/step is 20 seconds.

Benchmarks: don't trust them. A model that scores 95% on a benchmark may fail your specific task. The gap between "benchmark performance" and "production performance" is the central engineering challenge.

Retrieval vs reasoning

A useful distinction. Retrieval is "what did the model see during training, and can it spit it back?" Reasoning is "can the model derive something it has never seen?" Models are very good at retrieval (sometimes scarily good) and uneven at reasoning.

Models are excellent at the bottom-left (easy retrieval) and worsen toward the top-right (hard reasoning). Match the task to the quadrant.

The trap: reasoning often *looks* like retrieval. A model that solves a logic puzzle may have solved that exact puzzle in training. The 2024 "reversal curse" paper showed that if a model has only seen "A is the father of B," it cannot reliably answer "who is B's father?" The information is there but the model can't manipulate it. Treat impressive reasoning demos with care.

In one line each

Strong at: drafting, transforming text, summarising, extracting structure, brainstorming.
Strong with scaffolding: coding (+ tests), math (+ tools), facts (+ retrieval), reasoning (+ steps).
Cannot do: know what they don't know, persist state, plan over long horizons, learn from conversation.
Fluency creates false trust. Benchmarks are misleading. Your real benchmark is the only one that matters.

Where to go next

Chapter 7: Using AI well