The workshop metaphor

“Stop asking the genie. Build the workshop.”

The four levers of a working AI system

Almost every real AI application that works pulls some combination of four levers. Most failing AI products fail because they only pulled one.

Prompting: telling the model precisely what you want, with constraints and examples.
Retrieval: feeding the model the right context so it doesn't have to guess.
Tools: letting the model call deterministic systems (calculators, databases, type checkers) instead of pretending.
Evaluation: measuring whether the system actually works, before and after every change.

Pull at least three of the four. Most failing AI products are pulling only one, usually prompting.

Fine-tuning is a fifth lever but a much more expensive one. Most teams who think they need it actually need better retrieval or better evals. Reach for it last.

Prompting: just the load-bearing parts

Most "prompt engineering" content is local optima. Four principles that actually generalise:

Constraints: tell the model what format, what length, what style, what to exclude. The clearer the box, the better the output fills it.
Examples (few-shot): show two or three sample inputs paired with the kind of output you want. The model is much better at imitating than at obeying.
Decomposition: if the task has multiple steps, run them as separate prompts (or explicit chain-of-thought) instead of asking for the whole answer at once.
Verification: ask the model to check its own output, or pass it to a second model with a different prompt. Cheap, often catches the silly errors.

Retrieval (RAG): when the model needs facts

Retrieval-augmented generation pairs a model with a search system. When the user asks a question, you first search a knowledge base (vector store, database, web), retrieve the most relevant chunks, and feed those into the prompt as context. The model answers from the context rather than from its training data.

RAG is the right answer to most "chat with our docs" or "customer support bot" problems. It separates what the system knows (the index) from what the system says (the model). You can update the index hourly; you cannot update the model hourly.

Where RAG goes wrong: bad chunking (the right answer is split across two chunks), bad retrieval (the relevant doc isn't even in the top 10), wrong embedding model (your domain isn't represented), or the model ignores the retrieved context. Each is fixable; none are obvious until you measure.

The model answers from the retrieved chunks, not from its training data. Update the index, not the model.

Tools and agents

A model that can call tools is dramatically more capable than one that can't. Give the model a calculator and it stops faking math. Give it a database query tool and it stops hallucinating SQL. Give it a code interpreter and it can verify its own outputs.

An "agent" is just a model in a loop where each turn it can call tools, see the result, and decide what to do next. The loop typically has a step limit and some kind of stopping condition. Most production agents are loops of 3 to 20 steps; agents that go beyond that without strong constraints rarely work.

The honest state of agents in 2025: they're useful for well-scoped tasks ("answer this support ticket using these tools") and unreliable for open-ended ones ("plan a launch and execute it"). Errors compound. Step 1 with 95% accuracy is fine; step 10 with 95% accuracy is below 60%.

The simplest agent: model calls a tool, reads the result, decides what to do next, stops when it has a final answer.

Evals: the only honest signal

An eval is a set of inputs paired with expected outputs (or a way to grade outputs), plus a script that runs them through your system and reports how it did. Without evals, you do not know whether a change improved or broke your system. Without evals, you are doing AI engineering on vibes.

Start small: 20 to 50 examples drawn from real usage, each one a problem your system should handle. Add the failures you discover. Re-run the eval on every prompt change, every model upgrade, every retrieval tweak. If you only do one thing from this chapter, do this.

When not to use AI

Some tasks should not be solved with an LLM. A regex is faster, cheaper, and more reliable than a model for "extract this email address." A database query is more honest than a model for "count rows where status='active'." A type checker is better than a model for "is this code valid Rust?"

Rule of thumb: if the task has a deterministic correct answer and there's a deterministic tool for it, use the tool. Reach for AI when the task is fuzzy, the inputs are messy, or the cost of being approximately right is acceptable. Don't use a language model to add two numbers.

When in doubt, walk down this tree before adding an LLM to your stack.

What this course doesn't cover

Computer vision, multimodal models, robotics, RL beyond RLHF, mechanistic interpretability, alignment research, the specifics of any one model provider. Each is a course of its own. What you have now is enough to follow the field, and enough to spot when someone is selling you something.

Where to go from here

If you want depth on prompting: read "Prompt Engineering: First Principles" in our guides. If you want to test whether your team is ready to put AI in production: take the AI Readiness Self-Audit. If you want to build something and need a partner: that's literally what SDEN does.

In one line each

Four levers: prompting, retrieval, tools, evaluation. Pull at least three for any serious system.
RAG separates what the system knows from what it says. Update the index, not the model.
Agents work when scoped narrowly with idempotent tools, dry-runs, and human gates on high-stakes actions.
Evals are the only honest signal. Without them you ship vibes; with them you ship something you can reason about.

Where to go next