What is Llama?
Llama is Meta's family of open-weight large language models, among the most widely downloaded foundations for building and self-hosting AI. You download the weights and run them on your own hardware or cloud, fine-tune them on your data, or call them through one of many hosting providers.
The family spans tiny models that run on a laptop or phone up to large instruction-tuned and multimodal variants, released under the Llama Community License (broadly permissive, with conditions only at very large scale). A deep open ecosystem (llama.cpp, Ollama, vLLM, and Hugging Face) grew up around it, and Meta AI is the consumer assistant built on top.
If you want to own the model that runs your product (for cost, latency, privacy, or fine-tuning), Llama is the default open-weight starting point.
What it's best for
- Self-hosting: run the model entirely on your own infrastructure so nothing leaves your network.
- On-device and edge: small Llama models run locally on laptops and phones via llama.cpp or Ollama.
- Fine-tuning: adapt the open weights to your domain, data, and tone with techniques like LoRA.
- Cost control at scale: pay only for your own compute instead of a per-token vendor bill.
- The widest open ecosystem: tooling, quantizations, guides, and hosting providers outnumber any other open family.
- Privacy- and residency-sensitive deployments where data simply cannot leave your environment.
Where it falls short
- Absolute top-of-leaderboard reasoning. The largest closed models still tend to lead the very hardest benchmarks.
- Teams with no appetite to run infrastructure, unless they call Llama through a managed hosting provider.
- A polished, batteries-included consumer assistant. Meta AI is consumer-facing but narrower than ChatGPT or Gemini, and varies by region.
- Workloads that need a vendor SLA out of the box. Self-hosting shifts uptime and support onto your team.
Getting the weights
Download Llama from Hugging Face or llama.com after accepting the license. Pick a size that fits your hardware and an instruction-tuned ('Instruct') variant for chat-style use rather than the raw base model.
Quantized builds (smaller, lower-precision copies) let larger models fit on modest GPUs or even CPUs, trading a little quality for a lot of reach.
Running it: local or production
For local and on-device work, llama.cpp and Ollama get a quantized model running in minutes. For production serving, vLLM or TGI provide batching and an OpenAI-compatible endpoint your existing code can talk to.
If you'd rather not manage GPUs, providers such as Together, Groq, Fireworks, and the major clouds serve Llama by API, so you get open weights with someone else handling the infrastructure.
Fine-tuning and retrieval
LoRA and QLoRA make domain fine-tuning cheap: you train a small adapter rather than the whole model, to teach Llama your tone, formats, or jargon.
For knowledge that changes, keep the base model and add retrieval (RAG) instead of fine-tuning facts in; you update an index rather than retraining.
Getting better answers
Use the Instruct variants with a clear system prompt, and pick the smallest size that passes your evals. Over-provisioning a large model wastes money and latency.
Match quantization to the job: aggressive quantization is fine for classification or extraction, less so for hard reasoning. Test a couple of configurations before committing.
What Llama costs
Approximate, in USD, as of January 2026. Prices change often. Confirm on the official site before you rely on them.
Open weights
$0 (self-host)
Free to download and run; you pay only for your own compute. License adds conditions at very large scale.
Hosted API (third-party)
Usage-based
Many providers serve Llama per token, often at low cost, with no GPUs to manage.
Meta AI
$0
The consumer assistant built on Llama, free where it's available.
Example prompts
Copy these into Llama as starting points, then adapt them to your task.
Right-size the model
I want to run a chatbot on a single 24GB GPU. Which Llama model and quantization should I use, what context length is realistic, and what throughput should I expect?
Plan a fine-tune
Outline a LoRA fine-tuning plan to adapt an Instruct Llama model to our support tone. Cover dataset size, how to build the eval set, and the common pitfalls to avoid.
Design a self-host stack
Recommend a production serving stack for Llama on our own Kubernetes cluster: serving engine, batching, an OpenAI-compatible endpoint, and how to size the GPU pool.
Llama
common questions.
Direct answers to the questions we get asked the most. If yours isn't covered, write to the team.