One of the most common questions in the AI agent community is: should I use llama.cpp or vLLM to serve my local model? Both are excellent tools, but they're optimized for different situations. This guide cuts through the noise.
The Short Answer
| Situation | Use This | |---|---| | CPU or mixed CPU+GPU | llama.cpp | | Model fits fully in VRAM | vLLM | | Single user, dev machine | llama.cpp | | Multiple concurrent users | vLLM | | Minimal Python dependencies | llama.cpp | | Maximum throughput | vLLM | | Quantized models (GGUF) | llama.cpp | | Production API serving | vLLM |
llama.cpp — The Versatile Swiss Army Knife
llama.cpp is a C++ inference engine that runs virtually anywhere. Its biggest strength is flexibility.
Strengths
- Partial GPU offloading: Only have 8GB VRAM but your model is 14GB? llama.cpp can offload as many layers as fit, running the rest on CPU. vLLM can't do this.
- GGUF quantization: Run 70B parameter models on consumer hardware with smart quantization (Q4KM, Q5KM, etc.)
- Low dependencies: Compile it, run it. No Python environment hell.
- Battery/resource friendly: Great for always-on personal assistants on laptops or Macs
Weaknesses
- Single request at a time (by default): Not built for concurrent users
- Throughput ceiling: Even with batching options, vLLM wins on pure tokens/second at scale
- Build flags matter a lot: A misconfigured build (missing CUDA/Metal/AVX2) can make it feel unbearably slow
Common llama.cpp gotcha
If llama.cpp feels slow, check your build. Run ./llama-cli --version and look for backend details at startup. If you see "no GPU backend loaded" — your CUDA or Metal support didn't compile in. Rebuild with:
cmake -B build -DLLAMA_CUDA=ON # for NVIDIA cmake -B build -DLLAMA_METAL=ON # for Mac
vLLM — The Throughput Machine
vLLM is a Python-based serving framework built specifically for high-throughput inference. Its core innovation is continuous batching — instead of waiting for a full batch before processing, it continuously adds new requests to in-flight batches.
Strengths
- Continuous batching: Dramatically higher throughput for concurrent requests
- PagedAttention: Efficient KV cache management means more requests in flight
- OpenAI-compatible API: Drop-in replacement for OpenAI API calls in your agents
- Multi-token prediction: Supported on many models, giving free speed boosts
Weaknesses
- Full VRAM required: vLLM needs the entire model in GPU memory. No partial offloading.
- Python overhead: More complex setup and dependencies vs llama.cpp
- Overkill for solo use: If you're the only one calling your model, the complexity isn't worth it
vLLM minimum viable setup
pip install vllm python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-7B-Instruct \ --port 8000
Then point your agents at http://localhost:8000/v1 — it's OpenAI API compatible.
For AI Agent Workflows Specifically
If you're running AI agents (not just chatbots), here's what matters:
llama.cpp is usually the right call for:
- A personal agent that runs 24/7 on your own machine
- Agentic loops with tool use (single-threaded is fine)
- Hardware where you can't fit the model fully in VRAM
- Running alongside other GPU-heavy tasks
vLLM is the right call for:
- A shared agent server multiple team members hit
- High-frequency tool-calling loops where latency adds up
- Running a "router" model that routes between specialized agents
- Production deployments serving end users
The Build Quality Problem (llama.cpp)
Many people switch to vLLM after getting frustrated with llama.cpp — but the issue is often a bad build, not the tool itself.
LM Studio's CLI being faster than your hand-compiled llama.cpp? That's a red flag. LM Studio ships with optimized binaries. Your compile might be missing:
- GPU backend (CUDA/Metal/Vulkan)
- AVX2/AVX512 CPU optimizations
- Flash attention support
Fix your build before switching frameworks — it might be all you need.
Quick Decision Tree
Do you have enough VRAM to fit the full model? ├── YES → Are you serving multiple concurrent users? │ ├── YES → vLLM │ └── NO → Either works; llama.cpp is simpler └── NO → llama.cpp (only option with partial offload)
Resources
- llama.cpp GitHub
- vLLM docs
- Ask Patrick Library — agent configs and templates for both setups
Want the full playbook?
Get copy-paste AI templates, prompt frameworks, and agent patterns — all in one place.
Get Access — It’s FreeNo credit card. No fluff. Just the good stuff.