llama.cpp vs vLLM: Which Should You Use for Local Model Serving?

One of the most common questions in the AI agent community is: should I use llama.cpp or vLLM to serve my local model? Both are excellent tools, but they're optimized for different situations. This guide cuts through the noise.

The Short Answer

| Situation | Use This | |---|---| | CPU or mixed CPU+GPU | llama.cpp | | Model fits fully in VRAM | vLLM | | Single user, dev machine | llama.cpp | | Multiple concurrent users | vLLM | | Minimal Python dependencies | llama.cpp | | Maximum throughput | vLLM | | Quantized models (GGUF) | llama.cpp | | Production API serving | vLLM |

llama.cpp — The Versatile Swiss Army Knife

llama.cpp is a C++ inference engine that runs virtually anywhere. Its biggest strength is flexibility.

Strengths

Partial GPU offloading: Only have 8GB VRAM but your model is 14GB? llama.cpp can offload as many layers as fit, running the rest on CPU. vLLM can't do this.
GGUF quantization: Run 70B parameter models on consumer hardware with smart quantization (Q4KM, Q5KM, etc.)
Low dependencies: Compile it, run it. No Python environment hell.
Battery/resource friendly: Great for always-on personal assistants on laptops or Macs

Weaknesses

Single request at a time (by default): Not built for concurrent users
Throughput ceiling: Even with batching options, vLLM wins on pure tokens/second at scale
Build flags matter a lot: A misconfigured build (missing CUDA/Metal/AVX2) can make it feel unbearably slow

Common llama.cpp gotcha

If llama.cpp feels slow, check your build. Run ./llama-cli --version and look for backend details at startup. If you see "no GPU backend loaded" — your CUDA or Metal support didn't compile in. Rebuild with:

cmake -B build -DLLAMA_CUDA=ON  # for NVIDIA
cmake -B build -DLLAMA_METAL=ON  # for Mac

vLLM — The Throughput Machine

vLLM is a Python-based serving framework built specifically for high-throughput inference. Its core innovation is continuous batching — instead of waiting for a full batch before processing, it continuously adds new requests to in-flight batches.

Strengths

Continuous batching: Dramatically higher throughput for concurrent requests
PagedAttention: Efficient KV cache management means more requests in flight
OpenAI-compatible API: Drop-in replacement for OpenAI API calls in your agents
Multi-token prediction: Supported on many models, giving free speed boosts

Weaknesses

Full VRAM required: vLLM needs the entire model in GPU memory. No partial offloading.
Python overhead: More complex setup and dependencies vs llama.cpp
Overkill for solo use: If you're the only one calling your model, the complexity isn't worth it

vLLM minimum viable setup

pip install vllm
python -m vllm.entrypoints.openai.api_server \
  --model Qwen/Qwen2.5-7B-Instruct \
  --port 8000

Then point your agents at http://localhost:8000/v1 — it's OpenAI API compatible.

For AI Agent Workflows Specifically

If you're running AI agents (not just chatbots), here's what matters:

llama.cpp is usually the right call for:

A personal agent that runs 24/7 on your own machine
Agentic loops with tool use (single-threaded is fine)
Hardware where you can't fit the model fully in VRAM
Running alongside other GPU-heavy tasks

vLLM is the right call for:

A shared agent server multiple team members hit
High-frequency tool-calling loops where latency adds up
Running a "router" model that routes between specialized agents
Production deployments serving end users

The Build Quality Problem (llama.cpp)

Many people switch to vLLM after getting frustrated with llama.cpp — but the issue is often a bad build, not the tool itself.

LM Studio's CLI being faster than your hand-compiled llama.cpp? That's a red flag. LM Studio ships with optimized binaries. Your compile might be missing:

GPU backend (CUDA/Metal/Vulkan)
AVX2/AVX512 CPU optimizations
Flash attention support

Fix your build before switching frameworks — it might be all you need.

Quick Decision Tree

Do you have enough VRAM to fit the full model?
├── YES → Are you serving multiple concurrent users?
│         ├── YES → vLLM
│         └── NO → Either works; llama.cpp is simpler
└── NO → llama.cpp (only option with partial offload)

Resources

llama.cpp GitHub
vLLM docs
Ask Patrick Library — agent configs and templates for both setups

Want the full playbook?

Get copy-paste AI templates, prompt frameworks, and agent patterns — all in one place.

Join The Library — $9/mo

Cancel any time. Instant access.

llama.cpp vs vLLM: Which Should You Use for Local Model Serving?

The Short Answer

llama.cpp — The Versatile Swiss Army Knife

Strengths

Weaknesses

Common llama.cpp gotcha

vLLM — The Throughput Machine

Strengths

Weaknesses

vLLM minimum viable setup

For AI Agent Workflows Specifically

The Build Quality Problem (llama.cpp)

Quick Decision Tree

Resources

Want the full playbook?

More from Ask Patrick

Want More Like This?