Self-Hosting an LLM for AI Agent Workflows: A Practical Guide

Running your own LLM for agent workflows unlocks privacy, cost control, and the ability to run 24/7 without rate limits. Here's what actually matters when setting this up.

Why Self-Host for Agents?

Agent workloads are different from chat:

High token volume — agents loop, retry, and call tools repeatedly. API costs add up fast.
Low latency matters — a 3-second per-call delay becomes a 30-second delay over 10 tool calls.
Context window control — you set the limits, not the provider.
Privacy — your prompts, memory, and tool outputs never leave your machine.

Hardware Starting Points

| VRAM | Model tier | Good for | |------|-----------|---------| | 16–24 GB | 7–14B dense | Light agent tasks, fast iteration | | 48 GB | 30–34B dense or small MoE | Most agent workflows | | 80 GB+ | 70B dense | Complex reasoning, long context | | 144 GB (e.g. 6x 4090) | 235B MoE | Full frontier-class agent capability |

Rule of thumb: For agent work, prefer a larger model at lower precision over a smaller model at full precision. Tool-calling accuracy scales more with parameter count than precision.

Model Recommendations (March 2026)

Best all-around: Qwen3-235B-A22B (MoE)

Excellent tool-calling and multi-step reasoning
Runs in ~90–100GB VRAM at FP8/Q4
Strong structured output support
Best choice if you have 144GB VRAM

Best for 48GB: Qwen3-30B dense

No MoE tensor-parallel headaches
Fast, reliable tool-use
Fits in a single A6000 or pair of 3090s

Best small model: Qwen3-8B or Mistral-7B-Instruct

For orchestration shells and routing agents
Not great at complex multi-tool chains alone

Serving Stack

vLLM is the standard choice. Key flags for agent workloads:

vllm serve <model-path> \
  --tensor-parallel-size <N>       # match your GPU count
  --enable-prefix-caching          # huge win for agents with shared system prompts
  --max-model-len 32768            # or whatever your context needs
  --gpu-memory-utilization 0.90

Prefix caching is the single biggest performance win for agent workloads. When your agent re-sends the same system prompt + memory block every call, vLLM caches the KV state and skips recomputation. Enable it.

Tensor Parallel Tips

If you hit output_size not divisible errors with MoE models:

Try TP=2 or TP=6 instead of TP=4 (MoE gate dimensions aren't always divisible by 4)
Test with --dtype bfloat16 first to rule out FP8 quantization issues
Pin to a known-stable vLLM version (0.15.x is well-tested for most models)

Agent Framework Integration

Most frameworks support OpenAI-compatible endpoints, so vLLM just works:

# Works with OpenAI SDK pointed at local vLLM
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

Frameworks tested with self-hosted vLLM:

OpenAI Agents SDK — works out of box, set base_url
LangChain / LangGraph — use ChatOpenAI with custom base URL
Pydantic AI — OpenAI provider, custom base URL
AutoGen — configure model_client with local endpoint

Common Pitfalls

1. Tool-calling format mismatch Not all models use the same tool-call format. Qwen3 uses a XML format by default — make sure your framework parses it, or force JSON mode.

2. Context overflow mid-run Agents accumulate context fast. Set maxmodellen generously and add context trimming logic to your agent loop. Don't just crash when you hit the limit.

3. Temperature for agents Use temperature=0 or very low values for deterministic tool-calling. Higher temperatures = more hallucinated tool arguments.

4. Memory without persistence Self-hosted doesn't mean stateful. You still need external memory (a database, vector store, or simple files). The LLM itself is stateless between calls.

Quick-Start Checklist

[ ] Choose model based on VRAM
[ ] Install vLLM (match CUDA version to your drivers)
[ ] Start server with prefix caching enabled
[ ] Test tool-calling with a simple echo tool before wiring up real tools
[ ] Add context window monitoring to your agent loop
[ ] Set up a process manager (systemd/PM2) so the server restarts on crash

Want the full playbook?

Get copy-paste AI templates, prompt frameworks, and agent patterns — all in one place.

Join The Library — $9/mo

Cancel any time. Instant access.