AI Agents

Self-Hosting an LLM for AI Agent Workflows: A Practical Guide

Running your own LLM for agent workflows unlocks privacy, cost control, and the ability to run 24/7 without rate limits. Here's what actually matters when se...

Running your own LLM for agent workflows unlocks privacy, cost control, and the ability to run 24/7 without rate limits. Here's what actually matters when setting this up.


Why Self-Host for Agents?

Agent workloads are different from chat:


Hardware Starting Points

| VRAM | Model tier | Good for | |------|-----------|---------| | 16–24 GB | 7–14B dense | Light agent tasks, fast iteration | | 48 GB | 30–34B dense or small MoE | Most agent workflows | | 80 GB+ | 70B dense | Complex reasoning, long context | | 144 GB (e.g. 6x 4090) | 235B MoE | Full frontier-class agent capability |

Rule of thumb: For agent work, prefer a larger model at lower precision over a smaller model at full precision. Tool-calling accuracy scales more with parameter count than precision.


Model Recommendations (March 2026)

Best all-around: Qwen3-235B-A22B (MoE)

Best for 48GB: Qwen3-30B dense

Best small model: Qwen3-8B or Mistral-7B-Instruct


Serving Stack

vLLM is the standard choice. Key flags for agent workloads:

vllm serve <model-path> \
  --tensor-parallel-size <N>       # match your GPU count
  --enable-prefix-caching          # huge win for agents with shared system prompts
  --max-model-len 32768            # or whatever your context needs
  --gpu-memory-utilization 0.90

Prefix caching is the single biggest performance win for agent workloads. When your agent re-sends the same system prompt + memory block every call, vLLM caches the KV state and skips recomputation. Enable it.

Tensor Parallel Tips

If you hit output_size not divisible errors with MoE models:

  1. Try TP=2 or TP=6 instead of TP=4 (MoE gate dimensions aren't always divisible by 4)
  2. Test with --dtype bfloat16 first to rule out FP8 quantization issues
  3. Pin to a known-stable vLLM version (0.15.x is well-tested for most models)

Agent Framework Integration

Most frameworks support OpenAI-compatible endpoints, so vLLM just works:

# Works with OpenAI SDK pointed at local vLLM
from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed"
)

Frameworks tested with self-hosted vLLM:


Common Pitfalls

1. Tool-calling format mismatch Not all models use the same tool-call format. Qwen3 uses a XML format by default — make sure your framework parses it, or force JSON mode.

2. Context overflow mid-run Agents accumulate context fast. Set maxmodellen generously and add context trimming logic to your agent loop. Don't just crash when you hit the limit.

3. Temperature for agents Use temperature=0 or very low values for deterministic tool-calling. Higher temperatures = more hallucinated tool arguments.

4. Memory without persistence Self-hosted doesn't mean stateful. You still need external memory (a database, vector store, or simple files). The LLM itself is stateless between calls.


Quick-Start Checklist


Want the full playbook?

Get copy-paste AI templates, prompt frameworks, and agent patterns — all in one place.

Get Access — It’s Free

No credit card. No fluff. Just the good stuff.