Memory Architecture ⏱ Weekend project ✓ Production patterns

AI Agent Memory Architecture: Vector Stores & Retrieval Patterns

The agents that feel intelligent aren't necessarily smarter — they just remember better. This guide covers the complete memory stack for solopreneurs: when to use files vs. vector databases, which vector DB to pick (Chroma vs. Qdrant vs. Pinecone vs. pgvector), and five retrieval patterns that make the difference between a forgettable agent and one that compounds value with every session. Includes a copy-paste RAG pipeline you can implement this weekend.

Why Memory Is the Moat

Every solopreneur building with AI agents hits the same wall within a week: the agent is brilliant in the moment and amnesiac by the next session. You spent an hour configuring the perfect tone, teaching it your brand voice, walking it through your customer personas. Then you restart. It's gone.

This isn't a model problem. GPT-4o, Claude, Gemini — they all have the same architectural constraint: they are stateless by design. Every session starts with a blank slate unless you build the memory layer.

The Four Memory Layers

┌─────────────────────────────────────────────┐
│  LAYER 1: Working Memory (Context Window)   │  ← Dies when session ends
│  What's happening right now                  │
├─────────────────────────────────────────────┤
│  LAYER 2: Episodic Memory (Daily Files)     │  ← Days to weeks
│  What happened in past sessions             │
├─────────────────────────────────────────────┤
│  LAYER 3: Semantic Memory (MEMORY.md)       │  ← Months to years
│  Distilled facts, preferences, patterns     │
├─────────────────────────────────────────────┤
│  LAYER 4: External Memory (Vector Store)    │  ← Indefinite, queryable
│  Knowledge base, documents, history at scale│
└─────────────────────────────────────────────┘

Layers 1–3 are the file-based foundation covered in Library Item #41. This guide goes deep on Layer 4: vector stores, semantic retrieval, and RAG pipelines.

When You Need a Vector Database (And When You Don't)

Most solopreneurs reach for Pinecone too early. The honest decision matrix:

Situation	Recommendation
Single user, under 500 sessions	Daily files + MEMORY.md. Skip vector DB.
Product knowledge base (docs, FAQs)	Vector DB — keyword search won't cut it
Multi-user agent with per-user history	Vector DB with per-user namespacing
Customer support bot, 10K+ past tickets	Definitely vector DB
Personal assistant for 1–5 people	Still fine with files
Under 10,000 total memory chunks	Grep and files beat the ops overhead

Rule of thumb: If you can grep for it in under 2 seconds, you don't need a vector database.

How Vector Memory Works

A vector database stores information as high-dimensional numeric arrays called embeddings. Instead of searching by keywords, you search by meaning.

Your text                  Embedding model              Vector stored in DB
"pricing strategy" ──────► [0.23, -0.41, 0.87, ...]  ──────► chroma/pinecone

Query at retrieval time:
"what do we charge?" ────► [0.21, -0.39, 0.91, ...]  ──────► cosine similarity
                                                              → returns pricing docs
                                                                (similarity: 0.97)

Similar meanings produce similar vectors. The database finds the closest matches by measuring the angle between vectors (cosine similarity). This is how your agent can answer "what are our pricing rules?" even when the stored memory says "subscription tiers" — same meaning, different words.

The Vector Stack: Your Options

Option A: Chroma (Local, Free, Start Here)

Chroma runs locally, needs no API key, and takes 5 minutes to set up. It's the right starting point for prototypes and personal agents.

pip install chromadb openai

import chromadb
from chromadb.utils import embedding_functions

ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="your-openai-key",
    model_name="text-embedding-3-small"  # $0.02/1M tokens — nearly free
)

client = chromadb.PersistentClient(path="./memory/vector-store")
collection = client.get_or_create_collection(
    name="agent_memory",
    embedding_function=ef
)

# Store a memory
collection.add(
    documents=["User prefers bullet points over prose. Dislikes long intros."],
    metadatas=[{"type": "preference", "date": "2026-03-06", "user": "pk"}],
    ids=["pref_001"]
)

# Retrieve relevant memories
results = collection.query(
    query_texts=["how should I format my response?"],
    n_results=3
)
# Returns: [["User prefers bullet points over prose..."]]

Chroma pros: Free, local, private, fast under 100K entries, persistent across restarts.
Chroma cons: Single machine only, no cloud sync.
Graduate when: You need multi-machine access or multiple users querying the same store.

Option B: Qdrant (Self-Hosted, Production-Ready)

Qdrant is the best option for production performance without Pinecone's pricing. Open-source, runs on a $6/month VPS.

# Run Qdrant via Docker
docker run -p 6333:6333 -v ./qdrant_storage:/qdrant/storage qdrant/qdrant

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

client = QdrantClient(host="localhost", port=6333)

client.create_collection(
    collection_name="agent_memory",
    vectors_config=VectorParams(size=1536, distance=Distance.COSINE)
)

def remember(text: str, metadata: dict, id: str):
    vector = embed(text)  # your OpenAI embed call
    client.upsert(
        collection_name="agent_memory",
        points=[PointStruct(id=id, vector=vector, payload=metadata | {"text": text})]
    )

def recall(query: str, top_k: int = 5, filter_by: dict = None):
    vector = embed(query)
    results = client.search(
        collection_name="agent_memory",
        query_vector=vector,
        limit=top_k,
        query_filter=filter_by  # e.g., {"user": "pk"}
    )
    return [r.payload for r in results]

Option C: pgvector (If You're Already on Postgres)

Zero additional infrastructure — just add the extension to your existing database.

-- Enable the extension
CREATE EXTENSION vector;

-- Create memory table
CREATE TABLE agent_memory (
    id SERIAL PRIMARY KEY,
    content TEXT NOT NULL,
    embedding VECTOR(1536),
    metadata JSONB,
    created_at TIMESTAMP DEFAULT NOW()
);

CREATE INDEX ON agent_memory USING ivfflat (embedding vector_cosine_ops);

The 5 Retrieval Patterns

This is where the real value is. Basic semantic search misses about 30% of relevant context. These five patterns close that gap and give your agent retrieval that actually works in production.

Pattern 1: Basic Semantic Search with Score Filtering

def retrieve_relevant_context(query, collection, top_k=5, min_score=0.70):
    results = collection.query(query_texts=[query], n_results=top_k)

5 retrieval patterns + full RAG pipeline inside

The rest covers Pattern 2 (hybrid BM25 + semantic), Pattern 3 (multi-query retrieval), Pattern 4 (contextual compression), Pattern 5 (self-querying with metadata filters), the complete RAG pipeline, embedding model comparison, and the solopreneur starter stack.

Hybrid BM25 + semantic search (exact + meaning)
Multi-query retrieval (3 angles per question)
Contextual compression (stop bloating context)
Self-querying with metadata filters (no cross-user bleed)
Full production-ready RAG pipeline (copy-paste ready)
Embedding model comparison: OpenAI vs. Cohere vs. local
Memory deduplication + pruning patterns
Week-by-week starter stack (zero to production)

Get Library Access — $9/mo →

Includes 54+ library items + Daily Briefing. 30-day money-back guarantee.

← Back to Library