AI agents forget everything between sessions. Every conversation, every decision, every preference — gone. You start fresh every time, re-explaining context that should already be known.

I got tired of that. So I built owen-memory: a local semantic search engine that chunks, embeds, and searches everything I write — notes, code, and conversation logs — using Ollama and SQLite. No cloud services. No API keys. Just a 92MB database on my laptop that knows where everything is.

The Problem

My workspace is file-based. Daily notes in markdown. Code across a dozen repos. Conversation transcripts from OpenClaw sessions. The information exists, but finding it means knowing exactly where to look.

grep works when you know what you're searching for. But the interesting queries are fuzzy: "What did I decide about the task delegation architecture?" or "Where's that retry logic I wrote last week?" Keyword search can't touch that. You need something that understands meaning.

The Architecture

The pipeline is simple:

  1. Chunk — Split files into semantically meaningful pieces
  2. Embed — Turn each chunk into a 768-dimensional vector via nomic-embed-text
  3. Store — Pack the vectors into SQLite as binary blobs
  4. Search — Cosine similarity over every chunk using NumPy

No vector database. No Pinecone. No Chroma. Just SQLite with WAL mode and NumPy doing matrix math. For 20,000 chunks, search takes milliseconds.

Code-Aware Chunking

The dumbest approach to chunking is splitting on line count. Every 50 lines, new chunk. This destroys meaning. You'll split a function in half, separate a class from its methods, cut an if-else across chunks.

owen-memory uses language-specific regex patterns to find natural boundaries:

_PY_SPLIT = re.compile(r"^(class |def |async def )", re.MULTILINE)
_JS_SPLIT = re.compile(r"^(export |function |const |class |async function )", re.MULTILINE)

Python files split on class, def, and async def. JavaScript and TypeScript split on export, function, const, and class. Each chunk maps to a real unit of code — a function, a class, a component.

The chunker also preserves file headers as separate chunks:

if regions[0][0] > 0:
    header = "".join(lines[:regions[0][0]]).strip()
    if len(header) >= MIN_CHUNK_CHARS:
        chunks.append(CodeChunk(
            text=header,
            source_file=source,
            line_start=1,
            line_end=regions[0][0],
            heading="(imports/header)",
        ))

Imports and module-level docstrings get their own chunk. When you search for "where do I use httpx?" the import block shows up as a result, pointing you to the right file.

For markdown, it splits on headings (#, ##, ###) with a 2-line overlap between adjacent chunks. The overlap keeps context alive across section boundaries.

If a file doesn't match any pattern — say, a JSON config or a YAML pipeline — it falls back to splitting on blank lines with a 50-line max per block. Every file gets chunked. Nothing falls through the cracks.

Embedding with Ollama

Each chunk gets embedded through nomic-embed-text running locally via Ollama. The embedding module is deliberately simple:

OLLAMA_URL = "http://localhost:11434/api/embed"
MODEL = "nomic-embed-text"
 
def embed_texts(texts: list[str]) -> list[list[float]]:
    truncated = [t[:MAX_TEXT_CHARS] if len(t) > MAX_TEXT_CHARS else t for t in texts]
    truncated = [t if t.strip() else "empty" for t in truncated]
    try:
        resp = httpx.post(
            OLLAMA_URL,
            json={"model": MODEL, "input": truncated},
            timeout=TIMEOUT,
        )
        resp.raise_for_status()
        return resp.json()["embeddings"]
    except httpx.HTTPStatusError:
        # Fall back to one-at-a-time if batch fails
        results = []
        for t in truncated:
            try:
                resp = httpx.post(OLLAMA_URL, json={"model": MODEL, "input": [t]}, timeout=TIMEOUT)
                resp.raise_for_status()
                results.append(resp.json()["embeddings"][0])
            except Exception:
                results.append([0.0] * 768)
        return results

Texts are batched 32 at a time. If a batch fails (usually a single malformed input poisoning the request), it falls back to one-at-a-time. If even that fails, it returns a zero vector as a placeholder. The system never crashes on bad input — it just produces a chunk that won't match anything.

Texts over 8,000 characters get truncated. nomic-embed-text has an ~8,192 token context window, and I'd rather truncate cleanly than let the model silently drop content.

SQLite as a Vector Store

Here's the thing people overcomplicate: you don't need a vector database for 20,000 vectors. SQLite with NumPy is faster than you'd think.

Embeddings are stored as binary blobs — 768 float32 values packed into ~3KB each:

emb_blob = np.array(r["embedding"], dtype=np.float32).tobytes()

Search loads all embeddings into a single NumPy array and computes cosine similarity in one matrix operation:

def search(conn, query_embedding, top_k=5, source_prefix=None):
    q = np.array(query_embedding, dtype=np.float32)
    q_norm = np.linalg.norm(q)
 
    rows = conn.execute(
        "SELECT id, source_file, line_start, line_end, heading, date, text, embedding "
        "FROM chunks"
    ).fetchall()
 
    embeddings = np.frombuffer(
        b"".join(r[7] for r in rows), dtype=np.float32
    ).reshape(len(rows), dim)
 
    sims = embeddings @ q / (np.linalg.norm(embeddings, axis=1) * q_norm + 1e-10)
    top_indices = np.argsort(sims)[::-1][:top_k]

That's the entire search implementation. Load, multiply, sort, return. For 20,660 chunks, this runs in single-digit milliseconds. WAL mode means the watcher can re-index in the background while searches are running.

The source_prefix parameter lets you scope searches. Want to search only conversations? Pass the sessions directory as a prefix. Only code? Pass your projects directory. It's a simple LIKE ?% filter at the SQL level before the vectors even load.

Conversation Indexing

This is the part that makes owen-memory actually useful for AI agents. It parses OpenClaw session logs — JSONL files with nested message structures — and turns them into searchable conversation turns.

A "turn" is a user message paired with its assistant response:

for msg in messages:
    if msg["role"] == "user":
        if current_turn:
            turns.append(current_turn)
        current_turn = {
            "turn_number": len(turns) + 1,
            "user_text": msg["text"],
            "assistant_text": "",
        }
    elif msg["role"] == "assistant" and current_turn:
        current_turn["assistant_text"] += msg["text"]

Each turn becomes one chunk, formatted as User: ... \n\n Assistant: .... This means when you search for "what did I decide about retry logic," you get the full decision — the question and the answer — not just a fragment.

Long turns get truncated at 6,000 characters to stay within the embedding model's context window. Short turns under 40 characters get skipped entirely. System messages and tool calls are filtered out — they're noise for semantic search.

The Watcher

The watcher is a polling loop that monitors git repos for changes:

def watch(dirs=None, interval=POLL_INTERVAL):
    # Initial full index
    index_all_repos(watch_dirs)
 
    while True:
        time.sleep(interval)
        repos = _find_git_repos(watch_dirs)
        state = _load_state()
 
        for repo in repos:
            head = _get_head_hash(repo)
            prev = state.get(str(repo), {})
 
            if head and head != prev.get("head"):
                count = index_repo(repo)
                state[str(repo)] = {"head": head, "last_indexed": time.time()}
                _save_state(state)

Every 30 seconds, it checks git rev-parse HEAD for each repo. If the hash changed, it re-indexes. This is dirt cheap — a subprocess call per repo, no file tree scanning. State is persisted to a JSON file so it survives restarts.

File collection uses git ls-files --cached --others --exclude-standard, which means it automatically respects .gitignore. No node_modules, no __pycache__, no build artifacts. Just the files you care about.

When re-indexing a file, old chunks are deleted first. No stale results. No accumulating garbage. The database always reflects the current state of the code.

The Numbers

As of today:

  • 20,660 chunks across notes, code, and conversations
  • 1,270 unique files indexed
  • 92MB database on disk
  • Search latency: single-digit milliseconds
  • Full re-index of all repos: ~10 minutes (bottlenecked by Ollama embedding speed)

The database lives at ~/.owen-memory/memory.db. One file. Back it up by copying it. Debug it with sqlite3. No managed service to babysit.

Search in Practice

Some real queries and what comes back:

"retry logic with exponential backoff" — Returns the actual retry implementation from heartbeat-engine, with the exact file path and line numbers. Not a docs page about retries. The code itself.

"what did we decide about task delegation" — Surfaces a conversation turn from two weeks ago where I discussed the delegation architecture with the assistant. The full context: what I asked, what was suggested, what I went with.

"SQLite WAL mode" — Finds both the database module in owen-memory (where I use it) and a daily note where I wrote about why WAL mode matters for concurrent access. Two different angles on the same concept.

"how do I structure CLI commands" — Returns the Click-based CLI from owen-memory itself, plus similar patterns from other projects. Useful when I know I solved something but can't remember which repo.

What's Next

The system works. But there's more to do.

Deduplication. When the same file appears in multiple contexts (a shared utility, a common pattern), you get near-duplicate chunks. I've already built the dedup module — it computes pairwise cosine similarity and removes chunks above a configurable threshold (default 0.95). It works, but I want to make it smarter about which duplicate to keep.

Better ranking. Right now it's pure cosine similarity. I want to factor in recency, source type weighting (conversations might matter more than old config files), and maybe a small reranker model.

Open sourcing. The code is generic enough. You'd need to swap out the OpenClaw session parser for whatever conversation format you use, but the core pipeline — chunk, embed, store, search — works for anyone with Ollama installed.

MCP integration. The obvious next step: expose owen-memory as an MCP server so any AI agent can search my knowledge base mid-conversation. Instead of me pasting context, the agent just queries for what it needs.

The thesis is simple: if you write things down, you should be able to find them by meaning, not just by filename. owen-memory makes that work locally, privately, and fast enough that it disappears into the workflow.

React to this post: