From 600MB of Models to Zero: How I Simplified My AI Agent Memory System

I spent a weekend ripping out 1,865 lines of machine learning infrastructure from Myelin. It was the best architectural decision I’ve made on this project.

The Extraction Pipeline

If you read my previous post, you know Myelin gives AI agents persistent memory across sessions. When an agent logs a decision, discovers a bug, or learns something about a codebase, Myelin extracts entities and relationships from that log and stores them in a knowledge graph. Later sessions surface relevant context automatically.

The extraction pipeline was the heart of this — take unstructured text, pull out structured knowledge. I built a local ML inference stack:

GLiNER (a zero-shot NER model based on DeBERTa v2) — 583MB ONNX model for entity extraction
all-MiniLM-L6-v2 — 90MB model for semantic embeddings
onnxruntime-node — native Node.js addon for ONNX inference
Custom tokenizers — WordPiece and Unigram implementations to feed the models

It worked. But it also broke constantly.

The Install Nightmare

Myelin ships as a Copilot CLI extension. My users are developers, not ML engineers. When they ran npm install, they hit:

onnxruntime-web (a browser runtime I didn’t even use) getting pulled in by @huggingface/transformers, containing WebGPU symlinks that broke Windows tar extraction
Native addon compilation requiring C++ toolchains — Visual Studio Build Tools on Windows, Xcode on macOS
NODE_MODULE_VERSION mismatches every time someone upgraded Node.js
600MB+ of model downloads on first run

My first beta tester couldn’t install it without manually patching his global npm config. That’s when I knew something was fundamentally wrong.

The First Fix (Wrong Direction)

My instinct was to fix the dependency chain. I dropped @huggingface/transformers entirely and wrote my own tokenizers — a 475-line pure TypeScript module implementing WordPiece (for the embedding model) and Unigram/SentencePiece (for GLiNER). I built direct ONNX inference without the HuggingFace wrapper. I moved onnxruntime-node to optionalDependencies.

This worked. Tests passed. Install friction decreased. But stepping back, I realized I was polishing the wrong architecture.

The Insight

Myelin runs inside an AI agent’s process. The Copilot CLI agent already has access to Claude, GPT, or whatever model is powering the session. I was running a smaller, less capable model — a 2023 span classifier — inside the process of a larger, more capable model, to do a task the larger model does better.

Think about that for a second. I had a 583MB ONNX model doing entity extraction, plus a 200-line blocklist to filter out its mistakes (“West” classified as a person, “Hub Hour” classified as a person, “Dark” classified as a person). The host LLM would never make those mistakes because it understands context.

The relationship extraction was even worse. GLiNER finds entity spans — it doesn’t understand relationships. So I used proximity heuristics: if two entities appear within 300 characters of each other, create a relates_to edge. Then scan for signal phrases (“depends on”, “created by”) to maybe upgrade the edge type. 60% of my graph edges were generic relates_to — essentially “these things were mentioned near each other.”

The Architecture Pivot

Three changes, one push:

1. FTS5 keyword search as the primary search path. I already had FTS5 as a fallback. I made it primary. Agent queries are keyword-rich (“how does auth work”, “what did we decide about caching”) — FTS5 handles these well. Semantic vector search became an optional enhancement, not a requirement.

2. LLM extraction via extension tools. I added a myelin_consolidate tool with three modes:

prepare — reads pending agent logs, chunks them, returns text + extraction schema
ingest — accepts the LLM’s structured extraction (entities, typed relationships, salience scores), writes to graph
complete — runs the REM decay/prune cycle

The LLM sits in the middle. It reads the logs, reasons about what matters, extracts entities with correct types, and creates relationships with actual meaning — not proximity guesses.

3. Complete removal of ONNX. Not optional. Not a fallback. Gone. I archived the code on a branch and deleted 1,865 lines. The tokenizer module I’d built that same weekend? Deleted. Sometimes the best code is the code you delete.

The Result

Before	After
583MB GLiNER model download	No model downloads
90MB embedding model download	No model downloads
onnxruntime-node (native addon)	No native ML deps
Custom WordPiece + Unigram tokenizer	Deleted
200-line entity blocklist	Not needed
60% `relates_to` edges	LLM-typed relationships
C++ toolchain required	Not required
`npm install` breaks on Windows	Just works

Install is now: npm install -g github:shsolomo/myelin && myelin setup-extension. Done. No models, no C++ compiler, no 600MB download. Under 10 seconds.

What I Learned

Don’t run a smaller model inside a bigger model. If your tool operates within an LLM’s context, use the LLM. The only exception is latency-critical paths where you need sub-millisecond inference. Memory consolidation is not latency-critical.

“Optional” complexity is still complexity. My first instinct was to make ONNX optional — keep it as a fallback for offline use. But optional code paths still need tests, docs, CI coverage, and mental overhead. Until you’re stable, maintain one path.

The best tokenizer is no tokenizer. I wrote a genuinely clean tokenizer implementation — proper Viterbi algorithm, trie-based lookup, BertNormalizer, the works. Then I deleted it 12 hours later. The code was correct. The architecture was wrong.

Heuristic post-processing is a code smell. When you need a 200-line blocklist to filter your model’s output, the model is wrong for the job. An LLM that understands “Jeff West” is a person and “West Coast” is a region doesn’t need a blocklist.

What’s Next

The extraction architecture is settled. Now the interesting work begins: procedural memory graduation — patterns that emerge from repetition — associative linking during the REM sleep phase, and configurable embedding APIs for users who want semantic search. The knowledge graph has 9,643 nodes and 27,069 edges. The quality of those edges is about to get a lot better.

Myelin is open source at github.com/shsolomo/myelin. If you’re building agent memory systems, I’d love to hear how you’re approaching extraction.