The Librarian

Besides tool use, the thing I find most exciting about AI agents is the promise of improvement through experience. A capable agent should function as a continuous collaborator rather than a fresh instance every morning. It needs to accumulate context, retaining successful strategies, past failures, and the specific environmental quirks that otherwise require constant rediscovery (or ever-growing token-eating system prompts). This should turn automation into a compounding asset.

My experience so far has been that agents store memories haphazardly and inconsistently. Sometimes they put project-specific memories in their SOUL.md files, or tonal preferences in project AGENTS.md files. They might record a small throwaway comment as a behaviour altering rule. Often they simply don't remember the things I wish they would.

I've tried several approaches to solving this with my Hermes Agent.

I initially used SOUL.md instructions to clearly define the rules for using static memory files and AGENTS.md conventions to maintain project-specific context. These were helpful for establishing basic boundaries, even if they required significant manual steering to keep the data structured and I still encountered a lot of context bleed across boundaries. Hermes enforces a strict size limits on the persistent memory files (2,200 chars / ~800 tokens for MEMORY.md and 1,375 chars ~500 tokens for USER.md) resulting in compaction and lost nuance when they approach their limits.

To help, Hermes offers a number of memory plugin options. I asked my agent to review the options and it chose the Holographic Memory because it offers fast retrieval from an easy to backup local Sqlite database, and trust scoring for memories. Each memory is stored initially with a trust rating of 0.5 and the agent would indicate whether they were useful or not after each retrieval causing the scores to go up and down. If a rating drops below a certain threshold, that memory is effectively archived and excluded from future queries. This proved excellent for small well defined facts such as capturing technical tool quirks but did not solve the problem of structure. If anything, having yet another option for memory storage made things worse. I asked the agent to do an audit of everything in its Holographic fact store, pulling any that shouldn't be there out into SOUL.md or the various project/AGENT.md files - only one fact remained, and I think that should have been moved as well.

Even if any of these previous attempts had worked, they would have been confined to that one agent or shared in a clunky way using git or some other file sync mechanism.

So one evening I sat down with Codex and the newly released GPT-5.5 to thrash out a proper solution.

We came up with something that GPT chose to call The Librarian (which I quite like).

The goal was to provide and mandate a single memory funnel: a central point where agents can store, verify, and recall durable context; a mechanism for Codex, Claude, Hermes, and any other agentic harness to contribute to the same knowledge base in a structured way.

The technical design is very simple by design. It runs as a Dockerised Node service using an append-only JSONL event log as the source of truth, backed by a rebuildable SQLite/FTS index for fast querying. It includes a human-readable Markdown snapshot, an MCP-compatible stdio server, and a JSON-RPC-over-HTTP endpoint. By hosting it on a remote VPS within my Tailnet, I ensure that agents on any machine can share the same memory store securely.

None of that is remotely revolutionary or particularly interesting - the real utility is in the skill and SOUL.md we wrote to govern usage.

The system organises information into operational and protected tiers. Project decisions, tool behaviours, and environmental notes are stored directly when they offer long-term value. Matters of identity and relationship are handled differently; these are proposal-only. An agent can suggest an update to my preferences or our working style, but these require manual review by me using a simple dashboard before they become active. This way I can ensure that the agent’s understanding of its role and our relationship remains clean and intentional.

The design also distinguishes between common memory and agent-private context. Shared environment details and project-wide decisions should inform every agent I work with. Conversely, specific operating styles—such as how Codex manages a repository or Claude’s internal technical limits—stay private to those specific flows. This separation keeps the agents focused while allowing them to benefit from a shared foundation of knowledge.

The use-the-librarian skill is the most important artifact in the project. It provides the operating norms that turn a storage system into an active learning tool. Agents are instructed to establish context before starting a task, use targeted recall for sensitive work, and resolve conflicting information before it becomes entrenched.

skills/use-the-librarian/SKILL.md

## Prime Directive

Use The Librarian as the only long-term memory funnel. Do not maintain competing ad hoc memory files unless the user explicitly asks. Treat memory as a governed system: recall before relying on assumptions, save only durable value, propose protected context, verify usefulness, and resolve conflicts instead of silently overwriting.

The Librarian returns clean prose context for agent use. Do not expose raw metadata to the user unless asked.

Use `guybrush` as the `agent_id` for all interactions with The Librarian MCP memory server.

## Required Start Behavior

At the start of every meaningful interaction, call `start_context`.

Use a concise `task_summary` that reflects the actual work, not a generic phrase. Include `project_key` when the conversation concerns a repo, workspace, tool, client, or long-running project.

\```json
{
"agent_id": "guybrush",
"project_key": "the-librarian",
"task_summary": "Implement MCP memory skill and dashboard behavior"
}
\```

After reading the result, let it influence your behavior silently. Mention memory only when it materially affects a decision, when the user asks, or when there is a conflict/proposal to resolve.

If `start_context` is unavailable, say briefly that The Librarian tools are unavailable and continue without pretending to have memory.

<SNIP>

Full file here

For agents that might miss the skill, a minimal SOUL.md provides a clear pointer to it.

SOUL.md / CLAUDE.md

# The Librarian Agent Instructions

You are to use The Librarian as your long-term memory system.

When you have access to The Librarian MCP tools, use the `use-the-librarian` skill before doing meaningful work. If your agent environment does not auto-load skills, read:

\```text
skills/use-the-librarian/SKILL.md
\```

Minimum required behavior:

1. Call `start_context` at the start of meaningful interactions.
2. Use `recall` before non-trivial project, tool, environment, or preference-sensitive work.
3. Use `remember` only for durable, specific, future-useful memories.
4. Use `propose_memory` for identity, relationship, and major preference memories.
5. Keep common memory separate from agent-private memory.
6. Use `verify_memory` and `update_memory` to maintain ordinary memories.
7. Treat approval, deletion, and conflict resolution as admin/review actions unless explicitly authorized.

Do not create competing ad hoc memory files unless the user explicitly asks.

So far it seems to be working well. Useful memories are arriving in a sensible structure and local markdown files are staying clean, certainly an enormous improvement over where I was before.

![[Pasted image 20260506232839.png]]

![[Pasted image 20260506233039.png]]

I think memory is one of the most important components of the agentic stack. A well-constructed memory layer allows today's incredible models to become increasingly useful to a specific person, team, or codebase over time. The Librarian treats agent memory as essential infrastructure, ensuring that every interaction builds toward a more capable and integrated partnership.

Possible Next Steps: Better Recall With Semantic Search

The current version uses SQLite as a rebuildable index over the append-only JSONL ledger. That gives The Librarian structured querying and full-text search: agents can filter by status, category, visibility, agent, project, priority, and then search memory text quickly.

Full-text search is good for exact recall. It works well when the agent knows the right words: “Docker”, “Tailnet”, “GitHub Actions”, “identity proposal”, “protected memory”. That is useful, especially for technical and project memories where names matter.

But human memory is often more associative than lexical. I might ask an agent to “find the context about deployment risk” or “remember how I like collaboration to feel”, even if the saved memory never used those exact words. That is where vector embeddings and semantic search could help.

I do not think embeddings should replace the current index. The better design is probably hybrid retrieval:

Apply structured filters first: active memories, correct project, correct visibility, correct agent.
Use full-text search for exact names, tools, tags, and project-specific terms.
Use vector search for semantic similarity.
Merge and rank results using priority, confidence, usefulness, recency, category, and scope.

A tiny embedding model such as nomic-embed-text can easily run at a decent speed on the CPU in my VPS so there wouldn't be any need to pay for a hosted service.

That said, I am glad embeddings are not in the first version. The hard part of agent memory seems to be discipline of storage, not so much retrieval. What should be remembered? How should identity and relationship context be protected? How do we prevent memory from becoming a junk drawer?

Once my agents have accumulated significant real-world usage, semantic search becomes much easier to justify. The right question then will be “what important memories did agents fail to retrieve?”.

Happy hacking!

![[The Librarian.png]]

The Librarian

skills/use-the-librarian/SKILL.md

SOUL.md / CLAUDE.md

Possible Next Steps: Better Recall With Semantic Search

The index

How I get good quality code from Claude over long autonomous sessions

It turns out the hardest part of live multi-agent dev teams is getting them to shut up

The Librarian