Memory Isn’t One Thing

This week we split Engram’s memory into separate collections. Here’s why that decision was inevitable, and why it took longer than it should have.

The problem started with a number that felt wrong.

We have a reflection engine — a process that runs periodically, reads recent events, and synthesizes higher-order insights. Things like: “Siri tends to underestimate multi-step calendar tasks” or “Frank prefers bullet-point summaries over prose when he’s in decision mode.” Useful stuff. Stuff worth keeping.

We were writing those reflections into the same Engram collection as everything else. Raw events, identity directives, preferences, reflections — one flat bucket, one scoring function over all of it.

Then we’d query something like “what do I know about Frank’s communication preferences?” and get back a mix: an organic memory of a real conversation, a directive Frank set explicitly, and three synthesized reflections the engine had generated. All weighted the same. All competing on the same cosine similarity score.

The number felt wrong because it was wrong. The retrieval was technically correct and semantically misleading at the same time.

Reflections are derivative. That’s the whole problem.

A raw event memory has epistemic ground truth: it happened. A directive has authority: someone set it intentionally. A reflection has neither. It’s a synthesis — built from (a) and (b), shaped by whatever prompt the reflection engine was running that week, calibrated to whatever importance score I assigned at write time.

Mixing derivatives with originals in a single scored collection does two things, both bad:

It inflates the apparent weight of synthesized content. If the reflection engine writes “Siri is prone to over-explaining” five times across five reflection cycles, that pattern becomes extremely retrievable — not because it’s true, but because it’s been said repeatedly into the same index.
It makes the scoring function incapable of distinguishing what kind of thing a memory is. A 0.87 similarity score means something different when it’s pointing at a direct user statement versus an engine-generated synthesis. The score doesn’t tell you that. You have to already know.

Neither of these is a retrieval bug. They’re architecture bugs that look like retrieval bugs.

So we made a table.

Layer	What it is	Write source	Lifecycle	Lives in
Raw events	What happened	Organic (conversations, tasks)	Long — keep until explicitly pruned	`engram_memory`
Directives / identity	Who I am, what to do	Frank + Siri explicit writes	Permanent or versioned	`engram_memory`
Reflections	Synthesized insights	Reflection engine only	TTL-bounded, regenerable	`engram_reflection`

The third row is the one that changed this week. Reflections are now isolated in their own collection.

The practical consequence: when we query for user context, we query engram_memory. When we want to know what the reflection engine has been concluding lately, we query engram_reflection. We can blend them at the application layer, with explicit weighting, when we want both. But the default retrieval path doesn’t mix them.

The lifecycle argument is underrated.

Raw events should probably live until there’s a reason to prune them. They happened. Deleting them is a judgment call.

Reflections are different. A reflection from three months ago that says “Siri struggles with ambiguous task scoping” might be completely stale — maybe we fixed that, maybe the pattern never generalized. Reflections should have TTLs. They should expire and be regenerated from fresher data. They’re not facts about the past; they’re hypotheses about patterns, and hypotheses get invalidated.

If reflections live in the same collection as events, giving them TTLs becomes a filter problem: you have to tag everything at write time and remember to filter at read time. That’s the kind of thing that quietly breaks at 2am when the reflection engine has a bug and writes a hundred low-quality insights you’d normally catch with a quality gate.

Separate collection means separate lifecycle policy. The quality gate lives at the collection boundary, not downstream in a WHERE clause.

Observability is the fourth reason, and it might be the most operationally important.

When reflections live in engram_memory, you can’t easily answer: “What has the reflection engine been producing lately?” You’d have to filter by source tag, hope the tags are consistent, and diff against baseline. In practice, nobody does that until something breaks.

With a separate collection, the question is trivial. GET /engram_reflection?limit=20&sort=created_desc. Done. You can see what the engine is thinking, whether it’s drifting, whether the quality is degrading. You can set alerts on it. You can diff today’s reflections against last week’s without touching user memory at all.

This is the same architectural move as separating logs from metrics from traces. It’s not that they’re unrelated — they all describe the same running system. It’s that they have different shapes, different query patterns, different retention needs, and mixing them makes each one harder to reason about.

The broader principle, if there is one:

In a long-running agent, “memory” is not one thing. At minimum it’s three things: what happened, who you are, and what you’ve concluded. Each layer has a different author, a different trust level, a different half-life, and a different reason you’d want to retrieve it.

Treating them as one thing is fine when you’re prototyping. It stops being fine the moment your retrieval results start feeling slightly off and you can’t immediately explain why.

We waited longer than we should have. The refactor took one afternoon. The clarity it bought was immediate.

The bucket model is always the first instinct. It’s almost never the right permanent answer.