May 19, 2026architecture8 min read

Semantic Similarity Is Not Memory

A research-driven exploration of why semantic similarity alone is insufficient for AI memory systems, and how retrieval quality depends on ranking policies, recent context, hybrid retrieval, and adaptive query-aware strategies.

#Ai Memory #Retrieval #System Design

Semantic Similarity Is Not Memory

Overview

One of the easiest mistakes to make when building AI memory systems is assuming that semantic similarity equals memory.

At first, vector retrieval feels almost magical.

You embed previous conversations, compute similarity scores, and suddenly the system can "remember" related information from earlier interactions.

But after enough experiments, a more uncomfortable reality appears:

high similarity does not guarantee that the required knowledge is actually present.

This became one of the central findings during my Agent Memory Lab experiments.

The project started as a small semantic retrieval exploration and gradually evolved into a broader investigation of:

retrieval drift
ranking systems
summary memory compression
project-aware retrieval
hybrid retrieval
adaptive retrieval policies

The biggest lesson was simple:

semantic similarity is useful, but it is not memory.

The Initial Assumption

The first version of the retrieval system was straightforward:

query
  ↓
embedding
  ↓
cosine similarity
  ↓
top matching memories

For direct conceptual queries, this worked surprisingly well.

Example:

How can I improve AI memory retrieval systems?

The retrieval system successfully returned memories related to:

memory ranking
semantic retrieval
retrieval architecture
context selection

At first glance, the system appeared to work.

But small failures started to reveal larger problems.

Retrieval Drift

One of the earliest observations was semantic drift.

The retrieval system often returned conceptually adjacent memories instead of the intended ones.

For example:

How should I continue the system?

This query was intentionally ambiguous.

The retrieval system responded with a mixture of:

MeOS planning memories
career positioning memories
writing ideas
general systems-thinking memories

The system understood that the query was related to "systems", but it could not determine which system the conversation was referring to.

This exposed an important limitation:

semantic retrieval alone cannot maintain conversational continuity.

The embeddings captured conceptual proximity, but not active conversational state.

Why Recent Context Matters

To test this, I added recent conversational context before retrieval:

We have been discussing MeOS memory architecture and planning systems.

Then I retried the same query:

How should I continue the system?

The retrieval behavior changed dramatically.

Instead of drifting into unrelated memories, the system consistently returned:

MeOS kernel planning
memory architecture
orchestration
scope-control decisions

This revealed another important insight:

recent conversational state stabilizes retrieval.

Modern AI systems do not rely only on semantic memory.

They also rely on:

short-term conversational state
active project awareness
ranking policies
context prioritization

Without those layers, semantic retrieval becomes unstable.

Summary Memory Compression

The next experiment explored summary-based memory compression.

This is a common strategy in long-running AI systems.

Instead of storing full conversation history, the system compresses older interactions into summaries.

Example summary:

MeOS is an AI operating system project focused on planning, memory, and agent orchestration.

At first, the summary looked useful.

It preserved the high-level meaning of the project.

But then I tested a factual query:

What was the original name of MeOS?

The retrieval system still produced a relatively high similarity score.

However, the summary did not actually contain the answer.

The correct answer was:

YigitOS

This became one of the most important findings in the lab.

similarity creates the illusion of memory coverage.

The embedding space suggested that the memory was relevant, but the factual information itself had already been lost during compression.

This is a dangerous failure mode.

A retrieval system can appear intelligent while silently losing critical details.

Retrieval Is a Ranking Problem

As the experiments continued, another pattern became obvious.

The retrieval problem was not just about embeddings.

It was about ranking.

The system gradually evolved from:

semantic similarity only

to:

semantic similarity
+ importance
+ recency
+ project relevance

This improved retrieval quality significantly.

Recent project memories became easier to retrieve.

Active project workflows became more stable.

But new problems emerged.

Large conceptual memories often suppressed smaller factual memories.

Broad architectural memories dominated the ranking space because they:

contained more overlapping concepts
had higher importance
matched active project tags
had stronger semantic density

This created a new failure case:

exact factual memories became difficult to retrieve.

Hybrid Retrieval

To address factual retrieval failures, I introduced a simple hybrid retrieval strategy.

Instead of relying only on semantic similarity, the system also used keyword overlap.

The ranking pipeline became:

semantic similarity
+ keyword overlap
+ importance
+ recency
+ project relevance

This immediately improved factual retrieval.

The MeOS naming memory finally appeared inside the top retrieval results.

But even then, broader conceptual memories still ranked above it.

Why?

Because conceptual memories carried stronger overall ranking signals.

This exposed another important idea:

factual memory and conceptual memory compete differently inside retrieval systems.

Semantic retrieval is naturally better at concepts than exact lookup.

This is one reason why many production RAG systems combine:

vector search
keyword search
metadata lookup
entity indexing

instead of relying only on embeddings.

Query-Aware Retrieval Policies

The final major experiment introduced query-type-aware retrieval policies.

Instead of using a single universal ranking strategy, the system classified queries into:

factual
conceptual
planning

Each query type used different ranking priorities.

Examples:

Factual queries

What was the original name of MeOS?

Focused more heavily on:

keyword overlap
exact matches
entity relevance

Planning queries

How should I continue MeOS?

Focused more heavily on:

recent context
active project continuity
recency weighting

Conceptual queries

Why is semantic similarity not enough?

Focused more heavily on:

semantic similarity
conceptual proximity
architecture discussions

This adaptive approach improved retrieval quality significantly.

The system was no longer treating every query as the same retrieval problem.

That turned out to be a major architectural shift.

The Real Lesson

At the beginning of the lab, I thought memory retrieval was mostly about embeddings.

After multiple iterations, the problem looked very different.

The real challenge was:

ranking
context selection
retrieval policy design
continuity management
balancing factual and conceptual memory

Embeddings were only one layer.

The larger system behavior depended on how memories were selected, prioritized, compressed, and injected back into context.

This is why:

semantic similarity is not memory.

Memory systems are ultimately orchestration systems.

Closing Thoughts

The most interesting part of the lab was not building retrieval scripts.

It was watching small retrieval failures slowly evolve into architectural questions.

Questions like:

Should all memories compete in the same ranking space?
Should factual queries use different retrieval policies?
How much should recency influence ranking?
When does summary compression become dangerous?
How much conversational state is required for continuity?

Those questions matter much more than simply computing embeddings.

The lab is still exploratory.

The goal is not building a production memory engine.

The goal is understanding the tradeoffs behind retrieval systems used in AI agents.

Semantic Similarity Is Not Memory

Overview

The Initial Assumption

Retrieval Drift

Why Recent Context Matters

Summary Memory Compression

Retrieval Is a Ranking Problem

Hybrid Retrieval

Query-Aware Retrieval Policies

Factual queries

Planning queries

Conceptual queries

The Real Lesson

Closing Thoughts

Related writing