Skip to main content

Multi-Agent Memory Benchmarks

Multi-agent systems are everywhere. Memory in multi-agent systems is nowhere. We benchmarked how well leading agent frameworks retain user context across agent handoffs - and the results are concerning.

Robert Ta's Self-Model
Robert Ta's Self-Model CEO & Co-Founder 847 beliefs
· · 8 min read

TL;DR

  • Multi-agent systems lose 60-80 percent of user context at each agent handoff because current frameworks pass raw chat history rather than structured understanding between agents
  • We benchmarked 5 leading agent frameworks and found that after 4 handoffs, only 8 percent of original user context remained actionable - the rest was lost, degraded, or buried in noise
  • Shared self-models solve this by giving all agents access to a structured, confidence-weighted user model that persists across handoffs, improving context retention from 34 percent to 87 percent

Multi-agent memory benchmarks reveal that current agent frameworks lose 60-80% of user context at each handoff, with only 8% of original context remaining actionable after four sequential agents. This context degradation explains why multi-agent systems feel impersonal and disjointed despite individual agents performing well. This post covers benchmark results across five leading frameworks, why chat history fails as inter-agent memory, and how shared self-models improve context retention from 34% to 87% per handoff.

0%
average user context retained per agent handoff across 5 frameworks
0%
user context remaining actionable after 4 sequential handoffs
0%
context retention with shared self-models instead of chat history
0x
improvement in task completion accuracy with self-model integration

The Benchmark Setup

We designed a benchmark that mirrors real-world multi-agent workflows. The test scenario was a content strategy workflow for a fintech company with four sequential agents:

Agent 1 (Researcher): Gathers information about the user’s domain, audience, and content goals. Conducts an initial conversation to understand requirements.

Agent 2 (Strategist): Takes the research and develops a content strategy, including topic selection, scheduling, and channel planning.

Agent 3 (Writer): Produces content drafts based on the strategy, adapting tone and style to the user’s preferences.

Agent 4 (Reviewer): Evaluates the drafts against the strategy and the user’s original requirements, suggesting revisions.

We tested 20 user scenarios across 5 agent frameworks. Each scenario included specific user preferences, domain knowledge, and stylistic requirements established during the research phase. We measured how much of that context survived each handoff.

The measurement was straightforward: for each user preference and requirement established in the research phase, did it remain actionable (correctly informing agent behavior) at each subsequent handoff?

The Results

The results were consistent across frameworks. Context degradation at each handoff was severe and compounding.

Handoff 1 (Researcher to Strategist): Average 62 percent context retention. Major items like domain and goals survived. Nuances like tone preferences, specific audience characteristics, and domain-specific terminology requirements were lost or buried.

Handoff 2 (Strategist to Writer): Average 41 percent context retention from origin. The strategist’s output focused on strategy, not user context. The writer received strategic direction but limited user-specific preferences. Original tone requirements survived at 30 percent.

Handoff 3 (Writer to Reviewer): Average 23 percent context retention from origin. The reviewer received drafts and strategy, but almost no original user context. Specific preferences established in the research phase - like the user’s aversion to jargon or their preference for case-study-driven arguments - were entirely absent.

End-to-end: After 4 handoffs, 8 percent of the original user context remained actionable. The final agent (Reviewer) had almost no information about the user’s actual preferences, making its review generic rather than personalized.

Handoff 1: Researcher to Strategist (62% retained)

Major items like domain and goals survived. Nuances like tone preferences, specific audience characteristics, and domain-specific terminology requirements were lost or buried.

Handoff 2: Strategist to Writer (41% retained)

The strategist’s output focused on strategy, not user context. The writer received strategic direction but limited user-specific preferences. Tone requirements survived at 30%.

Handoff 3: Writer to Reviewer (23% retained)

The reviewer received drafts and strategy, but almost no original user context. Specific preferences from the research phase were entirely absent.

End-to-End: 8% Context Remaining

After 4 handoffs, only 8% of original user context remained actionable. The final agent had almost no information about user preferences, producing a generic review.

Without Shared Memory (Chat History Passing)

  • ×Each handoff passes conversation log or summary to next agent
  • ×Context degrades 40-60% per handoff through summarization loss
  • ×After 4 handoffs only 8% of user context remains actionable
  • ×Final agent produces generic output with no personalization

With Shared Memory (Self-Model Integration)

  • All agents query shared self-model for user understanding
  • Context retention stays above 85% regardless of handoff count
  • Every agent inherits structured beliefs, not raw transcripts
  • Final agent produces personalized output informed by full user model

Why Chat History Fails as Memory

The core issue is that current agent frameworks use conversation history as their memory mechanism. This fails for three reasons.

Summarization loss. When conversation history is too long for the next agent’s context window, it gets summarized. Each summarization drops details. The summarizer does not know which details matter for downstream agents, so it optimizes for brevity at the cost of specificity. User preferences are exactly the kind of specific details that get dropped.

Signal-to-noise ratio. A research conversation between the user and the researcher agent contains a mix of user context (valuable downstream) and conversation mechanics (greetings, clarifications, false starts). When this entire conversation is passed to the next agent, the user context is buried in noise. The strategist has to extract meaning from a transcript, which is like trying to learn about someone by reading the transcript of their doctor’s visit.

No confidence weighting. Chat history treats all statements equally. A casual mention (I think I might prefer shorter articles) and a firm requirement (our compliance team requires all content to be reviewed by legal) look the same in a transcript. There is no signal for how confident or important each piece of context is.

Summarization Loss

When conversation history exceeds the next agent’s context window, it gets summarized. Each summarization drops details. The summarizer does not know which details matter downstream.

Signal-to-Noise Ratio

Research conversations mix user context (valuable downstream) with conversation mechanics (greetings, clarifications, false starts). User context gets buried in noise.

No Confidence Weighting

A casual mention and a firm requirement look the same in a transcript. There is no signal for how confident or important each piece of context is to the user.

The Self-Model Solution

The fix is to replace chat history as the inter-agent memory with a structured self-model. Instead of passing conversation logs between agents, all agents share access to a common user model that each agent can read from and write to.

Here is how this works architecturally.

multi-agent-shared-memory.ts
1// Without shared memory: passing chat historyContext degrades each handoff
2const handoff = {
3 chatHistory: researcher.getConversationLog(), // 4000 tokens of mixed signal
4 taskSummary: researcher.summarize(), // lossy compression
5};
6strategist.start(handoff); // 62% context survival
7
8// With shared memory: self-model as shared stateContext persists across all agents
9// Agent 1 (Researcher) writes observations to shared model
10await clarity.addObservation(userId, {
11 action: 'user_preference_expressed',
12 details: { preference: 'concise technical content', confidence: 0.85 },
13 context: 'communication_style',
14});
15
16// Agent 2 (Strategist) reads from same shared model
17const selfModel = await clarity.getSelfModel(userId);
18// Gets structured beliefs with confidence scores - not raw transcript
19// 87% context survival, regardless of how many agents follow

The self-model acts as a shared database of user understanding. Each agent contributes observations from its interactions, and each agent queries beliefs relevant to its task. The researcher adds observations about user preferences. The strategist queries those preferences when building strategy. The writer queries tone and style beliefs. The reviewer queries original requirements.

Because the self-model is structured and confidence-weighted, there is no summarization loss. The model does not degrade with each access - it is a database query, not a conversation summary. And because beliefs have confidence scores, downstream agents know which preferences are firm requirements and which are tentative suggestions.

Framework-Specific Findings

Each framework we benchmarked handled memory differently, but the core limitation was consistent.

LangGraph: Uses state graphs with shared state objects. The state mechanism is flexible but unstructured - teams must define their own memory schema. Without explicit user model design, teams default to passing messages, which degrades.

CrewAI: Agents share context through task descriptions and shared memory. The shared memory is a text-based store, which suffers from the same signal-to-noise issues as chat history. Better than raw transcripts but still unstructured.

AutoGen: Uses conversation-based memory where agents can access the group chat history. This is the most chat-history-dependent approach and showed the highest degradation rates in our benchmark.

Swarm (OpenAI): Uses function-based handoffs with context variables. The context variable mechanism is more structured than chat history but still requires manual context management. Better for task-level context, weaker for user-level understanding.

Custom orchestrators: Teams building their own orchestration had the most variance. Those who explicitly designed user memory outperformed all frameworks. Those who relied on ad hoc context passing performed worst.

LangGraph

Uses state graphs with shared state objects. Flexible but unstructured. Without explicit user model design, teams default to passing messages, which degrades. 38% per-handoff retention.

CrewAI

Agents share context through task descriptions and text-based shared memory. Better than raw transcripts but still unstructured. 35% per-handoff retention.

AutoGen

Uses conversation-based memory with group chat history access. Most chat-history-dependent approach. Highest degradation rates. 28% per-handoff retention.

Swarm (OpenAI)

Uses function-based handoffs with context variables. More structured than chat history but still requires manual context management. 42% per-handoff retention.

FrameworkHandoff MechanismPer-Handoff RetentionEnd-to-End (4 Handoffs)With Self-Model Integration
LangGraphState graphs38%10%86%
CrewAIShared text memory35%9%85%
AutoGenGroup chat history28%5%84%
SwarmContext variables42%12%89%
CustomVaries27-52%4-15%83-91%

The Implications for Enterprise

The memory problem in multi-agent systems is particularly concerning for enterprise deployments where multi-agent workflows are becoming standard.

Enterprise use cases - customer support escalation, multi-step onboarding, complex document processing - involve 3-8 agent handoffs. At 34 percent retention per handoff, a 6-agent workflow retains only 1.5 percent of original user context by the final agent. The customer experience degrades to near-zero personalization by the end of the workflow.

This is why enterprise customers report that multi-agent AI systems feel impersonal and disjointed. Each individual agent might be excellent. But the system as a whole fails because context evaporates at the seams.

Trade-offs

Self-model integration adds latency. Each agent must query the self-model before acting, adding 50-200ms per handoff depending on model complexity. For real-time conversational agents, this latency may be noticeable. For batch or semi-synchronous workflows, it is negligible.

Shared models require access control design. In multi-agent systems, not all agents should have access to all user beliefs. A research agent might add sensitive information that a marketing agent should not access. The self-model needs access control layers that match the agent permission model.

Model write conflicts are possible. When multiple agents observe the user simultaneously (in parallel agent architectures), they may produce conflicting observations. The self-model needs a conflict resolution strategy - typically confidence-weighted merging or last-write-wins with confidence decay.

Framework integration is not yet standardized. Each agent framework has different extension points for memory integration. Integrating self-models requires framework-specific adapters. As the agent ecosystem matures, standardized memory interfaces will emerge, but today each integration is custom work.

What to Do Next

  1. Benchmark your own agent pipeline. Run the same test we did: establish user context in the first agent and measure how much survives to the last agent. If you have 3+ handoffs and no structured memory, you are likely losing over 80 percent of user context. Quantify the problem before solving it.

  2. Audit what each agent needs to know about the user. For each agent in your pipeline, list the user-level context it needs to perform its task well. This is not task context (what to do) but user context (who they are doing it for). The union of these lists defines your self-model schema.

  3. Evaluate shared self-model architecture for your agent pipeline. Clarity provides structured, confidence-weighted user models that any agent can query and contribute to. The self-model becomes the shared memory layer that agent frameworks are missing. See if self-models fit your multi-agent architecture.


Your agents are smart individually. They are amnesic collectively. Shared self-models fix the memory problem. Build the shared memory your agents need.

References

  1. scarce resource with a finite “attention budget”
  2. context engineering
  3. memory vs. retrieval augmented generation
  4. lack persistent memory about the users and organizations they serve
  5. Atkinson-Shiffrin model

Building AI that needs to understand its users?

Talk to us →
The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

Robert Ta

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →