Multi-Agent Memory Benchmarks
Multi-agent systems are everywhere. Memory in multi-agent systems is nowhere. We benchmarked how well leading agent frameworks retain user context across agent handoffs - and the results are concerning.
TL;DR
- Multi-agent systems lose 60-80 percent of user context at each agent handoff because current frameworks pass raw chat history rather than structured understanding between agents
- We benchmarked 5 leading agent frameworks and found that after 4 handoffs, only 8 percent of original user context remained actionable - the rest was lost, degraded, or buried in noise
- Shared self-models solve this by giving all agents access to a structured, confidence-weighted user model that persists across handoffs, improving context retention from 34 percent to 87 percent
Multi-agent memory benchmarks reveal that current agent frameworks lose 60-80% of user context at each handoff, with only 8% of original context remaining actionable after four sequential agents. This context degradation explains why multi-agent systems feel impersonal and disjointed despite individual agents performing well. This post covers benchmark results across five leading frameworks, why chat history fails as inter-agent memory, and how shared self-models improve context retention from 34% to 87% per handoff.
The Benchmark Setup
We designed a benchmark that mirrors real-world multi-agent workflows. The test scenario was a content strategy workflow for a fintech company with four sequential agents:
Agent 1 (Researcher): Gathers information about the user’s domain, audience, and content goals. Conducts an initial conversation to understand requirements.
Agent 2 (Strategist): Takes the research and develops a content strategy, including topic selection, scheduling, and channel planning.
Agent 3 (Writer): Produces content drafts based on the strategy, adapting tone and style to the user’s preferences.
Agent 4 (Reviewer): Evaluates the drafts against the strategy and the user’s original requirements, suggesting revisions.
We tested 20 user scenarios across 5 agent frameworks. Each scenario included specific user preferences, domain knowledge, and stylistic requirements established during the research phase. We measured how much of that context survived each handoff.
The measurement was straightforward: for each user preference and requirement established in the research phase, did it remain actionable (correctly informing agent behavior) at each subsequent handoff?
The Results
The results were consistent across frameworks. Context degradation at each handoff was severe and compounding.
Handoff 1 (Researcher to Strategist): Average 62 percent context retention. Major items like domain and goals survived. Nuances like tone preferences, specific audience characteristics, and domain-specific terminology requirements were lost or buried.
Handoff 2 (Strategist to Writer): Average 41 percent context retention from origin. The strategist’s output focused on strategy, not user context. The writer received strategic direction but limited user-specific preferences. Original tone requirements survived at 30 percent.
Handoff 3 (Writer to Reviewer): Average 23 percent context retention from origin. The reviewer received drafts and strategy, but almost no original user context. Specific preferences established in the research phase - like the user’s aversion to jargon or their preference for case-study-driven arguments - were entirely absent.
End-to-end: After 4 handoffs, 8 percent of the original user context remained actionable. The final agent (Reviewer) had almost no information about the user’s actual preferences, making its review generic rather than personalized.
Handoff 1: Researcher to Strategist (62% retained)
Major items like domain and goals survived. Nuances like tone preferences, specific audience characteristics, and domain-specific terminology requirements were lost or buried.
Handoff 2: Strategist to Writer (41% retained)
The strategist’s output focused on strategy, not user context. The writer received strategic direction but limited user-specific preferences. Tone requirements survived at 30%.
Handoff 3: Writer to Reviewer (23% retained)
The reviewer received drafts and strategy, but almost no original user context. Specific preferences from the research phase were entirely absent.
End-to-End: 8% Context Remaining
After 4 handoffs, only 8% of original user context remained actionable. The final agent had almost no information about user preferences, producing a generic review.
Without Shared Memory (Chat History Passing)
- ×Each handoff passes conversation log or summary to next agent
- ×Context degrades 40-60% per handoff through summarization loss
- ×After 4 handoffs only 8% of user context remains actionable
- ×Final agent produces generic output with no personalization
With Shared Memory (Self-Model Integration)
- ✓All agents query shared self-model for user understanding
- ✓Context retention stays above 85% regardless of handoff count
- ✓Every agent inherits structured beliefs, not raw transcripts
- ✓Final agent produces personalized output informed by full user model
Why Chat History Fails as Memory
The core issue is that current agent frameworks use conversation history as their memory mechanism. This fails for three reasons.
Summarization loss. When conversation history is too long for the next agent’s context window, it gets summarized. Each summarization drops details. The summarizer does not know which details matter for downstream agents, so it optimizes for brevity at the cost of specificity. User preferences are exactly the kind of specific details that get dropped.
Signal-to-noise ratio. A research conversation between the user and the researcher agent contains a mix of user context (valuable downstream) and conversation mechanics (greetings, clarifications, false starts). When this entire conversation is passed to the next agent, the user context is buried in noise. The strategist has to extract meaning from a transcript, which is like trying to learn about someone by reading the transcript of their doctor’s visit.
No confidence weighting. Chat history treats all statements equally. A casual mention (I think I might prefer shorter articles) and a firm requirement (our compliance team requires all content to be reviewed by legal) look the same in a transcript. There is no signal for how confident or important each piece of context is.
Summarization Loss
When conversation history exceeds the next agent’s context window, it gets summarized. Each summarization drops details. The summarizer does not know which details matter downstream.
Signal-to-Noise Ratio
Research conversations mix user context (valuable downstream) with conversation mechanics (greetings, clarifications, false starts). User context gets buried in noise.
No Confidence Weighting
A casual mention and a firm requirement look the same in a transcript. There is no signal for how confident or important each piece of context is to the user.
The Self-Model Solution
The fix is to replace chat history as the inter-agent memory with a structured self-model. Instead of passing conversation logs between agents, all agents share access to a common user model that each agent can read from and write to.
Here is how this works architecturally.
1// Without shared memory: passing chat history← Context degrades each handoff2const handoff = {3chatHistory: researcher.getConversationLog(), // 4000 tokens of mixed signal4taskSummary: researcher.summarize(), // lossy compression5};6strategist.start(handoff); // 62% context survival78// With shared memory: self-model as shared state← Context persists across all agents9// Agent 1 (Researcher) writes observations to shared model10await clarity.addObservation(userId, {11action: 'user_preference_expressed',12details: { preference: 'concise technical content', confidence: 0.85 },13context: 'communication_style',14});1516// Agent 2 (Strategist) reads from same shared model17const selfModel = await clarity.getSelfModel(userId);18// Gets structured beliefs with confidence scores - not raw transcript19// 87% context survival, regardless of how many agents follow
The self-model acts as a shared database of user understanding. Each agent contributes observations from its interactions, and each agent queries beliefs relevant to its task. The researcher adds observations about user preferences. The strategist queries those preferences when building strategy. The writer queries tone and style beliefs. The reviewer queries original requirements.
Because the self-model is structured and confidence-weighted, there is no summarization loss. The model does not degrade with each access - it is a database query, not a conversation summary. And because beliefs have confidence scores, downstream agents know which preferences are firm requirements and which are tentative suggestions.
Framework-Specific Findings
Each framework we benchmarked handled memory differently, but the core limitation was consistent.
LangGraph: Uses state graphs with shared state objects. The state mechanism is flexible but unstructured - teams must define their own memory schema. Without explicit user model design, teams default to passing messages, which degrades.
CrewAI: Agents share context through task descriptions and shared memory. The shared memory is a text-based store, which suffers from the same signal-to-noise issues as chat history. Better than raw transcripts but still unstructured.
AutoGen: Uses conversation-based memory where agents can access the group chat history. This is the most chat-history-dependent approach and showed the highest degradation rates in our benchmark.
Swarm (OpenAI): Uses function-based handoffs with context variables. The context variable mechanism is more structured than chat history but still requires manual context management. Better for task-level context, weaker for user-level understanding.
Custom orchestrators: Teams building their own orchestration had the most variance. Those who explicitly designed user memory outperformed all frameworks. Those who relied on ad hoc context passing performed worst.
LangGraph
Uses state graphs with shared state objects. Flexible but unstructured. Without explicit user model design, teams default to passing messages, which degrades. 38% per-handoff retention.
CrewAI
Agents share context through task descriptions and text-based shared memory. Better than raw transcripts but still unstructured. 35% per-handoff retention.
AutoGen
Uses conversation-based memory with group chat history access. Most chat-history-dependent approach. Highest degradation rates. 28% per-handoff retention.
Swarm (OpenAI)
Uses function-based handoffs with context variables. More structured than chat history but still requires manual context management. 42% per-handoff retention.
| Framework | Handoff Mechanism | Per-Handoff Retention | End-to-End (4 Handoffs) | With Self-Model Integration |
|---|---|---|---|---|
| LangGraph | State graphs | 38% | 10% | 86% |
| CrewAI | Shared text memory | 35% | 9% | 85% |
| AutoGen | Group chat history | 28% | 5% | 84% |
| Swarm | Context variables | 42% | 12% | 89% |
| Custom | Varies | 27-52% | 4-15% | 83-91% |
The Implications for Enterprise
The memory problem in multi-agent systems is particularly concerning for enterprise deployments where multi-agent workflows are becoming standard.
Enterprise use cases - customer support escalation, multi-step onboarding, complex document processing - involve 3-8 agent handoffs. At 34 percent retention per handoff, a 6-agent workflow retains only 1.5 percent of original user context by the final agent. The customer experience degrades to near-zero personalization by the end of the workflow.
This is why enterprise customers report that multi-agent AI systems feel impersonal and disjointed. Each individual agent might be excellent. But the system as a whole fails because context evaporates at the seams.
Trade-offs
Self-model integration adds latency. Each agent must query the self-model before acting, adding 50-200ms per handoff depending on model complexity. For real-time conversational agents, this latency may be noticeable. For batch or semi-synchronous workflows, it is negligible.
Shared models require access control design. In multi-agent systems, not all agents should have access to all user beliefs. A research agent might add sensitive information that a marketing agent should not access. The self-model needs access control layers that match the agent permission model.
Model write conflicts are possible. When multiple agents observe the user simultaneously (in parallel agent architectures), they may produce conflicting observations. The self-model needs a conflict resolution strategy - typically confidence-weighted merging or last-write-wins with confidence decay.
Framework integration is not yet standardized. Each agent framework has different extension points for memory integration. Integrating self-models requires framework-specific adapters. As the agent ecosystem matures, standardized memory interfaces will emerge, but today each integration is custom work.
What to Do Next
-
Benchmark your own agent pipeline. Run the same test we did: establish user context in the first agent and measure how much survives to the last agent. If you have 3+ handoffs and no structured memory, you are likely losing over 80 percent of user context. Quantify the problem before solving it.
-
Audit what each agent needs to know about the user. For each agent in your pipeline, list the user-level context it needs to perform its task well. This is not task context (what to do) but user context (who they are doing it for). The union of these lists defines your self-model schema.
-
Evaluate shared self-model architecture for your agent pipeline. Clarity provides structured, confidence-weighted user models that any agent can query and contribute to. The self-model becomes the shared memory layer that agent frameworks are missing. See if self-models fit your multi-agent architecture.
Your agents are smart individually. They are amnesic collectively. Shared self-models fix the memory problem. Build the shared memory your agents need.
References
- scarce resource with a finite “attention budget”
- context engineering
- memory vs. retrieval augmented generation
- lack persistent memory about the users and organizations they serve
- Atkinson-Shiffrin model
Related
Building AI that needs to understand its users?
What did this article change about what you believe?
Select your beliefs
After reading this, which resonate with you?
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →