Agent Context Decay: Why Your AI Gets Worse Over Time
AI agents degrade in production not from model decay but context management failure. Self-models solve context decay with structured user understanding.
TL;DR
- AI agents degrade in production not from model decay but from context management failure. Long-running agents silently drop user preferences as context windows fill with recent turns.
- Research confirms this pattern: Chroma’s Context Rot study [1] found that LLM performance grows increasingly unreliable as input length grows, even on simple tasks.
- Self-models solve it by maintaining structured user understanding outside the context window, so preferences from week 1 persist through week 40.
Agent context decay is the degradation of AI agent performance over time caused by context window saturation, not model quality regression. Users report that the AI “used to get me” because early preferences are silently displaced by recent conversation turns as the context window fills. This post covers the mechanics of context decay, why standard fixes like larger windows and RAG fail, and how structured user models provide decay-proof understanding.
The Mechanics of Decay
Context decay is not a bug. It is the inevitable consequence of finite context windows applied to indefinite user relationships. Research from Liu et al. in “Lost in the Middle” [2] demonstrated that language models perform best when relevant information appears at the beginning or end of their input, with significant degradation for information positioned in the middle. This positional sensitivity compounds over time in long-running agents.
Here is what happens in a typical production agent:
Week 1-2: The Honeymoon. The context window is mostly empty. Early interactions establish user preferences, goals, communication style, and domain context. The agent has room to retain all of it. Responses feel personal because the agent genuinely has the full picture.
Week 3-4: The Crowding. As sessions accumulate, the context window fills. Recent conversation turns take priority because the agent needs immediate context to respond coherently. User preferences from week 1 start getting pushed out of the window or truncated by summarization. As Chroma’s research on context rot [3] documented across 18 LLMs, models do not process their context uniformly. Performance becomes increasingly non-uniform as input length grows.
Week 6-8: The Forgetting. The agent now has detailed context about the last few conversations and almost nothing about the user’s fundamental preferences, goals, or expertise level. It responds to what was just said without understanding who it is talking to. Users notice. They start re-explaining things the agent “should” know.
Week 12+: The Generic. The agent has effectively reset to a zero-knowledge state for everything except recent turns. The user treats it as a stateless tool. The personalization that drove initial adoption is gone. Factory.ai’s analysis of agent context challenges [4] identifies seven distinct types of context that agents need to function well, and long-running sessions steadily crowd out all but the most recent.
Week 1-2: The Honeymoon
Context window mostly empty. All user preferences, goals, and communication style retained. Responses feel personal because the agent has the full picture.
Week 3-4: The Crowding
Context window fills with recent turns. Week 1 preferences start getting pushed out or truncated. Performance becomes non-uniform as input length grows.
Week 6-8: The Forgetting
Detailed context only for recent conversations. Fundamental preferences, goals, and expertise level are gone. Users start re-explaining things the agent should know.
Week 12+: The Generic
Agent has reset to zero-knowledge state except for recent turns. User treats it as a stateless tool. Personalization that drove initial adoption is gone entirely.
Why Standard Fixes Fail
Teams that recognize context decay typically try one of three approaches. All three are insufficient.
Larger context windows. Moving from 8K to 128K tokens delays decay but does not prevent it. A 128K window fills in months instead of weeks. The fundamental problem, finite memory applied to indefinite relationships, remains. As IBM Research notes [5], when text length doubles, LLMs need four times more memory and compute power due to the quadratic scaling of the attention mechanism. Larger windows also suffer from the “lost in the middle” effect: models struggle to use information buried in long contexts regardless of window size.
RAG-based memory retrieval. Storing past interactions in a vector database and retrieving relevant ones per query helps with factual recall but fails at holistic understanding. As Letta explains in “RAG is Not Agent Memory,” [6] traditional RAG performs a single retrieval-and-generate cycle that cannot synthesize scattered personal details into coherent understanding. If a user mentioned their risk tolerance in session 7 and their investment timeline in session 12, a similarity search for “financial goals” may surface neither. Research on seven failure points in RAG systems [7] further documents how relevant information can be present in the retrieval corpus yet still fail to reach the model’s output.
Periodic summarization. Compressing the full interaction history into a summary that fits in the context window is lossy by design. The summarizer decides what matters, often based on frequency rather than importance. A preference stated once with high conviction gets dropped in favor of a topic discussed repeatedly but with low significance.
Standard Context Management
- ×Larger windows: delays decay by weeks, does not prevent it
- ×RAG retrieval: finds fragments, cannot synthesize understanding
- ×Summarization: lossy compression that drops infrequent but critical preferences
- ×User understanding treated as a cache problem instead of a modeling problem
Self-Model Architecture
- ✓Structured beliefs persist outside the context window entirely
- ✓User understanding is queryable, not retrieved by similarity
- ✓Confidence-weighted beliefs prioritize conviction over frequency
- ✓User understanding treated as a first-class data model, not a cache
Self-Models: Understanding Outside the Window
The fix for context decay is architectural: stop storing user understanding inside the context window.
A self-model maintains a structured, compressed representation of who the user is: their beliefs, preferences, goals, expertise, and how these evolve over time. This representation lives outside the context window as a persistent data model. When the agent needs user context, it queries the self-model for relevant beliefs rather than hoping the right information is still in the window.
The difference is fundamental. Context windows are short-term memory. Self-models are long-term understanding. Where context window management strategies [8] like sliding windows and hierarchical summarization try to compress more into the same finite space, self-models sidestep the constraint entirely.
1// Without self-model: hope preferences are still in context← Context decay2const response = await agent.chat(userMessage, {3context: conversationHistory.slice(-50) // last 50 turns4});5// User preferences from week 1? Gone.67// With self-model: query structured understanding← Decay-proof8const selfModel = await clarity.getSelfModel(userId);9const beliefs = selfModel.getBeliefs({10contexts: ['communication_style', 'goals', 'expertise']11});1213const response = await agent.chat(userMessage, {14context: [...recentTurns, ...beliefs]15});16// Week 1 preferences + week 12 conversation = coherent response
This architecture has three properties that context windows cannot provide:
Persistence without size limits. The self-model grows with the user relationship without competing for space in the context window. A user with 500 sessions has the same quality of understanding as a user with 5.
Structured evolution. When a user’s preferences change, the self-model updates the relevant belief with a new confidence score and records the trajectory. It does not append a contradictory statement. It resolves the contradiction. Week 1 preferences are superseded, not forgotten.
Selective injection. The agent queries only the beliefs relevant to the current interaction. A financial planning conversation pulls financial goals and risk tolerance. A scheduling conversation pulls communication preferences and time constraints. The context window stays focused, avoiding the context pollution problem [9] where irrelevant retrieved information degrades response quality.
Persistence Without Size Limits
Self-model grows with the relationship without competing for context window space. 500 sessions gets the same understanding quality as 5.
Structured Evolution
Preference changes update relevant beliefs with new confidence scores and recorded trajectories. Contradictions are resolved, not appended.
Selective Injection
Agent queries only relevant beliefs per interaction. Financial conversations get risk tolerance. Scheduling gets time preferences. Context stays focused.
Measuring Decay in Your System
Before implementing self-models, measure whether your agents suffer from context decay. Three diagnostic approaches:
Preference recall test. Identify 10 user preferences stated in early sessions. Ask the agent about those preferences after 30+ subsequent sessions. The recall rate is your decay score. Compare results against the positional sensitivity patterns documented in the Lost in the Middle research [10]. Early preferences are exactly the kind of information that gets buried and lost as context grows.
Re-explanation rate. Track how often users restate information the agent should already have. A rising re-explanation rate over time is the clearest signal of context decay. Instrument this by detecting user messages that contain phrases like “as I mentioned before” or “I already told you.”
Personalization regression. Compare the specificity of agent responses in sessions 1-5 versus sessions 30-35. If early responses reference user-specific context (goals, preferences, domain) and later responses are generic, decay is present.
Preference Recall Test
Identify 10 preferences from early sessions. Query the agent after 30+ sessions. The recall rate is your decay score.
Re-explanation Rate
Detect phrases like “as I mentioned before.” A rising rate over time is the clearest signal of context decay and user frustration.
Personalization Regression
Compare response specificity in sessions 1-5 versus 30-35. If early responses are specific and later ones are generic, decay is present.
What to Do Next
-
Run the preference recall test. Pick your 10 longest-running user sessions. List the preferences each user stated in their first 5 interactions. Query the agent about those preferences now. The gap between what was stated and what is recalled is your context decay score, and the business case for fixing it.
-
Instrument re-explanation tracking. Add lightweight detection for user corrections and restated preferences. Track the rate per user over time. This gives you a continuous decay metric that correlates directly with user frustration and churn risk.
-
Pilot a self-model for your highest-value users. Take your top 20 users by engagement depth and implement self-models alongside your existing context management. Measure preference recall, personalization specificity, and user satisfaction before and after. The delta is your ROI case. Start building decay-proof agents.
The model is not getting worse. The context is getting crowded. Self-models move user understanding out of the window and into a structure that compounds. Fix context decay.
References
- Context Rot study
- “Lost in the Middle”
- Chroma’s research on context rot
- Factory.ai’s analysis of agent context challenges
- IBM Research notes
- “RAG is Not Agent Memory,”
- seven failure points in RAG systems
- context window management strategies
- context pollution problem
- Lost in the Middle research
Related
Building AI that needs to understand its users?
What did this article change about what you believe?
Select your beliefs
After reading this, which resonate with you?
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →