RAG Pipelines Need a User Layer
RAG retrieves the same chunks for every user. Adding a self-model user layer means retrieval is filtered and ranked by expertise, goals, and knowledge gaps.
TL;DR
- RAG retrieves documents based on what was asked, not who is asking - a junior engineer and a CTO get identical chunks for the same query
- Adding a self-model user layer to your RAG pipeline enables expertise-aware retrieval, goal-filtered ranking, and knowledge-gap-aware synthesis with minimal architectural changes
- The architecture is straightforward: inject user context at retrieval (filter/rerank) and generation (prompt augmentation), and measure alignment alongside accuracy
RAG pipelines need a user layer because they retrieve the same chunks for every user regardless of expertise, goals, or knowledge gaps. Without user-aware retrieval, a junior engineer and a CTO get identical responses to the same query, producing content calibrated for an average user who does not exist. This post covers the architecture for injecting self-model context at retrieval and generation stages, why top-K is the wrong abstraction, and the cold start path from zero user context to fully personalized knowledge delivery.
The Problem: User-Blind Retrieval
Here is what happens in a standard RAG pipeline when two users ask the same question - “How do we handle authentication for the new microservice?”
User A is a staff engineer who designed the company’s auth infrastructure. They want to know if there is a specific pattern recommended for service-to-service auth in the new deployment model.
User B is a junior engineer three months into the job. They need to understand what authentication even means in the context of microservices before they can implement anything.
The RAG pipeline embeds the query, searches the vector store, retrieves the top-K chunks about microservice authentication, and feeds them to the LLM. Both users get the same chunks. The LLM synthesizes the same response - a middle-ground explanation that is too basic for User A and too advanced for User B.
This is not a retrieval quality problem. The chunks are relevant to the query. They are not relevant to the user.
| Pipeline Stage | What It Considers | What It Ignores |
|---|---|---|
| Query embedding | Semantic meaning of the question | Who is asking it |
| Vector search | Document similarity to query | User’s existing knowledge |
| Reranking | Cross-encoder relevance scores | User’s expertise level |
| Chunk selection | Top-K by relevance score | Whether user already knows this |
| LLM synthesis | Retrieved context + query | User’s goals, role, knowledge gaps |
Every stage is query-aware. No stage is user-aware. The pipeline is optimized end-to-end for document relevance and is completely blind to human relevance.
RAG Without User Layer
- ×Same top-K chunks for every user regardless of expertise
- ×Retrieval budget wasted on content the user already knows
- ×LLM synthesis calibrated for an average user who does not exist
- ×Expert users get patronizing explanations, novices get assumed context
RAG With Self-Model User Layer
- ✓Chunks filtered and reranked by user expertise and knowledge gaps
- ✓Retrieval budget spent on content that fills actual knowledge gaps
- ✓LLM synthesis calibrated for the specific person asking
- ✓Experts get depth they need, novices get foundations they need
The Architecture: RAG + Self-Model Integration
The user layer does not replace your RAG pipeline. It augments it at two points: retrieval and generation. Your vector store, chunking strategy, and embedding model stay the same.
At retrieval: the self-model provides user context that filters or reranks results. Chunks covering concepts the user has already demonstrated mastery of get deprioritized. Chunks matching the user’s current goals get boosted. The same query produces different chunk sets for different users.
At generation: the self-model injects structured user context into the LLM prompt alongside the retrieved chunks. Expertise level, active goals, communication preferences, and knowledge gaps shape how the LLM synthesizes the retrieved information.
1// 1. Standard RAG retrieval← Query-only - no user awareness2const chunks = await vectorStore.search(query, { topK: 20 });34// 2. Fetch the user's self-model← Add the user layer5const selfModel = await clarity.getSelfModel(userId);6const userContext = {7expertise: selfModel.getBeliefs({ context: 'expertise_level' }),8goals: selfModel.getActiveGoals(),9knowledgeGaps: selfModel.getBeliefs({ context: 'knowledge_gaps' }),10preferences: selfModel.getBeliefs({ context: 'communication_style' })11};1213// 3. User-aware reranking← Same chunks, different ranking per user14const reranked = chunks15.map(chunk => ({16...chunk,17score: chunk.score18+ boostForKnowledgeGap(chunk, userContext.knowledgeGaps)19+ boostForGoalAlignment(chunk, userContext.goals)20- penalizeAlreadyKnown(chunk, userContext.expertise)21}))22.sort((a, b) => b.score - a.score)23.slice(0, 8);2425// 4. User-aware generation← Same LLM, personalized synthesis26const response = await llm.generate({27query,28context: reranked,29userModel: userContext, // Shapes depth, tone, emphasis30});
The key insight is in step 3: user-aware reranking. Instead of returning the same top-K chunks for everyone, the pipeline adjusts scores based on three signals from the self-model:
- Knowledge gap boost: chunks that address topics the user has not demonstrated understanding of get scored higher
- Goal alignment boost: chunks relevant to the user’s stated or inferred goals get promoted
- Already-known penalty: chunks covering concepts the user has already mastered get deprioritized, freeing retrieval budget for new information
Knowledge Gap Boost
Chunks addressing topics the user has not demonstrated understanding of get scored higher. Fills actual gaps instead of repeating known material.
Goal Alignment Boost
Chunks relevant to the user’s stated or inferred goals get promoted. A user evaluating gets ROI-focused content; a user implementing gets technical content.
Already-Known Penalty
Chunks covering concepts the user has already mastered get deprioritized. Frees retrieval budget for new information that actually adds value.
This means a staff engineer asking about microservice auth gets chunks about advanced patterns, mTLS edge cases, and service mesh configurations. The junior engineer gets chunks about foundational auth concepts, step-by-step guides, and common pitfalls. Same query. Same vector store. Different knowledge delivery.
Why Top-K Is the Wrong Abstraction
Standard RAG uses top-K retrieval - return the K most semantically similar chunks to the query. This treats relevance as a single dimension: how close is this chunk to what was asked?
But relevance has at least two dimensions: how close is this chunk to the query, and how useful is this chunk to this specific user. A chunk that perfectly matches the query but covers ground the user already knows is wasting a slot in the context window. A chunk that is slightly less query-relevant but fills a critical knowledge gap for this user is far more valuable.
User-aware retrieval turns top-K from a single-axis ranking into a multi-axis optimization. You are not just finding the most relevant documents. You are finding the most useful documents for this person at this moment.
This has measurable downstream effects. When retrieved chunks align with what the user actually needs to learn, the LLM produces responses that are more targeted, less repetitive, and more likely to resolve the user’s actual question on the first attempt.
The Cold Start Path
New users do not have rich self-models. Your pipeline needs graceful degradation:
Interaction 0: No user context. Standard RAG behavior. This is your baseline and it works fine - the same way it works today.
Interactions 1-3: Thin user context. The system infers basic signals - are they asking beginner or advanced questions? Do they prefer code examples or conceptual explanations? Light reranking adjustments with high uncertainty bounds.
Interactions 4-10: Growing user context. Expertise areas begin to differentiate. Knowledge gaps become identifiable. Retrieval reranking has enough signal to meaningfully diverge from default top-K.
Interactions 10+: Rich user context. The self-model has enough observations to confidently filter retrieval, adapt synthesis depth, and anticipate knowledge gaps before the user articulates them.
The critical design principle: user-aware retrieval should always improve or match standard retrieval, never degrade it. When confidence in the user model is low, bias toward standard behavior. As confidence increases, increase the influence of user signals on reranking.
Interaction 0: No User Context
Standard RAG behavior. No reranking adjustments. This is your baseline and it works the same way it works today.
Interactions 1-3: Thin Context
Basic signals inferred: beginner or advanced questions, preference for code vs. conceptual explanations. Light reranking adjustments with high uncertainty bounds.
Interactions 4-10: Growing Context
Expertise areas begin to differentiate. Knowledge gaps become identifiable. Retrieval reranking has enough signal to meaningfully diverge from default top-K.
Interactions 10+: Rich Context
Self-model has enough observations to confidently filter retrieval, adapt synthesis depth, and anticipate knowledge gaps before the user articulates them.
Trade-offs to Consider
Latency: Fetching self-model context adds a retrieval step. In practice this is 20-50ms - negligible for most applications, but worth caching for latency-sensitive use cases.
Prompt budget: User context competes with retrieved chunks for tokens. A well-structured self-model summary is 100-200 tokens. For context windows of 8K+, this is a worthwhile trade.
Model accuracy: If the self-model misclassifies an expert as a beginner, the user gets patronizing results. Use confidence thresholds - only apply user-aware reranking when the self-model has sufficient observations. Fall back to standard RAG when uncertain.
Privacy: User expertise profiles are sensitive data. Build consent into the user layer from the start. Let users see, edit, and delete their self-model. This is not just compliance - it is trust.
Latency: 20-50ms Added
Fetching self-model context adds a retrieval step. Negligible for most applications, but worth caching for latency-sensitive use cases.
Prompt Budget: 100-200 Tokens
User context competes with retrieved chunks for tokens. A well-structured self-model summary is 100-200 tokens. Worthwhile trade for 8K+ context windows.
Model Accuracy Risk
Misclassifying an expert as beginner produces patronizing results. Use confidence thresholds and fall back to standard RAG when the self-model is uncertain.
Privacy by Design
User expertise profiles are sensitive data. Build consent from the start. Let users see, edit, and delete their self-model. Trust, not just compliance.
What to Do Next
-
Audit your current pipeline for user blindness. Take your five most common queries and trace them through your RAG pipeline. At each stage - embedding, retrieval, reranking, generation - ask: does this stage know anything about who is asking? Document the gaps. That is your user-layer integration map.
-
Start with generation-stage injection. The lowest-effort integration is adding user context to the LLM prompt alongside retrieved chunks. Even a single line - “The user is a senior engineer who prefers concise, code-first responses” - measurably improves output quality for that user. No changes to your retrieval pipeline required.
-
Add the user layer with Clarity’s self-model API. Move from hardcoded user context to dynamic self-models that evolve with every interaction. Start with our API quickstart to see user-aware retrieval working against your own pipeline.
Your RAG pipeline finds the right documents. A user layer makes sure the right person gets the right response. Add the missing layer.
References
- estimates that personalized customer experiences can improve satisfaction by 15-20%
- commoditized significantly
- 2025 industry review
- Research from MIT
- dynamic user profiling
Related
Building AI that needs to understand its users?
What did this article change about what you believe?
Select your beliefs
After reading this, which resonate with you?
Stay sharp on AI personalization
Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.
Daily articles on AI-native products. Unsubscribe anytime.
We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.
Subscribe to Self Aligned →