RAG Pipelines Need a User Layer

RAG retrieves the same chunks for every user. Adding a self-model user layer means retrieval is filtered and ranked by expertise, goals, and knowledge gaps.

Robert Ta's Self-Model CEO & Co-Founder 847 beliefs

· February 28, 2026 · 7 min read

TL;DR

RAG retrieves documents based on what was asked, not who is asking - a junior engineer and a CTO get identical chunks for the same query
Adding a self-model user layer to your RAG pipeline enables expertise-aware retrieval, goal-filtered ranking, and knowledge-gap-aware synthesis with minimal architectural changes
The architecture is straightforward: inject user context at retrieval (filter/rerank) and generation (prompt augmentation), and measure alignment alongside accuracy

RAG pipelines need a user layer because they retrieve the same chunks for every user regardless of expertise, goals, or knowledge gaps. Without user-aware retrieval, a junior engineer and a CTO get identical responses to the same query, producing content calibrated for an average user who does not exist. This post covers the architecture for injecting self-model context at retrieval and generation stages, why top-K is the wrong abstraction, and the cold start path from zero user context to fully personalized knowledge delivery.

user signals in a standard RAG pipeline

retrieval relevance gain with user filtering

satisfaction increase for expert users

integration points needed (retrieval + generation)

Here is what happens in a standard RAG pipeline when two users ask the same question - “How do we handle authentication for the new microservice?”

User A is a staff engineer who designed the company’s auth infrastructure. They want to know if there is a specific pattern recommended for service-to-service auth in the new deployment model.

User B is a junior engineer three months into the job. They need to understand what authentication even means in the context of microservices before they can implement anything.

The RAG pipeline embeds the query, searches the vector store, retrieves the top-K chunks about microservice authentication, and feeds them to the LLM. Both users get the same chunks. The LLM synthesizes the same response - a middle-ground explanation that is too basic for User A and too advanced for User B.

This is not a retrieval quality problem. The chunks are relevant to the query. They are not relevant to the user.

Pipeline Stage	What It Considers	What It Ignores
Query embedding	Semantic meaning of the question	Who is asking it
Vector search	Document similarity to query	User’s existing knowledge
Reranking	Cross-encoder relevance scores	User’s expertise level
Chunk selection	Top-K by relevance score	Whether user already knows this
LLM synthesis	Retrieved context + query	User’s goals, role, knowledge gaps

Every stage is query-aware. No stage is user-aware. The pipeline is optimized end-to-end for document relevance and is completely blind to human relevance.

RAG Without User Layer

×Same top-K chunks for every user regardless of expertise
×Retrieval budget wasted on content the user already knows
×LLM synthesis calibrated for an average user who does not exist
×Expert users get patronizing explanations, novices get assumed context

RAG With Self-Model User Layer

✓Chunks filtered and reranked by user expertise and knowledge gaps
✓Retrieval budget spent on content that fills actual knowledge gaps
✓LLM synthesis calibrated for the specific person asking
✓Experts get depth they need, novices get foundations they need

The Architecture: RAG + Self-Model Integration

The user layer does not replace your RAG pipeline. It augments it at two points: retrieval and generation. Your vector store, chunking strategy, and embedding model stay the same.

At retrieval: the self-model provides user context that filters or reranks results. Chunks covering concepts the user has already demonstrated mastery of get deprioritized. Chunks matching the user’s current goals get boosted. The same query produces different chunk sets for different users.

At generation: the self-model injects structured user context into the LLM prompt alongside the retrieved chunks. Expertise level, active goals, communication preferences, and knowledge gaps shape how the LLM synthesizes the retrieved information.

rag-with-user-layer.ts

1// 1. Standard RAG retrieval← Query-only - no user awareness
2const chunks = await vectorStore.search(query, { topK: 20 });
3
4// 2. Fetch the user's self-model← Add the user layer
5const selfModel = await clarity.getSelfModel(userId);
6const userContext = {
7  expertise: selfModel.getBeliefs({ context: 'expertise_level' }),
8  goals: selfModel.getActiveGoals(),
9  knowledgeGaps: selfModel.getBeliefs({ context: 'knowledge_gaps' }),
10  preferences: selfModel.getBeliefs({ context: 'communication_style' })
11};
12
13// 3. User-aware reranking← Same chunks, different ranking per user
14const reranked = chunks
15  .map(chunk => ({
16    ...chunk,
17    score: chunk.score
18      + boostForKnowledgeGap(chunk, userContext.knowledgeGaps)
19      + boostForGoalAlignment(chunk, userContext.goals)
20      - penalizeAlreadyKnown(chunk, userContext.expertise)
21  }))
22  .sort((a, b) => b.score - a.score)
23  .slice(0, 8);
24
25// 4. User-aware generation← Same LLM, personalized synthesis
26const response = await llm.generate({
27  query,
28  context: reranked,
29  userModel: userContext,  // Shapes depth, tone, emphasis
30});

The key insight is in step 3: user-aware reranking. Instead of returning the same top-K chunks for everyone, the pipeline adjusts scores based on three signals from the self-model:

Knowledge gap boost: chunks that address topics the user has not demonstrated understanding of get scored higher
Goal alignment boost: chunks relevant to the user’s stated or inferred goals get promoted
Already-known penalty: chunks covering concepts the user has already mastered get deprioritized, freeing retrieval budget for new information

Knowledge Gap Boost

Chunks addressing topics the user has not demonstrated understanding of get scored higher. Fills actual gaps instead of repeating known material.

Goal Alignment Boost

Chunks relevant to the user’s stated or inferred goals get promoted. A user evaluating gets ROI-focused content; a user implementing gets technical content.

Already-Known Penalty

Chunks covering concepts the user has already mastered get deprioritized. Frees retrieval budget for new information that actually adds value.

This means a staff engineer asking about microservice auth gets chunks about advanced patterns, mTLS edge cases, and service mesh configurations. The junior engineer gets chunks about foundational auth concepts, step-by-step guides, and common pitfalls. Same query. Same vector store. Different knowledge delivery.

Why Top-K Is the Wrong Abstraction

Standard RAG uses top-K retrieval - return the K most semantically similar chunks to the query. This treats relevance as a single dimension: how close is this chunk to what was asked?

But relevance has at least two dimensions: how close is this chunk to the query, and how useful is this chunk to this specific user. A chunk that perfectly matches the query but covers ground the user already knows is wasting a slot in the context window. A chunk that is slightly less query-relevant but fills a critical knowledge gap for this user is far more valuable.

User-aware retrieval turns top-K from a single-axis ranking into a multi-axis optimization. You are not just finding the most relevant documents. You are finding the most useful documents for this person at this moment.

This has measurable downstream effects. When retrieved chunks align with what the user actually needs to learn, the LLM produces responses that are more targeted, less repetitive, and more likely to resolve the user’s actual question on the first attempt.

The Cold Start Path

New users do not have rich self-models. Your pipeline needs graceful degradation:

Interaction 0: No user context. Standard RAG behavior. This is your baseline and it works fine - the same way it works today.

Interactions 1-3: Thin user context. The system infers basic signals - are they asking beginner or advanced questions? Do they prefer code examples or conceptual explanations? Light reranking adjustments with high uncertainty bounds.

Interactions 4-10: Growing user context. Expertise areas begin to differentiate. Knowledge gaps become identifiable. Retrieval reranking has enough signal to meaningfully diverge from default top-K.

Interactions 10+: Rich user context. The self-model has enough observations to confidently filter retrieval, adapt synthesis depth, and anticipate knowledge gaps before the user articulates them.

The critical design principle: user-aware retrieval should always improve or match standard retrieval, never degrade it. When confidence in the user model is low, bias toward standard behavior. As confidence increases, increase the influence of user signals on reranking.

Interaction 0: No User Context

Standard RAG behavior. No reranking adjustments. This is your baseline and it works the same way it works today.

Interactions 1-3: Thin Context

Basic signals inferred: beginner or advanced questions, preference for code vs. conceptual explanations. Light reranking adjustments with high uncertainty bounds.

Interactions 4-10: Growing Context

Expertise areas begin to differentiate. Knowledge gaps become identifiable. Retrieval reranking has enough signal to meaningfully diverge from default top-K.

Interactions 10+: Rich Context

Self-model has enough observations to confidently filter retrieval, adapt synthesis depth, and anticipate knowledge gaps before the user articulates them.

Trade-offs to Consider

Latency: Fetching self-model context adds a retrieval step. In practice this is 20-50ms - negligible for most applications, but worth caching for latency-sensitive use cases.

Prompt budget: User context competes with retrieved chunks for tokens. A well-structured self-model summary is 100-200 tokens. For context windows of 8K+, this is a worthwhile trade.

Model accuracy: If the self-model misclassifies an expert as a beginner, the user gets patronizing results. Use confidence thresholds - only apply user-aware reranking when the self-model has sufficient observations. Fall back to standard RAG when uncertain.

Privacy: User expertise profiles are sensitive data. Build consent into the user layer from the start. Let users see, edit, and delete their self-model. This is not just compliance - it is trust.

Latency: 20-50ms Added

Fetching self-model context adds a retrieval step. Negligible for most applications, but worth caching for latency-sensitive use cases.

Prompt Budget: 100-200 Tokens

User context competes with retrieved chunks for tokens. A well-structured self-model summary is 100-200 tokens. Worthwhile trade for 8K+ context windows.

Model Accuracy Risk

Misclassifying an expert as beginner produces patronizing results. Use confidence thresholds and fall back to standard RAG when the self-model is uncertain.

Privacy by Design

User expertise profiles are sensitive data. Build consent from the start. Let users see, edit, and delete their self-model. Trust, not just compliance.

What to Do Next

Audit your current pipeline for user blindness. Take your five most common queries and trace them through your RAG pipeline. At each stage - embedding, retrieval, reranking, generation - ask: does this stage know anything about who is asking? Document the gaps. That is your user-layer integration map.
Start with generation-stage injection. The lowest-effort integration is adding user context to the LLM prompt alongside retrieved chunks. Even a single line - “The user is a senior engineer who prefers concise, code-first responses” - measurably improves output quality for that user. No changes to your retrieval pipeline required.
Add the user layer with Clarity’s self-model API. Move from hardcoded user context to dynamic self-models that evolve with every interaction. Start with our API quickstart to see user-aware retrieval working against your own pipeline.

Your RAG pipeline finds the right documents. A user layer makes sure the right person gets the right response. Add the missing layer.

References

Building AI that needs to understand its users?

Talk to us →

◉The Clarity Mirror

What did this article change about what you believe?

Select your beliefs

After reading this, which resonate with you?

Stay sharp on AI personalization

Daily insights and research on AI personalization and context management at scale. Read by hundreds of AI builders.

Daily articles on AI-native products. Unsubscribe anytime.

We build in public. Get Robert's weekly newsletter on building better AI products with Clarity, with a focus on hyper-personalization and digital twin technology. Join 1500+ founders and builders at Self Aligned.

Subscribe to Self Aligned →

Customer World Models: How to Build the AI Layer Anthropic Can't Ship

Per-user self-models that predict behavior, connect marketing to revenue, and compound with every interaction. The customer world model is the moat the frontier labs can't replicate.

Robert Ta's Self-Model

4 min read