The Goldfish Problem in Conversational AI

You’ve built a chatbot, integrated the latest GPT-5 or Claude 4 API, and everything looks great for the first five minutes. Then, around message #40, the agent starts hallucinating. It forgets the user’s preferred programming language, misses the core constraint you defined ten messages ago, and the latency spikes as your token count hits the 128k limit. This is the 'Goldfish Problem.'

In production, context isn't just a list of messages. It is a resource that must be managed, pruned, and prioritized. In 2026, we no longer just 'dump' history into the prompt. We build tiered memory systems that distinguish between what is happening now, what happened recently, and what is true always.

Why Context Management is the New Optimization Layer

While 1M+ token windows are technically possible, they are a trap for production systems. First, there is the 'lost in the middle' phenomenon: even the best models lose reasoning density when processing massive contexts. Second, the cost is linear (or worse) with context length. Third, latency scales with input size. If your RAG (Retrieval-Augmented Generation) system pulls 20 chunks and your conversation history is 50 messages long, you are sending 15,000 tokens per turn just to say 'Yes, I agree.'

To build a system that feels human, you need a memory architecture that mimics human cognition: Ephemeral, Short-term, and Long-term memory.

The Three-Tier Memory Architecture

L1: Ephemeral Memory (The Working State): This is the current turn's data—the specific variables, the current function call, and the immediate preceding message. It lives in the application state, not the database.
L2: Short-term Memory (The Window): A rolling window of the last 5-10 exchanges, but with a twist. We don't just store text; we store structured summaries and extracted 'entities' (e.g., user_preference_python=true).
L3: Long-term Memory (The Semantic Cortex): This is where vector databases like Qdrant or LanceDB come in. We store historical interactions as embeddings, but we only retrieve them when the current query has a high semantic similarity to past topics.

Implementing Windowed Summarization with LangGraph

In my recent projects, I’ve moved away from linear chains to stateful graphs using LangGraph. It allows for 'cycles' where the model can reflect on its own memory before answering. Instead of just appending to a list, we trigger a 'summarization node' once the token count exceeds a threshold (e.g., 4,000 tokens).

Here is a production-ready implementation of a stateful memory graph that performs 'compaction' when the history gets too long.

from typing import Annotated, TypedDict, List
from langgraph.graph import StateGraph, END
from langchain_core.messages import BaseMessage, HumanMessage, AIMessage, SystemMessage
from pydantic import BaseModel

class AgentState(TypedDict):
    messages: Annotated[List[BaseMessage], "The full conversation history"]
    summary: str
    token_count: int

def summarize_history(state: AgentState):
    """Summarizes history if it exceeds our threshold."""
    messages = state["messages"]

# In 2026, we use specialized 'summarizer' models like GPT-5-mini for cost efficiency
summary_prompt = f"Summarize the following conversation into a concise paragraph. Current summary: {state.get('summary', 'None')}"

# Logic to only summarize messages older than the last 5
to_summarize = messages[:-5]
remaining = messages[-5:]

# Assume 'llm' is our summarization model
new_summary = llm.invoke([SystemMessage(content=summary_prompt)] + to_summarize)

return {
    "messages": remaining,
    "summary": new_summary.content,
    "token_count": len(remaining) * 150 # Rough estimate
}

def should_summarize(state: AgentState): """Conditional edge to determine if we need to compact memory.""" if len(state["messages"]) > 15: return "summarize" return END

Build the graph

workflow = StateGraph(AgentState) workflow.add_node("summarize", summarize_history) workflow.add_conditional_edges("agent", should_summarize, {"summarize": "summarize", END: END})

Semantic Retrieval: The External Cortex

Summarization handles the 'flow' of conversation, but what if the user asks about a project they mentioned three weeks ago? This is where metadata-aware vector retrieval becomes critical.

You shouldn't just index everything. You should index Atomic Insights. When a user says 'I prefer Tailwind over Bootstrap,' your background worker should extract that as a fact and store it in a vector DB with a TTL (Time To Live) or a priority score.

import lancedb
from pydantic_ai import Agent # The 2026 standard for typed AI interactions

db = lancedb.connect("/tmp/memory_store")
table = db.open_table("user_facts")

async def retrieve_relevant_memory(user_id: str, current_query: str):
    """Retrieves facts from the vector store based on semantic relevance."""
    results = table.search(current_query) \
        .where(f"user_id = '{user_id}'") \
        .limit(3) \
        .to_list()
    
    context_str = "
".join([r["fact"] for r in results])
    return f"Relevant historical context: {context_str}"

Usage in a tool-calling loop

memory_agent = Agent('openai:gpt-5-preview', deps_type=UserDeps)

@memory_agent.tool async def get_past_preferences(ctx: RunContext[UserDeps]) -> str: return await retrieve_relevant_memory(ctx.deps.user_id, ctx.prompt)

The 'Conflict Resolution' Problem

One thing the documentation rarely mentions is Memory Contradiction. If the user said 'I love Python' in 2024 but says 'I'm moving to Rust' in 2026, a simple vector search will return both. Your agent will be confused.

I solve this by using Temporal Weighting. Every memory in my vector store has a last_updated timestamp. When retrieving, I don't just use cosine similarity; I multiply the similarity score by a decay function:

Score = Similarity * exp(-λ * (CurrentTime - MemoryTime))

This ensures that newer preferences naturally override older ones without explicitly deleting data.

Gotchas: What Usually Goes Wrong

Recursive Summarization Loss: If you summarize a summary, you get 'digital decay.' After 10 cycles, the nuances are gone. Fix: Always keep a 'Core Context' block (e.g., user bio, project goals) that is never summarized and always included in the system prompt.
The Token Trap: Calculating tokens using len(text) / 4 is dangerous. Use tiktoken or the model-specific tokenizer. In 2026, many models use different vocabularies (like the o1-series), and being off by 10% can trigger context overflows in high-load scenarios.
Privacy and Deletion: Memory isn't just technical; it's legal. If a user asks 'Forget everything you know about me,' deleting from a vector DB is easy, but deleting from a 'summarized state' in a message thread is hard. Fix: Store summaries with a version ID linked to the raw messages so you can re-generate summaries post-deletion.

Takeaway

Stop treating conversation history as a simple array. To build a production AI that users actually trust, you must implement a tiered memory controller.

Your action item for today: Implement a token-counting middleware in your LLM calls. If the history exceeds 50% of your target window, trigger a separate LLM call to extract 'Key Entities and Facts' into a structured JSON block. Inject that JSON into the system prompt of the next turn. You'll see an immediate drop in hallucinations and a measurable improvement in user retention.

Context is Everything: Engineering Persistent Memory for LLM Agents

The Goldfish Problem in Conversational AI

Why Context Management is the New Optimization Layer

The Three-Tier Memory Architecture

Implementing Windowed Summarization with LangGraph

Build the graph

Semantic Retrieval: The External Cortex

Usage in a tool-calling loop

The 'Conflict Resolution' Problem

Gotchas: What Usually Goes Wrong

Takeaway

Enjoyed this article?

Related Articles

Beyond Vector Search: Building Production Knowledge Graphs with LLMs

Scaling Engineering Velocity: Building Autonomous Code Review Pipelines in 2026

Uğur Kaval

Engineering Reliable AI Agents: A Practical Guide to Tool Use and Function Calling