Why Does an LLM Forget Everything the Moment You Look Away?

Part 1 of 3 — Demystifying Conversational Memory: The Foundation Everything Else Builds On

This series is for developers building with LLM APIs who want to understand what's actually happening under the hood.

You've probably noticed that ChatGPT remembers your name mid-conversation. It refers back to things you said ten messages ago. It feels like it's keeping track of you.

Here's the surprising part: the model behind it doesn't remember a thing.

A quick note on terminology: when people say "LLM memory," they often conflate two very different things. There's parametric memory — the knowledge baked into the model's weights during training — and then there's conversational memory — the ability to recall what was said earlier in a chat session. This series is entirely about the second kind: how we engineer the illusion of continuity on top of a model that is, by design, stateless.

Stateless by Design

LLMs exposed via an API are stateless. Each call you make is completely independent. The model processes your input, generates a response, and then — as far as the infrastructure is concerned — that interaction is over. No memory is stored. No session is kept open. The next request you make is treated as if it's the very first one.

This isn't a limitation to work around. It's a deliberate design choice that makes these systems scalable. But it does create a beautifully interesting engineering challenge: if the model remembers nothing, how do we build something that feels like it does?

Let's look at the raw mechanics.

Setup

No frameworks — no LangChain, no LlamaIndex. Just the OpenAI client and a simple helper function.

from dotenv import load_dotenv
from openai import OpenAI

load_dotenv()  # loads OPENAI_API_KEY from .env file

client = OpenAI()

def chat(messages, model="gpt-4o-mini"):
    """Send messages to the chat completions API and return the response text."""
    response = client.chat.completions.create(
        model=model,
        messages=messages,
    )
    return response.choices[0].message.content

That's it. A function that takes a list of messages, sends them to the API, and returns the response. Every example below uses this.

Proving Statelessness — Three Independent Calls

Let's make three separate API calls and watch what happens.

# Call 1: Tell the model our name and ask it back
response1 = chat([
    {"role": "user", "content": "My name is Joy. What is my name?"}
])
print("Call 1:", response1)
# → Call 1: Your name is Joy.

That works — because the name is right there in the same request.

# Call 2: A completely new request — the model has NO memory of Call 1
response2 = chat([
    {"role": "user", "content": "What is my name?"}
])
print("Call 2:", response2)
# → Call 2: I don't know your name unless you tell me. How can I assist you today?

Same question, but the model has no idea who we are. Call 1 never happened as far as this request is concerned.

# Call 3: Asking about conversation history — impossible without context
response3 = chat([
    {"role": "user", "content": "What was the first thing I said to you?"}
])
print("Call 3:", response3)
# → Call 3: The first thing you said to me was, "What was the first thing I said to you?"

The model can only see what's in front of it right now. It literally thinks this one message is the entire conversation.

Takeaway: Each API call is a blank slate. Call 1 worked only because the name was in that same request. Calls 2 and 3 have zero access to Call 1. The LLM API is a stateless function: response = f(messages).

The Elegant Fix — Chat History

Here's where it gets satisfying. The solution isn't complicated at all. We simply keep track of the conversation ourselves — as a plain Python list — and we pass the entire history to the model on every call. This is exactly what ChatGPT does behind the scenes.

# Turn 1: Start a conversation and keep track of the history
conversation = [
    {"role": "user", "content": "My name is Joy."}
]

response = chat(conversation)
print("Turn 1:", response)
# → Turn 1: Nice to meet you, Joy! How can I assist you today?

# Append the assistant's reply to our history
conversation.append({"role": "assistant", "content": response})

# Turn 2: Ask the same question as Call 2 — but now with history attached
conversation.append({"role": "user", "content": "What is my name?"})

response = chat(conversation)
print("Turn 2:", response)
# → Turn 2: Your name is Joy. How can I help you today, Joy?

conversation.append({"role": "assistant", "content": response})

# Turn 3: Ask the same question as Call 3 — now the model can answer correctly
conversation.append({"role": "user", "content": "What was the first thing I said to you?"})

response = chat(conversation)
print("Turn 3:", response)
# → Turn 3: The first thing you said to me was, "My name is Joy."

conversation.append({"role": "assistant", "content": response})

The exact same questions that failed in Part 1 now work perfectly. The difference? Context.

Let's look at what we're actually sending to the API — this IS the "memory":

Full conversation history sent with each request:

        USER: My name is Joy.
   ASSISTANT: Nice to meet you, Joy! How can I assist you today?
        USER: What is my name?
   ASSISTANT: Your name is Joy. How can I help you today, Joy?
        USER: What was the first thing I said to you?
   ASSISTANT: The first thing you said to me was, "My name is Joy."

The model didn't "remember" anything between calls. We achieved conversational memory by appending every message to a list and sending the entire list with each new request. Memory is just input data.

Why This Foundation Matters So Much

Understanding this clearly — really clearly, not just vaguely — opens up a set of important questions that become central the moment you start building production AI systems.

On cost: Every token in that growing history is a token you're paying for. A 40-turn conversation at roughly 500 tokens per turn means Turn 40 sends ~20,000 input tokens. Multiply that across thousands of users and the bill adds up fast.

On speed: More input tokens means longer response times. Long conversations become noticeably slower — not because the model is tired, but because the payload is genuinely getting larger with every turn.

On limits: Every LLM has a context window — a ceiling on how many tokens it can process in a single call. When the history exceeds the window, you have to make choices about what to keep and what to drop. Those choices matter enormously.

These aren't theoretical concerns. They are the exact problems that motivated frameworks like LangChain to exist.

What Comes Next

This manual approach works, but it has limitations. In Part 2, we'll explore how frameworks like LangChain manage conversational memory automatically using strategies like:

Full buffer — store the entire conversation history (what we did here, but framework-managed)
Sliding window — keep only the last K messages, dropping older ones
Summary memory — condense older messages into a compact summary using an LLM

These strategies let us maintain useful context without sending the entire conversation history every time.