Ways to Give an LLM a Memory It Doesn't Have

Part 2 of 3 — Demystifying Conversational Memory: Framework-Managed Memory with LangChain

This series is for developers building with LLM APIs who want to understand what's actually happening under the hood.

In Part 1, we proved that LLM APIs are stateless and showed how manually appending chat history creates conversational memory. That works — but it doesn't scale.

This post introduces frameworks that manage conversational memory for us, and demonstrates three memory strategies using LangChain.

Why Use a Framework?

The manual approach from Part 1 has three problems:

Unbounded token growth — The conversation list grows with every turn. Eventually it exceeds the model's context window and the API call fails.
Cost — Every token sent costs money. Resending the entire history each time is wasteful for long conversations.
Boilerplate — The append-and-send pattern is repeated everywhere. It's error-prone and tedious to maintain.

Frameworks solve this by abstracting memory management behind a clean interface. We configure a memory strategy, and the framework handles the rest.

Several frameworks offer this — LangChain, LlamaIndex, Haystack, among others. We'll use LangChain here.

Setup

from dotenv import load_dotenv
from langchain_openai import ChatOpenAI
from langchain_core.chat_history import InMemoryChatMessageHistory
from langchain_core.messages import trim_messages, SystemMessage

load_dotenv()

llm = ChatOpenAI(model="gpt-4o-mini")

LangChain provides InMemoryChatMessageHistory — a message store that tracks conversation turns with .add_user_message() and .add_ai_message(). We call the LLM directly with llm.invoke(history.messages).

Each strategy below differs only in how we manage the history before sending it to the LLM.

Strategy 1: Full Buffer

The simplest strategy — stores the entire conversation history. This is the framework equivalent of what we built manually in Part 1. LangChain handles the append-and-send logic for us.

history = InMemoryChatMessageHistory()

def send(user_msg):
    """Add user message, call LLM, store response, and print the exchange."""
    history.add_user_message(user_msg)
    response = llm.invoke(history.messages)
    history.add_ai_message(response.content)
    print(f"User: {user_msg}")
    print(f"AI:   {response.content}")
    print("\nHistory:")
    for msg in history.messages:
        print(f"  {type(msg).__name__:>12}: {msg.content}")

send("My name is Joy.")
# → User: My name is Joy.
# → AI:   Nice to meet you, Joy! How can I assist you today?
#
# → History:
# →   HumanMessage: My name is Joy.
# →      AIMessage: Nice to meet you, Joy! How can I assist you today?

send("What is my name?")
# → User: What is my name?
# → AI:   Your name is Joy. How can I help you today?
#
# → History:
# →   HumanMessage: My name is Joy.
# →      AIMessage: Nice to meet you, Joy! How can I assist you today?
# →   HumanMessage: What is my name?
# →      AIMessage: Your name is Joy. How can I help you today?

send("What was the first thing I said to you?")
# → User: What was the first thing I said to you?
# → AI:   The first thing you said to me was, "My name is Joy."
#
# → History:
# →   HumanMessage: My name is Joy.
# →      AIMessage: Nice to meet you, Joy! How can I assist you today?
# →   HumanMessage: What is my name?
# →      AIMessage: Your name is Joy. How can I help you today?
# →   HumanMessage: What was the first thing I said to you?
# →      AIMessage: The first thing you said to me was, "My name is Joy."

Takeaway: Same result as Part 1's manual approach, but with zero boilerplate. The framework appends messages and sends the full history automatically. The downside is the same too — unbounded growth.

Strategy 2: Sliding Window

Keeps only the last K messages. Older messages are dropped entirely using trim_messages. This puts a hard bound on token usage.

history = InMemoryChatMessageHistory()

def send(user_msg):
    """Add user message, trim to last 4 messages, call LLM, and store response."""
    history.add_user_message(user_msg)
    trimmed = trim_messages(
        history.messages, max_tokens=4, strategy="last", token_counter=len
    )
    response = llm.invoke(trimmed)
    history.add_ai_message(response.content)
    print(f"User: {user_msg}")
    print(f"AI:   {response.content}")
    print("\nWhat the LLM sees (trimmed):")
    for msg in trimmed:
        print(f"  {type(msg).__name__:>12}: {msg.content}")

send("My name is Joy.")
# → User: My name is Joy.
# → AI:   Nice to meet you, Joy! How can I assist you today?
#
# → What the LLM sees (trimmed):
# →   HumanMessage: My name is Joy.

send("What is my name?")
# → User: What is my name?
# → AI:   Your name is Joy!
#
# → What the LLM sees (trimmed):
# →   HumanMessage: My name is Joy.
# →      AIMessage: Nice to meet you, Joy! How can I assist you today?
# →   HumanMessage: What is my name?

So far so good — Turn 1's messages still fit within the 4-message window. But watch what happens on Turn 3:

send("What was the first thing I said to you?")
# → User: What was the first thing I said to you?
# → AI:   The first thing you said to me was, "What is my name?"
#
# → What the LLM sees (trimmed):
# →      AIMessage: Nice to meet you, Joy! How can I assist you today?
# →   HumanMessage: What is my name?
# →      AIMessage: Your name is Joy!
# →   HumanMessage: What was the first thing I said to you?

The model got it wrong. With 5 messages in history, the oldest one — "My name is Joy." — was trimmed. The model can only see the last 4 messages, so it thinks the conversation started with "What is my name?".

Takeaway: Token usage is bounded, but older context is permanently lost. The sliding window trades recall for efficiency.

Strategy 3: Summary Memory

Instead of dropping old messages, this strategy summarizes them using an LLM. The summary replaces the raw history, preserving the key information in fewer tokens.

The tradeoff: an extra LLM call for summarization (adds latency and cost per turn).

history = InMemoryChatMessageHistory()

def summarize_messages(messages):
    """Ask the LLM to summarize a list of messages into a single sentence."""
    text = "\n".join(f"{type(m).__name__}: {m.content}" for m in messages)
    return llm.invoke(
        f"Summarize this conversation in one concise sentence:\n{text}"
    ).content

def send(user_msg):
    """Add user message, summarize if history is long, call LLM, and store response."""
    history.add_user_message(user_msg)
    if len(history.messages) > 4:
        summary = summarize_messages(history.messages[:-1])
        history.messages = [
            SystemMessage(content=f"Previous conversation summary: {summary}"),
            history.messages[-1],
        ]
    response = llm.invoke(history.messages)
    history.add_ai_message(response.content)
    print(f"User: {user_msg}")
    print(f"AI:   {response.content}")
    print("\nHistory:")
    for msg in history.messages:
        print(f"  {type(msg).__name__:>14}: {msg.content}")

send("My name is Joy.")
# → User: My name is Joy.
# → AI:   Nice to meet you, Joy! How can I assist you today?
#
# → History:
# →     HumanMessage: My name is Joy.
# →        AIMessage: Nice to meet you, Joy! How can I assist you today?

send("What is my name?")
# → User: What is my name?
# → AI:   Your name is Joy. How can I help you today, Joy?
#
# → History:
# →     HumanMessage: My name is Joy.
# →        AIMessage: Nice to meet you, Joy! How can I assist you today?
# →     HumanMessage: What is my name?
# →        AIMessage: Your name is Joy. How can I help you today, Joy?

Still within 4 messages, so no summarization yet. Now Turn 3 triggers it:

send("What was the first thing I said to you?")
# → User: What was the first thing I said to you?
# → AI:   You introduced yourself by saying your name is Joy.
#
# → History:
# →    SystemMessage: Previous conversation summary: The conversation involves Joy
# →                   introducing herself to the AI, which then confirms her name
# →                   and asks how it can help her.
# →     HumanMessage: What was the first thing I said to you?
# →        AIMessage: You introduced yourself by saying your name is Joy.

The four previous messages were compressed into a single system message summary — but the key fact (the user's name is Joy) survived. The model answered correctly, unlike the sliding window.

Takeaway: The summary preserves key facts without storing every message verbatim. Token usage stays compact even as conversations grow. The cost is an extra LLM call per turn to update the summary.

What Comes Next

The strategies in this post manage memory by manipulating the chat history sent with each request. This is just one approach. LangChain also offers more advanced memory management — including short-term memory, long-term memory, and persistent stores — that go beyond simple history manipulation. We'll explore these in Part 3.