Future

Christian Alexander Nonis
Christian Alexander Nonis

Posted on

Giving LLMs Real Memory: Why It’s Hard, and How BrainAPI Solves It

Large Language Models (LLMs) are impressive. They can write code, answer questions, and chat fluently on almost any topic.

But there’s one fundamental flaw:

LLMs have no memory.

Unless you manually feed the model your entire conversation history each time, it will forget everything.

For developers, this creates a lot of friction when building anything that needs persistence or context.

In this article, we’ll:

  • Explore why LLM memory is a hard problem
  • See how BrainAPI approaches it with a structured, hybrid memory architecture
  • Walk through a mini tutorial to integrate it into your project

Why LLM Memory Is Hard

LLMs are stateless by design. Each prompt is processed independently.

When you ask a follow-up question, the model doesn’t “remember” the previous answer — it only knows what you explicitly include in the input.

Developers try to work around this with:

  • Retrieval-Augmented Generation (RAG) — storing chunks of data in a vector DB and fetching relevant ones per query
  • Prompt stuffing — appending conversation history to every prompt
  • Manual state tracking — keeping facts in variables or databases and re-injecting them

The problems:

  • RAG struggles with multi-turn continuity (“What was the second method again?”)
  • Prompt stuffing bloats tokens and drives up cost
  • Coreference issues — “it” and “she” become ambiguous without entity tracking
  • No high-level awareness — the bot can’t easily remember your goals, preferences, or evolving context

We need something that:

  1. Understands entities and relationships
  2. Stores facts and knowledge in a structured way
  3. Retrieves context intelligently, not just by keyword similarity
  4. Tracks conversation at both a detail and summary level

Introducing BrainAPI

BrainAPI by Lumen Labs is an on-demand memory layer for LLM applications.

It’s accessible via Python and Node.js SDKs, and it handles:

  • Storing conversation messages
  • Injecting static or dynamic knowledge
  • Retrieving relevant context for the current query

Key differences vs. simple RAG:

  • Coreference resolution — normalizes references so “she” and “Mary” are connected
  • Triplet-based knowledge graph — facts are stored as subject → predicate → object
  • Hybrid retrieval — combines graph traversal and vector similarity search
  • High-level observation layer — summaries of user goals, topics, and context slices

How It Works Under the Hood

The architecture has five layers:

1. Coreference Resolution

Ensures entity consistency across messages.

Example:

"Mary is getting married next year. She wants it in Rome."
→ "Mary is getting married next year. Mary wants the wedding in Rome."
Enter fullscreen mode Exit fullscreen mode

Currently using fastcoref in Python; exploring a faster C++ rule-based resolver.


2. Triplet Extraction & Embedding

From each message or knowledge chunk:

  • Extract subject-predicate-object triples
  • Embed whole phrase and individual entities
  • Wikify entity names to avoid duplicates (e.g. "NYC" → "New York City")

3. Storage Backend

  • Neo4j — the knowledge graph
  • Pinecone — vector embeddings for semantic search
  • MongoDB — raw text chunks, logs, and metadata

4. Hybrid Retrieval

When asked “Mary’s wedding date”:

  1. Extract (Mary) and (wedding date)
  2. Search Neo4j for subject = “Mary”
  3. Traverse edges for exact match on object
  4. If no exact match, run vector search on connected nodes
  5. If no subject found, run vector search for closest parent entity

5. High-Level LLM Observations

A summarization layer produces structured observations every few turns:

  • Topics discussed
  • User goals
  • Relevant constraints

These summaries give the bot bird’s-eye awareness without flooding the context window.


When to Use BrainAPI

  • Documentation Bots — remember context between Q&A and follow-ups
  • Goal-Oriented Assistants — persist user preferences and constraints
  • Educational Tutors — track student progress and personalize lessons
  • Personal AI Companions — maintain continuity across days or weeks

Mini Tutorial: Adding Memory to Your Bot

Let’s add BrainAPI to a Python chatbot.

1. Install the SDK

pip install lumen-brain
Enter fullscreen mode Exit fullscreen mode

2. Save incoming messages

from lumen_brain import LumenBrainDriver
driver = LumenBrainDriver("your-api-key")

driver.save_message(
    memory_uuid="project-chat-memory",
    content="I’m planning a conference in Rome next May.",
    role="user",
    conversation_id="conv-001"
)
Enter fullscreen mode Exit fullscreen mode

3. Inject Knowledge

driver.inject_knowledge(
    memory_uuid="project-chat-memory",
    type="file",
    content="Our conference venue options include the Colosseum and Forum."
)
Enter fullscreen mode Exit fullscreen mode

4. Retrieve relevant context

result = driver.query_memory(
    text="When is the conference happening again?",
    memory_uuid="project-chat-memory",
    conversation_id="conv-001"
)

response = llm.invoke({ "input": "When is the conference happening again?" + result.context })
Enter fullscreen mode Exit fullscreen mode

Top comments (0)