Large Language Models (LLMs) are impressive. They can write code, answer questions, and chat fluently on almost any topic.
But there’s one fundamental flaw:
LLMs have no memory.
Unless you manually feed the model your entire conversation history each time, it will forget everything.
For developers, this creates a lot of friction when building anything that needs persistence or context.
In this article, we’ll:
- Explore why LLM memory is a hard problem
- See how BrainAPI approaches it with a structured, hybrid memory architecture
- Walk through a mini tutorial to integrate it into your project
Why LLM Memory Is Hard
LLMs are stateless by design. Each prompt is processed independently.
When you ask a follow-up question, the model doesn’t “remember” the previous answer — it only knows what you explicitly include in the input.
Developers try to work around this with:
- Retrieval-Augmented Generation (RAG) — storing chunks of data in a vector DB and fetching relevant ones per query
- Prompt stuffing — appending conversation history to every prompt
- Manual state tracking — keeping facts in variables or databases and re-injecting them
The problems:
- RAG struggles with multi-turn continuity (“What was the second method again?”)
- Prompt stuffing bloats tokens and drives up cost
- Coreference issues — “it” and “she” become ambiguous without entity tracking
- No high-level awareness — the bot can’t easily remember your goals, preferences, or evolving context
We need something that:
- Understands entities and relationships
- Stores facts and knowledge in a structured way
- Retrieves context intelligently, not just by keyword similarity
- Tracks conversation at both a detail and summary level
Introducing BrainAPI
BrainAPI by Lumen Labs is an on-demand memory layer for LLM applications.
It’s accessible via Python and Node.js SDKs, and it handles:
- Storing conversation messages
- Injecting static or dynamic knowledge
- Retrieving relevant context for the current query
Key differences vs. simple RAG:
- Coreference resolution — normalizes references so “she” and “Mary” are connected
-
Triplet-based knowledge graph — facts are stored as
subject → predicate → object - Hybrid retrieval — combines graph traversal and vector similarity search
- High-level observation layer — summaries of user goals, topics, and context slices
How It Works Under the Hood
The architecture has five layers:
1. Coreference Resolution
Ensures entity consistency across messages.
Example:
"Mary is getting married next year. She wants it in Rome."
→ "Mary is getting married next year. Mary wants the wedding in Rome."
Currently using fastcoref in Python; exploring a faster C++ rule-based resolver.
2. Triplet Extraction & Embedding
From each message or knowledge chunk:
- Extract subject-predicate-object triples
- Embed whole phrase and individual entities
- Wikify entity names to avoid duplicates (e.g. "NYC" → "New York City")
3. Storage Backend
- Neo4j — the knowledge graph
- Pinecone — vector embeddings for semantic search
- MongoDB — raw text chunks, logs, and metadata
4. Hybrid Retrieval
When asked “Mary’s wedding date”:
- Extract
(Mary)and(wedding date) - Search Neo4j for subject = “Mary”
- Traverse edges for exact match on object
- If no exact match, run vector search on connected nodes
- If no subject found, run vector search for closest parent entity
5. High-Level LLM Observations
A summarization layer produces structured observations every few turns:
- Topics discussed
- User goals
- Relevant constraints
These summaries give the bot bird’s-eye awareness without flooding the context window.
When to Use BrainAPI
- Documentation Bots — remember context between Q&A and follow-ups
- Goal-Oriented Assistants — persist user preferences and constraints
- Educational Tutors — track student progress and personalize lessons
- Personal AI Companions — maintain continuity across days or weeks
Mini Tutorial: Adding Memory to Your Bot
Let’s add BrainAPI to a Python chatbot.
1. Install the SDK
pip install lumen-brain
2. Save incoming messages
from lumen_brain import LumenBrainDriver
driver = LumenBrainDriver("your-api-key")
driver.save_message(
memory_uuid="project-chat-memory",
content="I’m planning a conference in Rome next May.",
role="user",
conversation_id="conv-001"
)
3. Inject Knowledge
driver.inject_knowledge(
memory_uuid="project-chat-memory",
type="file",
content="Our conference venue options include the Colosseum and Forum."
)
4. Retrieve relevant context
result = driver.query_memory(
text="When is the conference happening again?",
memory_uuid="project-chat-memory",
conversation_id="conv-001"
)
response = llm.invoke({ "input": "When is the conference happening again?" + result.context })
Top comments (0)