Future: Seenivasa Ramadurai

From RAG to Knowledge Graphs Why the Agent Era Is Redefining AI Architecture

Seenivasa Ramadurai — Sun, 12 Apr 2026 15:23:54 +0000

Introduction

One question is dominating AI architecture discussions right now. We already built RAG. Everyone is talking about GraphRAG. Should we move?

On the surface, it looks like a standard tech upgrade cycle. Underneath, something more fundamental is happening a debate about how we represent knowledge, how we retrieve it, and how we expect machines to reason over it.

For the last two years, the industry followed a predictable path. We started with raw Large Language Models, quickly realized they could hallucinate with terrifying confidence, and turned to RAG(Retrieval Augmented Generation) to ground them in real data. It was a genuine breakthrough. Suddenly you could connect a model to your PDFs, internal portals, wikis, and live databases without the nightmare of constant retraining. For most teams, it felt like magic.

Then the ceiling arrived.

Teams started noticing that RAG was useful, but not intelligent. It could find relevant text. It couldn't understand how things actually connected. This gap between finding information and understanding relationships is what drove the industry toward Knowledge Graphs and GraphRAG.

Now, just as that conversation is picking up steam, another shift is already underway the Agentic AI. Autonomous agents, dynamic tool use, and multi-step orchestration are changing the very definition of what retrieval even means. It's no longer about fetching facts it's about giving machines the cognitive infrastructure to solve genuinely complex problems.

Before you commit to your next infrastructure pivot, let's slow down and answer the questions that actually matter.

What exactly is RAG, and where does it fail?
Why did GraphRAG emerge, and what is the real cost of building it?
In a world of agents, do we still need it the same way?

This blog is the roadmap for that journey.

The Problem RAG Solved (and Why It Mattered So Much)

A large language model is trained on enormous amounts of text. That gives it remarkable linguistic ability and broad general knowledge but it comes with a hard constraint. The model doesn't know your enterprise data, your latest reports, your private documents, or the product changes that landed last Tuesday. And if it doesn't know something? It may still generate a confident, fluent answer anyway. That's hallucination, and it's not a bug you can patch it's structural.

RAG solves this by moving knowledge outside the model and fetching it dynamically at query time.

The flow is straightforward:

Ingest your documents PDFs, emails, contracts, meeting notes, tickets, whatever lives in your knowledge ecosystem.
Chunk the text into smaller, searchable units (chunk size matters enormously too small and you lose context, too large and retrieval gets noisy).
Embed each chunk using an embedding model, converting text into dense numerical vectors that capture semantic meaning.
Index those vectors in a vector database FAISS, Qdrant, Pinecone, Chroma, Weaviate, or Milvus are common choices.
At query time, embed the user's question, find the most semantically similar chunks, inject them into the prompt, and let the LLM answer from real evidence.

It changed practical AI development. It gave teams a way to build grounded document assistants, enterprise search tools, Q&A bots, and domain specific copilots without retraining foundation models. And it introduced an architectural principle that remains one of the most important ideas in modern AI systems.

The model doesn't need to contain all knowledge internally, if we can retrieve the right knowledge externally at the right moment.
That idea isn't going away. But it has limits.

Where RAG Starts Struggling

The challenge with RAG isn't that it's bad. The challenge is that it's optimized for similarity, not structure.

That difference turns out to matter a great deal in practice.
Imagine someone asks a question Which projects are affected by the recent leadership changes?

A classic RAG system might retrieve a chunk about a new VP appointment, another about a project roadmap, another about budget realignments, and another about team restructuring. Each chunk could be individually relevant. But the system has no natural way to understand that the VP change affects Project A through a specific reporting line, or that the budget change flows to Project B because of a procurement dependency. RAG retrieved similar text. It didn't model how things connect.

This plays out in three structural pain points that no amount of implementation tuning fully resolves.

Relationships Don't Live in Paragraphs

Real world knowledge is relational. Drugs interact with proteins. Engineers depend on infrastructure. Transactions flow through accounts. Court rulings reference precedents. Products belong to supply chains. None of this structure lives cleanly in a paragraph and vector similarity can't reconstruct it from loose chunks.

More Context Isn't the Same as Better Context

As context windows have grown from 4K to 128K to 1M tokens, the tempting fix has been just send more chunks. But flooding the LLM with additional text doesn't compensate for missing structure. Research has consistently shown that LLMs are sensitive to redundant and noisy context more text can actively degrade answer quality when the signal is buried in noise. A 2023 paper from Stanford memorably called this the lost-in-the-middle problem models perform worse when the relevant information is buried inside long contexts, not positioned at the edges.

Local Relevance ≠ Global Understanding

RAG surfaces locally relevant text fragments. It doesn't provide a holistic view of a domain, network, or system. This becomes a serious limitation in scientific literature review, financial relationship analysis, legal precedent mapping, biomedical research, and any domain where the value lies not just in what's said, but in how facts connect.

At some point, teams hit a realization if the problem isn't finding relevant text but navigating connected knowledge, then text chunks might be the wrong unit of retrieval entirely.

What a Knowledge Graph Actually Is

A Knowledge Graph is a way of representing knowledge as explicit entities and relationships — rather than as paragraphs and hoping the model infers structure later.
At the heart of this is a simple but powerful idea called a triplet

(Subject → Relationship → Object)

For example:

(Ram → leads → Project A)
(Project A → depends_on → Payments Platform v2)
(Payments Platform v2 → owned_by → FinTech Division)
(FinTech Division → reports_to → CTO Office)

Notice what just happened. We didn't store paragraphs. We stored meaning in a form the system can traverse, query, and reason over.

Now we can ask:

What does the CTO Office indirectly own? and follow the chain. We can ask.

What breaks if the Payments Platform is delayed? and trace the dependencies.

We've moved from retrieving information to navigating knowledge.

Knowledge graphs are stored as directed graphs nodes are entities, edges are typed relationships. This structure enables graph traversal algorithms, multi hop queries, shortest path analysis, and network centrality calculations none of which are available in a flat vector index.

Where Knowledge Graphs Already Live

Knowledge graphs aren't a new invention. Google has used its Knowledge Graph and Wikidata the structured data backbone of Wikipedia contains over 100 million items. The biomedical knowledge graph OpenBioLink contains millions of interactions between genes, proteins, diseases, and drugs. LinkedIn's economic graph models relationships between professionals, companies, skills, and jobs at scale. These aren't prototypes they're production systems handling billions of queries.

What GraphRAG Is and How It Works

GraphRAG popularized significantly by a 2024 Microsoft Research paper is a framework that uses a knowledge graph as the retrieval layer for an LLM, rather than a flat vector index.

The core intuition: instead of retrieving semantically similar text chunks, retrieve connected knowledge from a graph, then provide that richer context to the model.
GraphRAG typically involves three stages.

Stage 1: Graph Based Indexing

You build and index a graph. This might be an existing open knowledge graph (Wikidata, ConceptNet, UMLS for medical domains), a domain-specific proprietary graph, or a graph you construct from your own corpus using extraction pipelines. Proper indexing matters retrieval can use text descriptions, graph topology, embeddings over graph structure, hybrid schemes, or all of the above.

Stage 2: Graph Guided Retrieval

When a user asks a question, the system identifies relevant entities, then traverses relationships, paths, and subgraphs to assemble a richer answer context. This may involve entity linking, k-hop neighborhood expansion, Personalized PageRank, community detection, or LLM-directed graph traversal. The Microsoft GraphRAG paper specifically introduced a community summarization approach using graph algorithms to identify clusters of related entities and pre generating summaries which dramatically improved performance on global sense making tasks like, What are the major themes in this document corpus?

Stage 3: Graph Enhanced Generation

Once relevant graph knowledge is identified, it's translated into a form the LLM can consume: raw triplets, adjacency lists, natural language descriptions of paths, or structured summaries. This translation step is critical and often underestimated LLMs are sequence models trained on text, not graph traversal engines. The quality of this bridge between graph structure and language generation largely determines whether GraphRAG actually outperforms RAG in practice.

How Knowledge Graphs Get Built: The Extraction Pipeline

Before you can run GraphRAG, you need a graph. Building one from your own data means running an information extraction pipeline over your corpus.

The two core tasks are:

Named Entity Recognition (NER): Identifying entities in text people, organizations, products, locations, medical conditions, financial instruments, events, and whatever entity types your domain requires.

Relation Extraction (RE): Identifying the relationship between those entities works_at, acquired, causes, located_in, depends_on, cited_by.

Historically, this required expensive annotated training data and domain-specific supervised models. Modern LLMs have changed the economics significantly. You can prompt a model to extract entities and relationships from a document in a single pass, using in context examples to define your schema.

Two Practical Approaches

1. Custom LLM pipelines:

You design prompts that specify exactly what entity types and relationship types to extract, validate the output, handle edge cases, and write the results to your graph database often Neo4j, which uses the Cypher query language. This gives you fine-grained domain control but requires serious engineering effort: output validation, error handling, entity disambiguation (is OpenAI the same as Open AI?), conflict resolution, and ongoing maintenance. For enterprise grade graphs that become core assets, this is usually the right investment.

2. LangChain GraphTransformers / LlamaIndex graph tools

Frameworks like LangChain's LLMGraphTransformer abstract much of this into a few lines of code. You hand it documents and get back structured graph documents you can load into a graph store. This is excellent for prototyping and early validation you can have a working graph in hours, not weeks. The tradeoff is less control over extraction quality and ontology design.

A pragmatic approach: use LangChain tools to validate the concept and understand the data, then invest in a custom pipeline when the graph becomes a production dependency.

The Real Costs of GraphRAG (The Part Most Bloggers Skip)

Here's where most GraphRAG enthusiasm runs ahead of reality. The framework is genuinely powerful but it carries costs that compound at scale. Teams that discover these after committing to the architecture tend to have strong opinions about them.

1. Compute Cost Is a Design Constraint, Not a Detail

Building a graph from a large corpus means running LLM-based extraction over every document often multiple passes for NER, RE, and disambiguation. At scale, this gets expensive fast. A corpus of 100,000 documents running extraction at $0.01 per document is $1,000 **to build. But knowledge changes. Documents get updated, entities evolve, relationships become stale. **This isn't a one time cost it's an ongoing infrastructure commitment.

The Microsoft GraphRAG **paper noted that graph construction costs can be **10–100x higher than standard RAG indexing, depending on corpus size and extraction complexity. For many use cases, that's a reasonable investment. For others, it's prohibitive.

2. Maintenance Is Continuous and Non Trivial

In a standard RAG system, updating the index when data changes is relatively mechanical process the new document, chunk it, embed it, replace the old vectors.

In GraphRAG, a new document isn't just new text. It may

Introduce entities not yet in the graph
Rename or merge existing entities (disambiguation challenge)
Add relationships that contradict previously stored ones
Require schema updates to accommodate new relationship types
Trigger cascading updates across connected subgraphs

Real knowledge graph maintenance involves entity resolution (merging duplicate nodes), relationship validation, conflict handling, ontology management, and quality monitoring. This isn't optional a stale or inconsistent graph produces worse answers than no graph at all. Organizations running production knowledge graphs typically have dedicated data engineering pipelines, not just an extraction script that runs once.

3. Query Complexity Is Significantly Higher

Vector RAG retrieval is fast and conceptually simple embed the query, run approximate nearest neighbor search, return top-k chunks. The main failure mode is retrieving the wrong chunks, which you address by improving chunking, embeddings, and reranking.

GraphRAG retrieval involves: identifying entities in the query, traversing the graph, selecting relevant subgraphs, managing traversal depth (too shallow and you miss context, too deep and you hit subgraph explosion), translating graph results into LLM consumable text, and often generating structured queries in Cypher or SPARQL. Each step introduces new failure modes, and a single error the entity linker fails to identify a key node, the traversal goes in the wrong direction can cascade into a wrong answer even if the graph itself is perfectly accurate.

4. LLMs Are Not Graph Native Models

This is a foundational point that's easy to underestimate.
LLMs are trained on sequences of tokens. They're extraordinarily good at language, context, and reasoning over text. They're not naturally good at topological reasoning, deep multi-hop graph traversal, or understanding complex graph structure. As graph complexity increases more hops, more nodes, more relationship types LLM performance can degrade unless the graph-to-text translation is carefully designed.

This is why active research exists on Graph Neural Networks (GNNs), Knowledge Graph Embeddings (like TransE, RotatE, ComplEx), and specialized graph reasoning models that can work alongside LLMs because language models alone aren't sufficient for the hardest graph reasoning tasks.

5. Subgraph Explosion Is a Real Production Problem

As your graph grows, so does the number of paths between any two nodes. A query that seems simple What does this organization depend on? can trigger traversal over thousands of candidate subgraphs if the graph is dense. Without careful traversal bounds, relevance scoring, and pruning strategies, retrieval latency can blow past acceptable thresholds. Large scale industrial knowledge graphs at companies like Google and Amazon contain billions of entities and trillions of relationships and efficient retrieval over those structures requires specialized infrastructure, not just a graph database with default settings.

When to Use GraphRAG (and When Not To)

Given the costs and complexity, GraphRAG deserves a clear deployment framework.

Use GraphRAG when:

Relationships are the core question. If users routinely ask about dependencies, hierarchies, networks, chains of causation, or multi hop connections and your current RAG system struggles with these a graph likely adds genuine value.
Your domain has natural graph structure. Biomedical research (gene-protein-disease networks), legal precedent analysis, financial transaction monitoring, supply chain management, security incident investigation these domains are inherently relational, and graph structure captures meaning that flat text loses.
Multi hop reasoning is required. What companies did the CTO previously work at, and what products were they responsible for? requires following a chain of relationships across entities. RAG retrieves disconnected chunks a graph traverses the chain.
Global sense-making matters. The Microsoft GraphRAG research showed particular strength in tasks that require understanding themes, patterns, and relationships across an entire corpus summarization tasks where no single document contains the answer. Standard RAG performs poorly on these.

Stick with RAG when:

Text retrieval is the actual problem. If users are asking questions that can be answered by finding the right paragraph — policy lookup, document Q&A, manual search RAG is often simpler, cheaper, and more maintainable. Don't add complexity for problems that don't require it.
Your data changes rapidly. Fast moving data makes graph maintenance expensive. A vector index is much easier to keep current.
Agents can resolve the gap dynamically. More on this shortly.
You're early in your AI journey. Get RAG right first. Chunking, embeddings, metadata filtering, reranking, and permissions are complex enough. Adding graph infrastructure before validating the core product is usually premature.

Then Came Agents Changing the Game Again

While teams were deep in RAG vs. GraphRAG debates, agentic AI was quietly shifting the entire premise.

An agent isn't a retriever. It's a reasoning and orchestration layer that can choose tools, call APIs, query databases, write and execute code, maintain state across steps, and decide what to do next based on intermediate results.

This changes the architectural question fundamentally.

GraphRAG assumes that you should structure knowledge in advance so you can traverse it later. The entire value proposition is precomputed structure available at retrieval time.

Agents introduce a different possibility maybe we don't need to precompute every relationship if the system can discover and assemble relevant context dynamically at runtime.

Consider what an Agent can do in a single reasoning flow

Query a relational database for organizational structure
Search a vector index for relevant documents
Call an internal API for live financial data
Execute code to analyze a dataset
Synthesize all of it into a coherent answer

In some cases, that dynamic composition can substitute for a prebuilt knowledge graph especially when the relationships are discoverable from authoritative source systems rather than needing to be extracted and stored separately.

Major Agentic Frameworks in Production

Several frameworks have emerged to support this style of architecture:

LangGraph (from LangChain) provides a graph based state machine for building multi-step agent workflows with explicit control flow
AutoGen (Microsoft) enables multi agent conversations where specialized agents collaborate on complex tasks.
Microsoft Agent Framework = AutoGen+ Semantic Kernel is new Agentic framework to provides for building Multi agents + Workflows.
CrewAI focuses on role-based multi-agent systems for structured workflows.
Amazon Bedrock Agents and Google Vertex AI Agents offer managed agentic infrastructure at cloud scale

These frameworks don't replace retrieval they orchestrate it. An agent using LangGraph might invoke a vector search tool for semantic lookup, a graph query tool for relationship traversal, a SQL tool for structured data, and a web search tool for current information all within a single reasoning chain.

The Real Future: Composition, Not Competition

The industry loves a clean narrative. RAG is dead. GraphRAG wins. Agents replace everything.

None of that is how it actually plays out in production systems.
What we're seeing in Microsoft's research, in enterprise AI deployments, in the emerging architecture patterns at companies like Uber, Airbnb, and LinkedIn is convergence toward hybrid, layered systems where each approach plays to its strengths.

The simplest mental model

Or more concisely: RAG finds information. GraphRAG finds connections. Agents decide how to use both.

The future isn't choosing one acronym over another. It's building systems smart enough to know when each approach applies.

A Practical Decision Framework for Teams Building AI Systems Today

Most teams don't fail because they chose the wrong technology. They fail because they never got clear on what they were actually trying to fix. A few honest questions asked early can save months of over engineering.

Start with the failure, not the solution

Ask yourself: what is actually going wrong right now?

If users are saying:

The answer is incorrect
It didn't pick the right document

That's a RAG quality problem not a graph problem. Fix the fundamentals first:

Better chunking strategies
Higher quality embeddings
Stronger reranking

But if users are saying:

It doesn't understand how things are connected
It misses relationships between entities

That's a structural gap. That's where graphs start making sense.

Not every domain is a graph domain

Some domains are naturally relational relationships aren't optional, they're the system

Drug interactions in healthcare
Organizational hierarchies
Legal precedents
Financial dependencies
Supply chain networks

Many common applications are not like this:

Document Q&A
Policy lookup systems
Internal copilots
Knowledge assistants

For these, well built RAG is often more than enough.

Be honest about what maintenance actually costs

A knowledge graph is not a one time build. It's a living system that requires:

Continuous entity resolution
Relationship validation
Ongoing extraction pipelines
Schema evolution as data changes

If the ownership isn't there to sustain this, the graph will drift from reality and once users lose trust, no architecture can win it back.

Sometimes the bottleneck isn't retrieval at all

If your system needs to:

Work across multiple data sources
Call APIs dynamically
Adapt based on intermediate results
Execute multi-step reasoning

Then the RAG vs. graph debate is beside the point. Your bottleneck is orchestration and that's where agentic architectures deliver the most value.

Start simple. Evolve with evidence, not assumptions.

Start with a clean, well implemented RAG pipeline
Observe where it fails in real usage
Then decide: does this failure require relationships (Graph) or coordination (Agents)?

Not trends. Not what worked for another team. Actual evidence from your system.

You don't start with GraphRAG. You earn your way into it.

Conclusion

The next time someone declares a technology dead, look closer chances are it's just being absorbed into something bigger. The most resilient AI systems aren't built on a single winning bet. They're built on clarity: knowing what problem you're solving, what tool solves it best, and how to compose them intelligently when complexity demands it.

RAG finds. Graphs connect. Agents reason. None of them wins alone but together, in the right architecture, they form something greater than the sum of their parts.

The engineers who will build the most capable systems aren't the ones chasing the newest headline. They're the ones who resist the hype cycle long enough to ask the harder question not what's the best technology? but what does my problem actually need?

That discipline matching tools to problems, not problems to tools is what separates trend followers from system builders.
In a field that reinvents itself every six months, that kind of thinking isn't just useful.

It's the only thing that ages well. and finally

The goal was never just to retrieve text. The goal is to help systems understand, connect, and use knowledge in a way that actually supports reasoning. We're getting closer and the path runs through all of these ideas at once.

Thanks
Sreeni Ramadorai

What are Pre-Trained Models, Fine-Tuning, RAG, and Prompt Engineering? A Simple Kitchen Guide

Seenivasa Ramadurai — Sat, 11 Apr 2026 02:03:29 +0000

Explained Using Food The Analogy That Finally Makes It Click

Introduction

I’ve been asked the same question a thousand times. It comes from senior engineers moving into AI. It comes from product managers in architecture reviews. It comes from f*ounders building their first AI product*. And it always sounds like some version of this:

The Question I Hear Every Week

“When should I fine tune instead of just prompting better?”
“What exactly is RAG and is it better than fine tuning?”
“I keep hearing about pre-trained models what does pre-trained actually mean in practice?”

By the end of this blog, you’ll be able to explain these three techniques to any colleague technical or non-technical in under two minutes. More importantly, you’ll know exactly which one to reach for in your own work.

So I tried something different. I used food. And it worked better than anything else I’ve tried.

Let's EAT

1. 🧊 Pre-Trained Model = Frozen Food

Walk into any supermarket and pick up a bag of frozen pasta from the freezer section. A factory produced it using industrial equipment, professional chefs, tested recipes, and enormous quantities of ingredients all before you arrived. Y*ou don't know every detail of how it was made. But you trust it*, it works reliably, and you can have a meal in ten minutes.

That is a pre-trained model.

Companies like Anthropic, OpenAI, Google, and Meta spend hundreds of millions of dollars training these models on internet-scale data billions of web pages, books, code repositories, scientific papers, and conversations. The result is a model that already understands language, can write and debug code, reason through complex problems, translate between languages, summarize documents, and answer questions across hundreds of domains.

The Industrial Scale Behind That Frozen Bag

GPT-4 was trained on over 1 trillion tokens of text that is roughly 750 billion words.
Meta's open-source Llama 3 was trained on 15 trillion tokens.
Training a frontier model requires thousands of specialized GPUs running for weeks.
The compute cost alone can exceed $50–100 million USD for a single training run.
This is why 99% of developers never train from scratch. They start from a pre-trained base and work from there.

So what do you actually do with frozen food?

You heat it and eat it. That is the whole job. In AI terms, this means prompt engineering the craft of writing instructions that get the best possible output from the model without changing a single internal setting. Techniques like chain-of-thought prompting, few-shot examples, system instructions, and temperature control are all just different ways of heating the food more skillfully.

A well written prompt can unlock reasoning capabilities that seem almost magical. And the important thing to understand is: you are not changing the model. You are changing the conversation you are having with it.

Use the pre-trained model as is when…

General intelligence is enough for the tasks like summarizing, Q&A, writing, code generation
You are prototyping or proving a concept and need speed over perfection
Budget is a constraint no training pipeline needed, just an API call
The task doesn't require specialized private knowledge or consistent brand behavior

Here's the transition most people miss: The frozen food is brilliant for a quick, satisfying meal. But what if the default flavor doesn't taste like you? What if your guests expect something that reflects your kitchen, your brand, your domain? That's when you reach for the seasoning.

2. 🌶️ Fine-Tuning = Adding Your Own Seasoning

You have your bag of frozen pasta. But before you serve it, you add your own chilli oil, roasted garlic, fresh herbs, and a squeeze of lemon. The pasta itself is still the same factory product. The base structure is completely intact. But now it tastes like your pasta your kitchen's signature. Anyone who has eaten at your table before would recognize it.

Fine-tuning works exactly like this.

You take a pre-trained model and continue training it on a smaller, carefully curated dataset of your own. You are not rebuilding from scratch you start from those existing weight settings and nudge them in the direction you need. Think of it as turning dozens of those dials a few degrees, rather than starting from zero. The broad intelligence the model already has is preserved. What changes is how it behaves specifically for you.

Fine-tuning changes how the model behaves.

It does not change what the model knows.

This distinction matters enormously and it's where teams go wrong. If your legal AI product needs to produce documents in the exact format your senior partners expect, fine-tune. But if your product needs to answer questions about a case filed last Tuesday, fine-tuning won't help. That filed case isn't in the training data. That's RAG's job — we'll get there.

What fine-tuning actually looks like in practice

You collect hundreds or thousands of example input–output pairs that demonstrate the exact behavior you want. For a medical coding assistant, that might be clinical notes paired with correct ICD-10 billing codes. For a brand voice bot, it might be customer messages paired with ideal responses in your company's tone. This dataset is fed into the training process and the model updates its weights to match your examples. The process typically costs hundreds to thousands of dollars in GPU compute, takes hours to days depending on scale, and requires careful evaluation before you deploy.

Fine-tuning adjusts these things well

Tone and writing style formal, clinical, conversational, legal, brand-specific
Output structure consistent JSON schemas, report templates, specific formatting rules
Domain vocabulary medical codes, legal terminology, internal product names and systems
Default response behavior how the model handles edge cases and ambiguous inputs
Prompt efficiency a fine-tuned model often needs shorter system prompts, saving cost at scale

Fine-tuning cannot do these things don't ask it to

Update the model's knowledge of world events its understanding is frozen at training time
Give the model access to your private documents at query time — that is RAG
Prevent hallucination on specific facts a fine-tuned model still makes things up
Replace re-training when your data changes you must re-fine-tune, which is expensive

Real companies using fine-tuning today

GitHub Copilot is built on models fine-tuned on billions of lines of public code that's why it produces completions that match common coding patterns and library conventions far better than a general purpose model would.

Harvey AI fine-tunes on legal documents and case law so that it consistently produces output matching the precise language, structure, and citation style that lawyers expect from a junior associate.

Med-PaLM 2 (Google) is fine-tuned specifically on medical question answer pairs, reaching expert level performance on US Medical Licensing Examination questions a benchmark a general purpose model performs far below.

Use fine-tuning when…

The model's default tone or output format doesn't fit your use case
You have hundreds or thousands of high-quality labelled examples
Consistency in style and format matters more than freshness of knowledge
You are making thousands of API calls daily and need to reduce prompt length for cost

The seasoning has done its job. Your dish now has a recognizable identity. But there is still one problem that no amount of seasoning can solve: the frozen pasta was made months ago. What happens when your customer asks a question about something that happened last week? What happens when they need an answer based on your private internal documents that have never been part of any training dataset? For that, you need fresh ingredients and that's where RAG completely changes the game.

3. 🥗 RAG = Serving Fresh Side Dishes

Before I explain RAG, I need to explain the problem it solves because once you understand the problem, the solution becomes completely obvious.

The hallucination problem and why it matters

Language models hallucinate. That is not a bug that will eventually be fixed. It is a fundamental property of how they work. When a model is asked a question it cannot confidently answer from its training data an event that happened last month, a number from your private database, a policy you updated last quarter it does not say 'I don't know.' It produces a fluent, confident, completely fabricated answer.

A hallucinated answer looks exactly like a correct one. Same tone, same confidence, same formatting. A model will tell you that a law was passed on a specific date, that a case was decided a certain way, that a product specification has specific numbers and be entirely wrong. For consumer chatbots, this is annoying. In healthcare, legal, financial, and compliance contexts, it can be catastrophic.

Why models hallucinate in plain English

A language model's job is to predict the most statistically likely next word or sentence given the context. When the correct answer isn't in its training data, it doesn't have a 'I don't know' mode it has only a 'generate the most plausible continuation' mode. The result is confident sounding fabrication.

This is not fixable by making the model bigger or training it longer. The only reliable solution is to give it the correct information as context at query time which is exactly what RAG does.

Now: what is RAG?

Back to the kitchen. You heat your frozen pasta the pasta itself is completely unchanged. But tonight you serve it alongside a fresh caprese salad made this morning, warm garlic bread just out of the oven, and a sauce from tomatoes picked an hour ago. The pasta is still the factory's pasta. But the meal is elevated, current, and specific to tonight because you brought real, live ingredients to the table.

That is Retrieval Augmented Generation(RAG). The model is not changed. Instead, at the exact moment someone asks a question, your system fetches relevant, up-to-date information from an external source your documents, your database, your internal knowledge base and places that information into the model's context window before asking it to answer.

What is a 'context window'?

Think of the context window as the model's short term memory(STM) everything it can see and reason about in a single conversation. It has a fixed size. When we do RAG, we use part of that window to inject the retrieved documents, essentially saying: 'Here is what you need to know to answer this question accurately. Now answer it.' The model reasons over both its trained knowledge and the fresh material we just handed it.

The RAG pipeline step by step

Here is exactly what happens behind the scenes every time a RAG-enabled system answers a question:

The 6 steps of a RAG pipeline

The user submits a question for example: 'What is our current parental leave policy?'
The system converts that question into a vector embedding a list of numbers representing its meaning in mathematical space.
A similarity search runs against a vector database (Pinecone, Weaviate, ChromaDB, pgvector, OpenSearch) and retrieves the document chunks that are mathematically closest in meaning to the question.
In some systems, a re-ranker then scores these chunks by relevance and selects the best ones.
Those chunks are **injected into the model's context window **alongside the original question: 'Here is relevant information. Using only this, answer the question accurately.'
The model generates a response that is grounded in the retrieved content not in its training memory and can cite the source document by name.

The key thing to understand: the quality of your answers in a RAG system depends almost entirely on the quality of your retrieval. Naive RAG simply dumping documents into a vector database and hoping produces mediocre results at scale. Production RAG is an engineering discipline: thoughtful chunking strategies, the right embedding model, tuned retrieval parameters, and post-retrieval re-ranking. The model is the least of your concerns.

Real companies using RAG today

Notion AI uses RAG to let users ask questions about their own workspace content. The model has no idea what is in your Notion pages until the RAG pipeline retrieves and injects the relevant pages at query time.

Perplexity AI is essentially a RAG system at its core it retrieves live web pages and uses a language model to synthesize an answer with citations. No fine-tuning required for the freshness.

Legal and compliance tools at enterprise firms use RAG to answer questions about thousands of private contracts, regulations, and precedents data that can never be used in training because of sensitivity and confidentiality requirements.

Use RAG when…

Your data changes frequently products, prices, policies, news, regulations
Your data is private or sensitive and cannot be part of a training pipeline
You need answers to be accurate and traceable citations matter
Updating knowledge should not require retraining just update the database
You need to eliminate hallucination on specific factual questions
Fresh side dishes served. The meal is extraordinary. But here's the thing the best chefs know: a three-course meal beats any single dish. The future of enterprise AI is not pre-trained or fine-tuned or RAG. It's all three, deliberately layered which is what we'll look at next.

The Full Kitchen: When You Need All Three

The most powerful AI products in production today combine all three techniques. The food analogy holds perfectly:

A real enterprise AI assistant all three layers working together

Layer 1: The frozen food (pre-trained model): GPT-4o or Claude 3.5 Sonnet provides the base intelligence language understanding, reasoning, code generation. No one trains this from scratch.

Layer 2: The seasoning (fine-tuning): The model is fine-tuned on the company's internal communication style, product naming conventions, escalation procedures, and output formats. Now it sounds like the company.

Layer 3: The fresh sides (RAG): At query time, the system retrieves the live knowledge base current product specs, today's pricing, this week's policy updates, this customer's order history. Now the answers are both brand-consistent and factually current.

The result: an assistant that always talks like your company, always knows your latest information, and never makes up facts it doesn't have. That's not a single technique. That's a kitchen running three stations at once.

How agents use all three techniques together

The pre-trained model is the agent's core reasoning engine it reads the task, makes decisions, and generates instructions for each step.

Fine-tuning can make the agent better at following specific agentic patterns tool use, self-reflection, multi-step planning so it behaves more reliably in your particular workflow.

RAG gives the agent access to live information at each step it retrieves what it needs, acts on it, retrieves again, acts again so the agent always works with current data.

The bottom line: agents are not a fourth technique that replaces the three. They are an architecture that sits on top of all three. You cannot build a reliable agent without understanding the foundations. The kitchen analogy extends: if pre-trained is the frozen food, fine-tuning is the seasoning, and RAG is the fresh sides agents are the chef who orchestrates the whole meal in real time.

Pre-trained models give you the dish.

Fine-tuning changes the taste.

RAG brings fresh ingredients to the table.

Agents are the chef who runs the whole kitchen.

Fine-Tuning vs RAG vs Both

Here is the comparison most architecture conversations need.

The Takeaway

The next time someone on your team asks 'should we fine-tune or do RAG?' you now have the full answer. Not just the technique names, but the underlying reason behind each choice, the tradeoffs in cost and complexity, the failure modes to avoid, and the mental model that makes all of it easy to explain.

I've watched engineers waste months on fine-tuning jobs they never needed. I've watched teams deploy naive RAG and wonder why their accuracy is terrible. I've watched founders spend their first $50,000 on a problem that a better prompt would have solved in a day. I wrote this blog because those mistakes are completely avoidable if you have the right mental model before you start.

Start with the frozen food. Season it when you need to. Always bring fresh ingredients to the table. And when you are ready to build something truly ambitious hire the chef to orchestrate the whole kitchen.

Save this blog. You will want it in your next architecture conversation.

Share it with one developer on your team who is confused about these techniques. The clearest gift you can give them is a mental model that sticks.

Every comment, share, and save tells me what to write next. I read every single one.

Thanks
Sreeni Ramadorai

Q, K, V : The Three Things Every Great Tech Lead Does Without Knowing It

Seenivasa Ramadurai — Mon, 06 Apr 2026 01:58:56 +0000

Introduction

I’ve been thinking about transformer architecture a lot lately not just as an ML practitioner, but as someone who has spent years in engineering teams, watching how the best tech leads operate. And one day it just clicked a great tech lead behaves almost exactly like the self attention mechanism in a transformer. Not as a loose metaphor, but as a surprisingly precise structural analogy.

Bear with me. Once you see it, you can’t unsee it.

A quick refresher on self attention

In a transformer, each token in a sequence needs to understand its meaning in context. It can’t do that in isolation so instead of processing itself alone, it looks at every other token in the sequence, decides how relevant each one is, and creates a weighted blend of information from the whole sequence.

This happens through three simple projections for every token

Query (Q): What am I looking for right now?

Key (K): What does each other token offer?

Value (V): What should I actually take from them?

Attention(Q, K, V) = softmax( QKᵀ / √dₖ ) · V

The output isn’t just the token’s raw embedding. It’s a context-aware blend what this token means given everything around it. The whole is smarter than the sum of its parts.

Now map that onto your tech lead

A team is, in this framing, a sequence of people each carrying different skills, contexts, and domain knowledge. The tech lead’s job is to make that sequence produce coherent, high quality output. Sound familiar?

The tech lead doesn’t process problems one person at a time. They hold the whole team in mind simultaneously weighting each person’s input against the relevance of the problem at hand.

The Tech Lead as a Transformer: Scaling Attention in Your Team

In the world of Large Language Models, the Transformer architecture changed everything by mastering the art of "Attention." But the mechanics of a transformer Queries, Keys, and Values aren't just for silicon; they are a perfect blueprint for high performing engineering leadership.

If you want to scale your team’s impact, you have to stop managing tasks and start mastering the attention operation.

Q:Read the problem precisely before reacting

The principle: Before you reach for a person, you must understand the exact shape of what you need. A vague question finds the wrong answer. A precise question finds the right person.

IN THE TRANSFORMER

Every token generates a Query vector a precise representation of the context it is searching for. The word “crash” needs to know if it is financial or physical. Its Query is asking: “what domain am I in?” The word “it” needs to find its antecedent. Its Query is asking: “who am I referring to?” The Query gets scored against every other token’s Key. The more precise the Query, the more accurately the model attends to the right context. A sloppy Query means the model attends to the wrong tokens and the output degrades no matter how good the rest of the sequence is.

IN YOUR TECH LEAD

It’s 11pm on Tuesday. API latency has spiked to 8 seconds. Alerts are firing. A weak tech lead fires a message to the whole channel “Hey, who can look at this?” That is not a Query. That is a panic broadcast the problem has not been read at all, just forwarded.

A strong tech lead takes fifteen seconds before typing anything. They are reading the problem precisely: is this a database write bottleneck? A bad deploy? A downstream dependency choking? A traffic spike? Each of those is a different Query, and each points to a different person. Reading the problem precisely before reacting is not hesitation it is the entire foundation of what comes next. Get the Query wrong and everything downstream is wasted effort.

K:Know what each engineer truly carries

The principle: Not their job title. Not their years of experience. What they actually carry right now the specific knowledge, the lived context, the warm mental model that matches this exact problem.

IN THE TRANSFORMER

Every token generates a Key vector *a representation of what it holds and can offer to others. When a Query asks *“what domain am I in?”, the Keys from surrounding tokens compete to answer. The attention score between two tokens is the dot product of one’s Query against the other’s Key. High alignment means high attention. Low alignment means that token fades. The Key is not the same as the Value the Key is the advertisement that says “I am relevant to your question.” What gets extracted once that match is confirmed is the Value, which we will get to next.

IN YOUR TECH LEAD

The Query is formed: looks like a write contention issue in the orders table. Now the tech lead scans the team.

Sreeni is first online. Senior, reliable, composed under pressure. But his background is frontend. His Key what he truly carries doesn’t match this problem. High score on “reliable team member,” low score on this specific database crisis.

Ragavan wrote the orders pipeline eighteen months ago. He knows every design decision, every shortcut, every known failure mode. His Key is a near perfect match for the Query.

Siva debugged a nearly identical write contention issue two sprints ago. The mental model is warm. The patterns are fresh. Siva’s Key is both relevant and current.

A tech lead who knows their team only by title pages Sreeni because he’s available. A tech lead who truly knows what each engineer carries reaches for Ragavan and Siva. The depth of your Key knowledge is the single biggest factor in whether your team’s intelligence gets used or wasted.

V:Extract the exact contribution that matters

The principle: Finding the right person is only half the job. The other half is knowing what to pull from them the specific piece of their knowledge that solves this problem right now, not everything they know.

IN THE TRANSFORMER

The Value vector is the real payload. Once the attention scores are computed and we know how much to attend to each token, what we actually pull from them is their Value not their Key. The Key said “I am relevant.” The Value delivers what that relevance actually contains. These are two separate learned representations and they can be very different from each other.

The final output for any token is a weighted sum of the Value vectors from every token in the sequence including itself. That is the “self” in self attention. High attention score means a large portion of that token’s Value flows into the output. Low score means a small contribution but nothing is ever fully zeroed out. The result is a single enriched representation that carries synthesized meaning from across the whole sequence.

IN YOUR TECH LEAD

The tech lead has reached Ragavan and Siva. The Keys matched. Now comes the part most tech leads miss extracting the exact contribution that matters, not just getting them on a call.

Ragavan’s Value is specific: the orders table has a known write hotspot on the status column. A nearly identical incident in 2022 was resolved by switching to a queue based write pattern. The full fix takes four hours, but there is a config level workaround that buys time right now. That is his Value vector not his presence, not his seniority, but that precise, usable knowledge.

Siva’s Value is different: a step by step diagnosis approach from the recent incident, three specific queries to run against the slow query log, and a clear hunch about which index is missing based on the pattern of the spike. Different from Ragavan’s. Equally specific. Equally usable.

*The tech lead extracts architecture insight from Ragavan and live diagnosis steps from Siva * then synthesizes both into a single coherent response. Neither person alone had the full answer. The weighted combination of their two Value vectors did. That is what great tech leadership actually produces.

A note for the technically precise: in actual self attention, every token generates Q, K, and V simultaneously each team member would be questioner, advertiser, and content provider all at once. The analogy maps these roles onto distinct actors for clarity. That’s a deliberate simplification, and the right trade off for a blog. The structural point holds.

Softmax: decisive, not democratic

After the Query Key scores are computed for every token pair, a softmax function sharpens the distribution. The highest scoring tokens get heavily weighted. Lower scoring ones are suppressed not erased, but pushed toward the edges. The result is focused, purposeful attention rather than diffuse averaging.

Great tech leads calibrate the same way. During the incident, Ragavan and Siva carry the highest weights. Sreeni’s input on how to communicate the downtime to customers still matters and still flows into the output he’s not ignored. But he doesn’t drive the technical response. The softmax isn’t a veto. It’s a weighting.

The ability to weight confidently without dismissing is one of the hardest skills in the role. Too much sharpening and you become a dictator. Too little and you’re running a committee. The best tech leads calibrate this by problem type, stakes, and who is genuinely best positioned to contribute right now.

Multi head attention: running several concerns at once

Real transformers use multi head attention several independent attention operations running in parallel, each learning to track a different type of relationship in the sequence. One head catches syntactic structure. Another tracks semantic similarity. Another handles long range dependencies. The outputs are concatenated and projected into a single unified representation.

Watch a strong tech lead manage a major incident and you’ll see exactly this. One part of their mind is tracking the technical diagnosis. Another is watching team stress levels and deciding when to rotate people off the call. Another is composing the stakeholder update due in twenty minutes. Another is already thinking about the post-mortem structure and what process change this incident should trigger. None of those heads switches off while the others run. The incident gets resolved, the team stays functional, stakeholders are informed, and the right lesson gets captured because all four heads ran and synthesized their outputs.

MultiHead(Q, K, V) = Concat(head₁, …, headₙ) · Wᵒ

head₁ = technical diagnosis head₂ = team health & stress

head₃ = stakeholder comms head₄ = process & post mortem

Why the old model fails the RNN problem

Before transformers, the dominant approach was recurrent neural networks — process one token at a time, pass a hidden state forward, repeat. The problem was fundamental: information from early in the sequence degraded over time, gradients vanished on long sequences, and nothing could be parallelized. Every step depended on the last.

The command-and-control manager is an RNN. Every problem routes through them serially. Context from earlier conversations gets dropped. Team throughput is capped at the manager’s personal bandwidth. In a small team this is merely inefficient. In a scaling organization it becomes catastrophic.

The tech lead who operates like self-attention doesn’t become the bottleneck. They become the context layer the mechanism that helps the whole team understand the situation more clearly and move together faster. The team’s intelligence is the output. Not the manager’s.

So what does a great tech lead actually look like?

They’re the one who pauses before reacting forming the Query before reaching for a person. They’re the one who knows that Ragavan is the right call at 11pm not because he’s available, but because he wrote the system. They’re the one who doesn’t just ping the right people, but knows exactly what to extract from each of them and how to stitch those pieces into a response no single engineer could have produced alone.

They run multiple heads simultaneously without dropping any. Technical diagnosis, team morale, stakeholder communication, process improvement all running in parallel, all synthesized into a single coherent output. And they do it without becoming the bottleneck, without turning every decision into a committee, and without making anyone feel unseen.

That is self attention. Not as a metaphor. As a description of the job.

Attention is all you need. And a tech lead who truly understands that who attends broadly, weights wisely, and synthesizes instead of dictating is everything a team needs to become more than the sum of its people.

Thanks
Sreeni Ramadorai

Run Open Source AI Models with Docker Model Runner

Seenivasa Ramadurai — Sun, 05 Apr 2026 01:52:02 +0000

Introduction

If you've spent any time in software development, cloud engineering, or microservices architecture, the name Docker needs no introduction. But for those newer to the ecosystem, here's the short version.

Docker is an open platform for developing, shipping, and running applications. Its core idea is elegant: separate your application from the underlying infrastructure so you can build fast, test consistently, and deploy confidently. By standardizing how code is packaged and delivered, Docker dramatically shrinks the gap between "it works on my machine" and "it works in production."

What is Docker Desktop?

Docker Desktop takes everything Docker offers and wraps it into a single, batteries-included application for macOS, Windows, and Linux. It bundles the Docker Engine, CLI, Docker Compose, Kubernetes, and a visual dashboard giving developers a complete container workflow without ever touching low level OS configuration.

Over the years, Docker Desktop has become the de facto local development environment for millions of engineers worldwide. Version 4.x doubled down on AI workloads, and the latest releases ship with Docker Model Runner as a first class, built in feature accessible directly from the Docker Dashboard or the CLI you already use every day.

What is Docker Model Runner?

Docker Model Runner (DMR) is an inference engine embedded directly into Docker Desktop. It lets you pull, run, and interact with open-source large language models using the same familiar docker CLI no new tools, no configuration headaches, no surprises.

Under the hood, DMR uses llama.cpp as its runtime backend, delivering high performance inference on both CPU and GPU — Metal on Apple Silicon, CUDA on Linux and Windows out of the box.

Models are distributed as OCI compliant artifacts through Docker Hub's ai/ namespace. That means model versioning, access control, and distribution are all handled by the same battle tested infrastructure already powering your container images.

"What Docker did for application packaging, Model Runner does for AI inference one pull command, consistent behavior everywhere."

When to Use Docker Model Runner

How It Works Under the Hood

When you run a model through DMR, Docker Desktop spins up a local HTTP server exposing an OpenAI-compatible REST API including /v1/chat/completions, /v1/completions, and /v1/models. Any application or SDK already speaking the OpenAI protocol works against DMR with zero code changes, making it a drop in local alternative for AI-powered development.

Install Latest Docker Desktop based on your OS
Start the Docker Desktop
Click the Settings icon top Right corner
Select AI and enable Docker Model Runner, Enable DMR and Host TCP as shown below .

Note: Default TCP port is 12434 , you can change it whatever free port available in your machine , Mine i set it 5018

Next, click the models left side as shown below

Now, click pull and download the model and run it.

Below screenshot shows i pulled or downloaded two open source models

Test the Model within docker desktop itself

Testing GPT-OSS

The docker model subcommand is your primary interface. Let's walk through pulling and running qwen3.5 step by step.

1. Pull a model from Docker Hub

2. List available models ( what models, downloaded locally )

Quick reference cheat sheet

Why Docker Model Runner matters

Using DMR in your applications

Python with the OpenAI SDK

Since DMR speaks the OpenAI protocol, swap the base URL and you're done no model specific library needed:


from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:5018/engines/v1",
    api_key="not-needed",
)

while True:
    user_input = input("You: ")
    if user_input.lower() in ["exit", "quit", "bye"]:
        break
    response = client.chat.completions.create(
        model="gpt-oss", messages=[{"role": "user", "content": user_input}]
    )
    print(response.choices[0].message.content)

Testing the above code.

Docker Model Runner closes the gap between containerized application development and AI-powered application development. By treating models as OCI(Open Container Initiative) artifacts and exposing a standard OpenAI compatible API, DMR lets you build with local LLMs using the same mental model, the same toolchain, and the same workflows you already use for everything else.

The combination of zero setup inference, hardware acceleration, and Compose integration makes DMR the most practical way to add local AI capabilities to any project whether you're building a RAG pipeline, a coding assistant, or a document summarizer.

Thanks
Sreeni Ramadorai

Agent Middleware in Microsoft Agent Framework 1.0

Seenivasa Ramadurai — Sat, 04 Apr 2026 19:14:54 +0000

A familiar pipeline pattern applied to AI agents

Covers three middleware types, registration scopes, termination, result override, and when to use each

Not a New Idea

If you have used ASP.NET Core or Express.js, you already understand the core concept. Both frameworks let you register a chain of functions around every request. Each function receives a context and a next() delegate. Calling next() continues the chain. Not calling it short circuits it. That is the pipeline pattern a clean way to apply cross cutting concerns like logging, authentication, and error handling without touching any business logic.

Microsoft’s Agent Framework applies this exact pattern to AI agents. The next() delegate becomes call_next(), the context object holds the agent’s conversation instead of an HTTP request, and the pipeline wraps an AI reasoning turn instead of a web request. If you know app.Use() or app.use(), you already know the shape of what follows.

What is new, and worth understanding deeply, is that an agent turn is not a single request/response cycle. It is a multi step reasoning loop, and Agent Framework exposes three distinct interception points within it. The rest of this post covers all three types, how they differ, when to use each, and how they come together in a real SQL agent example.

Middleware

The Agent Framework supports three types of middleware, each intercepting a different layer of execution:

Agent middleware wraps agent runs, giving you access to inputs, outputs, and overall control flow.
Function middleware wraps individual tool calls, enabling input validation, result transformation, and execution control.
Chat middleware wraps the underlying requests sent to AI models, exposing raw messages, options, and responses.

All three types support both function based and class based implementations.

Chaining

When multiple middleware of the same type are registered, they execute as a chain each middleware calls call_next() to hand off control to the next one in line.

Rather than passing updated values into call_next() as arguments, middleware mutates the shared context object directly. This means any changes you make to the context before calling call_next() are automatically visible to downstream middleware, with no need to thread values through the call explicitly.

Execution Order

Agent level middleware always wraps run level middleware. Given agent middleware [A1, A2] and run middleware [R1, R2], the execution order is:

A1 → A2 → R1 → R2 → Agent → R2 → R1 → A2 → A1

Function and chat middleware follow the same wrapping principle, applied at the time of each tool call or chat request respectively.

Why we need it

The biggest value is not convenience; it is correctness and consistency.

Without middleware, teams usually end up in one or both of these patterns:

Pattern 1: policy hidden in prompts

Example instruction:

"Never run destructive SQL. Never send data to personal email."

This is useful guidance, but it is still model behavior, not a hard gate. As prompts get long, tools increase, and edge cases appear, this policy can become inconsistent. It is also hard to audit after the fact.

Pattern 2: policy duplicated in each tool

def run_sql(query: str) -> str:
    if "drop" in query.lower():
        return "blocked"
    ...

def export_data(target: str) -> str:
    if "gmail.com" in target.lower():
        return "blocked"
    ...

def quote_inventory_line(quantity: int) -> str:
    if quantity > 10000:
        return "blocked"
    ...

This looks safe, but it creates:

duplicated logic
inconsistent rules across tools
expensive updates when policy changes

Middleware fixes both

With middleware, concerns live at the right boundary:

run level checks in Agent middleware
per tool checks in Function middleware
model call telemetry/metadata in Chat middleware

Result:

cleaner tools
stronger guardrails
easier tests
better observability

1. Agent Middleware-outermost layer

Agent middleware is the outermost layer of the pipeline. It fires once per turn before any LLM call is made and after the final reply or response is produced making it the right place for concerns that span the entire turn: input validation, security screening, audit logging, and output transformation.

Implementation Styles & Chaining

Agent middleware supports both class based and function based implementations both are fully equivalent, and the choice comes down to whether you need instance state or prefer a lighter syntax.
When multiple middleware components are registered, they form a chain. Each component is responsible for calling call_next() to pass control to the next layer; omitting this call short-circuits the pipeline, preventing any downstream middleware or the LLM from running.

Note that call_next() takes no arguments. Instead of passing updated values explicitly, middleware mutates the shared AgentContext object directly — any changes made before await call_next() are automatically visible to everything further down the chain.

Class-Based Implementation

Subclass AgentMiddleware and override process(). The example below shows SecurityAgentMiddleware It inspects the latest user message and short-circuits the pipeline if it detects a threat the LLM is never invoked for blocked requests.

class SecurityAgentMiddleware(AgentMiddleware):
    """Agent-level guard: blocks risky **user chat text** before the model runs.

    Inspects ``context.messages[-1]`` (latest user turn). If :func:`_unsafe_input_reason`
    returns a reason, sets ``context.result`` to a canned assistant reply and **does not**
    call ``call_next()``, so the LLM and tools are skipped for that turn.
    """

    async def process(
        self,
        context: AgentContext,
        call_next: Callable[[], Awaitable[None]],
    ) -> None:
        # Only the latest user utterance is checked (typical for a single-turn REPL).
        last_message = context.messages[-1] if context.messages else None
        if last_message and last_message.text:
            query = last_message.text
            reason = _unsafe_input_reason(query)
            if reason:
                print(f"[SecurityAgentMiddleware] Security Warning: {reason}; blocking request.")
                # Short-circuit: set the assistant reply here; do NOT call call_next() → no LLM, no tools.
                context.result = AgentResponse(
                    messages=[
                        Message(
                            "assistant",
                            [f"Request blocked: {reason}."],
                        )
                    ]
                )
                return

        print("[SecurityAgentMiddleware] Security check passed.")
        # Continue pipeline: model + optional run_sql; function middleware runs inside tool path.
        await call_next()

# here is the _unsafe_input_reason function & For brevity, I’ve omitted the full code.”

def _unsafe_input_reason(query: str) -> str | None:
    """Classify why a user message should be blocked, or ``None`` if it may proceed.

    Checks run in order: injection-style patterns first, then destructive natural language.
    """
    # Order matters: catch obvious SQL fragments before broader NL patterns.
    if _looks_like_dangerous_sql(query):
        return "injection-style or suspicious SQL fragment in your message"
    if _looks_like_destructive_database_intent(query):
        return "destructive database request (e.g. delete/drop/truncate)"
    return None

Function Based and Decorator Based Styles

Agent Framework also supports function based and decorator based implementations. All three styles are equivalent; choose based on whether you need state or explicit type annotations.

Function based

async def logging_agent_middleware(

context: AgentContext,

next: Callable[[AgentContext], Awaitable[None]],

) -> None:

print("[Agent] Turn starting")

await next(context)

print("[Agent] Turn completed")

Decorator-based (no type annotation required)

@agent_middleware

async def simple_agent_middleware(context, next):

print("Before agent execution")

await next(context)

print("After agent execution")

Registering Middleware

Middleware is registered when constructing the agent. Pass a list to the middleware argument different middleware types can be mixed in the same list and the framework routes each to the correct pipeline layer automatically:

FOUNDRY_PROJECT_ENDPOINT = "https://sreeniagent.services.ai.azure.com/api/projects/sreeni_foundry"
FOUNDRY_MODEL = "gpt-4.1"


async with (
    AzureCliCredential() as credential,
    Agent(
        client=FoundryChatClient(
            credential=credential,
            project_endpoint=FOUNDRY_PROJECT_ENDPOINT, # Your Microsoft Foundry project URL 
            model=FOUNDRY_MODEL, # The model you deployed 
        ),
        name="Sreeni-SqlAssistant",
        instructions=(
            "You help users query a small demo database. "
            "The only table is `customers` with columns id, name, city. "
            "Always use the run_sql tool with a proper SELECT; explain results briefly."
        ),
        tools=run_sql,
        # Agent middleware wraps the turn; function middleware wraps each tool call
        middleware=[SecurityAgentMiddleware(), LoggingFunctionMiddleware()],
    ) as agent,
):

When to Use Agent Middleware

Agent middleware is the right choice for any concern that applies to the turn as a whole, rather than to a specific tool call or model request.

2.FunctionMiddleware- The ToolCall Layer

FunctionMiddleware fires inside the agent turn, but only when the LLM decides to invoke a tool. A single agent turn can trigger multiple tool calls, and FunctionMiddleware wraps each one independently. This makes it the right place for concerns that are specific to tool execution: timing, input validation, result transformation, and tool call auditing.

The FunctionInvocationContext Object

Each FunctionMiddleware component receives a FunctionInvocationContext, which is scoped to a single tool invocation:

When to Use FunctionMiddleware

Use it for concerns specific to tool execution the execution timing and performance monitoring, validating or sanitising tool arguments before they run, capping the number of times a tool may be called in one turn, transforming tool results before the LLM sees them, or auditing exactly which tools were called and with what arguments.

Terminating the Function Calling Loop

Setting context.terminate = True inside FunctionMiddleware does something powerful: it stops the LLM’s function calling loop entirely. The LLM will not receive the tool result and will not make any further tool calls in this turn. This is useful for enforcing tool call budgets or stopping a loop that is going in an undesirable direction.


@function_middleware

async def budget_middleware(context, next):

 if context.function.name == "run_sql":

 # Allow at most one SQL query per turn

 call_count = context.metadata.get("sql_calls", 0)

 if call_count >= 1:

 context.result = "Query limit reached for this turn."

 context.terminate = True  # stop the LLM tool-calling loop

 return

 context.metadata["sql_calls"] = call_count + 1

 await next(context)

Warning: Termination and Chat History

Terminating the function calling loop can leave the chat history in an inconsistent state a tool-call message with no corresponding tool result. This may cause errors if the same history is used in subsequent agent runs. Use termination carefully and consider clearing or repairing the history afterward.

3. ChatMiddleware —The LLM Call Layer

ChatMiddleware is the deepest layer. It wraps the actual inference call sent to the underlying language model the raw list of messages, the model options, and the response that comes back. This layer fires for every call to the LLM within a turn, which can be more than one if tools are used.

The ChatContext Object

Each ChatMiddleware component receives a ChatContext.

Function Based Example


  async def logging_chat_middleware(

  context: ChatContext,

  next: Callable[[ChatContext], Awaitable[None]],

  ) -> None:

  print(f"[Chat] Sending {len(context.messages)} messages to model")

  await next(context)

  print("[Chat] Model response received")

Because ChatMiddleware sees the exact message list going to the model, it can be used to inject system instructions, strip sensitive content, enforce token budgets, or even substitute a cached response all without the AgentMiddleware or FunctionMiddleware layers knowing anything changed.

When to Use ChatMiddleware

Use it when you need access to the raw LLM call: injecting or modifying system level instructions per call, redacting PII from messages before they leave your infrastructure, enforcing token count limits, caching repeated inference calls, or monitoring every model request for compliance purposes.

Registration: Agent Level vs. Run Level (run scope)

Microsoft Agent Framework supports two scopes for registering middleware. Understanding the difference is important for designing flexible agent systems.

Agent Level Middleware

Middleware passed in the middleware=[...] list when constructing the Agent applies to every single call to agent.run() for the lifetime of that agent. This is where you put policies that should always be enforced: security guards, mandatory audit logging, content filters.

Run Level Middleware

You can also pass middleware directly to a single agent.run() call. This middleware applies only to that one invocation and is discarded afterward. It is useful for per request customisation: adding a trace ID for a specific call, applying extra validation for a sensitive operation, or attaching a debug logger without affecting every other turn.

Choosing the Right Middleware Type

With three types available, the choice usually comes down to what you need to see and at what granularity.

Conclusion

Microsoft Agent Framework’s middleware brings the same pipeline contract you know from ASP.NET Core and Express ordered components, a context object, and a call_next() delegate into the world of AI agents. The structural difference is that an agent turn is not a single request/response cycle but a multi-step reasoning loop, and Agent Framework exposes three separate interception points within it.

AgentMiddleware is the right home for turn level concerns: security screening, content policy, and audit logging.

FunctionMiddleware is the right home for tool level concerns: execution timing, argument validation, and tool call budgets.

ChatMiddleware is the right home for model level concerns: raw message inspection, token enforcement, and caching.

Thanks
Sreeni Ramadorai

Five Agent Memory Types in LangGraph: A Deep Code Walkthrough (Part 2)

Seenivasa Ramadurai — Fri, 03 Apr 2026 03:23:44 +0000

In Part-1 [https://dev.to/sreeni5018/the-5-types-of-ai-agent-memory-every-developer-needs-to-know-part-1-52fn] we covered the five memory types, why the LLM is stateless by design, and why memory is always an infrastructure concern. This post is the how. Same five types, but now we wire each one up with LangGraph, dissect every line of code, flag the gotchas, and leave you with a single working script you can run today.

Before We Write a Single Line: Two Things You Must Understand

The Context Window Is the Only Reality
Repeat this like a mantra and the model only knows what is in the context window at inference time. Every token your message, retrieved facts, conversation history, tool results, system instructions has to be physically present in that window at the moment of the call. If it is not there, the model does not know it exists. Your memory infrastructure's entire job is to decide what goes in, when, and in what form.
Checkpointer ≠ Store This Confusion Breaks Designs
LangGraph gives you two distinct persistence hooks and mixing them up is the most common architecture mistake beginners make.

The practical consequence: if you store a user preference in the checkpointer (i.e., in state["messages"]), it vanishes the moment you start a new thread_id. If you store it in the store, it is there regardless of which thread the user returns on. Choose deliberately.

For local production setups you typically use SQLite for both, as two separate files:

SqliteSaver → durable per thread checkpoint history
SqliteStore → durable cross thread LTM/episodic records

The demos below use InMemory* backends so you can run them with zero setup. That is a teaching choice, not a recommendation for production.

Environment Setup

bashpip install langgraph langchain-openai langchain-community faiss-cpu python-dotenv

export OPENAI_API_KEY=sk-...
export OPENAI_CHAT_MODEL=gpt-4o-mini # optional, this is the default

macOS note: If you have PyTorch installed alongside FAISS, two OpenMP runtimes may be loaded and Python will abort on import. The fix is one line: os.environ.setdefault("KMP_DUPLICATE_LIB_OK", "TRUE") — set it before importing FAISS. The full script at the end does this automatically.

from __future__ import annotations

import os
os.environ.setdefault("KMP_DUPLICATE_LIB_OK", "TRUE")  # Must be before FAISS import

import operator
import sys
import uuid
from pathlib import Path
from typing import Annotated, TypedDict

from dotenv import load_dotenv
from langchain_core.documents import Document
from langchain_core.messages import AIMessage, BaseMessage, HumanMessage, SystemMessage
from langchain_core.tools import tool
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.config import get_store
from langgraph.graph import END, START, MessagesState, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.store.memory import InMemoryStore

Memory Type 1: Short Term Memory (STM) The Conversation Buffer

What it is?

Short-term memory (STM) is the rolling transcript of the current conversation. It is what allows the model to understand "make it shorter" without you specifying what "it" refers to. Every prior message in the session is assembled into the context window on each subsequent call.

pythondef demo_short_term_memory(llm: ChatOpenAI) -> None:
    """
    Short-term memory = this thread's message list, restored by the checkpointer.

    The same thread_id on each invoke reloads prior turns into state["messages"]
    so the model sees continuity without you manually merging history.
    """

    def chat(state: MessagesState) -> dict:
        # state["messages"] already contains ALL prior turns for this thread_id,
        # restored from the checkpoint. We pass the full list to the LLM.
        return {"messages": [llm.invoke(state["messages"])]}

    graph = StateGraph(MessagesState)
    graph.add_node("model", chat)
    graph.add_edge(START, "model")
    graph.add_edge("model", END)

    # Compile with a checkpointer. Without this, state is not saved between invokes.
    app = graph.compile(checkpointer=InMemorySaver())

    tid = "session-stm-demo"
    cfg: dict = {"configurable": {"thread_id": tid}}

    # First turn: store the codename.
    app.invoke({"messages": [HumanMessage("My codename for this session is Bluejay.")]}, cfg)

    # Second turn: only the new message is passed in.
    # The checkpointer reloads the first turn automatically.
    out = app.invoke({"messages": [HumanMessage("What codename did I give?")]}, cfg)
    print("[STM] Last reply:", out["messages"][-1].content)

Line-by-line breakdown

def chat(state: MessagesState) -> dict:
This is the only node in the graph. MessagesState is a TypedDict with one key: messages. By the time this function executes on the second invoke, state["messages"] already contains both turns the original "My codename…" message, the model's reply to it, and the new "What codename…" message. The checkpointer loaded the prior checkpoint and the add_messages reducer merged the new input on top.

app = graph.compile(checkpointer=InMemorySaver())
This is the critical line. Without checkpointer=, each invoke starts with an empty state. With it, LangGraph saves a snapshot after every node completes and restores it at the start of the next invoke for the same thread_id.

cfg: dict = {"configurable": {"thread_id": tid}}
This config dict is how you identify which conversation thread this call belongs to. The same thread_id = same checkpoint = continuity. A different thread_id = blank slate. This is intentional — you support multiple concurrent users by giving each a unique thread_id.

app.invoke({"messages": [HumanMessage("What codename did I give?")]}, cfg)
Notice we only pass the new message. We do not rebuild the history manually. The checkpointer and the add_messages reducer do that for us.

The token budget problem and how to handle it
STM has one fundamental weakness: as the conversation grows, the context window fills up. For production systems you have two standard strategies

Truncation — drop the oldest messages once you exceed a token threshold. Simple, but the model loses early context.
Summarization — periodically ask the LLM to write a running summary of the conversation so far, then replace the old messages with that summary. More expensive, but preserves the gist.

LangGraph does not do this automatically for you. You would add a summarization node that fires conditionally when len(state["messages"]) exceeds a threshold.

Production upgrade

Swap InMemorySaver() for

SqliteSaver.from_conn_string("checkpoints.db") and thread history survives process restarts. Swap for AsyncPostgresSaver for a cloud deployed multi instance setup.

Memory Type 2: Long Term Memory(LTM) Cross Thread Persistence

What it is?

Long-term memory (LTM) solves the problem that checkpoints can't: persistence across different thread_id values. When a user returns next week in a new session (new thread_id), their preferences, constraints, and facts should still be available. That requires the store.

def demo_long_term_memory(llm: ChatOpenAI) -> None:
    """
    Long-term memory = LangGraph Store: keyed data shared across thread_ids.

    Checkpoints reset per thread; store.put / get survives that boundary.
    """

    def remember_node(state: MessagesState) -> dict:
        # get_store() is injected by LangGraph at runtime because the graph
        # was compiled with store=. Do not pass the store as a function argument.
        store = get_store()

        # Namespace is a tuple of strings — like a file path for your data.
        # ("users", "demo-user", "facts") scopes this record to one user.
        ns = ("users", "demo-user", "facts")

        last = state["messages"][-1]
        text = last.content if isinstance(last.content, str) else str(last.content)

        if text.lower().startswith("remember:"):
            # Extract the fact and store it under key "profile" in this namespace.
            fact = text.split(":", 1)[1].strip()
            store.put(ns, "profile", {"text": fact})
            return {"messages": [AIMessage(content=f"Stored: {fact}")]}

        # For any other query, retrieve the stored fact and inject it as context.
        item = store.get(ns, "profile")
        fact = item.value.get("text", "") if item else ""

        # The retrieved fact goes into a SystemMessage so it conditions the reply
        # without appearing as part of the user's message.
        msg = llm.invoke([
            SystemMessage(content=f"Stored user fact (long-term): {fact or 'none'}"),
            HumanMessage(content=text),
        ])
        return {"messages": [msg]}

    graph = StateGraph(MessagesState)
    graph.add_node("agent", remember_node)
    graph.add_edge(START, "agent")
    graph.add_edge("agent", END)

    store = InMemoryStore()
    app = graph.compile(checkpointer=InMemorySaver(), store=store)

    # Thread A: Store the user's preference.
    app.invoke(
        {"messages": [HumanMessage("Remember: I always want concise bullet answers.")]},
        {"configurable": {"thread_id": "ltm-a"}},
    )

    # Thread B: Completely different thread_id. No shared checkpoint history.
    # But store.get still finds the preference stored under the same namespace.
    out = app.invoke(
        {"messages": [HumanMessage("What style do I prefer?")]},
        {"configurable": {"thread_id": "ltm-b"}},
    )
    print("[LTM] Reply on a *different* thread_id:", out["messages"][-1].content)

Line-by-line breakdown
store = get_store()

This is not get_store from a module level import in the traditional sense it is called inside the node function at runtime. LangGraph's execution engine makes the compiled store available via this call. If you try to use the store object directly from the outer scope inside a node, it works in this simple example, but get_store() is the correct pattern for production because it handles async contexts and subgraph injection correctly.

ns = ("users", "demo-user", "facts")
Namespaces are tuples of strings. Think of them as a path in a key-value hierarchy. You could have ("users", user_id, "facts") for facts, ("users", user_id, "episodes") for events, and ("global", "config") for shared config. The store does not enforce any schema — the structure is entirely yours.

store.put(ns, "profile", {"text": fact})
Three arguments: namespace tuple, key string, value dict. The value must be JSON-serializable. Here we use a single "profile" key which gets overwritten each time. For multi-fact storage you'd use a unique key per fact (perhaps the fact's text, hashed, or a UUID).
item = store.get(ns, "profile")

Returns an Item object (or None if the key does not exist). The dict you stored is at item.value. Always check for None before accessing .value a missing key returns None, not an exception.

The SystemMessage injection pattern Retrieved LTM facts almost always go into a SystemMessage, not a HumanMessage. This is intentional: you are giving the model background context before it reads the user's actual query. Putting it in the system prompt keeps it conceptually separate from the conversation.

What "vector-based LTM" looks like

In the demo, retrieval is a direct key lookup: store.get(ns, "profile"). In production you typically want semantic retrieval — given the user's current query, find the most relevant stored facts, not all of them. The pattern is:

On write: embed the fact text, store embedding + text + metadata.
On read: embed the current query, run similarity search, inject top-k results.

LangGraph's SqliteStore and InMemoryStore both support a search(namespace, query=..., limit=k) call when an embedding function is configured. For larger scale, swap the store backend for Pinecone, Weaviate, or ChromaDB with the same put/get/search interface pattern.

Production upgrade

Replace InMemoryStore() with SqliteStore.from_conn_string("ltm.db") for local durability, or use a cloud vector store for multi-instance deployments.

Memory Type 3: Working Memory — The Reasoning Scratchpad

What it is?

Working memory is the temporary state that accumulates across multiple nodes within a single graph run. When an agent needs to research five things before answering one question, intermediate results need somewhere to live between steps. That place is an extra field in the graph state, cleared when the run ends.
The code

class WorkingState(TypedDict):
    """
    Custom state schema: messages + a scratchpad notes list.

    The Annotated[list[str], operator.add] declaration tells LangGraph:
    when multiple nodes return a 'notes' key, concatenate the lists
    rather than replacing the field. This is the 'reducer' pattern.
    """
    messages: Annotated[list[BaseMessage], add_messages]
    notes: Annotated[list[str], operator.add]


def research_step(_: WorkingState) -> dict:
    """
    Simulated research/tool step.
    In a real agent this would call APIs, databases, or search tools.
    Returns a partial state update — only the 'notes' field.
    """
    return {"notes": ["Competitor A monthly price = $49", "Competitor B monthly price = $39"]}


def demo_working_memory(llm: ChatOpenAI) -> None:
    """Working memory: research node fills notes, answer node reads them in one run."""

    def answer_from_notes(state: WorkingState) -> dict:
        # By the time this node runs, state["notes"] contains everything
        # appended by research_step (and any other upstream nodes).
        notes = "\n".join(state["notes"])
        msg = llm.invoke([
            SystemMessage(
                content="Answer using only the working notes below.\n## Working notes\n" + notes
            ),
            HumanMessage(content="Which competitor is cheaper and by how much?"),
        ])
        return {"messages": [msg]}

    graph = StateGraph(WorkingState)
    graph.add_node("research", research_step)
    graph.add_node("answer", answer_from_notes)
    graph.add_edge(START, "research")
    graph.add_edge("research", "answer")
    graph.add_edge("answer", END)

    # No checkpointer needed for working memory.
    # The scratchpad lives only for the duration of this single invoke call.
    app = graph.compile()
    out = app.invoke({"messages": [], "notes": []})
    print("[Working] Final:", out["messages"][-1].content)

Line-by-line breakdown
notes: Annotated[list[str], operator.add]

This is the key architectural decision. Without the operator.add reducer, if two nodes both return {"notes": [...]}, the second write would overwrite the first. With operator.add, LangGraph calls operator.add(current_notes, new_notes) — which for lists is concatenation. Multiple research nodes can all write notes and they accumulate correctly.

graph.add_edge(START, "research") and graph.add_edge("research", "answer")

This creates a sequential two-step pipeline. The research node runs first, populates notes. Then answer runs and reads the accumulated notes. This is a simple linear chain — real agents might have fan-out (multiple parallel research nodes) feeding into a single synthesis node.
app = graph.compile() (no checkpointer)
Working memory is intentionally ephemeral. You do not need a checkpointer for it. Adding one would checkpoint the scratchpad state, which is sometimes useful for debugging but not necessary for the pattern to work.

app.invoke({"messages": [], "notes": []})
Both fields must be initialized. If you omit "notes": [], LangGraph will error because the state schema declares notes as required. The initial empty list is the starting point for the operator.add reducer.

The multi-node fan-out pattern

The real power of working memory emerges when you parallelize:
START → [research_a, research_b, research_c] → synthesize → END
Each research node appends to notes. Because all three use operator.add, their results accumulate in whatever order they complete. The synthesize node sees all of them. You would wire this with:

graph.add_edge(START, "research_a")
graph.add_edge(START, "research_b")
graph.add_edge(START, "research_c")
graph.add_edge("research_a", "synthesize")
graph.add_edge("research_b", "synthesize")
graph.add_edge("research_c", "synthesize")

Working memory vs long-term memory the key difference
Working MemoryLong-Term MemoryLifespanOne invoke callIndefinitely, across sessionsStorageGraph state (in-process)Store backend (in-memory or durable)PurposeAccumulate intermediate resultsPersist user facts and preferencesCleared wheninvoke returnsExplicitly deleted, or never

Memory Type 4: Episodic Memory — The Event Log

What it is
Episodic memory stores what happened, not just what is true. Long-term memory holds preferences ("I like bullet points"). Episodic memory holds events ("Last Tuesday we reviewed three quotes and chose Plan B"). It is the agent's diary — structured, timestamped, queryable.
The code

def demo_episodic_memory() -> None:
    """
    Episodic memory = append-only events (task, outcome, ...), recalled by search.

    In production: add timestamps, semantic search over episode summaries,
    and filters by date range, task type, or user ID.
    """
    store = InMemoryStore()

    # Namespace: scoped to this user's episode log.
    ns = ("users", "demo-user", "episodes")

    # Each episode gets a UUID so records are uniquely addressable.
    # If the same event needs to be updated later (e.g., outcome changed),
    # use the same key. For append-only logs, always generate a fresh UUID.
    eid = str(uuid.uuid4())

    store.put(
        ns,
        eid,
        {
            "task": "pricing_review",
            "outcome": "Chose plan B after comparing three quotes",
            # In production, add: "timestamp": datetime.utcnow().isoformat()
            # and embed the outcome text for semantic search.
        },
    )

    # Retrieve recent episodes. In production, filter by timestamp or
    # use store.search(ns, query="pricing decision", limit=5) for semantic recall.
    results = store.search(ns, limit=5)
    print("[Episodic] Stored episodes:", [r.value for r in results])

Line-by-line breakdown
eid = str(uuid.uuid4())

Each episode is a separate record with a unique key. This is the append only pattern: you never overwrite an existing episode, you always create a new one. If you need to mark an episode as completed or update its outcome, you can use the same UUID as the key (the put call will overwrite it). The choice depends on whether you want a full audit trail or just the latest state of each event.
store.put(ns, eid, {...})

The value dict can contain any JSON serializable data. In production, you would always include a timestamp so you can filter by date range. You might also store the full conversation summary, the user who triggered it, the tool calls made, and structured outcomes.
store.search(ns, limit=5)

Without a query parameter, search returns the most recently written records up to limit. With a query string and an embedding function configured on the store, it performs semantic similarity search over stored records. The toy demo uses simple listing; real recall would look like:

python# Production-style episodic recall (pseudocode):
results = store.search(
ns,
query="what pricing decisions did we make?",
limit=5
)
The r.value access
store.search returns a list of SearchItem objects. Each has .key, .namespace, and .value (the dict you stored). Filter and process them however you need before injecting into context.

Connecting episodic memory to the conversation

The episodic demo is intentionally standalone — it shows the storage pattern without a full graph. In a real agent, you'd write episodes in an after-action node that fires after every task completes, and you'd surface them in a context-building node at the start of each new session:

START → retrieve_episodes → main_agent → [task] → log_episode → END

Memory Type 5: Semantic Memory RetrievalAugmented Generation (RAG)

What it is?

Semantic memory is your agent's domain knowledge layer grounded in a corpus of verified text, retrieved dynamically rather than hallucinated from training weights. The pattern is: embed a query, find the most relevant document chunks, inject those chunks as tool output, let the model answer from the retrieved evidence.

def build_kb() -> FAISS:
    """
    Build a small FAISS vector index over profile documents.

    In production: load from PDFs, databases, or a web crawl.
    Use a persistent vector store (Pinecone, Weaviate, ChromaDB) instead of FAISS
    so the index survives process restarts.
    """
    return FAISS.from_documents(
        [
            Document(
                page_content=(
                    "Seenivasa Ramadurai works at Provizient. He architects cloud-native software — "
                    "microservices, gRPC, REST — and delivers GenAI, LLMs, and agentic patterns."
                )
            ),
            Document(
                page_content=(
                    "At Provizient, skills include C#, Python, Java, Scala, TypeScript; LLMs, RAG, "
                    "orchestration; ML and MLOps; vector databases; APIs; Kubernetes and Docker."
                )
            ),
        ],
        OpenAIEmbeddings(),
    )


def _bind_tools(model: ChatOpenAI, tools: list):
    """
    Node factory: bind a list of tools to the LLM and return a graph node function.

    bind_tools() tells the model what tools are available and how to call them.
    The model's response may be a plain AIMessage OR an AIMessage with tool_calls populated.
    """
    bound = model.bind_tools(tools)

    def node(state: MessagesState) -> dict:
        # Pass the full message history (including any prior tool results) to the model.
        return {"messages": [bound.invoke(state["messages"])]}

    return node


def demo_semantic_memory(llm: ChatOpenAI) -> None:
    """
    Semantic memory: model calls a KB search tool, ToolNode executes it,
    results are appended to messages, model reads them and answers.
    This is the standard ReAct (Reason + Act) loop.
    """
    kb = build_kb()

    @tool
    def profile_kb_search(query: str) -> str:
        """
        Retrieve top-k chunks from the profile knowledge base.

        The docstring is shown to the LLM as the tool description —
        write it clearly so the model knows when and how to use this tool.
        """
        docs = kb.similarity_search(query, k=2)
        return "\n".join(d.page_content for d in docs)

    tools = [profile_kb_search]
    graph = StateGraph(MessagesState)

    # Two nodes: the LLM agent and the tool executor.
    graph.add_node("agent", _bind_tools(llm, tools))
    graph.add_node("tools", ToolNode(tools))

    graph.add_edge(START, "agent")

    # Conditional routing: if the agent emitted tool calls → run ToolNode.
    # If the agent emitted a final answer → END.
    graph.add_conditional_edges(
        "agent",
        tools_condition,
        {"tools": "tools", "__end__": END}
    )

    # After ToolNode runs, go back to the agent so it can read the tool results.
    graph.add_edge("tools", "agent")

    # No checkpointer needed for this demo, but you'd add one in production.
    app = graph.compile()

    out = app.invoke({
        "messages": [
            HumanMessage(
                "Which company does Seenivasa work for, and what are some of his skills? "
                "Use the knowledge tool."
            )
        ]
    })
    print("[Semantic] Last message:", out["messages"][-1].content)

Line-by-line breakdown

FAISS.from_documents([...], OpenAIEmbeddings())

FAISS (Facebook AI Similarity Search) builds an in memory vector index. OpenAIEmbeddings() calls text-embedding-ada-002 (or the latest embedding model) to convert each document chunk into a vector. from_documents is a class method that handles both embedding and indexing in one call. For production, replace FAISS with a persistent vector store — FAISS is RAM-only and rebuilds from scratch on every process start.

@tool decorator

The @tool decorator from langchain_core.tools does three things: (1) wraps the Python function so it can be called by ToolNode, (2) extracts the function signature to build a JSON schema for the tool parameters, and (3) uses the docstring as the tool description sent to the LLM. Write clear docstrings — the model reads them to decide which tool to call and when model.bind_tools(tools)

This attaches the tool definitions to the model in the format required by the OpenAI function-calling API. When you call bound.invoke(messages), the model can now return an AIMessage with a populated tool_calls list in addition to (or instead of) plain text content.

tools_condition

This is a prebuilt LangGraph router function. It inspects the last message in state: if it has tool_calls, it returns "tools"; otherwise it returns "end". The conditional edge uses this to route traffic. The {"tools": "tools", "end": END} dict maps those return values to node names.

graph.add_edge("tools", "agent")

After ToolNode executes the tool call and appends the result as a ToolMessage to state, control returns to the agent. The agent now sees the tool result in its message history and generates a final answer. This loop continues until the agent produces a response with no tool calls.

The execution flow, step by step

User: "Which company does Seenivasa work for?"

1. agent node runs:

LLM sees the question + tool definition
LLM responds: AIMessage(tool_calls=[{name: "profile_kb_search", args: {query: "Seenivasa company"}}])
tools_condition sees tool_calls → routes to "tools"

2. tools node runs:

ToolNode calls profile_kb_search("Seenivasa company")
FAISS returns the two most similar chunks
Result appended as ToolMessage to state["messages"]
Edge sends control back to "agent"

3. agent node runs again:

LLM now sees: original question + tool call + tool result
LLM produces a final AIMessage with no tool_calls
tools_condition sees no tool_calls → routes to END

4. Graph returns state["messages"][-1].content = the grounded answer
Why not just put knowledge in the system prompt?
For small knowledge bases, you could. For anything non-trivial:

System prompts have token limits
You pay for all tokens even if most are irrelevant
RAG retrieves only what's relevant to the current query
You can update the knowledge base without redeploying the agent

The Complete, Runnable Script
Copy this file, set OPENAI_API_KEY, and run it. All five memory patterns execute sequentially.
python"""
Five agent memory patterns with LangGraph (Part 2 companion script).

Memory types demonstrated:

Short-term : MessagesState + InMemorySaver + stable thread_id
Long-term : InMemoryStore + get_store() across different thread_ids
Working : Custom WorkingState with notes merged via operator.add
Episodic : Append-only store rows + search (toy recall)
Semantic : FAISS + @tool + ReAct loop (ToolNode / tools_condition)

All demos use InMemory* backends (zero setup required).
For production: swap InMemorySaver → SqliteSaver, InMemoryStore → SqliteStore.

Dependencies:

pip install langgraph langchain-openai langchain-community faiss-cpu python-dotenv

Environment:

OPENAI_API_KEY  (required)
OPENAI_CHAT_MODEL  (optional, defaults to gpt-4o-mini)

"""

from __future__ import annotations

import os

# Set before any FAISS import to prevent OpenMP duplicate library crash on macOS.
os.environ.setdefault("KMP_DUPLICATE_LIB_OK", "TRUE")

import operator
import sys
import uuid
from pathlib import Path
from typing import Annotated, TypedDict

from dotenv import load_dotenv
from langchain_core.documents import Document
from langchain_core.messages import AIMessage, BaseMessage, HumanMessage, SystemMessage
from langchain_core.tools import tool
from langchain_community.vectorstores import FAISS
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langgraph.checkpoint.memory import InMemorySaver
from langgraph.config import get_store
from langgraph.graph import END, START, MessagesState, StateGraph
from langgraph.graph.message import add_messages
from langgraph.prebuilt import ToolNode, tools_condition
from langgraph.store.memory import InMemoryStore

_ROOT = Path(__file__).resolve().parent
load_dotenv(_ROOT / ".env")

CHAT_MODEL = os.getenv("OPENAI_CHAT_MODEL", "gpt-4o-mini")


def require_api_key() -> None:
    """Exit with a clear message if the OpenAI key is missing."""
    if not os.getenv("OPENAI_API_KEY"):
        print(
            "ERROR: Set OPENAI_API_KEY in the environment or in a .env file next to this script.",
            file=sys.stderr,
        )
        sys.exit(1)

# 1. SHORT-TERM MEMORY

def demo_short_term_memory(llm: ChatOpenAI) -> None:
    """STM: conversation buffer restored per thread_id via checkpointer."""

    def chat(state: MessagesState) -> dict:
        return {"messages": [llm.invoke(state["messages"])]}

    graph = StateGraph(MessagesState)
    graph.add_node("model", chat)
    graph.add_edge(START, "model")
    graph.add_edge("model", END)
    app = graph.compile(checkpointer=InMemorySaver())

    tid = "session-stm-demo"
    cfg: dict = {"configurable": {"thread_id": tid}}

    app.invoke({"messages": [HumanMessage("My codename for this session is Bluejay.")]}, cfg)
    out = app.invoke({"messages": [HumanMessage("What codename did I give?")]}, cfg)
    print("[STM] Last reply:", out["messages"][-1].content)

python

2. LONG-TERM MEMORY

def demo_long_term_memory(llm: ChatOpenAI) -> None:
    """LTM: LangGraph Store persists facts across different thread_ids."""

    def remember_node(state: MessagesState) -> dict:
        store = get_store()
        ns = ("users", "demo-user", "facts")
        last = state["messages"][-1]
        text = last.content if isinstance(last.content, str) else str(last.content)

        if text.lower().startswith("remember:"):
            fact = text.split(":", 1)[1].strip()
            store.put(ns, "profile", {"text": fact})
            return {"messages": [AIMessage(content=f"Stored: {fact}")]}

        item = store.get(ns, "profile")
        fact = item.value.get("text", "") if item else ""

        msg = llm.invoke([
            SystemMessage(content=f"Stored user fact (long-term): {fact or 'none'}"),
            HumanMessage(content=text),
        ])
        return {"messages": [msg]}

    graph = StateGraph(MessagesState)
    graph.add_node("agent", remember_node)
    graph.add_edge(START, "agent")
    graph.add_edge("agent", END)

    store = InMemoryStore()
    app = graph.compile(checkpointer=InMemorySaver(), store=store)

    app.invoke(
        {"messages": [HumanMessage("Remember: I always want concise bullet answers.")]},
        {"configurable": {"thread_id": "ltm-a"}},
    )
    out = app.invoke(
        {"messages": [HumanMessage("What style do I prefer?")]},
        {"configurable": {"thread_id": "ltm-b"}},
    )
    print("[LTM] Reply on a *different* thread_id:", out["messages"][-1].content)

3. WORKING MEMORY

class WorkingState(TypedDict):
    """State with a scratchpad: notes lists from all nodes are concatenated."""
    messages: Annotated[list[BaseMessage], add_messages]
    notes: Annotated[list[str], operator.add]


def research_step(_: WorkingState) -> dict:
    """Simulated research node — returns structured data into working memory."""
    return {"notes": ["Competitor A monthly price = $49", "Competitor B monthly price = $39"]}


def demo_working_memory(llm: ChatOpenAI) -> None:
    """Working memory: research node fills notes, answer node reads them."""

    def answer_from_notes(state: WorkingState) -> dict:
        notes = "\n".join(state["notes"])
        msg = llm.invoke([
            SystemMessage(
                content="Answer using only the working notes below.\n## Working notes\n" + notes
            ),
            HumanMessage(content="Which competitor is cheaper and by how much?"),
        ])
        return {"messages": [msg]}

    graph = StateGraph(WorkingState)
    graph.add_node("research", research_step)
    graph.add_node("answer", answer_from_notes)
    graph.add_edge(START, "research")
    graph.add_edge("research", "answer")
    graph.add_edge("answer", END)
    app = graph.compile()
    out = app.invoke({"messages": [], "notes": []})
    print("[Working] Final:", out["messages"][-1].content)

4. EPISODIC MEMORY

def demo_episodic_memory() -> None:
    """Episodic memory: one logged event written to store, recalled via search."""
    store = InMemoryStore()
    ns = ("users", "demo-user", "episodes")
    eid = str(uuid.uuid4())
    store.put(
        ns,
        eid,
        {
            "task": "pricing_review",
            "outcome": "Chose plan B after comparing three quotes",
        },
    )
    results = store.search(ns, limit=5)
    print("[Episodic] Stored episodes:", [r.value for r in results])

5. SEMANTIC MEMORY (RAG)

def build_kb() -> FAISS:
    """Build an in-memory FAISS index over profile document chunks."""
    return FAISS.from_documents(
        [
            Document(
                page_content=(
                    "Seenivasa Ramadurai works at Provizient. He architects cloud-native software — "
                    "microservices, gRPC, REST — and delivers GenAI, LLMs, and agentic patterns."
                )
            ),
            Document(
                page_content=(
                    "At Provizient, skills include C#, Python, Java, Scala, TypeScript; LLMs, RAG, "
                    "orchestration; ML and MLOps; vector databases; APIs; Kubernetes and Docker."
                )
            ),
        ],
        OpenAIEmbeddings(),
    )


def _bind_tools(model: ChatOpenAI, tools: list):
    """Node factory: bind tools to the LLM and return a graph node function."""
    bound = model.bind_tools(tools)

    def node(state: MessagesState) -> dict:
        return {"messages": [bound.invoke(state["messages"])]}

    return node


def demo_semantic_memory(llm: ChatOpenAI) -> None:
    """Semantic memory: ReAct loop with FAISS retrieval tool."""
    kb = build_kb()

    @tool
    def profile_kb_search(query: str) -> str:
        """Retrieve top-k chunks from the profile knowledge base."""
        docs = kb.similarity_search(query, k=2)
        return "\n".join(d.page_content for d in docs)

    tools = [profile_kb_search]
    graph = StateGraph(MessagesState)
    graph.add_node("agent", _bind_tools(llm, tools))
    graph.add_node("tools", ToolNode(tools))
    graph.add_edge(START, "agent")
    graph.add_conditional_edges(
        "agent", tools_condition, {"tools": "tools", "__end__": END}
    )
    graph.add_edge("tools", "agent")
    app = graph.compile()

    out = app.invoke({
        "messages": [
            HumanMessage(
                "Which company does Seenivasa work for, and what are some of his skills? "
                "Use the knowledge tool."
            )
        ]
    })
    print("[Semantic] Last message:", out["messages"][-1].content)

ENTRY POINT

def main() -> None:
    """Run all five memory demos in sequence."""
    require_api_key()
    llm = ChatOpenAI(model=CHAT_MODEL, temperature=0)

    print("\n=== 1. SHORT-TERM MEMORY ===")
    demo_short_term_memory(llm)

    print("\n=== 2. LONG-TERM MEMORY ===")
    demo_long_term_memory(llm)

    print("\n=== 3. WORKING MEMORY ===")
    demo_working_memory(llm)

    print("\n=== 4. EPISODIC MEMORY ===")
    demo_episodic_memory()

    print("\n=== 5. SEMANTIC MEMORY ===")
    demo_semantic_memory(llm)


if __name__ == "__main__":
    main()

Thanks
Sreeni Ramadorai

The 5 Types of AI Agent Memory Every Developer Needs to Know (Part 1)

Seenivasa Ramadurai — Thu, 02 Apr 2026 04:13:24 +0000

Because building agents without understanding memory is like hiring an employee who forgets everything by morning.

Introduction

Your Agent Is Not Broken. It Was Never Built to Remember.

Here is something most people get wrong when they first build an AI agent. They set it up, give it context, run a few tasks, it works great. Then they come back the next session and it has no idea who they are, what the project is, or what was decided. So they open a GitHub issue. They try different prompts. They assume something is misconfigured.

Nothing is misconfigured. The agent is working exactly as designed.
The hard truth is this: agent memory is not a model problem. It is an infrastructure problem. The LLM at the core of your agent is stateless by design every inference call starts completely fresh. No history, no context, no record of what happened before. That is never going to change, because statelessness is precisely what allows LLMs to scale to millions of users at once.

What this means for builders is important: you cannot give the model memory. You have to build memory infrastructure around it.
The agent does not remember. The infrastructure remembers. The agent only knows what the infrastructure decides to place in front of it inside the context window.

That distinction is the foundation of everything in this post. Once you understand it, the five memory types stop being abstract concepts and start being concrete engineering decisions you make when designing an agent system.

The Context Window: Why It's at the Center of Every Memory Decision

Before we get into the memory types, you need to understand one thing clearly and the context window is the only reality the LLM has.

Every token the model can reason about your message, the conversation history, retrieved documents, tool outputs, system instructions must be inside the context window at the moment of inference. If it is not in the window, the model does not know it exists. Full stop.

This is why memory architecture matters so much. Context windows are finite they have token limits, they cost money to fill, and they reset completely between sessions. You cannot just dump everything into them and call it done. You need a system that intelligently decides what information gets retrieved, when, and injected into that window at the right moment.

That system is agent memory. And because different situations demand different kinds of information recent conversation turns, user preferences, mid task reasoning state, past interaction history, domain facts there is not one type of memory but five, each built to retrieve and inject the right information at the right moment.

How the Memory Problem Got Serious

AI applications did not start as agents. They started as simple request response systems you send a message, the model replies, nothing is retained. Each call was completely isolated from the last (pervious).

The first attempt to fix this was brute force send the entire conversation history with every request. It worked well enough for short conversations, but it was never really memory it was just a growing pile of text being thrown at the model each time. Once conversations got long enough, older messages fell off the token limit and disappeared. The "memory" was already leaking.

Then models gained the ability to call tools APIs, databases, search engines and the use case jumped entirely. Now you could build agents systems that take a goal, break it into steps, call tools, observe results, and loop until the task is complete. Then came multi agent systems, where specialized agents work as a team, routing tasks between each other like a coordinated workforce.

Each step forward made the memory problem worse. A single chatbot forgetting context is annoying. An agent losing state mid task is a failure. A multi agent system where no agent knows what the others have decided is a broken system. The "stuff everything into the context window" approach simply does not hold at this level of complexity.

What you need instead is intentional memory architecture a layer that knows what to store, how long to keep it, and exactly when to surface it. That layer is built on five distinct memory types, each designed to solve a different part of the problem.

The 5 Types of Agent Memory

1. Short-Term Memory (STM) The Conversation Buffer

Short Term Memory(STM) is the simplest form of agent memory and the one you are almost certainly already using without thinking about it.
Every message the user sends and every response the agent gives gets stored in a session buffer. That buffer gets assembled into the context window on every subsequent request. This is how the agent understands follow up questions when you say "make it shorter," it knows what "it" refers to because the prior exchange is sitting in the context window.

The technical implementation is a rolling token buffer. When the buffer approaches the model's token limit, older messages get truncated or summarized before dropping off. New inputs overwrite old ones. When the session ends, the buffer clears entirely.
Think of it like RAM in a computer fast, active, and useful right now. But the moment you turn it off, it's gone.

What it solves: Conversation coherence within a single session. Follow up questions. Context continuity across a short interaction.

What it does not solve: Anything beyond the current session. Come back tomorrow, and the agent has no idea who you are.

2. Long Term Memory (LTM) Persistence Across Sessions

Long Term Memory is what makes an agent feel like it actually knows you.

Instead of losing everything when a session ends, LTM stores important information in a persistent external store user preferences, past decisions, project context, communication style, recurring constraints. The next time you interact with the agent, the most relevant pieces of that stored knowledge get retrieved and injected into the context window before the model ever sees your message.

The standard implementation uses a vector database like Pinecone, Weaviate, or ChromaDB. When something worth remembering happens, it gets converted into a vector embedding and stored with metadata. On future sessions, incoming queries trigger a similarity search the top-k most semantically relevant memories are retrieved and quietly injected into context. The model then responds as if it already knew those things about you, because from its perspective, it does.

The workflow in practice:

User shares something reusable preferences, goals, constraints, project structure
That information is embedded and stored in the vector database
On every future session, a similarity search retrieves what is relevant
Retrieved memories are injected into the context window before the model processes the request
Memory updates when new important information is provided

What it solves: Cross session personalization. User preference retention. Long running project continuity. Making the agent feel like a real colleague who knows your context.

Real example: An AI assistant that remembers your name, your team's preferred report format, and the fact that you always prioritize cost over speed in trade off decisions even when you return after weeks away.

3. Working Memory The Reasoning Scratchpad

Working Memory is what the agent uses while it is actively thinking through a complex, multi step task.

Imagine you ask an agent to research five competitors, extract their pricing, compare them against your product, and write a summary recommendation. That is not one step it is a chain of steps where each result feeds into the next. Working memory is the temporary store that holds intermediate results across those steps, so the agent does not lose track of what it has already done.

Without working memory, each loop iteration in an agentic workflow would start with no knowledge of previous iterations. The agent would spin in circles or repeat steps it had already completed.
The implementation is typically an in-memory structure a dict or JSON object — maintained by the agent framework across loop iterations. At each step, the current working memory state gets injected into the context window alongside the new task, so the model can build on prior results. Once the task is complete, working memory is cleared.

What it solves: Multi step task execution. Complex reasoning chains. Agentic loops that need to carry state from one iteration to the next without losing the thread.

Real example: An agent planning a travel itinerary holds flights, hotel constraints, budget limits, and date conflicts in working memory building the full picture step by step before producing a final recommendation.

4. Episodic Memory The Interaction Log

Episodic Memory gives an agent the ability to recall specific things that happened in the past not just general preferences, but actual events with context and outcomes.

Where Long Term Memory stores what you like, Episodic Memory stores what happened. It is a structured log of past interactions, each saved as an event record with a timestamp, the task that was performed, inputs, actions taken, and the outcome. Think of it as the agent's diary specific, timestamped, retrievable.

When you come back and ask "what did we work on last week?" or "remind me of the decision we made on the pricing model," the agent queries the episodic store by timestamp, keyword, or semantic similarity retrieves the relevant episodes, compresses them into a summary, and injects that summary into the current context window.
This is also what enables agents to say things like: "Last time you reviewed this type of document, you flagged the legal section first want me to start there again?" That is episodic memory working correctly.

What it solves: Specific past event recall. Long running project continuity. Agents that learn from experience and build on prior decisions rather than repeating mistakes.

Real example: "Last time you chose Option A over Option B because of budget should I apply the same logic here?" That sentence could only come from an agent with episodic memory.

5. Semantic Memory The Knowledge Layer

Semantic Memory is the agent's understanding of the world facts, concepts, domain knowledge, relationships between things independent of any specific interaction with you.

It is not about your history with the agent. It is about what the agent knows to be true. That Python is a programming language. That Singapore's corporate tax rate is 17%. That a JWT token expires and must be refreshed. This kind of knowledge lives either in the model's pre-trained weights or more usefully for domain specific and u*p-to-date needs* in an external knowledge base accessed through RAG (Retrieval Augmented Generation).

When you ask a factual or domain specific question, the agent does a semantic search against the knowledge base, retrieves the most relevant facts, injects them into the context window, and generates a grounded response. This is how you build agents that give accurate answers in specialized domains without hallucinating details they were never trained on.

What it solves: Factual accuracy. Domain specific expertise. Keeping agents grounded in verified knowledge beyond their training cutoff. Enterprise knowledge bases where accuracy is non negotiable.

Real example: An agent asked "Is Bangalore more populous than Amaravathi?" does not guess from training data it queries semantic memory, retrieves the fact, and answers with confidence.

How All Five Work Together

These memory types are not mutually exclusive a well designed agent uses all of them simultaneously, each handling a different layer of the memory problem.

The Tools or Frameworks That Make This Real

This is not theoretical. The tooling is production ready right now.

LangChain handles buffer memory, summary memory, and vector-based LTM out of the box. It is the most flexible starting point for composing memory types together in one agent.

LlamaIndex is purpose built for connecting external knowledge sources PDFs, APIs, databases, knowledge graphs making it the go to for RAG heavy Semantic Memory implementations.

Pinecone, Weaviate, ChromaDB are dedicated vector stores that power both LTM and Semantic Memory with fast, scalable similarity based retrieval.

LangGraph brings graph based orchestration to stateful, multistep agentic workflows this is what Part 2 uses to wire all five memory types into a real working system.

AWS Strands Agents provides production grade agent infrastructure with memory at cloud scale also covered hands on in Part 2.

Thanks
Sreeni Ramadorai

Retrieval Finds Candidates. Reranking Finds the Right One.

Seenivasa Ramadurai — Mon, 30 Mar 2026 05:09:53 +0000

A hiring analogy that finally makes RAG Reranking click

First, What Is RAG?

Before we get into the analogy, let me give you a 30 second crash course on RAG because this is where reranking lives.
RAG stands for Retrieval Augmented Generation.

Here's the problem it solves:

Large Language Models (LLMs) like GPT or Claude are incredibly powerful but they only know what they were trained on. They don't know about your company's internal documents, last week's product update, or your customer support knowledge base.

RAG fixes that by giving the LLM a memory it can search.
Here's how it works in three simple steps:

Retrieve — When a user asks a question, the system searches your document library and pulls the most relevant chunks
Augment — Those retrieved chunks are added to the prompt as context
Generate — The LLM reads the context and generates a grounded, accurate answer

Think of it like an open book exam. The LLM doesn't have to memorize everything it just needs to find the right page and read it. Simple enough. But here's where most RAG systems quietly fail.

The Hiring Analogy That Changes Everything

One of my friends recently asked me a simple but powerful question. "Why do we even need reranking after retrieval? Isn't finding the right documents enough?. "Instead of going technical, I said "Let me tell you about a hiring process."
Think of embedding based retrieval as your HR or Talent Acquisition team.

Their job is to:

Scan thousands of resumes
Filter based on keywords, skills, and experience
Shortlist candidates that look relevant

This is exactly what vector similarity does. It retrieves documents that are "close enough" based on embeddings fast, broad, and essential.

But here's the problem nobody talks about:

👉 Relevance is not correctness.
👉 Similarity is not suitability.

Just because a resume matches keywords doesn't mean the candidate can actually solve the hiring manager's real problem.

The same way, just because a document is topically similar doesn't mean it actually answers the user's question.

Now enters the Hiring Manager.

The hiring manager:

Reviews the shortlisted candidates deeply
Evaluates beyond surface level keywords
Matches candidates against the actual needs of the role
Rejects those who don't truly fit
Surfaces the one who genuinely belongs

This step is exactly what we call Reranking.

In AI Terms

Retrieval gives you Top-K similar documents (the shortlist)
Reranking evaluates semantic relevance to the actual question (the deep review)
It pushes the most useful answer to the top and filters out the noise

Real World Example: Cohere Reranking Model

One of the most popular and production ready reranking solutions today is Cohere's Rerank API.

Here's how it fits into a RAG pipeline in practice:

import cohere

co = cohere.Client("your-api-key")

# Step 1: Your retrieval system fetches top-K documents
query = "What is the refund policy for enterprise customers?"

retrieved_docs = [
    "Our refund policy allows returns within 30 days.",
    "Enterprise customers get dedicated support and SLA guarantees.",
    "Enterprise plans include custom refund terms negotiated at contract signing.",
    "Refunds are processed within 5–7 business days.",
    "Customer support is available 24/7 for enterprise accounts."
]

# Step 2: Cohere Reranker evaluates each document against the query
response = co.rerank(
    model="rerank-english-v3.0",
    query=query,
    documents=retrieved_docs,
    top_n=3  # Return only the top 3 most relevant
)

# Step 3: Most relevant documents bubble to the top
for result in response.results:
    print(f"Rank {result.index + 1} | Score: {result.relevance_score:.4f}")
    print(f"Document: {retrieved_docs[result.index]}")
    print()

What Cohere Rerank does differently:

It doesn't just compare embeddings it reads the query and document together
It uses a cross encoder architecture that understands the relationship between the question and each document
It returns a relevance score for each document so you know exactly why something ranked higher
It works on top of any retrieval system FAISS, Pinecone, Weaviate, you name it

Sample Output:

_Rank 1 | Score: 0.9821
Document: Enterprise plans include custom refund terms negotiated at contract signing.

Rank 2 | Score: 0.7134
Document: Our refund policy allows returns within 30 days.

Rank 3 | Score: 0.4821
Document: Refunds are processed within 5–7 business days._

Notice how the document that specifically answers the enterprise refund question jumps to the top even though all five documents were "about" refunds or enterprise. That's the hiring manager effect in action.

The Real Insight

Without Reranking:

You get good looking answers
But not always correct or truly useful ones
Your LLM is working with noisy, approximate inputs

With Reranking:

You move from approximate similarity → precise relevance
Your LLM gets exactly the right context to generate sharp, accurate answers
The difference in output quality is night and day.

One Line Takeaway

Retrieval is about finding options. Reranking is about making the right decision.

The next time someone asks why reranking matters skip the jargon.
Just say: "HR shortlists the candidates. The hiring manager picks the right one. Your AI needs both."
Because in RAG systems, just like in hiring, getting the right candidates in the room is only half the battle. Choosing the right one is where the magic happens.

Thanks
Sreeni Ramadorai

Agent Skills: The Missing Layer That Makes AI Agents Enterprise Ready

Seenivasa Ramadurai — Wed, 18 Mar 2026 06:55:45 +0000

How Enterprises Encode Institutional Knowledge into AI Agents

What Is an AI Agent?

An AI agent is a system that uses a large language model (LLM) to reason, plan, and act to reach a goal. Unlike a simple chatbot that only answers questions, an agent can:

Perceive its environment (user input, tool results, context)
Decide what to do next (reason, plan, choose tools)
Act by calling tools, APIs, or scripts
Iterate until the task is done (ReAct: Reason → Act → Observe → repeat)

AI agents are being deployed across every department HR, Finance, Legal, IT, Customer Support. They can read documents, call APIs, extract data, and take actions. On paper, the capability is there.

But in practice, something keeps going wrong. The agent gives an answer that is technically correct but does not match how your company actually operates. It follows a process that made sense in general but violates your internal policy. It retrieves the right data but does not know what to do with it next. It escalates everything when only some things need escalating or escalates nothing when everything does.

The problem is not the model. The problem is that the agent has no knowledge of your organization your leave policies, your approval thresholds, your GL coding rules, your escalation criteria, your vendor requirements. It has been trained on the public internet. It knows what companies do in general. It does not know what your company does specifically.

This is the gap that Agent Skills closes.

What Are Agent Skills?

Agent Skills are a simple, open format for giving AI agents domain expertise and procedural knowledge. They answer one question: How should the agent approach this kind of task?

The Pilot, the Plane, and the Flight Manual

The clearest way to understand how the LLM, Tools, and Agent Skills work together is through a single analogy one that maps all three, not just two.

Picture a commercial aircraft sitting at the gate, ready for departure. It has everything a plane is supposed to have. But before we can talk about what makes it fly safely, we need to understand what each part actually does.

The Controls-Tools

The aircraft has throttles, a yoke, flaps, landing gear, and dozens of other physical controls. These are the mechanisms that change the state of the world. Push the throttle forward and the engines spool up. Deploy the flaps and the lift characteristics of the wing change. Lower the landing gear and the plane is ready to touch down.

The controls cannot do anything on their own. They sit inert until someone operates them. But without them, nothing can happen at all no matter how skilled the pilot or how detailed the manual.

In an AI agent, Tools are the controls. They are the callable functions that interact with the outside world querying a database, calling a REST API, reading a PDF, posting to Slack, writing a record to an ERP. Without tools, the agent can reason about anything but change nothing. With tools, every decision the agent makes can become a real action in a real system.

The Pilot-The LLM

The pilot is what brings the aircraft to life. They sit in the cockpit, read the instruments altitude, airspeed, weather radar, traffic alerts and make continuous decisions: when to climb, when to level off, when to adjust course, when to call air traffic control.

The pilot does not invent the controls. The throttle was already there. The pilot decides which control to use, when to use it, in what combination, and in what sequence. They are the reasoning layer that turns raw capability into purposeful action.

In an AI agent, the LLM is the pilot. It reads the inputs the user message, the tool results, the conversation history and decides what to do next. Which tool to call. What parameters to pass. Whether the task is complete or needs another step. The LLM does not execute tools directly; it decides to invoke them, just as a pilot decides to operate a control.

The Flight Manual—Agent Skills

Now imagine a highly experienced pilot in an unfamiliar aircraft type, flying into an airport they have never visited before, under regulations they were not trained on. They can fly. They can read instruments. They can operate controls. But they are improvising every decision because they do not have the specific procedures for this situation.

The flight manual the Standard Operating Procedures is what fills that gap. It tells the pilot exactly what checklist to run before takeoff at this airport. What altitude to maintain in this specific airspace. Precisely what to do when this warning light illuminates. How to coordinate with ground control using this airline’s specific protocols.

The manual does not fly the plane. It does not operate the controls. What it does is ensure that every decision the pilot makes is the correct decision for this context, not just a reasonable guess based on general experience.

In an AI agent, Agent Skills are the flight manual. They encode your organization’s specific rules, workflows, and policies the leave entitlements, the approval thresholds, the invoice validation steps, the escalation criteria. The LLM still does the reasoning. The tools still take the actions. But now every decision is grounded in your actual procedures, not generic training data.

What Happens When One Is Missing

Controls but no pilot and no manual (Tools only): the throttle is there but nothing is operating it. The agent has APIs it can call but no reasoning to decide which one, when, or in what order. It cannot complete a task.

Pilot but no controls (LLM only): the pilot reads every instrument perfectly and knows exactly what to do but has no way to act. The agent reasons flawlessly but cannot retrieve data, call a system, or change anything in the world. It can only generate text.

Pilot and controls but no manual (LLM + Tools, no Agent Skills): the pilot can fly and the controls respond but every decision is improvised from general experience. This is most enterprise agents today. They work, inconsistently. They produce plausible answers that do not match your actual policies. Each run may go differently. Nothing is auditable.

All three together: the controls take action, the pilot reasons about what to do, and the manual ensures every decision follows your organization’s exact procedures. Consistent. Auditable. Trustworthy.

Tools (controls) give the agent reach. The LLM (pilot) gives the agent reasoning. Agent Skills (flight manual) give the agent organizational judgment. You need all three just like a flight needs controls, a pilot, and the procedures to fly it safely.

The Same Pattern Across Every Skilled Domain

The analogy holds anywhere expertise is applied through instruments:

Surgeon and scalpel: the scalpel can cut anywhere. The surgeon’s training specifies exactly where, how deep, at what angle, and what to do if something unexpected is found. Remove the training and the scalpel is just a sharp object.

Chef and kitchen: the kitchen has every tool ovens, knives, heat, timers. The recipe encodes the sequence, temperatures, timings, and substitutions that produce a consistent dish. Without it, two chefs produce two different meals from the same ingredients.

Architect and CAD tools: the software can draw anything. The architect’s expertise encodes load bearing constraints, building codes, spatial relationships, and material properties that make the drawing a safe, buildable structure.

In every case the pattern is identical: tools provide capability, expertise provides judgment. The tool without the expertise is hardware. The expertise without the tool cannot act. Together they produce something reliable.

What Agent Skills Actually Are

A skill is a directory containing:

*SKILL.md *(required) — instructions in plain Markdown with YAML frontmatter metadata

scripts/ (optional) — code the agent can run

references/ (optional) — policy documents, FAQs, reference material

assets/ (optional) — templates, schemas, examples

All Three Working Together

In a well designed enterprise agent, MCP and Agent Skills each play their role and neither replaces the other:

MCP Tool: fetches the vendor contract document from SharePoint

Agent Skill: applies liability cap rules, escalation logic, and policy references

LLM: reads both, produces the grounded compliance response

Remove any one of the three and the agent breaks. The tool without the skill fetches the document but does not know what to look for. The skill without the tool knows the rules but cannot access the data. The LLM without either produces a plausible guess.

> Need to connect to something? Use MCP. Need to teach the agent how to approach something? Use a Skill. Need a policy-grounded answer? You need both.

Scaling Agent Skills Across the Enterprise

One skill in one department is a proof of concept. The real value is a skills library a version-controlled repository of organizational expertise that any agent can draw from, on any compatible platform, across every department.

Any agent on any platform granted read access to this repository can load skills from it. When a policy changes, one SKILL.md update propagates to every agent on every platform simultaneously.

One skills library. Every department. Every platform. Updated in one place. Governed like code. T*his is the enterprise value proposition of Agent Skills*.

Building an Agent with Tools and AgentSkills .

*Folder Structure *

Sample Tool

"""Incident report status tool - implements incident-report skill."""

from langchain_core.tools import tool
from pydantic import BaseModel, Field


class IncidentInput(BaseModel):
    incident_id: str = Field(..., description="Incident ID (e.g., INC-2025-001)", min_length=5, max_length=32)


_DEMO_INCIDENTS = {
    "INC-2025-001": {
        "title": "API latency spike in us-east-1",
        "severity": "P1",
        "status": "Resolved",
        "affected_service": "payment-gateway",
        "started_at": "2025-03-16 14:30 UTC",
        "resolved_at": "2025-03-16 15:45 UTC",
        "owner": "SRE On-Call",
    },
    "INC-2025-002": {
        "title": "Database replica lag exceeding threshold",
        "severity": "P2",
        "status": "Investigating",
        "affected_service": "analytics-db",
        "started_at": "2025-03-17 09:00 UTC",
        "resolved_at": None,
        "owner": "Platform Team",
    },
    "INC-2025-003": {
        "title": "CDN cache miss rate elevated",
        "severity": "P2",
        "status": "Monitoring",
        "affected_service": "cdn-edge",
        "started_at": "2025-03-17 11:20 UTC",
        "resolved_at": None,
        "owner": "Infrastructure",
    },
}


def lookup_incident(incident_id: str) -> dict:
    """Lookup incident - used by tool and scripts."""
    incident_id = incident_id.strip().upper()
    if not incident_id:
        return {"error": "Incident ID is required. Use format INC-2025-001."}
    if incident_id not in _DEMO_INCIDENTS:
        return {
            "error": f"Incident '{incident_id}' not found. Known demo incidents: "
            f"{', '.join(_DEMO_INCIDENTS.keys())}."
        }
    data = _DEMO_INCIDENTS[incident_id]
    return {
        "incident_id": incident_id,
        "title": data["title"],
        "severity": data["severity"],
        "status": data["status"],
        "affected_service": data["affected_service"],
        "started_at": data["started_at"],
        "resolved_at": data["resolved_at"] or "N/A - still active",
        "owner": data["owner"],
    }


@tool
def incident_report_status(incident_id: str) -> str:
    """Check incident report status. Use when user asks about outage status, P1/P2 incidents, or specific incident ID (e.g., INC-2025-001)."""
    try:
        validated = IncidentInput(incident_id=incident_id)
        result = lookup_incident(validated.incident_id)
        if "error" in result:
            return result["error"]
        resolved = f"Resolved: {result['resolved_at']}" if result["resolved_at"] != "N/A - still active" else "Status: Active"
        return (
            f"Incident {result['incident_id']}: {result['title']}\n"
            f"Severity: {result['severity']} | Status: {result['status']}\n"
            f"Affected: {result['affected_service']} | Owner: {result['owner']}\n"
            f"Started: {result['started_at']} | {resolved}"
        )
    except Exception as e:
        return f"Error: Invalid incident ID format. Use INC-YYYY-NNN. Details: {e}"

SKILL.md

"""Lookup incident by ID. Usage: python lookup.py <incident_id>"""

import json
import sys
from pathlib import Path

sys.path.insert(0, str(Path(__file__).resolve().parents[4]))
from agentskills.skills._tools.incident_report import lookup_incident

if __name__ == "__main__":
    incident_id = sys.argv[1] if len(sys.argv) > 1 else ""
    result = lookup_incident(incident_id)
    print(json.dumps(result, indent=2))

agent.py

"""LangGraph ReAct agent with AgentSkills."""

from langchain_openai import ChatOpenAI
from langgraph.prebuilt import create_react_agent

from agentskills.skills import get_agent_skills


def create_agent(model: str = "gpt-4o-mini", temperature: float = 0):
    """Create a ReAct agent with AgentSkills tools."""
    llm = ChatOpenAI(model=model, temperature=temperature)
    tools = get_agent_skills()
    return create_react_agent(llm, tools)


def run_agent(agent, user_message: str) -> str:
    """Run the agent and return the final response."""
    result = agent.invoke({"messages": [{"role": "user", "content": user_message}]})
    messages = result.get("messages", [])
    if messages:
        last = messages[-1]
        if hasattr(last, "content") and last.content:
            return last.content
    return str(result)


def run_agent_interactive(agent, user_message: str) -> str:
    """Run the agent with visible execution: streams tool calls and results."""
    from langchain_core.messages import AIMessage, ToolMessage

    final_content = ""

    for chunk in agent.stream(
        {"messages": [{"role": "user", "content": user_message}]},
        stream_mode="updates",
    ):
        for node_name, node_output in chunk.items():
            messages = node_output.get("messages", [])
            for msg in messages:
                if isinstance(msg, AIMessage):
                    if getattr(msg, "tool_calls", None):
                        for tc in msg.tool_calls:
                            name = tc.get("name", "?")
                            args = tc.get("args", {})
                            print(f"  🔧 AgentSkills: {name}({args})")
                    elif msg.content:
                        text = str(msg.content)
                        print(f"  💭 Agent: {text[:200]}{'...' if len(text) > 200 else ''}")
                        final_content = text
                elif isinstance(msg, ToolMessage):
                    content = str(msg.content)[:300]
                    if len(str(msg.content)) > 300:
                        content += "..."
                    print(f"  ✓ Result: {content}")

    return final_content

main.py

"""Run the AgentSkills LangGraph agent."""

from dotenv import load_dotenv

load_dotenv()

from agentskills import create_agent, get_agent_skills, run_agent_interactive
from agentskills.skills.registry import get_skill_catalog


def _print_skills():
    """Show available Agent Skills (agentskills.io format)."""
    catalog = get_skill_catalog()
    if catalog:
        print("\n  Agent Skills (agentskills.io):")
        for s in catalog:
            desc = s["description"]
            print(f"    • {s['name']}: {desc[:70]}{'...' if len(desc) > 70 else ''}")
    else:
        tools = get_agent_skills()
        print("\n  Agent Skills:")
        for t in tools:
            print(f"    • {t.name}")
    print()


def main():
    print("Creating AgentSkills agent...")
    agent = create_agent()
    print("\n" + "=" * 50)
    print("  AgentSkills – enterprise skills for the agent")
    print("=" * 50)
    _print_skills()
    print("Commands: 'skills' = list AgentSkills | 'quit'/'exit' = stop\n")

    while True:
        try:
            user_input = input("You: ").strip()
        except (EOFError, KeyboardInterrupt):
            print("\nGoodbye!")
            break

        if not user_input:
            continue
        if user_input.lower() in ("quit", "exit", "q"):
            print("Goodbye!")
            break
        if user_input.lower() == "skills":
            _print_skills()
            continue

        print("\n--- AgentSkills executing ---")
        response = run_agent_interactive(agent, user_input)
        print(f"\n--- Agent ---\n{response}\n")


if __name__ == "__main__":
    main()

Run the agent with sample query or prompt

(AgentSkills) sreenir@Seenivasas-MacBook-Pro AgentSkills % uv run python main.py
Creating AgentSkills agent...

==================================================

AgentSkills – enterprise skills for the agent

Agent Skills (agentskills.io):
• contract-review: Check vendor contract review status in legal or procurement. Use when ...
• incident-report: Check incident and outage status for P1/P2 incidents. Use when the use...
• jira-ticket: Fetch Jira ticket details including status, assignee, and priority. Us...

Commands: 'skills' = list AgentSkills | 'quit'/'exit' = stop

*You: outage status *

Incident ID: INC-2025-001
Description: API latency spike in us-east-1
Severity: P1
Status: Resolved
Affected Service: Payment Ga...

--- Agent ---
The current outage status is as follows:

Incident ID: INC-2025-001
Description: API latency spike in us-east-1
Severity: P1
Status: Resolved
Affected Service: Payment Gateway
Owner: SRE On-Call
Incident Start Time: March 16, 2025, 14:30 UTC
Incident Resolved Time: March 16, 2025, 15:45 UTC

If you need more information, feel free to ask!

*You: RCA Requests *

--- AgentSkills executing ---
💭 Agent: It seems you're looking for information related to Root Cause Analysis (RCA) requests. However, I can't assist with creating or managing RCA requests directly. If you have a specific incident or issue...

--- Agent ---
It seems you're looking for information related to Root Cause Analysis (RCA) requests. However, I can't assist with creating or managing RCA requests directly. If you have a specific incident or issue in mind, please provide details, and I can help check the status of that incident or provide related information.

*You: any active outages *

--- AgentSkills executing ---
🔧 AgentSkills: incident_report_status({'incident_id': 'INC-2025-001'})
✓ Result: Incident INC-2025-001: API latency spike in us-east-1
Severity: P1 | Status: Resolved
Affected: payment-gateway | Owner: SRE On-Call
Started: 2025-03-16 14:30 UTC | Resolved: 2025-03-16 15:45 UTC
🔧 AgentSkills: incident_report_status({'incident_id': 'INC-2025-002'})
✓ Result: Incident INC-2025-002: Database replica lag exceeding threshold
Severity: P2 | Status: Investigating
Affected: analytics-db | Owner: Platform Team
Started: 2025-03-17 09:00 UTC | Status: Active
💭 Agent: Currently, there is one active outage:

Incident ID: INC-2025-002
- Description: Database replica lag exceeding threshold
- Severity: P2
- Status: Investigating
- **Affected ...

--- Agent ---
Currently, there is one active outage:

Incident ID: INC-2025-002
- Description: Database replica lag exceeding threshold
- Severity: P2
- Status: Investigating
- Affected Service: analytics-db
- Owner: Platform Team
- Started: March 17, 2025, 09:00 UTC

The previous incident (INC-2025-001) regarding API latency spike has been resolved.

You:

The Bottom Line

The core challenge in enterprise AI is not raw capability. Modern LLMs are already powerful enough for most business use cases. The real challenge is reliability ensuring agents consistently follow enterprise rules, policies, and standards at scale.

This is where Agent Skills and MCP complement each other:

MCP provides access to systems APIs, databases, and external actions (capabilities)

Agent Skills define how those capabilities should be used policies, reasoning steps, and constraints (procedures)

Used together, they close the gap between what an agent can do and what it should do.

A tool without a skill leads to inconsistent, guess based behavior

A skill without a tool produces correct reasoning but no execution

When combined, they create agents that are:

Predictable (consistent outputs)
Precise (aligned with business rules)
Auditable (traceable decisions and actions)

These are not optional qualities they are baseline requirements for production enterprise systems.

In One Line

MCP gives agents reach. Agent Skills give them discipline.

Write your skills once.

Apply them everywhere.

Govern them like code.

That is enterprise grade AI.

Thanks
Sreeni Ramadorai

Understanding the Model Router in Microsoft Foundry

Seenivasa Ramadurai — Fri, 13 Mar 2026 04:44:29 +0000

Introduction

As generative AI applications move from prototypes to production systems, developers increasingly face a new architectural challenge is choosing the right model for each task. Modern AI platforms now offer dozens or even hundreds of models with different strengths some optimized for reasoning, others for speed, cost, or domain specialization. Selecting the best model dynamically becomes critical for both performance and cost efficiency.

Microsoft addresses this challenge through Model Router, a capability within Microsoft Foundry, its enterprise platform for building and operating AI applications.

Before exploring how Model Router works, it is useful to understand the platform it belongs to.

Model Router: How AI Selects Models the Way We Choose Apartments

Analogy

Think of Model Router in Microsoft Foundry like an apartment finder.

When searching for an apartment, you usually consider:

Budget
Distance to work
Amenities (gym, parking, pool)

You don’t manually evaluate every apartment. The platform analyzes your preferences and recommends the best match.

Model Router works the same way for AI models.

When an application sends a prompt, the router evaluates factors such as cost, latency, and model capabilities, and then selects the most suitable model automatically.

Just as an apartment finder helps you pick the best place to live, Model Router helps your application choose the best model to answer the prompt.

Microsoft Foundry: The AI Application Platform

Microsoft Foundry is Microsoft’s unified platform for building, deploying, and operating AI applications and intelligent agents on Azure. It provides a centralized environment where developers can discover models, build AI powered applications, integrate enterprise data, and deploy systems with built in governance and observability.

The platform brings together several core capabilities required for modern AI systems:

Model Catalog for discovering and deploying foundation models

Agent development tools for building AI copilots and multi-step agent workflows

Enterprise AI services such as language, vision, speech, and document intelligence

Evaluation and monitoring for measuring AI quality and reliability

Security and governance through Azure’s RBAC, networking, and policy controls

In practice, Microsoft Foundry acts as the development and operational layer for enterprise AI applications, enabling teams to build systems that integrate models, tools, and data while maintaining enterprise grade reliability and security.

However, once multiple models become available within a platform, another question arises

Which model should handle each request?

Why This Matters

Without a router, developers would need to implement custom logic
such as:

if simple_prompt:
    use_small_model()
elif coding_task:
    use_reasoning_model()
else:
    use_general_model()

Maintaining such logic quickly becomes complex.

Model Router removes this burden by allowing the platform to learn the routing strategy automatically.

This is where Model Router comes in.

The Problem: Model Selection in Multi-Model Systems

In most AI applications, developers initially choose a single model for example, a large reasoning model such as GPT4 class models. While this approach works, it often leads to inefficiencies

Simple queries do not require a large reasoning model.

High quality models may introduce unnecessary latency.

Large models significantly increase operational costs.

As organizations adopt multi model architectures, manually choosing the correct model becomes increasingly complex.

Developers would need to implement logic such as:

Route simple queries to small models
Route complex reasoning tasks to large models
Route coding tasks to specialized models

Maintaining this routing logic manually quickly becomes difficult to scale.

Model Router: Intelligent Model Selection

The Model Router in Microsoft Foundry solves this problem by acting as an intelligent routing layer across multiple models.

Instead of developers explicitly selecting a model, the router evaluates each request and automatically forwards it to the most appropriate model in a configured pool.

From the developer’s perspective, the application interacts with a single endpoint. Behind the scenes, the router performs model selection dynamically.

The router analyzes characteristics of the incoming prompt, such as:

Prompt complexity
Reasoning requirements
Expected response quality
Latency requirements
Cost considerations

Based on this evaluation, the router selects the most suitable model for that request.

For example:

Simple informational queries may be routed to smaller, faster models
Complex reasoning tasks may be routed to larger reasoning models
Coding prompts may be routed to specialized coding models

This architecture allows organizations to optimize cost, performance, and response quality simultaneously.

How Model Router Works

At a high level, Model Router functions as a meta model a model trained to evaluate prompts and determine which underlying model should handle them.

The routing process typically follows these steps:

1. Client Request
The application sends a prompt to the Model Router endpoint.

2. Prompt Analysis
The router evaluates the prompt’s complexity and characteristics.

3. Model Selection
Based on the evaluation, the router selects the most appropriate model from the configured model pool.

4. Request Forwarding
The router forwards the prompt to the selected model.

5. Response Return
The response from the selected model is returned to the client through the same endpoint.

From the application’s perspective, the entire interaction appears as a single model invocation, even though different models may handle different requests.

Deploying Model Router in Microsoft Foundry

Deploying Model Router in Microsoft Foundry is designed to be straightforward.

Developers create a router deployment that references a set of available models. The router then dynamically selects among those models during inference.

Typical deployment steps include:

Create a Foundry project in Azure

Select models from the Foundry model catalog

Create a Model Router deployment

Configure the routing model set

Test the model Router with different prompts

Expose the router as a single API endpoint

Applications then send prompts to the router endpoint instead of directly calling individual models.

This architecture simplifies multi model systems while allowing the platform to optimize routing decisions automatically.

Why Model Routers Matter

As AI platforms continue to expand their model catalogs, multi-model architectures will become the norm. Model routers represent an important architectural shift:

Instead of building applications around a single model, systems will be designed around dynamic model orchestration.

The benefits include:

Cost optimization by avoiding unnecessary use of large models
Performance improvements through faster models for simpler tasks
Higher quality responses through specialized model selection
Simpler application architecture through a single API interface

In this sense, Model Router acts as a control layer for multi model AI systems, enabling developers to focus on application logic while the platform handles model selection.

Conclusion

As AI systems evolve, applications are no longer built around a single model. Modern platforms like Microsoft Foundry make it possible to work with multiple LLMs, each optimized for different capabilities such as reasoning, speed, cost efficiency, or specialized tasks.

This is where the Model Router becomes an important architectural component. Instead of developers manually deciding which model should handle each request, the router evaluates the prompt and dynamically selects the most appropriate model based on factors like cost, latency, and model capabilities.

Just as an apartment search platform helps you find the best place to live by balancing budget, distance, and amenities, the Model Router helps AI applications find the best model for every prompt.

The result is a simpler architecture, better performance, and optimized cost allowing developers to focus on building intelligent applications while the platform handles model selection behind the scenes.

In many ways, Model Router represents the future of multi model AI systems, where intelligent routing becomes just as important as the models themselves.

Thanks
Sreeni Ramadorai

Why Production AI Agents Are Hard & How Amazon Bedrock AgentCore Makes Them Production Ready

Seenivasa Ramadurai — Thu, 12 Mar 2026 06:43:00 +0000

Introduction

Over the past couple of years, I have architected and delivered a significant number of agentic AI applications across enterprise environments. Many of these deployments ran on Azure infrastructure using Azure Web Apps for lightweight agent endpoints and Azure Container Apps for more sophisticated multi agent systems that required orchestration, scaling, and reliable session routing.

In building these systems, I have repeatedly implemented the underlying foundations myself credential vaults, memory pipelines, observability layers, and isolation mechanisms. After doing this enough times, you develop a clear understanding of both how long these pieces take to build and where the real production challenges tend to surface.

When I first evaluated Amazon Bedrock AgentCore, it was the first platform I encountered that appeared to address many of these challenges holistically. Not just through surface level abstractions, but with production grade depth designed for real world deployments.

That practical experience is the perspective I bring to this blog.

Before we talk about AWS Bedrock AgentCore, we need to answer a more fundamental question what exactly is an AI agent?, and why is it so different from a regular chatbot or API call?

What is an AI Agent?

“An AI agent is a software system that uses a large language model not just to generate text, but to reason, plan, take actions, and work toward a goal often across multiple steps, over time, with minimal human involvement.”

Most people encounter AI through a prompt response loop type something in, get something back. That model is useful, but it is fundamentally passive. The language model sits in a box, waits to be asked, generates text, and stops.

An AI agent is something entirely different. Think of a brilliant expert locked in a room with no tools. They can give extraordinary advice but they cannot act on it. Give that same expert a phone, a laptop, access to databases, the ability to send emails, run code, and call APIs. They no longer just advise. They act, verify, execute, and report back. That is the agentic paradigm.

Formally, an AI agent is a software system that uses a large language model not just to generate text, but to reason, plan, take actions, and work toward a goal — across multiple steps, over time, with minimal human direction.

“An AI agent doesn’t just answer your question. It takes on your objective, plans a path to achieve it, executes that plan, monitors its own progress, and self corrects when things go wrong without you directing each step.”

A Concrete Example

Ask an agent: “Find our top three open support tickets today, check each against the known issues database, draft replies, and email them to the support team.”

A plain language model cannot do this it has no access to your ticketing system, knowledge base, or email infrastructure. An AI agent handles the entire workflow end to end.

Step 1: Query the ticketing tool for today’s open critical tickets

Step 2: Search the knowledge base for related known issues

Step 3: Reason about which tickets match which issues

Step 4: Draft personalized reply emails using the LLM

Step 5: Send those emails via the email API (This may Tool /MCP server)

The LLM is the reasoning engine. The tools are how the agent reaches into real systems. And it does not stop after one response it pursues the objective through every step until the goal is met.

Agents Are Goal Driven

The most critical characteristic of an AI agent and the one most often glossed over is that it is goal driven, not prompt driven.

Prompt driven systems (plain LLMs) receive an input and produce an output. The interaction is complete. No awareness of a broader objective, no adaptation if the first attempt fails.

Goal driven systems (agents) receive an objective and autonomously determine the steps, tool calls, and decisions required to achieve it. They persist, adapt, retry, and self correct until the goal is met or explicitly report that it cannot be.

The Agentic Loop:

Observe, Think, Act, Repeat
The mechanics of goal driven behaviour are captured in what is called the agentic loop the cognitive cycle every agent runs until its objective is achieved. Strands Agents, AWS’s own open source framework, describes this as its core architecture. in each loop iteration the model is invoked with the prompt, agent context, and available tools, and it decides whether to respond in natural language, plan next steps, reflect on prior results, or select one or more tools to use. This loop continues until the task is complete.

1. Observe

The agent reads its current goal and decomposed sub goals. It reviews all results from prior steps. It retrieves relevant short term memory. It incorporates new information from the environment since the last cycle.

2. Think

The LLM reasons over accumulated context and available tools to determine the single best next action. It outputs either a tool call with exact parameters, or if the goal is satisfied a final answer. Planning capable models may first decompose the goal into an explicit sub task sequence.

3. Act

If a tool call was selected, the framework executes the real function calling an API, querying a database, running code, navigating a browser, or invoking any registered tool. If a Human In The Loop(HITL) checkpoint is configured, the agent pauses and waits for approval. The result is captured and fed back into context.

4. Loop

T*he action result becomes new input to the next Observe phase*. Is the goal achieved? If yes, the agent produces its final answer and terminates. If no, the loop continues potentially dozens or hundreds of iterations for complex, long running tasks.

ReAct: The Reasoning Pattern Behind the Loop

ReAct (Reason + Act) is the foundational prompting pattern that implements the agentic loop. The agent produces a structured Thought describing its reasoning, followed by an Action (the chosen tool call), followed by an Observation incorporating the result. This interleaved reasoning action format makes goal pursuit explicit, traceable, and debuggable. It is the pattern that AWS Strands Agents uses internally to power its model driven approach.

Goal: Find top 3 critical GitHub bugs and draft fixes.

Thought: I need to query GitHub for open issues labelled critical.

Action: github_search(repo=‘company/api’, label=‘critical’, state=‘open’, limit=3)

Observation: Returned issues #16, #5018, #18.

Thought: I have the issues. Now I need each affected code file.

Action: github_get_file(repo=‘company/api’, path=‘src/auth.py’)

Observation: File retrieved.

Thought: Enough context. Drafting fix for issue #16.

… [loop continues for all 3 bugs] …

Final Answer: Three bugs with proposed patches attached.

Self Correction and Adaptive Planning
Goal driven agents are not fragile pipelines. When an action fails a tool returns an error, a query finds nothing, a generated solution does not compile the agent observes the failure, reasons about the cause, and adapts. A failed search triggers a reformulated query. An unavailable tool triggers a fallback. A complex goal gets decomposed into smaller achievable sub goals. When the agent determines a goal truly cannot be achieved, it surfaces a clear explanation rather than silently returning empty results.

HITL Human In The Loop

Goal driven does not mean unsupervised. Production agents are designed with explicit human in the loop checkpoints moments where the agent pauses, presents its proposed action, and waits for approval before taking any irreversible step, sending emails, deleting records, initiating payments, deploying code. AgentCore Runtime’s bi-directional WebSocket streaming makes these pause-and-resume flows practical within long running sessions, enabling real-time human collaboration without terminating and restarting the session.

The 4 Pillars of Every Production AI Agent

Pillar 1

Tools: How Agents Act on the external or real World
Without tools, a goal driven agent has nowhere to go. Tools allow agents to reach beyond language generation into real business systems.

Read tools retrieve information: database queries, document reads, semantic search against knowledge bases, API calls to Salesforce, GitHub, Jira, Slack, and any other SaaS tool.

Write tools create or modify data: email senders, database writers, file generators, CRM updaters, ticket creators, calendar schedulers.

Execution tools run processes: code interpreters, browser automation for web based applications that have no API, and shell command runners.

The production challenge: A prototype might hard code three tools. An enterprise deployment often needs fifty tools across ten SaaS platforms, each with its own authentication scheme, error patterns, and schema. Tool management becomes a major engineering project on its own.

Pillar 2

Memory: How Agents Remember
Language models (LLM) are stateless. Every API call starts blank. For an agent serving the same user across weeks of ongoing work, statelessness is a fundamental blocker.

Short term memory covers the active session: conversation history, task state, intermediate tool results, and reasoning steps. It requires intelligent summarization to manage the LLM’s context window limits without losing critical thread.

Long term memory persists across sessions. User preferences, past project outcomes, accumulated domain knowledge, and learned patterns must survive session end and be retrievable in future sessions. This requires extraction logic, persistent storage, and semantic retrieval.

Episodic memory is the most powerful form: storing specific past experiences what the agent tried, what worked, what failed, what the outcome was so it can recall and apply successful strategies in future similar situations. This is the mechanism by which agents genuinely improve over time.

Pillar 3

Observability: How Agents Are Understood and Governed
When an AI agent produces a wrong output after twelve reasoning steps and seven tool calls, traditional logs tell you almost nothing useful. You cannot search for ‘sessions where the agent called the wrong tool’ in standard APM tools.

“You cannot safely govern what you cannot observe. For AI agents in enterprise production, observability is not optional it is the difference between a system you can audit and a black box waiting to cause a compliance incident.”

Agent native observability must capture the full reasoning chain in step by step order, every tool invocation with exact inputs and outputs, every LLM prompt and response with token counts, decision points where the agent chose between alternatives, failure attribution pinpointing which specific step caused a wrong downstream output, and token consumption per step for cost control. Without this, AI assisted decisions in regulated environments cannot be explained, investigated, or defended.

Pillar 4

MCP Server bridge the agent and external Data sources and MCP solves the M×N Integration Problem

MCP: The Universal Connectivity Standard (USB-C)

For years, every team connecting agents to external services built bespoke adapters custom code per tool, per framework, per model. This created the classic M×N integration problem if there are M agent frameworks and N external services, teams end up building M × N separate integrations.

A LangChain Salesforce connector did not work with a Strands agent. Every framework switch meant rewriting all integrations. As the number of models, frameworks, and enterprise systems grew, the integration burden multiplied.

MCP the Model Context Protocol is the open standard that ended this fragmentation. Published by Anthropic in 2024 and now adopted across the industry by AWS, Microsoft, Google, and others, MCP defines a universal language for agent-to-tool communication.

Instead of building M × N bespoke connectors, developers can build one MCP server for a data source, and any MCP compatible agent regardless of framework or model can connect to it immediately. In effect, MCP transforms the integration landscape from M × N complexity to reusable connectivity, much like USB-C standardized hardware connectivity across devices.

The MCP architecture is built around three roles:

MCP Host —the agent framework that initiates connections and sends tool requests

MCP Server —the lightweight connector process wrapping an external service

MCP Resources and Tools —the capabilities exposed: actions the agent can invoke, data sources it can read, and prompt templates it can use

By introducing a standard protocol layer, MCP removes the need to repeatedly rebuild integrations and enables true interoperability across agent frameworks, models, and enterprise systems.

The Production Gap Why Building Enterprise AI Agents Is Mostly an Infrastructure Problem?

Across nearly every enterprise agent project, the same pattern **appears. Before the agent logic can even be written, **engineering teams must build a large amount of supporting infrastructure, including:

Session routing
Credential vaults
Memory extraction pipelines
Observability wiring
Multi tenant context isolation
Policy enforcement

In practice, a substantial portion of early development effort goes into these foundations before the agent’s intelligence is implemented.

Let’s walk through the key engineering challenges that create this gap.

Problem 1: Infrastructure for Long Running Stateful Sessions

Traditional serverless platforms are designed for short lived, stateless workloads.

Agents behave very differently.

They often require long running, stateful execution environments that maintain context across many tool calls and reasoning steps.

Supporting this requires infrastructure for:

Session routing
Per user state management
Lifecycle management
Dynamic scaling of execution environments

Constructing this infrastructure on top of general purpose compute platforms can become a significant engineering effort before any agent logic is written.

Problem 2: Security Isolation at Scale

Enterprise agents frequently process sensitive user data.

When thousands of users run concurrent sessions, strong isolation between sessions becomes critical. Without proper safeguards, a defect could potentially expose:

One user’s data to another user
Information across tenants
Privileged credentials or tokens

Achieving secure isolation at scale requires carefully designed execution environments, container isolation, and strict identity boundaries, rather than relying solely on application level safeguards.

Problem 3:Identity, OAuth, and Credential Management

Agents rarely operate in isolation.

They interact with external services on behalf of users, which introduces the need to manage authentication and authorization flows such as:

OAuth consent processes
Secure token storage
Automatic token refresh
Fine grained permission enforcement
Audit trails for every access

When agents integrate with multiple SaaS platforms across thousands of users, credential management becomes a full platform capability, not just a small feature.

Problem 4: Memory Infrastructure

Agents depend heavily on memory systems to function effectively.

Short Term Memory

Maintaining conversation context across long interactions often requires summarization pipelines that compress earlier dialogue while preserving meaning.

Long Term Memory

Persistent knowledge typically involves:
Information extraction pipelines
Vector storage
Semantic retrieval
Mechanisms to reconcile new information with existing knowledge

Each of these components introduces potential failure modes that can gradually degrade agent behaviour if not carefully managed, particularly in multi-tenant environments.

Problem 5: Observability for Agent Reasoning

Traditional monitoring tools measure metrics such as:

Latency
Error rates
Throughput
But production AI agents require deeper visibility.
Engineers often need to understand:
Which reasoning step produced an incorrect output
Which tool call returned unexpected data
Why the agent chose a particular decision path

Achieving this level of visibility requires trace level instrumentation, structured logs, and AI aware observability dashboards.

Problem 6: Policy Enforcement Outside the Agent

Early agent systems often embed governance rules directly inside prompts.

This approach is fragile.

A carefully crafted user input can sometimes influence the agent to ignore or reinterpret its own instructions.

Production systems therefore require external policy enforcement layers that evaluate permissions and constraints independently of the agent’s reasoning process.

This ensures governance cannot be bypassed.

Problem 7: Multi Agent Coordination

Real enterprise workflows rarely rely on a single agent.

Instead, they often involve multiple specialized agents working together. For example:

A research agent to gather information
A writing agent to generate responses
A verification agent to validate outputs
An approval agent to enforce governance
Supporting these workflows requires infrastructure for:
Inter agent communication
Shared state management
Workflow orchestration
Failure handling and retries

This coordination layer introduces yet another architectural component to an already complex system.

Introducing Amazon Bedrock AgentCore

Amazon Bedrock AgentCore is an agentic platform from AWS designed to build, deploy, and operate AI agents securely at scale. It provides a set of modular, enterprise grade services that handle the infrastructure required to run production grade AI agents without developers having to manage the underlying systems.

In real world deployments, building an agent is only a small part of the challenge. Production systems must manage runtime execution, memory, tool connectivity, identity, security, and observability before agents can reliably interact with enterprise data and services. These infrastructure concerns often become the primary barrier to moving from prototype agents to production systems.

Amazon Bedrock AgentCore addresses this challenge by providing fully managed services that remove the undifferentiated heavy lifting of building agent infrastructure. Developers can focus on implementing the agent’s reasoning and workflows while AgentCore manages the operational backbone required to run agents reliably in enterprise environments.

AgentCore services are modular and composable, meaning they can be used together or independently depending on the architecture of the system. The platform is also framework agnostic and model agnostic, supporting popular open source agent frameworks such as LangGraph, CrewAI, LlamaIndex, and Strands Agents, and it can work with foundation models from Amazon Bedrock or external providers.

At a high level, AgentCore provides capabilities such as:

AgentCore Runtime : A secure serverless environment for running agents and tools

AgentCore Memory : Managed short term and long term memory for context aware agents

AgentCore Gateway : A service that converts APIs and services into MCP-compatible tools for agents

AgentCore Identity : Identity and access management designed specifically for AI agents

Built in tools and observability : Including code execution, browser automation, monitoring, and evaluation capabilities

Together, these services form a production infrastructure layer for agentic systems, allowing teams to deploy AI agents that are secure, scalable, observable, and capable of interacting with real enterprise systems

AgentCore Runtime

AgentCore Runtime is the secure, serverless execution environment for AI agents. Each user session runs inside a dedicated, hardware isolated microVM, providing strong isolation of CPU, memory, and filesystem resources.

Isolation is enforced at the virtualization layer, ensuring one user’s agent cannot access another user’s data. When a session ends due to 15 minutes of inactivity, user termination, or the 8 hour maximum session limit the microVM is destroyed and memory is fully sanitized, preventing cross session data leakage.

Framework Compatibility

AgentCore Runtime is framework agnostic and works with common agent frameworks such as:

Strands Agents (AWS)
LangChain / LangGraph
LlamaIndex
Microsoft Agent Framework (Autogen + Semantic Kernel)

It can also host any custom agent implementation that runs inside a container.

Minimal Integration

Existing agents can be deployed with a small wrapper:

from bedrock_agentcore.runtime import BedrockAgentCoreApp

app = BedrockAgentCoreApp()

@app.entrypoint
def invoke(payload):
    return your_agent(payload.get("prompt", ""))

Deployment:

agentcore configure
agentcore deploy
Model Support

AgentCore is model agnostic and works with major foundation models including:

Amazon Nova
Anthropic Claude
OpenAI GPT
Google Gemini
Meta Llama
Mistral

Your agent chooses the model; AgentCore only provides the execution environment.

Communication

AgentCore supports two interaction modes:

HTTP API—standard request/response execution

Bi directional WebSocket streaming real-time conversational and multi-turn agents

Using a sessionId keeps requests routed to the same microVM session, preserving state.

Strands Agents

Strands Agents is AWS’s open source agent framework designed around a model first approach. A Strands agent is defined by three elements:

Model
Tools
Prompt

The model drives planning and tool usage. Strands agents deploy to AgentCore Runtime using the same lightweight SDK wrapper

Deployment Options

AgentCore supports two deployment paths.

Direct code upload
AgentCore automatically builds the container and deploys the agent — no Dockerfile required.

Container deployment
Provides full control over runtime dependencies and system configuration.

Both use the same lifecycle:

agentcore configure
agentcore deploy

Deployments are immutable and versioned, allowing multiple versions and canary testing before traffic promotion.

AgentCore Gateway

AgentCore Gateway converts existing APIs, AWS Lambda functions, and OpenAPI specifications into agent ready MCP tools automatically without writing custom adapters.

From API to Agent Tool

Point Gateway to a Lambda function or OpenAPI specification and it automatically:

Generates the MCP tool schema

Handles protocol translation

Exposes the API as a discoverable agent tool

What previously required weeks of custom integration can now be done in minutes.

agentcore gateway create \
--name "crm-tools" \
--lambda-arn "arn:aws:lambda:us-east-1:123:function:crm-api" \
--protocol MCP

Once registered, any MCP compatible agent can discover and invoke the tool.

MCP Native Architecture

Gateway is built around the Model Context Protocol (MCP). Registered tools become automatically usable by MCP compatible frameworks such as:

Strands
LangGraph
CrewAI

Agents can dynamically discover tools at runtime rather than requiring tools to be hardcoded during initialization.

SaaS Integration

Gateway provides built in connectors for common enterprise platforms such as:

GitHub
Salesforce
Slack
Google Workspace
Microsoft 365
Jira / Confluence

These connectors handle authentication, schema generation, and error handling automatically.

Agent-to-Agent Communication (A2A)

Gateway also supports the Agent2Agent (A2A) protocol, which standardizes how agents communicate with each other.

Agents built using different frameworks can delegate tasks across systems while communicating through standardized A2A messages.

AgentCore Identity

AgentCore Identity manages authentication and credential delegation for AI agents accessing external systems.

It controls both:

Who can invoke the agent
How the agent authenticates to external services
Supported authentication mechanisms include:
AWS IAM SigV4 for internal services
OAuth 2.0 and OpenID Connect for external users and applications

Compatible identity providers include Amazon Cognito, Okta, Microsoft Entra ID, and Auth0.

Machine-to-Machine Access (2LO)

For system-level tasks, agents authenticate using OAuth Client Credentials without a user involved.

Common scenarios:

Scheduled workflows
Background analytics
System integrations

User Delegated Access (3LO)

When agents act on behalf of a user, AgentCore manages the full OAuth lifecycle:

User consent flow
Encrypted token storage
Token refresh
Access auditing

All credentials are stored in an encrypted vault protected by customer managed KMS keys.

AgentCore Memory

AgentCore Memory provides built in memory management for agents without requiring developers to build custom vector pipelines.

It supports three types of memory:

Short Term Memory

Maintains session context, including conversation history, tool outputs, and reasoning state.

Long Term Memory

Stores extracted knowledge such as user preferences, decisions, and discovered facts so future sessions begin with relevant context.

Episodic Memory

Stores past experiences what actions were attempted and which strategies succeeded enabling agents to improve behavior over time.

AgentCore Browser

Some enterprise systems can only be accessed through a web interface.

AgentCore Browser provides isolated browser instances that agents can use to interact with websites and web applications.

Agents can:

Navigate multi step workflows
Fill forms
Extract information from dynamic pages
Interact with internal portals

Each session runs in a sandboxed browser environment, which is destroyed when the session ends.

AgentCore Code Interpreter

When agents generate code for analysis or computation, that code must execute safely.

AgentCore Code Interpreter provides an isolated execution sandbox where generated code can run securely.

Agents can use it to:

Analyze datasets
Run calculations
Generate charts and files
Validate generated code

Each execution occurs in a separate ephemeral sandbox with no access to other sessions or infrastructure.

Conclusion

The Platform for the Production Agent Era

Having architected agentic systems across Azure Web Apps, Azure Container Apps, and custom infrastructure, I know how much engineering effort goes into the layers that production agents require.

Session routing, credential management, memory pipelines, observability, governance policies, and multi tenant isolation are all necessary pieces of a reliable agent system. None of them are impossible to build but they consume time that should be spent improving the reasoning, behavior, and usefulness of the agent itself.

This is the problem Amazon Bedrock AgentCore is designed to solve.

AgentCore provides 7 purpose built services that handle the production infrastructure required for agent systems:

Runtime: Secure microVM execution for agents

Gateway: MCP-native tool integration and API exposure

Identity: OAuth credential lifecycle and delegated access

Memory: Short term and Long term persistent memory for agents

Browser: managed browser automation for web interactions

Code Interpreter: Isolated sandbox for executing generated code

Observability: CloudWatch native tracing with OpenTelemetry support

AgentCore is framework agnostic and works with common agent frameworks such as Strands, LangChain, LangGraph, LlamaIndex, CrewAI, and AutoGen, as well as custom implementations.

It is also model agnostic, allowing agents to use foundation models including Amazon Nova, Anthropic Claude, OpenAI GPT models, Google Gemini, Meta Llama, and Mistral, or any model accessible through an API.

The question is no longer whether a production AI agent can be built.
With AgentCore, the real question becomes what agent you want to build and how quickly you can deliver it to the people who need it.

Getting Started

pip install bedrock-agentcore bedrock-agentcore-starter-toolkit

Thanks
Sreeni Ramadorai

Semantic Caching in RAG Systems & AI Agents

Seenivasa Ramadurai — Mon, 09 Mar 2026 06:39:33 +0000

What Is Caching?

Caching is the practice of storing the result of an expensive operation so that future requests for the same result can be served instantly without repeating the work.

The concept is foundational in computing. A** web browser caches images so pages load faster*. A database caches query results so it does not re read the disk. **A CDN caches static files close to the user*. In every case the principle is the same: compute once, reuse many times.

Traditional caches work on exact matches. The key is the exact input string or request. “What is the capital of France?” and “What’s the capital of France?” are different keys the cache misses on the second even though the answer is identical. This works fine for static web assets, but it falls apart the moment users express the same intent in different words.

What Is Semantic Caching?

Semantic caching replaces the exact string key with a meaning based key. Instead of asking “is this input identical to a stored input?”, it asks “is this input similar enough in meaning to a stored input?”

It does this using embeddings. Every query is converted into a dense numerical vector a point in high dimensional space where semantically similar sentences sit close together. The cache stores these vectors alongside their responses. When a new query arrives, its vector is compared to all stored vectors using cosine similarity. If the closest match is above a threshold (e.g. 0.92), the cached response is returned.

The result: a user who asks “How many sick days do I get?” and another who asks “What is our sick leave allowance?” both get the same cached answer because the questions mean the same thing.

The Problem Why RAG Pipelines Waste Money at Scale

A standard RAG pipeline does four things every time a user sends a query:

Embed the query convert the text to a vector using an embedding model
Vector search and find the most relevant chunks in your document store
Assemble context + build a prompt from the retrieved chunks and the user query
LLM invocation send the prompt to the model and pay per token

Each step adds latency and cost. An LLM call alone typically adds 1–4 seconds and costs money on every single request. The problem is that in production, over 40% of queries are near paraphrases of questions already answered. Without caching, the system repeats all four steps for every one of them.

10,000 queries per day. 40% duplicates = 4,000 unnecessary LLM calls. At $0.002 per call that is $8 wasted daily, $2,920 per year before accounting for latency degradation during peak load.

The solution is to intercept duplicate queries before they reach step 2 that is vector search. A semantic cache sits at the front of the pipeline. If a semantically equivalent query has been answered before, return that answer immediately. The entire pipeline retrieval, context assembly, LLM call is bypassed.

The Solution How Semantic Caching Works

Every cached entry is a triple: the query embedding vector, the stored response, and a timestamp for expiry management.

📄 Cache entry structure

At inference time the flow is:

Embed the incoming query → vector v_q
Search the cache: find the stored vector with highest cosine similarity to v_q
If max similarity ≥ threshold θ → return the cached response y_i immediately
Otherwise → run the full RAG pipeline, then store the new (v_q, response, timestamp) in the cache

The Three Cache Layers

Semantic caching can be applied at three points in the pipeline, each with different hit rates and trade offs:

Semantic Caching in the AI Agentic World

A standard RAG pipeline answers one question per request. An AI agent does something more ambitious. it plans, decides which tools to call, executes those tools, reasons over the results, and repeats, sometimes across many turns to complete a task.

This makes the cost and latency problem significantly worse. Where a RAG system makes one LLM call per query, an agent may make 5–15. Where a RAG system makes one tool call, an agent may make the same tool call repeatedly across different user sessions fetching the same product info, the same knowledge article, the same company record.

🤖 In an agentic system, semantic caching is not just about saving one LLM call it is about short circuiting entire reasoning chains. A cached tool result prevents a retrieval step, which prevents a reasoning step, which may prevent two further tool calls downstream.

Where Caching Fits in an Agent Loop

An agent loop has two natural places to insert a cache:

Before the first LLM call: If the user’s intent has been handled before in a similar session, return the full cached final answer immediately. Bypasses the entire loop.

Before each tool execution: Before calling an external tool (database lookup, API call, knowledge base search), check whether the same call or a semantically equivalent one was made recently. Return the cached tool result instead of executing.

Real World Use Cases

HR Policy Bot

Internal HR chatbots are one of the highest value deployments for semantic caching. Employees ask the same questions constantly sick leave, parental leave, expense claims, performance review timelines phrased differently by every person.

A 2,000 person company. Employees ask ~3 HR questions each per year = 6,000 annual queries. Analysis shows 60% are near paraphrases of existing questions. Semantic caching eliminates ~3,600 LLM calls per year. HR policy changes at most quarterly stale cache is rarely a risk.

Example queries that resolve to the same cached answer:

“How many sick days do I get?”
“What is our sick leave allowance?”
“Can I take a sick day without a doctor’s note?”
“What’s the policy on calling in sick?”

All four embed to vectors with cosine similarity > 0.93 against the same cached entry. One LLM call. Four employees served.

⚠️Never cache personal HR queries. “How many sick days do I have left?” is a personal balance query it must bypass the cache and hit the HRIS system directly. Detect these with entity patterns before the cache lookup.

Use Case 2 Customer Support Services

Customer support is the highest volume use case. Password resets, billing queries, refund policies, order tracking questions a small set of issues accounts for the vast majority of ticket volume. The same problem, asked by thousands of different customers, in thousands of different ways.

An Ecommerce platform with 50,000 support queries per day. 12 issue categories account for 73% of volume. Semantic caching at query level reduces average response time from 4.2s to 0.3s for cache hits a 93% latency improvement for nearly three quarters of all users.

Example queries resolving to the same cached response:

“I can’t log in”
“How do I reset my password?”
“I’m locked out of my account”
“Forgot my password, what do I do?”

Identical reset flow, identical cached answer. Cache hit rate for this category alone: 85%+.

⚠️Order specific queries must bypass the cache. Any query containing an order number, transaction ID, or account reference is personal it gets routed to a live data lookup, never a cached response.

When NOT to Use Semantic Caching

A cache that returns a wrong, stale, or contextually mismatched answer is worse than no cache. Applied in the wrong contexts, semantic caching silently degrades quality and in regulated environments creates compliance liability.

🚫 The cache returns a past response to a new user. If that response was wrong, personalized, or time sensitive, the cache amplifies the mistake at scale. One bad cached entry can poison thousands of downstream responses.

Quick Decision Checklist

Before adding any query or tool call to the cache, run through these six checks:

Is the answer the same regardless of who asks it? If no → skip cache.
Could the answer change within your TTL window? If yes → shorten TTL or skip cache.
Does the query contain personal identifiers? If yes → bypass cache unconditionally.
Is the domain regulated (medical, legal, financial)? If yes → get explicit policy approval before caching.
Is the expected hit rate above ~15%? If no → cache overhead likely exceeds the savings.
Has the response passed output validation? If no → never cache unvalidated LLM outputs.

Implementation Qdrant Docker + Repository Pattern

Qdrant is a purpose built, open source vector database written in Rust. It is the ideal default backend for a semantic cache: it stores vectors and response payloads together, supports native TTL filtering, and runs in Docker in under 30 seconds.

The Repository Pattern wraps the backend behind a clean interface. Your RAG pipeline and agent loop only ever talk to that interface they never import Qdrant or FAISS directly. Swapping backends is one environment variable.

I am running Qdrant docker container in Docker Desktop

Repository Pattern — Semantic Caching Implementation

This project uses the Repository Pattern to abstract cache storage behind a clean interface. The application code depends only on CacheRepository not on Qdrant, FAISS, or any specific backend. Backends are swappable via configuration.

This project applies the Repository Pattern to decouple semantic cache storage from application logic. Every consumer the agent loop, RAG pipeline, HR pipeline, and tool cache depends only on the CacheRepository abstract interface, never on a concrete backend.

The payoff: swap backends by changing a single environment variable. Qdrant runs in production, FAISS runs locally for zero Docker development, and adding Redis or Pinecone means creating one file and one factory registration. No application code changes.

DESIGN PRINCIPLE

Application code never imports

QdrantCache
or

FaissCache directly. Everything flows through get_cache(), which returns a CacheRepository. Storage concerns are fully isolated from business logic.

Python · src/cache_repository.py

from abc import ABC, abstractmethod
from typing import Optional, Tuple
import numpy as np

class CacheRepository(ABC):
    """Abstract base class for semantic cache backends."""

    @abstractmethod
    def lookup(self, v_q: np.ndarray, threshold: float = 0.92,
               namespace: Optional[str] = None) -> Optional[str]:
        """Return cached response if similarity >= threshold, else None."""
        ...

    @abstractmethod
    def lookup_with_score(self, v_q: np.ndarray, threshold: float = 0.92,
                          namespace: Optional[str] = None
                         ) -> Tuple[Optional[str], float]:
        """Same as lookup but also returns the similarity score."""
        ...

    @abstractmethod
    def insert(self, v_q: np.ndarray, response: str,
               namespace: Optional[str] = None,
               ttl_hours: Optional[int] = None,
               query: Optional[str] = None) -> None:
        """Store query embedding and response in the cache."""
        ...

    @abstractmethod
    def invalidate(self, max_age_hours: int = 24) -> int:
        """Remove entries older than max_age_hours. Returns count removed."""
        ...

WHY LOOKUP_WITH_SCORE?

lookup_with_score exists because consumers occasionally need the raw similarity score, not just a hit/miss boolean. The tool cache uses it to log near misses (high score, below threshold) for threshold tuning. The agent loop uses it to decide whether to show a cache-hit indicator in the UI.

Factory: cache_factory.get_cache()

src/cache_factory.py
The factory reads CACHE_BACKEND and creates the appropriate concrete implementation, storing it as a module-level singleton. Every subsequent call returns the same instance — avoiding reconnections to Qdrant or repeated FAISS index construction.

Python · src/cache_factory.py
import os
from .cache_repository import CacheRepository

_cache: CacheRepository | None = None

def get_cache() -> CacheRepository:
    """Return singleton backend based on CACHE_BACKEND env var."""
    global _cache
    if _cache is not None:
        return _cache

    backend = os.getenv("CACHE_BACKEND", "qdrant")
    if backend == "qdrant":
        from .qdrant_cache import QdrantCache
        _cache = QdrantCache(
            url=os.getenv("QDRANT_URL", "http://localhost:6333"),
            collection_name=os.getenv("QDRANT_COLLECTION", "semantic_cache")
        )
    elif backend == "faiss":
        from .faiss_cache import FaissCache
        _cache = FaissCache()
    else:
        raise ValueError(f"Unknown CACHE_BACKEND: {backend}")

    return _cache

standard consumer pattern

from src.cache_factory import get_cache

cache = get_cache()              # Always returns CacheRepository

# Step 1: check before running the pipeline
cached = cache.lookup(v_q, threshold=0.85)
if cached:
    return cached                # LLM is never called on a hit

# Step 2: run the full pipeline
response = run_pipeline(query)

# Step 3: store for future queries
cache.insert(v_q, response, query=user_query)

Adding a New Backend

Adding a backend Redis, Pinecone, Azure Cache requires exactly three steps. No changes are needed in agent_loop.py, tool_cache.py, hr_pipeline.py, or main.py.

Step 1 — Implement the interface

Python · src/redis_cache.py  (new file)
from .cache_repository import CacheRepository
from typing import Optional, Tuple
import numpy as np

class RedisCache(CacheRepository):
    def __init__(self, url: str = 'redis://localhost:6379'):
        import redis
        self.client = redis.from_url(url)

    def lookup(self, v_q, threshold=0.92, namespace=None) -> Optional[str]:
        response, _ = self.lookup_with_score(v_q, threshold, namespace)
        return response

    def lookup_with_score(self, v_q, threshold=0.92,
                          namespace=None) -> Tuple[Optional[str], float]:
        # implement vector similarity search via Redis VSIM or custom hashing
        ...

    def insert(self, v_q, response, namespace=None,
               ttl_hours=None, query=None) -> None: ...

    def invalidate(self, max_age_hours=24) -> int: ...


## Step 2 — Register in the factory

Python · src/cache_factory.py  (add one elif)
    elif backend == "redis":
        from .redis_cache import RedisCache
        _cache = RedisCache(url=os.getenv("REDIS_URL", "redis://localhost:6379"))


## Step 3 — Set the environment variable

.env
CACHE_BACKEND=redis
REDIS_URL=redis://localhost:6379

ZERO APPLICATION CHANGES

After these three steps every consumer agent_loop, tool_cache, hr_pipeline, main automatically uses Redis. No imports to update, no signatures to change. This is the Repository Pattern payoff.

The Repository Pattern separates what the cache does (the interface) from how it stores data (the backend). This is the property that makes the system testable with FAISS, production ready with Qdrant, and infinitely extensible without touching application code.

Request flow

Output/Testing

SEMANTIC CACHING — ALL QUERIES + TOOL CACHE EXAMPLES

TOOL CACHE TEST (same tool, same param — Call 1 then Call 2):
get_company_policy(topic='remote work') — Call 1: MISS, Call 2: HIT

Query #1
PROMPT: What is our sick leave policy?
RESPONSE: Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.
QUERY CACHE: MISS | TOOL CACHE: get_company_policy(topic='sick leave'): MISS | TIME: 2.48s

Query #2
PROMPT: Tell me about our sick leave policy
RESPONSE: Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.
QUERY CACHE: HIT | TOOL CACHE: skipped (query cached) | TIME: 0.17s

Query #3
PROMPT: What is our sick leave policy?
RESPONSE: Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.
QUERY CACHE: HIT | TOOL CACHE: skipped (query cached) | TIME: 0.16s

Query #4
PROMPT: Tell me about vacation days
RESPONSE: Employees are entitled to 20 days of paid time off (PTO) per year for vacation. Requests for vacation days must be submitted at least 2 weeks in advance.
** QUERY CACHE: MISS | TOOL CACHE**: get_company_policy(topic='vacation days'): MISS | TIME: 2.59s

Query #5
PROMPT: Tell me about vacation
RESPONSE: Employees are entitled to 20 days of paid time off (PTO) per year for vacation. Requests for vacation days must be submitted at least 2 weeks in advance.
** QUERY CACHE: HIT | TOOL CACHE:** skipped (query cached) | TIME: 0.22s

Query #6
PROMPT: How many PTO days for vacation?
RESPONSE: You have 20 PTO days for vacation per year. Please note that you must request your time off at least 2 weeks in advance.
QUERY CACHE: MISS | TOOL CACHE: get_company_policy(topic='PTO days for vacation'): HIT | TIME: 2.21s

=============================================================

TOOL CACHE EXAMPLES (tool call → result, from this run):

get_company_policy(topic='remote work') → Call 1: MISS, Call 2: HIT
Result: Hybrid: 2 days in office required. Equipment provided.
get_company_policy(topic='sick leave') → MISS

Result: Employees get 15 sick days per year. Doctor's note required after 3 consecutive days.

get_company_policy(topic='vacation days') → MISS
Result: 20 days PTO per year. Must request 2 weeks in advance.
get_company_policy(topic='PTO days for vacation') → HIT
Result: 20 days PTO per year. Must request 2 weeks in advance.

==============================================================

Total: 7.83s for 6 queries

Conclusion

Semantic caching turns repeated or similar questions into instant answers **instead of expensive LLM **and tool calls. By using embeddings and vector similarity instead of exact string matching, you can cache responses for "What is our sick leave policy?" and serve them when users ask "Tell me about sick leave", "How many sick days do we get?", or other paraphrased variants.

The Repository Pattern keeps the implementation clean and flexible: you can run with FAISS for local development and switch to Qdrant for production without changing application code. The two layer approach query cache for full responses and tool cache for tool results gives you fine grained control over what gets cached and when.

Key takeaways:

Meaning over exact match Embeddings capture intent, so paraphrases and synonyms hit the cache.

Two layers Cache full responses and tool results separately; each has its own threshold and namespace.

Swappable backends Use Qdrant for persistence or FAISS for fast, in-memory testing.

Know when to skip Avoid caching personal, entity-specific, or rapidly changing data.

Thanks
Sreeni Ramadorai

Future: Seenivasa Ramadurai

From RAG to Knowledge Graphs Why the Agent Era Is Redefining AI Architecture

Introduction

The Problem RAG Solved (and Why It Mattered So Much)

The flow is straightforward:

Where RAG Starts Struggling

Relationships Don't Live in Paragraphs

More Context Isn't the Same as Better Context

Local Relevance ≠ Global Understanding

What a Knowledge Graph Actually Is

(Subject → Relationship → Object)

Now we can ask:

Where Knowledge Graphs Already Live

What GraphRAG Is and How It Works

Stage 1: Graph Based Indexing

Stage 2: Graph Guided Retrieval

Stage 3: Graph Enhanced Generation

How Knowledge Graphs Get Built: The Extraction Pipeline

The two core tasks are:

Two Practical Approaches

1. Custom LLM pipelines:

2. LangChain GraphTransformers / LlamaIndex graph tools

The Real Costs of GraphRAG (The Part Most Bloggers Skip)

1. Compute Cost Is a Design Constraint, Not a Detail

2. Maintenance Is Continuous and Non Trivial

3. Query Complexity Is Significantly Higher

4. LLMs Are Not Graph Native Models

5. Subgraph Explosion Is a Real Production Problem

When to Use GraphRAG (and When Not To)

Use GraphRAG when:

Stick with RAG when:

Then Came Agents Changing the Game Again

Consider what an Agent can do in a single reasoning flow

Major Agentic Frameworks in Production

The Real Future: Composition, Not Competition

The simplest mental model

A Practical Decision Framework for Teams Building AI Systems Today

Further Reading

Conclusion

What are Pre-Trained Models, Fine-Tuning, RAG, and Prompt Engineering? A Simple Kitchen Guide

Introduction

The Question I Hear Every Week

Let's EAT

1. 🧊 Pre-Trained Model = Frozen Food

So what do you actually do with frozen food?

Use the pre-trained model as is when…

2. 🌶️ Fine-Tuning = Adding Your Own Seasoning

Fine-tuning works exactly like this.

Fine-tuning adjusts these things well

Fine-tuning cannot do these things don't ask it to

3. 🥗 RAG = Serving Fresh Side Dishes

Now: what is RAG?

What is a 'context window'?

The Full Kitchen: When You Need All Three

How agents use all three techniques together

Fine-Tuning vs RAG vs Both

The Takeaway

Q, K, V : The Three Things Every Great Tech Lead Does Without Knowing It

Introduction

A quick refresher on self attention

Now map that onto your tech lead

The Tech Lead as a Transformer: Scaling Attention in Your Team

Q:Read the problem precisely before reacting

K:Know what each engineer truly carries

V:Extract the exact contribution that matters

Softmax: decisive, not democratic

Multi head attention: running several concerns at once

Why the old model fails the RNN problem

So what does a great tech lead actually look like?

Run Open Source AI Models with Docker Model Runner

Introduction

What is Docker Desktop?

What is Docker Model Runner?

When to Use Docker Model Runner

How It Works Under the Hood

Test the Model within docker desktop itself

Testing GPT-OSS

1. Pull a model from Docker Hub

2. List available models ( what models, downloaded locally )

Quick reference cheat sheet