<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>Future: Seenivasa Ramadurai</title>
    <description>The latest articles on Future by Seenivasa Ramadurai (@sreeni5018).</description>
    <link>https://future.forem.com/sreeni5018</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1829954%2F564a03dc-062e-4c58-b28e-be52605aefa8.jpg</url>
      <title>Future: Seenivasa Ramadurai</title>
      <link>https://future.forem.com/sreeni5018</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://future.forem.com/feed/sreeni5018"/>
    <language>en</language>
    <item>
      <title>From RAG to Knowledge Graphs Why the Agent Era Is Redefining AI Architecture</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Sun, 12 Apr 2026 15:23:54 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/from-rag-to-knowledge-graphs-why-the-agent-era-is-redefining-ai-architecture-3fgc</link>
      <guid>https://future.forem.com/sreeni5018/from-rag-to-knowledge-graphs-why-the-agent-era-is-redefining-ai-architecture-3fgc</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawrm0mmn6ijdke4059m5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawrm0mmn6ijdke4059m5.png" alt=" " width="800" height="524"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One question is dominating AI architecture discussions right now. We already built RAG. Everyone is talking about GraphRAG. Should we move?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;On the surface&lt;/strong&gt;, it looks like a &lt;strong&gt;standard tech upgrade cycle.&lt;/strong&gt; &lt;strong&gt;Underneath, something more fundamental is happening&lt;/strong&gt; a debate about how we &lt;strong&gt;represent knowledge, how we retrieve it, and how we expect machines to reason over it&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;last two years&lt;/strong&gt;, the industry followed a predictable path. We started with &lt;strong&gt;raw Large Language Models&lt;/strong&gt;, quickly realized they could &lt;strong&gt;hallucinate&lt;/strong&gt; with &lt;strong&gt;terrifying confidence&lt;/strong&gt;, and turned to &lt;strong&gt;RAG(Retrieval Augmented Generation)&lt;/strong&gt; to &lt;strong&gt;ground&lt;/strong&gt; them in real data. It was a genuine breakthrough. Suddenly you could connect a model to your &lt;strong&gt;PDFs&lt;/strong&gt;, internal &lt;strong&gt;portals&lt;/strong&gt;, wikis, and live &lt;strong&gt;databases&lt;/strong&gt; without the nightmare of constant retraining. For most teams, it felt like magic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Then the ceiling arrived&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Teams started noticing that &lt;strong&gt;RAG was useful&lt;/strong&gt;, but not intelligent. It &lt;strong&gt;could find relevant text&lt;/strong&gt;. It couldn't &lt;strong&gt;understand how things actually connected.&lt;/strong&gt; This &lt;strong&gt;gap between finding information&lt;/strong&gt; and &lt;strong&gt;understanding relationships&lt;/strong&gt; is what &lt;strong&gt;drove&lt;/strong&gt; the industry toward &lt;strong&gt;Knowledge Graphs and GraphRAG.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Now, just as that conversation is picking up steam, another shift is already underway the &lt;strong&gt;Agentic AI.&lt;/strong&gt; &lt;strong&gt;Autonomous agents&lt;/strong&gt;, &lt;strong&gt;dynamic tool use&lt;/strong&gt;, and &lt;strong&gt;multi-step orchestration&lt;/strong&gt; are changing the very definition of what &lt;strong&gt;retrieval&lt;/strong&gt; even means. It's no longer about fetching facts it's &lt;strong&gt;about giving machines the cognitive infrastructure to solve genuinely complex problems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Before you commit to your next infrastructure pivot, let's slow down and answer the questions that actually matter.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;What exactly is RAG, and where does it fail?&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Why did GraphRAG emerge,&lt;/strong&gt; and what is the &lt;strong&gt;real cost of building it?&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;In a world of agents&lt;/strong&gt;, do we still need it the same way?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This blog is the roadmap for that journey.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem RAG Solved (and Why It Mattered So Much)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A large language model is trained on enormous amounts of text.&lt;/strong&gt; That gives it remarkable linguistic ability and broad general knowledge but it &lt;strong&gt;comes with a hard constraint&lt;/strong&gt;. The model doesn't know your &lt;strong&gt;enterprise data&lt;/strong&gt;, your &lt;strong&gt;latest reports&lt;/strong&gt;, your &lt;strong&gt;private&lt;/strong&gt; &lt;strong&gt;documents&lt;/strong&gt;, or the product changes that landed last Tuesday. And if it doesn't know something? It may still generate a confident, fluent answer anyway. That's &lt;strong&gt;hallucination&lt;/strong&gt;, and it's &lt;strong&gt;not a bug you can patch it's structural.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG solves this by moving knowledge outside the model and fetching it dynamically at query time.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcv3m3w9v0corobfj385u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcv3m3w9v0corobfj385u.png" alt=" " width="800" height="268"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The flow is straightforward:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Ingest your documents&lt;/strong&gt; PDFs, emails, contracts, meeting notes, tickets, &lt;strong&gt;whatever lives in your knowledge ecosystem&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Chunk&lt;/strong&gt; the text into smaller, searchable units (chunk size matters enormously too small and you lose context, too large and retrieval gets noisy).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Embed&lt;/strong&gt; each chunk using an embedding model, converting text into dense numerical vectors that capture semantic meaning.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Index&lt;/strong&gt; those vectors in a vector database FAISS, Qdrant, Pinecone, Chroma, Weaviate, or Milvus are common choices.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;At query time, &lt;strong&gt;embed the user's question&lt;/strong&gt;, find the most semantically similar chunks, inject them into the prompt, and let the LLM answer from real evidence.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;It changed practical AI development. It gave teams a way to build &lt;strong&gt;grounded document assistants&lt;/strong&gt;, &lt;strong&gt;enterprise search tools&lt;/strong&gt;, &lt;strong&gt;Q&amp;amp;A bots&lt;/strong&gt;, and domain specific copilots without retraining foundation models. &lt;strong&gt;And it introduced an architectural principle that remains one of the most important ideas in modern AI systems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;The model doesn't need to contain all knowledge internally, if we can retrieve the right knowledge externally at the right moment.&lt;br&gt;
That idea isn't going away. But it has limits.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where RAG Starts Struggling
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The challenge with RAG isn't that it's bad&lt;/strong&gt;. The challenge is that it's optimized for similarity, not structure.&lt;/p&gt;

&lt;p&gt;That difference turns out to matter a great deal in practice.&lt;br&gt;
Imagine someone asks a question  &lt;strong&gt;Which projects are affected by the recent leadership changes?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpip4r7eau39nss2okcq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbpip4r7eau39nss2okcq.png" alt=" " width="800" height="778"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;classic RAG&lt;/strong&gt; system might &lt;strong&gt;retrieve&lt;/strong&gt; a chunk about a &lt;strong&gt;new VP appointment&lt;/strong&gt;, another about a project roadmap, another about budget realignments, and another about team restructuring. Each chunk could be individually relevant. &lt;strong&gt;But the system has no natural way to understand that the VP change affects Project A through a specific reporting line&lt;/strong&gt;, or that the &lt;strong&gt;budget change flows to Project B&lt;/strong&gt; because of a procurement dependency. &lt;strong&gt;RAG retrieved similar text. It didn't model how things connect.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This plays out in &lt;strong&gt;three structural pain points&lt;/strong&gt; that no amount of implementation tuning fully resolves.&lt;/p&gt;

&lt;h2&gt;
  
  
  Relationships Don't Live in Paragraphs
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Real world knowledge is relational.&lt;/strong&gt; Drugs interact with proteins. Engineers depend on infrastructure. Transactions flow through accounts. Court rulings reference precedents. Products belong to supply chains. None of this structure lives cleanly in a paragraph and vector similarity can't reconstruct it from loose chunks.&lt;/p&gt;

&lt;h2&gt;
  
  
  More Context Isn't the Same as Better Context
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;As context windows have grown from 4K to 128K to 1M tokens&lt;/strong&gt;, the tempting fix has been just send more chunks. &lt;strong&gt;But flooding the LLM with additional text doesn't compensate for missing structure.&lt;/strong&gt; Research has consistently shown that LLMs are sensitive to redundant and noisy context more text can actively degrade answer quality when the signal is buried in noise. A 2023 paper from Stanford memorably called this the &lt;strong&gt;lost-in-the-middle problem&lt;/strong&gt; models perform worse when the relevant information is buried inside long contexts, not positioned at the edges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Local Relevance ≠ Global Understanding
&lt;/h2&gt;

&lt;p&gt;RAG surfaces locally relevant text fragments. It doesn't provide a &lt;strong&gt;holistic view of a domain, network, or system.&lt;/strong&gt; This becomes a serious limitation in scientific literature review, financial relationship analysis, legal precedent mapping, biomedical research, and any domain where the value lies not just in what's said, but in how facts connect.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At some point, teams hit a realization if the problem isn't finding relevant text but navigating connected knowledge, then text chunks might be the wrong unit of retrieval entirely.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What a Knowledge Graph Actually Is
&lt;/h2&gt;

&lt;p&gt;A Knowledge Graph is a way of representing knowledge as explicit entities and relationships — rather than as paragraphs and hoping the model infers structure later.&lt;br&gt;
At the heart of this is a simple but powerful idea called a &lt;strong&gt;triplet&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  (Subject → Relationship → Object)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;For example:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;(Ram → leads → Project A)&lt;br&gt;
(Project A → depends_on → Payments Platform v2)&lt;br&gt;
(Payments Platform v2 → owned_by → FinTech Division)&lt;br&gt;
(FinTech Division → reports_to → CTO Office)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ow1cf18o2xvz6l2ugq8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0ow1cf18o2xvz6l2ugq8.png" alt=" " width="800" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Notice what just happened. We didn't store paragraphs. &lt;strong&gt;We stored meaning in a form the system can traverse&lt;/strong&gt;, query, and reason over. &lt;/p&gt;

&lt;h2&gt;
  
  
  Now we can ask:
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;What does the CTO Office indirectly own?&lt;/strong&gt; and follow the chain. We can ask. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What breaks if the Payments Platform is delayed?&lt;/strong&gt; and trace the dependencies. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;We've moved from retrieving information to navigating knowledge.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F983k09wfh68i3qz6aht1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F983k09wfh68i3qz6aht1.png" alt=" " width="800" height="751"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge graphs&lt;/strong&gt; are stored as &lt;strong&gt;directed graphs&lt;/strong&gt; &lt;strong&gt;nodes&lt;/strong&gt; are &lt;strong&gt;entities&lt;/strong&gt;, &lt;strong&gt;edges&lt;/strong&gt; are typed &lt;strong&gt;relationships&lt;/strong&gt;. This structure enables graph traversal algorithms, &lt;strong&gt;multi hop queries&lt;/strong&gt;, &lt;strong&gt;shortest path analysis&lt;/strong&gt;, and network centrality calculations none of which &lt;strong&gt;are available in a flat vector index&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Knowledge Graphs Already Live
&lt;/h2&gt;

&lt;p&gt;Knowledge graphs aren't a new invention. Google has used its Knowledge Graph and Wikidata the structured data backbone of Wikipedia contains over 100 million items. The biomedical knowledge graph OpenBioLink contains millions of interactions between genes, proteins, diseases, and drugs. LinkedIn's economic graph models relationships between professionals, companies, skills, and jobs at scale. These aren't prototypes they're production systems handling billions of queries.&lt;/p&gt;

&lt;h2&gt;
  
  
  What GraphRAG Is and How It Works
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;GraphRAG&lt;/strong&gt; popularized significantly by a 2024 Microsoft Research paper is a framework that uses a knowledge graph as the retrieval layer for an LLM, rather than a flat vector index.&lt;/p&gt;

&lt;p&gt;The core intuition: instead of retrieving &lt;strong&gt;semantically similar text chunks&lt;/strong&gt;, &lt;strong&gt;retrieve connected knowledge from a graph&lt;/strong&gt;, then provide that richer context to the model.&lt;br&gt;
GraphRAG typically involves three stages.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fja5bqngzcqiuvxm7f3rw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fja5bqngzcqiuvxm7f3rw.png" alt=" " width="800" height="507"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 1: Graph Based Indexing
&lt;/h2&gt;

&lt;p&gt;You build and index a graph. This might be an existing open knowledge graph (Wikidata, ConceptNet, UMLS for medical domains), a domain-specific proprietary graph, or a graph you construct from your own corpus using extraction pipelines. Proper indexing matters retrieval can use text descriptions, graph topology, embeddings over graph structure, hybrid schemes, or all of the above.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 2: Graph Guided Retrieval
&lt;/h2&gt;

&lt;p&gt;When a user asks a question, the system identifies relevant entities, then traverses relationships, paths, and subgraphs to assemble a richer answer context. This may involve entity linking, k-hop neighborhood expansion, Personalized PageRank, community detection, or LLM-directed graph traversal. The Microsoft GraphRAG paper specifically introduced a community summarization approach using graph algorithms to identify clusters of related entities and pre generating summaries which dramatically improved performance on global sense making tasks like, What are the major themes in this document corpus?&lt;/p&gt;

&lt;h2&gt;
  
  
  Stage 3: Graph Enhanced Generation
&lt;/h2&gt;

&lt;p&gt;Once relevant graph knowledge is identified, it's translated into a form the LLM can consume: raw triplets, adjacency lists, natural language descriptions of paths, or structured summaries. This translation step is critical and often underestimated LLMs are sequence models trained on text, not graph traversal engines. The quality of this bridge between graph structure and language generation largely determines whether GraphRAG actually outperforms RAG in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Knowledge Graphs Get Built: The Extraction Pipeline
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before you can run GraphRAG, you need a graph&lt;/strong&gt;. Building one from your own data means running an information extraction pipeline over your corpus.&lt;/p&gt;

&lt;h2&gt;
  
  
  The two core tasks are:
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Named Entity Recognition (NER):&lt;/strong&gt; Identifying entities in text &lt;strong&gt;people&lt;/strong&gt;, &lt;strong&gt;organizations&lt;/strong&gt;, &lt;strong&gt;products&lt;/strong&gt;, &lt;strong&gt;locations&lt;/strong&gt;, medical conditions, financial instruments, events, and whatever entity types your domain requires.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Relation Extraction (RE):&lt;/strong&gt; Identifying the relationship between those &lt;strong&gt;entities&lt;/strong&gt; &lt;strong&gt;works_at&lt;/strong&gt;, &lt;strong&gt;acquired&lt;/strong&gt;, &lt;strong&gt;causes&lt;/strong&gt;, &lt;strong&gt;located_in&lt;/strong&gt;, depends_on, cited_by.&lt;/p&gt;

&lt;p&gt;Historically, this required &lt;strong&gt;expensive annotated training data and domain-specific supervised models&lt;/strong&gt;. Modern LLMs have changed the economics significantly. &lt;strong&gt;You can prompt a model to extract entities and relationships from a document in a single pass, using in context examples to define your schema.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fct71v21zzhnlo6rsroen.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fct71v21zzhnlo6rsroen.png" alt=" " width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Two Practical Approaches
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Custom LLM pipelines:
&lt;/h2&gt;

&lt;p&gt;You &lt;strong&gt;design&lt;/strong&gt; &lt;strong&gt;prompts&lt;/strong&gt; that &lt;strong&gt;specify exactly what entity types and relationship types to extract&lt;/strong&gt;, validate the output, handle edge cases, and write the results to your &lt;strong&gt;graph database&lt;/strong&gt; often &lt;strong&gt;Neo4j&lt;/strong&gt;, &lt;strong&gt;which uses the Cypher query language&lt;/strong&gt;. This gives you fine-grained domain control but requires serious engineering effort: output validation, error handling, entity disambiguation (is OpenAI the same as Open AI?), conflict resolution, and ongoing maintenance. For enterprise grade graphs that become core assets, this is usually the right investment.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. LangChain GraphTransformers / LlamaIndex graph tools
&lt;/h2&gt;

&lt;p&gt;Frameworks like &lt;strong&gt;LangChain's&lt;/strong&gt; &lt;strong&gt;LLMGraphTransformer&lt;/strong&gt; abstract much of this into a &lt;strong&gt;few lines of code&lt;/strong&gt;. You hand it documents and get back structured graph documents you can load into a graph store. This is excellent for prototyping and early validation you can have a working graph in hours, not weeks. &lt;strong&gt;The tradeoff is less control over extraction quality and ontology design&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A pragmatic approach:&lt;/strong&gt; use LangChain tools to validate the concept and understand the data, then invest in a custom pipeline when the graph becomes a production dependency.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Costs of GraphRAG (The Part Most Bloggers Skip)
&lt;/h2&gt;

&lt;p&gt;Here's where most GraphRAG enthusiasm runs ahead of reality. The framework is genuinely powerful but it carries costs that compound at scale. Teams that discover these after committing to the architecture tend to have strong opinions about them.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Compute Cost Is a Design Constraint, Not a Detail
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Building a graph&lt;/strong&gt; from a &lt;strong&gt;large corpus&lt;/strong&gt; means running LLM-based extraction over every &lt;strong&gt;document often multiple passes for NER, RE&lt;/strong&gt;, and disambiguation. At scale, this gets expensive fast. A corpus of 100,000 documents running extraction at &lt;strong&gt;$0.01 per document is $1,000 **to build. But knowledge changes. Documents get updated, entities evolve, relationships become stale. **This isn't a one time cost it's an ongoing infrastructure commitment&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Microsoft GraphRAG **paper noted that graph construction costs can be **10–100x higher than standard RAG indexing&lt;/strong&gt;, depending on corpus size and extraction complexity. For many use cases, that's a reasonable investment. For others, it's prohibitive.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Maintenance Is Continuous and Non Trivial
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;In a standard RAG system&lt;/strong&gt;, updating the index when data changes is relatively mechanical process the new document, chunk it, embed it, replace the old vectors.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In GraphRAG, a new document isn't just new text. It may&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Introduce entities&lt;/strong&gt; not yet in the graph&lt;/li&gt;
&lt;li&gt;Rename or merge &lt;strong&gt;existing entities&lt;/strong&gt; (disambiguation challenge)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Add&lt;/strong&gt; &lt;strong&gt;relationships&lt;/strong&gt; that contradict previously stored ones&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Require schema updates&lt;/strong&gt; to accommodate new relationship types&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Trigger cascading updates across connected subgraphs&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Real knowledge graph maintenance involves entity resolution (merging duplicate nodes), relationship validation, conflict handling, ontology management, and quality monitoring.&lt;/strong&gt; This isn't optional a stale or inconsistent graph produces worse answers than no graph at all. Organizations running production knowledge graphs typically have dedicated data engineering pipelines, not just an extraction script that runs once.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Query Complexity Is Significantly Higher
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Vector RAG retrieval is fast and conceptually simple&lt;/strong&gt; &lt;strong&gt;embed&lt;/strong&gt; the &lt;strong&gt;query&lt;/strong&gt;, run approximate &lt;strong&gt;nearest neighbor search&lt;/strong&gt;, return &lt;strong&gt;top-k chunks&lt;/strong&gt;. The main failure mode is retrieving the wrong chunks, &lt;strong&gt;which you address by improving chunking, embeddings, and reranking.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GraphRAG retrieval involves:&lt;/strong&gt; identifying &lt;strong&gt;entities&lt;/strong&gt; in the &lt;strong&gt;query&lt;/strong&gt;, &lt;strong&gt;traversing&lt;/strong&gt; the graph, selecting relevant subgraphs, managing traversal depth (too shallow and you miss context, too deep and you hit subgraph explosion), translating graph results into LLM consumable text, and often generating structured queries in &lt;strong&gt;Cypher&lt;/strong&gt; or &lt;strong&gt;SPARQL&lt;/strong&gt;. Each step introduces new failure modes, and a single error the entity linker fails to identify a key node, the traversal goes in the wrong direction can cascade into a wrong answer even if the graph itself is perfectly accurate.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. LLMs Are Not Graph Native Models
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is a foundational point that's easy to underestimate.&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;LLMs&lt;/strong&gt; are trained on &lt;strong&gt;sequences of tokens&lt;/strong&gt;. They're &lt;strong&gt;extraordinarily good at language&lt;/strong&gt;, context, and reasoning over text. They're not naturally good at topological reasoning, deep multi-hop graph traversal, or understanding complex graph structure. As graph complexity increases more hops, more nodes, more relationship types LLM performance can degrade unless the &lt;strong&gt;graph-to-text translation is carefully designed.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is why active research exists on Graph Neural Networks (GNNs), Knowledge Graph Embeddings (like TransE, RotatE, ComplEx), and specialized graph reasoning models that can work alongside LLMs because language models alone aren't sufficient for the hardest graph reasoning tasks.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Subgraph Explosion Is a Real Production Problem
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;As your graph grows, so does the number of paths between any two nodes.&lt;/strong&gt; A query that seems simple  What does this organization depend on? can trigger traversal over thousands of candidate subgraphs if the graph is dense. Without careful traversal bounds, relevance scoring, and pruning strategies, retrieval latency can blow past acceptable thresholds. Large scale industrial knowledge graphs at companies like Google and Amazon contain billions of entities and trillions of relationships and efficient retrieval over those structures requires specialized infrastructure, not just a graph database with default settings.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use GraphRAG (and When Not To)
&lt;/h2&gt;

&lt;p&gt;Given the &lt;strong&gt;costs&lt;/strong&gt; and &lt;strong&gt;complexity&lt;/strong&gt;, &lt;strong&gt;GraphRAG&lt;/strong&gt; deserves a clear deployment framework.&lt;/p&gt;

&lt;h2&gt;
  
  
  Use GraphRAG when:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Relationships are the core question&lt;/strong&gt;. If users routinely ask about dependencies, hierarchies, networks, chains of causation, or multi hop connections and your current RAG system struggles with these a graph likely adds genuine value.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your domain has natural graph structure.&lt;/strong&gt; Biomedical research (gene-protein-disease networks), legal precedent analysis, financial transaction monitoring, &lt;strong&gt;supply chain management, security incident investigation these domains are inherently relational, and graph structure captures meaning that flat text loses.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multi hop reasoning is required.&lt;/strong&gt; What companies did the CTO previously work at, and what products were they responsible for? requires following a chain of relationships across entities. RAG retrieves disconnected chunks a graph traverses the chain.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Global sense-making matters.&lt;/strong&gt; The Microsoft GraphRAG research showed particular strength in tasks that require understanding themes, patterns, and relationships across an entire corpus summarization tasks where no single document contains the answer. Standard RAG performs poorly on these.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Stick with RAG when:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Text retrieval is the actual problem.&lt;/strong&gt; If users are asking questions that can be answered by finding the right paragraph — policy lookup, document Q&amp;amp;A, manual search RAG is often simpler, cheaper, and more maintainable. Don't add complexity for problems that don't require it.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Your data changes rapidly.&lt;/strong&gt; &lt;strong&gt;Fast moving data makes graph maintenance expensive. A vector index is much easier to keep current.&lt;/strong&gt; &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Agents&lt;/strong&gt; can resolve the &lt;strong&gt;gap dynamically&lt;/strong&gt;. More on this shortly.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;You're early in your AI journey.&lt;/strong&gt; Get RAG right first. Chunking, embeddings, metadata filtering, reranking, and permissions are complex enough. Adding graph infrastructure before validating the core product is usually premature.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Then Came Agents Changing the Game Again
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9ey3dny1ggvdchtkvce.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9ey3dny1ggvdchtkvce.png" alt=" " width="800" height="328"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;While teams were deep in RAG vs. GraphRAG debates&lt;/strong&gt;, &lt;strong&gt;agentic AI was quietly shifting the entire premise.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An agent isn't a retriever&lt;/strong&gt;. It's a &lt;strong&gt;reasoning&lt;/strong&gt; and &lt;strong&gt;orchestration&lt;/strong&gt; &lt;strong&gt;layer&lt;/strong&gt; that can choose &lt;strong&gt;tools&lt;/strong&gt;, &lt;strong&gt;call&lt;/strong&gt; &lt;strong&gt;APIs&lt;/strong&gt;, &lt;strong&gt;query&lt;/strong&gt; &lt;strong&gt;databases&lt;/strong&gt;, &lt;strong&gt;write&lt;/strong&gt; and &lt;strong&gt;execute&lt;/strong&gt; code, &lt;strong&gt;maintain state across steps&lt;/strong&gt;, and decide what to do next based on intermediate results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This changes the architectural question fundamentally.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GraphRAG assumes that you should structure knowledge in advance so you can traverse it later.&lt;/strong&gt; The entire value proposition is &lt;strong&gt;precomputed structure&lt;/strong&gt; available at retrieval time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agents introduce a different possibility&lt;/strong&gt; &lt;strong&gt;maybe we don't need&lt;/strong&gt; to &lt;strong&gt;precompute&lt;/strong&gt; every relationship if the system can discover and assemble relevant context dynamically at runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Consider what an Agent can do in a single reasoning flow
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Query&lt;/strong&gt; a relational &lt;strong&gt;database&lt;/strong&gt; for organizational structure&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Search&lt;/strong&gt; a &lt;strong&gt;vector&lt;/strong&gt; index for relevant documents&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Call&lt;/strong&gt; an internal &lt;strong&gt;API&lt;/strong&gt; for live financial data&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Execute&lt;/strong&gt; code to analyze a dataset&lt;/li&gt;
&lt;li&gt;Synthesize all of it into a &lt;strong&gt;coherent answer&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In some cases, that dynamic composition can substitute for a prebuilt knowledge graph especially when the relationships are discoverable from authoritative source systems rather than needing to be extracted and stored separately.&lt;/p&gt;

&lt;h2&gt;
  
  
  Major Agentic Frameworks in Production
&lt;/h2&gt;

&lt;p&gt;Several frameworks have emerged to support this style of architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LangGraph&lt;/strong&gt; (from LangChain) provides a &lt;strong&gt;graph based state machine&lt;/strong&gt; for building multi-step agent workflows with explicit control flow&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;AutoGen&lt;/strong&gt; (Microsoft) enables multi agent conversations where specialized agents collaborate on complex tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Microsoft Agent Framework&lt;/strong&gt; = AutoGen+ Semantic Kernel  is new Agentic framework to provides for building Multi agents + Workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CrewAI&lt;/strong&gt; focuses on role-based multi-agent systems for structured workflows.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Amazon Bedrock Agents and Google Vertex AI Agents&lt;/strong&gt; offer managed agentic infrastructure at cloud scale&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;These frameworks don't replace retrieval they orchestrate it.&lt;/strong&gt; An agent using LangGraph might invoke a vector search tool for semantic lookup, a graph query tool for relationship traversal, &lt;strong&gt;a SQL tool for structured data, and a web search tool for current information all within a single reasoning chain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdzjuuwtcy5exc9ngjy64.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdzjuuwtcy5exc9ngjy64.png" alt=" " width="800" height="502"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Future: Composition, Not Competition
&lt;/h2&gt;

&lt;p&gt;The industry loves a clean narrative. RAG is dead. GraphRAG wins. Agents replace everything.&lt;/p&gt;

&lt;p&gt;None of that is how it actually plays out in production systems.&lt;br&gt;
What we're seeing in Microsoft's research, in enterprise AI deployments, in the emerging architecture patterns at companies like Uber, Airbnb, and LinkedIn is convergence toward hybrid, layered systems where each approach plays to its strengths.&lt;/p&gt;

&lt;h2&gt;
  
  
  The simplest mental model
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7esbljxn8md1b861nc6q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7esbljxn8md1b861nc6q.png" alt=" " width="800" height="243"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Or more concisely: RAG finds information. GraphRAG finds connections. Agents decide how to use both.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The future isn't choosing one acronym over another.&lt;/strong&gt; It's building systems smart enough to know when &lt;strong&gt;each approach applies&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;A Practical Decision Framework for Teams Building AI Systems Today&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7d3uhyhci0ncs2jpz43o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7d3uhyhci0ncs2jpz43o.png" alt=" " width="800" height="410"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most teams don't fail because they chose the wrong technology.&lt;/strong&gt; They &lt;strong&gt;fail because they never got clear on what they were actually trying to fix&lt;/strong&gt;. A few honest questions asked early can save months of over engineering.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with the failure, not the solution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ask yourself: what is actually going wrong right now?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If users are saying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The &lt;strong&gt;answer is incorrect&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;It didn't &lt;strong&gt;pick the right document&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;That's a RAG quality problem not a graph problem&lt;/strong&gt;. Fix the fundamentals first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Better&lt;/strong&gt; &lt;strong&gt;chunking&lt;/strong&gt; strategies&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Higher quality embeddings&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Stronger &lt;strong&gt;reranking&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But if users are saying:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It doesn't &lt;strong&gt;understand how things are connected&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;It misses &lt;strong&gt;relationships between entities&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;That's a structural gap&lt;/strong&gt;. That's where &lt;strong&gt;graphs start making sense&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not every domain is a graph domain&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Some domains are naturally relational relationships aren't optional, they're the system&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Drug interactions in healthcare&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Organizational&lt;/strong&gt; hierarchies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Legal&lt;/strong&gt; precedents&lt;/li&gt;
&lt;li&gt;Financial dependencies&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Supply&lt;/strong&gt; &lt;strong&gt;chain&lt;/strong&gt; networks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Many common applications are not like this:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Document Q&amp;amp;A&lt;/li&gt;
&lt;li&gt;Policy lookup systems&lt;/li&gt;
&lt;li&gt;Internal copilots&lt;/li&gt;
&lt;li&gt;Knowledge assistants&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For these, well built RAG is often more than enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Be honest about what maintenance actually costs&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A knowledge &lt;strong&gt;graph is not a one time build.&lt;/strong&gt; It's a living system that requires:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Continuous entity resolution&lt;/li&gt;
&lt;li&gt;Relationship validation&lt;/li&gt;
&lt;li&gt;Ongoing extraction pipelines&lt;/li&gt;
&lt;li&gt;Schema evolution as data changes&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If the ownership isn't there to sustain this, the graph will drift from reality and once users lose trust, no architecture can win it back.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sometimes the bottleneck isn't retrieval at all&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If your system needs to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Work across multiple data sources&lt;/li&gt;
&lt;li&gt;Call APIs dynamically&lt;/li&gt;
&lt;li&gt;Adapt based on intermediate results&lt;/li&gt;
&lt;li&gt;Execute multi-step reasoning&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then the RAG vs. graph debate is beside the point. Your bottleneck is orchestration  and that's where agentic architectures deliver the most value.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start simple. Evolve with evidence, not assumptions.&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Start with a clean, well implemented RAG pipeline&lt;/li&gt;
&lt;li&gt;Observe where it fails in real usage&lt;/li&gt;
&lt;li&gt;Then decide: does this failure require relationships (Graph) or coordination (Agents)?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Not trends. Not what worked for another team. Actual evidence from your system.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You don't start with GraphRAG. You earn your way into it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Edge, D. et al. (2024). From Local to Global: A Graph RAG Approach to Query-Focused Summarization. Microsoft Research.&lt;/li&gt;
&lt;li&gt;Mallen, A. et al. (2023). When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories. ACL 2023.&lt;/li&gt;
&lt;li&gt;Liu, N.F. et al. (2023). Lost in the Middle: How Language Models Use Long Contexts. Stanford University.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The next time someone declares a technology dead,&lt;/strong&gt; look closer chances are it's just being absorbed into something bigger. The most resilient AI systems aren't built on a single winning bet. They're built on clarity: knowing what problem you're solving, what tool solves it best, and how to compose them intelligently when complexity demands it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG finds. Graphs connect. Agents reason. None of them wins alone&lt;/strong&gt; but together, in the right architecture, they form something greater than the sum of their parts.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The engineers who will build the most capable systems aren't the ones chasing the newest headline. They're the ones who resist the hype cycle long enough to ask the harder question&lt;/strong&gt; not what's the best technology? but what does my problem actually need?&lt;/p&gt;

&lt;p&gt;That discipline matching tools to problems, not problems to tools is what separates trend followers from system builders.&lt;br&gt;
In a field that reinvents itself every six months, that kind of thinking isn't just useful.&lt;/p&gt;

&lt;p&gt;It's the only thing that ages well.  and &lt;strong&gt;finally&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;The goal was never just to &lt;strong&gt;retrieve text&lt;/strong&gt;. The goal is to help &lt;strong&gt;systems understand, connect, and use knowledge&lt;/strong&gt; in a way that actually supports reasoning. We're getting closer and the path runs through all of these ideas at once.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>What are Pre-Trained Models, Fine-Tuning, RAG, and Prompt Engineering? A Simple Kitchen Guide</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Sat, 11 Apr 2026 02:03:29 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/what-are-pre-trained-models-fine-tuning-rag-and-prompt-engineering-a-simple-kitchen-guide-594b</link>
      <guid>https://future.forem.com/sreeni5018/what-are-pre-trained-models-fine-tuning-rag-and-prompt-engineering-a-simple-kitchen-guide-594b</guid>
      <description>&lt;p&gt;&lt;strong&gt;&lt;em&gt;Explained Using Food The Analogy That Finally Makes It Click&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I’ve been asked the same question a thousand times. It comes from &lt;strong&gt;senior engineers moving into AI&lt;/strong&gt;. It comes from &lt;strong&gt;product managers&lt;/strong&gt; in architecture reviews. It comes from f*&lt;em&gt;ounders building their first AI product&lt;/em&gt;*. And it always sounds like some version of this:&lt;/p&gt;

&lt;h2&gt;
  
  
  The Question I Hear Every Week
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;“When should I &lt;strong&gt;fine tune instead of just prompting better?&lt;/strong&gt;”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;“&lt;strong&gt;What exactly is RAG&lt;/strong&gt;  and is it better than fine tuning?”&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;“I keep hearing about &lt;strong&gt;pre-trained models&lt;/strong&gt; what does pre-trained actually mean in practice?”&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;By the end of this blog&lt;/strong&gt;, you’ll be able to explain these three techniques to any &lt;strong&gt;colleague technical&lt;/strong&gt; or &lt;strong&gt;non-technical&lt;/strong&gt; in under two minutes. More importantly, you’ll know exactly which one to reach for in your own work.&lt;/p&gt;

&lt;p&gt;So I tried something different. &lt;strong&gt;I used food&lt;/strong&gt;. &lt;strong&gt;And it worked better than anything else I’ve tried.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Let's EAT
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ul12rl1j3ja101ot339.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7ul12rl1j3ja101ot339.png" alt=" " width="800" height="535"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. &lt;strong&gt;🧊  Pre-Trained Model = Frozen Food&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Walk into any &lt;strong&gt;supermarket&lt;/strong&gt; and pick up a &lt;strong&gt;bag of frozen pasta from the freezer section&lt;/strong&gt;. A factory produced it using industrial equipment, professional chefs, tested recipes, &lt;strong&gt;and enormous quantities of ingredients all before you arrived&lt;/strong&gt;. Y*&lt;em&gt;ou don't know every detail of how it was made. But you trust it&lt;/em&gt;*, it works reliably, and you can have a meal in ten minutes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is a pre-trained model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Companies like &lt;strong&gt;Anthropic&lt;/strong&gt;, &lt;strong&gt;OpenAI&lt;/strong&gt;, &lt;strong&gt;Google&lt;/strong&gt;, and &lt;strong&gt;Meta&lt;/strong&gt; spend &lt;strong&gt;hundreds of millions of dollars training these models on internet-scale data billions of web pages, books, code repositories, scientific papers, and conversations&lt;/strong&gt;. The result is a model that already understands language, can write and &lt;strong&gt;debug code&lt;/strong&gt;, reason through complex problems, &lt;strong&gt;translate between languages&lt;/strong&gt;, &lt;strong&gt;summarize&lt;/strong&gt; documents, and &lt;strong&gt;answer questions&lt;/strong&gt; across hundreds of domains.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Industrial Scale Behind That Frozen Bag&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GPT-4 was trained on over 1 trillion tokens of text that is roughly 750 billion words.&lt;/li&gt;
&lt;li&gt;Meta's open-source Llama 3 was trained on 15 trillion tokens.&lt;/li&gt;
&lt;li&gt;Training a frontier model requires thousands of specialized GPUs running for weeks.&lt;/li&gt;
&lt;li&gt;The compute cost alone can exceed $50–100 million USD for a single training run.&lt;/li&gt;
&lt;li&gt;This is why 99% of developers never train from scratch. They start from a pre-trained base and work from there.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;So what do you actually do with frozen food?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;You heat it and eat it&lt;/strong&gt;. That is the whole job. In AI terms, &lt;strong&gt;this means prompt engineering&lt;/strong&gt; the craft of writing instructions that get the best possible output from the model without changing a single internal setting. Techniques like &lt;strong&gt;chain-of-thought prompting&lt;/strong&gt;, &lt;strong&gt;few-shot examples&lt;/strong&gt;, &lt;strong&gt;system instructions&lt;/strong&gt;, and &lt;strong&gt;temperature control&lt;/strong&gt; are all just &lt;strong&gt;different ways of heating&lt;/strong&gt; the food more skillfully.&lt;/p&gt;

&lt;p&gt;A well written prompt can unlock reasoning capabilities that seem almost magical. And the important thing to understand is: you are not changing the model. You are changing the conversation you are having with it.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Use the pre-trained model as is when…&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;General intelligence is enough for the tasks like &lt;strong&gt;summarizing&lt;/strong&gt;, &lt;strong&gt;Q&amp;amp;A&lt;/strong&gt;, writing, &lt;strong&gt;code generation&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;You are prototyping or proving a concept and need speed over perfection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Budget&lt;/strong&gt; is a constraint no training pipeline needed, just an API call&lt;/li&gt;
&lt;li&gt;The &lt;strong&gt;task doesn't require specialized private knowledge&lt;/strong&gt; or consistent brand behavior&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Here's the transition most people miss:&lt;/strong&gt; The &lt;strong&gt;frozen food is brilliant for a quick&lt;/strong&gt;, &lt;strong&gt;satisfying meal&lt;/strong&gt;. But what if the default &lt;strong&gt;flavor doesn't taste like you?&lt;/strong&gt; &lt;strong&gt;What if your guests expect something that reflects your kitchen, your brand, your domain? That's when you reach for the seasoning.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. &lt;strong&gt;🌶️  Fine-Tuning = Adding Your Own Seasoning&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;You have your bag of frozen pasta&lt;/strong&gt;. But before you serve it, you add your &lt;strong&gt;own chilli oil&lt;/strong&gt;, &lt;strong&gt;roasted garlic&lt;/strong&gt;, &lt;strong&gt;fresh herbs&lt;/strong&gt;, and a &lt;strong&gt;squeeze&lt;/strong&gt; of &lt;strong&gt;lemon&lt;/strong&gt;. The pasta itself is still the same factory product. The base structure is completely intact. But now it tastes like your &lt;strong&gt;pasta your kitchen's signature&lt;/strong&gt;. Anyone who has eaten at your table before would recognize it.&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-tuning works exactly like this.
&lt;/h2&gt;

&lt;p&gt;You take a pre-trained model and continue training it on a smaller, &lt;strong&gt;carefully curated dataset of your own&lt;/strong&gt;. You are not rebuilding from scratch you start from those existing weight settings and nudge them in the direction you need. Think of it as turning dozens of those dials a few degrees, rather than starting from zero. &lt;strong&gt;The broad intelligence the model already has is preserved. What changes is how it behaves specifically for you.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Fine-tuning changes how the model behaves.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;It does not change what the model knows.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This distinction matters enormously and it's where teams go wrong.&lt;/strong&gt; If your legal AI product needs to produce documents in the exact format your senior partners expect, fine-tune. But if your product needs to answer questions about a case filed last Tuesday, fine-tuning won't help. That filed case isn't in the training data. That's RAG's job — we'll get there.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What fine-tuning actually looks like in practice&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;You collect hundreds or thousands of example input–output pairs that demonstrate the exact behavior you want. For a medical coding assistant, that might be clinical notes paired with correct ICD-10 billing codes. For a brand voice bot, it might be customer messages paired with ideal responses in your company's tone. This dataset is fed into the training process and the model updates its weights to match your examples. The process typically costs hundreds to thousands of dollars in &lt;strong&gt;GPU compute, takes hours to days&lt;/strong&gt; depending on scale, and requires careful evaluation before you deploy.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Fine-tuning adjusts these things well&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Tone and writing style formal, clinical, conversational, legal, brand-specific&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Output structure consistent JSON schemas&lt;/strong&gt;, report templates, specific formatting rules&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Domain vocabulary medical codes&lt;/strong&gt;, &lt;strong&gt;legal terminology&lt;/strong&gt;, internal product names and systems&lt;/li&gt;
&lt;li&gt;Default response behavior how the model handles edge cases and ambiguous inputs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Prompt efficiency&lt;/strong&gt; a fine-tuned model often needs shorter system prompts, saving cost at scale&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Fine-tuning cannot do these things don't ask it to&lt;/strong&gt;
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Update the model's knowledge of world events its understanding is frozen at training time&lt;/li&gt;
&lt;li&gt;Give the model access to your private documents at query time — that is RAG&lt;/li&gt;
&lt;li&gt;Prevent hallucination on specific facts a fine-tuned model still makes things up&lt;/li&gt;
&lt;li&gt;Replace re-training when your data changes you must re-fine-tune, which is expensive&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Real companies using fine-tuning today&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;GitHub Copilot&lt;/strong&gt; is built on models fine-tuned on billions of lines of public code that's why it produces completions that match common coding patterns and library conventions far better than a general purpose model would.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Harvey AI&lt;/strong&gt; fine-tunes on legal documents and case law so that it consistently produces output matching the precise language, structure, and citation style that lawyers expect from a junior associate.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Med-PaLM 2 (Google)&lt;/strong&gt; is fine-tuned specifically on medical question answer pairs, reaching expert level performance on US Medical Licensing Examination questions a benchmark a general purpose model performs far below.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use fine-tuning when…&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;The model's default tone or output format doesn't fit your use case&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;You have hundreds or thousands of high-quality labelled examples&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Consistency in style&lt;/strong&gt; and &lt;strong&gt;format matters&lt;/strong&gt; more than freshness of knowledge&lt;/li&gt;
&lt;li&gt;You are making thousands of &lt;strong&gt;API calls daily and need to reduce prompt length for cost&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The seasoning has done its job. Your dish now has a recognizable identity.&lt;/strong&gt; But there is still one problem that no amount of seasoning can solve: the frozen pasta was made months ago. What happens when your customer asks a question about something that happened last week? What happens when they need an answer based on your private internal documents that have never been part of any training dataset? For that, you need fresh ingredients and that's &lt;strong&gt;where RAG completely changes the game&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;3. 🥗  RAG = Serving Fresh Side Dishes&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before I explain RAG&lt;/strong&gt;, I need to explain &lt;strong&gt;the problem it solves because once you understand the problem, the solution becomes completely obvious.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The hallucination problem and why it matters&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Language models hallucinate.&lt;/strong&gt; &lt;strong&gt;That is not a bug that will eventually be fixed&lt;/strong&gt;. &lt;strong&gt;It is a fundamental property of how they work.&lt;/strong&gt; When a model is asked a question it cannot confidently answer from its training data an event that happened last month, &lt;strong&gt;a number from your private database, a policy you updated last quarter&lt;/strong&gt; it &lt;strong&gt;does not say 'I don't know.'&lt;/strong&gt; &lt;strong&gt;It produces a fluent, confident, completely fabricated answer.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A hallucinated answer looks exactly like a correct one. &lt;strong&gt;Same tone, same confidence, same formatting. A model will tell you that a law was passed on a specific date&lt;/strong&gt;, that a case was decided a certain way, that a product specification has specific numbers and be entirely wrong. For consumer chatbots, this is annoying. In healthcare, legal, financial, and compliance contexts, it can be catastrophic.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why models hallucinate in plain English&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A language model's job is to predict the most statistically likely next word or sentence given the context.&lt;/strong&gt; When the correct answer isn't in its training data, it doesn't have a 'I don't know' mode it has only a 'generate the most plausible continuation' mode. The result is confident sounding fabrication.&lt;/p&gt;

&lt;p&gt;This is not fixable by making the model bigger or training it longer. The only reliable solution is to give it the correct information as context at query time which is exactly what RAG does.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Now: what is RAG?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Back to the kitchen.&lt;/strong&gt; &lt;strong&gt;You heat your frozen pasta the pasta itself is completely unchanged&lt;/strong&gt;. But tonight you serve it alongside &lt;strong&gt;a fresh caprese salad made this morning&lt;/strong&gt;, &lt;strong&gt;warm garlic bread just out of the oven, and a sauce from tomatoes picked an hour ago&lt;/strong&gt;. The pasta is still the factory's pasta. But the meal is elevated, current, and specific to tonight because you brought &lt;strong&gt;real, live ingredients to the table.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;That is &lt;strong&gt;R&lt;/strong&gt;etrieval &lt;strong&gt;A&lt;/strong&gt;ugmented &lt;strong&gt;G&lt;/strong&gt;eneration(RAG). The model is not changed. Instead, at the exact moment someone asks a question, your system fetches relevant, &lt;strong&gt;up-to-date information from an external source your documents, your database, your internal knowledge base  and places that information into the model's context window before asking it to answer.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;What is a 'context window'?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Think of the &lt;strong&gt;context window&lt;/strong&gt; as the &lt;strong&gt;model's short term memory(STM)&lt;/strong&gt; everything it can see and reason about in a single conversation. It has a fixed size. When we do RAG, we use part of that window to inject the retrieved documents, essentially saying: 'Here is what you need to know to answer this question accurately. Now answer it.' The model reasons over both its trained knowledge and the fresh material we just handed it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The RAG pipeline step by step&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here is exactly what happens behind the scenes every time a RAG-enabled system answers a question:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The 6 steps of a RAG pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The user submits a question for example: 'What is our current parental leave policy?'&lt;/li&gt;
&lt;li&gt;The system converts that question into a vector embedding a list of numbers representing its meaning in mathematical space.&lt;/li&gt;
&lt;li&gt;A similarity search runs against a &lt;strong&gt;vector database&lt;/strong&gt; (&lt;strong&gt;Pinecone&lt;/strong&gt;, &lt;strong&gt;Weaviate&lt;/strong&gt;, &lt;strong&gt;ChromaDB&lt;/strong&gt;, &lt;strong&gt;pgvector&lt;/strong&gt;, &lt;strong&gt;OpenSearch&lt;/strong&gt;) and retrieves the document chunks that are mathematically closest in meaning to the question.&lt;/li&gt;
&lt;li&gt;In some systems, a &lt;strong&gt;re-ranker&lt;/strong&gt; then scores these chunks by relevance and selects the best ones.&lt;/li&gt;
&lt;li&gt;Those &lt;strong&gt;chunks&lt;/strong&gt; are **injected into the model's context window **alongside the original question: 'Here is relevant information. Using only this, answer the question accurately.'&lt;/li&gt;
&lt;li&gt;The model generates a response that is grounded in the retrieved content not in its training memory and can cite the source document by name.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;The key thing to understand:&lt;/strong&gt; the quality of your answers in a RAG system depends almost entirely on the quality of your retrieval. Naive RAG simply dumping documents into a vector database and hoping produces mediocre results at scale. Production RAG is an engineering discipline: thoughtful chunking strategies, the right embedding model, tuned retrieval parameters, and &lt;strong&gt;post-retrieval re-ranking. The model is the least of your concerns.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real companies using RAG today&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Notion AI&lt;/strong&gt; uses RAG to let users ask questions about their own workspace content. The model has no idea what is in your Notion pages until the RAG pipeline retrieves and injects the relevant pages at query time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Perplexity AI&lt;/strong&gt; is essentially a RAG system at its core it retrieves live web pages and uses a language model to synthesize an answer with citations. No fine-tuning required for the freshness.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Legal and compliance tools&lt;/strong&gt; at enterprise firms use RAG to answer questions about thousands of private contracts, regulations, and precedents data that can never be used in training because of sensitivity and confidentiality requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Use RAG when…&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Your data changes frequently products, &lt;strong&gt;prices, policies, news, regulations&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Your data is &lt;strong&gt;private&lt;/strong&gt; or &lt;strong&gt;sensitive&lt;/strong&gt; and cannot be part of a training pipeline&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;You need answers to be accurate and traceable citations matter&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Updating knowledge should not require retraining just update the database&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;You need to eliminate hallucination on specific factual questions&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Fresh side dishes served. The meal is extraordinary.&lt;/strong&gt; But here's the thing the best chefs know: a three-course meal beats any single dish. &lt;strong&gt;The future of enterprise AI is not pre-trained or fine-tuned or RAG. It's all three, deliberately layered which is what we'll look at next.&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Full Kitchen: When You Need All Three
&lt;/h2&gt;

&lt;p&gt;The most powerful AI products in production today combine all three techniques. The food analogy holds perfectly:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A real enterprise AI assistant all three layers working together&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 1: The frozen food (pre-trained model):&lt;/strong&gt; &lt;strong&gt;GPT-4o&lt;/strong&gt; or &lt;strong&gt;Claude 3.5 Sonnet&lt;/strong&gt; provides the base intelligence language understanding, reasoning, code generation. No one trains this from scratch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 2: The seasoning (fine-tuning):&lt;/strong&gt; The model is fine-tuned on the company's internal communication style, product naming conventions, escalation procedures, and output formats. Now it sounds like the company.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer 3: The fresh sides (RAG):&lt;/strong&gt; At query time, the system retrieves the live knowledge base current product specs, today's pricing, this week's policy updates, this customer's order history. Now the answers are both brand-consistent and factually current.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt; an assistant that always talks like your company, always knows your latest information, and never makes up facts it doesn't have. That's not a single technique. That's a kitchen running three stations at once.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;How agents use all three techniques together&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The pre-trained model&lt;/strong&gt; is the agent's core reasoning engine it reads the task, &lt;strong&gt;makes decisions&lt;/strong&gt;, and generates instructions for each step.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fine-tuning&lt;/strong&gt; can make the agent better at following specific agentic patterns tool use, &lt;strong&gt;self-reflection&lt;/strong&gt;, &lt;strong&gt;multi-step planning&lt;/strong&gt; so it behaves more reliably in your particular workflow.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG&lt;/strong&gt; gives the agent access to &lt;strong&gt;live information&lt;/strong&gt; at each step it retrieves what it needs, acts on it, &lt;strong&gt;retrieves again, acts again so the agent always works with current data&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The bottom line:&lt;/strong&gt; agents are not a fourth technique that replaces the three. They are an architecture that sits on top of all three. You cannot build a reliable agent without understanding the foundations. The kitchen analogy extends: if &lt;strong&gt;pre-trained&lt;/strong&gt; is the &lt;strong&gt;frozen food&lt;/strong&gt;, &lt;strong&gt;fine-tuning is the seasoning&lt;/strong&gt;, and &lt;strong&gt;RAG is the fresh sides&lt;/strong&gt; &lt;strong&gt;agents are the chef who orchestrates&lt;/strong&gt; the whole meal in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Pre-trained models give you the dish.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Fine-tuning changes the taste.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;RAG brings fresh ingredients to the table.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Agents are the chef who runs the whole kitchen.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Fine-Tuning vs RAG vs Both
&lt;/h2&gt;

&lt;p&gt;Here is the comparison most architecture conversations need. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiam1nhxvz1mnlay7j9ag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiam1nhxvz1mnlay7j9ag.png" alt=" " width="800" height="537"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Takeaway
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The next time someone on your team asks 'should we fine-tune or do RAG?' you now have the full answer&lt;/strong&gt;. Not just the technique names, but the underlying reason behind each choice, the &lt;strong&gt;tradeoffs in cost and complexity, the failure modes to avoid, and the mental model that makes all of it easy to explain.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I've watched engineers waste months on fine-tuning jobs they never needed. I've watched &lt;strong&gt;teams deploy naive RAG and wonder&lt;/strong&gt; why their accuracy is terrible. I've watched founders spend their first $50,000 on a &lt;strong&gt;problem that a better prompt would have solved in a day&lt;/strong&gt;. I wrote this blog because those mistakes are completely avoidable if you have the right mental model before you start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Start with the frozen food.&lt;/strong&gt; &lt;strong&gt;Season it when you need to&lt;/strong&gt;. &lt;strong&gt;Always bring fresh ingredients to the table.&lt;/strong&gt; And when you are ready to build something truly ambitious hire the chef to orchestrate the whole kitchen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Save this blog.&lt;/strong&gt; You will want it in your next architecture conversation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Share it with one developer on your team who is confused about these techniques.&lt;/strong&gt; The clearest gift you can give them is a mental model that sticks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every comment, share, and save tells me what to write next. I read every single one.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>llm</category>
      <category>promptengineering</category>
      <category>rag</category>
    </item>
    <item>
      <title>Q, K, V : The Three Things Every Great Tech Lead Does Without Knowing It</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Mon, 06 Apr 2026 01:58:56 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/q-k-v-the-three-things-every-great-tech-lead-does-without-knowing-it-227i</link>
      <guid>https://future.forem.com/sreeni5018/q-k-v-the-three-things-every-great-tech-lead-does-without-knowing-it-227i</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;I’ve been thinking about &lt;strong&gt;transformer architecture a lot lately&lt;/strong&gt;  not just as an &lt;strong&gt;ML practitioner&lt;/strong&gt;, but as someone who has spent &lt;strong&gt;years in engineering teams&lt;/strong&gt;, watching how the best tech leads operate. And one day it just clicked &lt;strong&gt;a great tech lead behaves almost exactly like the&lt;/strong&gt; &lt;strong&gt;self attention mechanism in a transformer.&lt;/strong&gt; Not as a loose metaphor, but as a surprisingly precise structural analogy.&lt;/p&gt;

&lt;p&gt;Bear with me. Once you see it, you can’t unsee it.&lt;/p&gt;

&lt;h2&gt;
  
  
  A quick refresher on self attention
&lt;/h2&gt;

&lt;p&gt;In a &lt;strong&gt;transformer&lt;/strong&gt;, each token in a sequence needs to understand its meaning in &lt;strong&gt;&lt;em&gt;context&lt;/em&gt;&lt;/strong&gt;. It can’t do that in isolation so instead of processing itself alone, &lt;strong&gt;it looks at every other token in the sequence&lt;/strong&gt;, decides how &lt;strong&gt;relevant each one is&lt;/strong&gt;, and creates a &lt;strong&gt;weighted blend of information&lt;/strong&gt; from the whole sequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This happens through three simple projections for every token&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query (Q):&lt;/strong&gt; What am I looking for right now?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key (K):&lt;/strong&gt; What does each other token offer?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Value (V):&lt;/strong&gt; What should I actually take from them?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Attention&lt;/strong&gt;(Q, K, V) = &lt;strong&gt;softmax&lt;/strong&gt;( QKᵀ / √dₖ ) · V&lt;/p&gt;

&lt;p&gt;The output isn’t just the token’s raw embedding. It’s a &lt;strong&gt;&lt;em&gt;context-aware blend&lt;/em&gt;&lt;/strong&gt; what this token means given everything around it. The whole is smarter than the sum of its parts.&lt;/p&gt;

&lt;h2&gt;
  
  
  Now map that onto your tech lead
&lt;/h2&gt;

&lt;p&gt;A team is, in this framing, a &lt;strong&gt;sequence&lt;/strong&gt; of &lt;strong&gt;people&lt;/strong&gt; each carrying different &lt;strong&gt;skills&lt;/strong&gt;, &lt;strong&gt;contexts&lt;/strong&gt;, and domain knowledge. The tech lead’s job is to make that sequence &lt;strong&gt;produce coherent&lt;/strong&gt;, high &lt;strong&gt;quality output&lt;/strong&gt;. Sound familiar?&lt;/p&gt;

&lt;p&gt;&lt;em&gt;The &lt;strong&gt;tech lead doesn’t process problems one person at a time.&lt;/strong&gt; They hold the whole team in mind &lt;strong&gt;simultaneously&lt;/strong&gt;  &lt;strong&gt;weighting&lt;/strong&gt; each &lt;strong&gt;person’s input&lt;/strong&gt; against the &lt;strong&gt;relevance&lt;/strong&gt; of the &lt;strong&gt;problem at hand&lt;/strong&gt;.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tech Lead as a Transformer: Scaling Attention in Your Team
&lt;/h2&gt;

&lt;p&gt;In the world of Large Language Models, the &lt;strong&gt;Transformer&lt;/strong&gt; architecture changed everything by mastering the art of "Attention." But the mechanics of a transformer Queries, Keys, and Values aren't just for silicon; they are a perfect blueprint for high performing engineering leadership.&lt;/p&gt;

&lt;p&gt;If you want to scale your team’s impact, you have to stop managing tasks and start mastering the attention operation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02i1syzp5o0ihn5pi22h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F02i1syzp5o0ihn5pi22h.png" alt=" " width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjcbwi2f0207rhf68os02.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjcbwi2f0207rhf68os02.png" alt=" " width="800" height="113"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Q:Read the problem precisely before reacting&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The principle:&lt;/strong&gt; &lt;strong&gt;Before you reach for a person&lt;/strong&gt;, you must understand the exact shape of what you need. &lt;strong&gt;A vague question finds the wrong answer&lt;/strong&gt;. A precise question finds the right person.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IN THE TRANSFORMER&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every token generates a Query vector&lt;/strong&gt; a precise representation of the context it is searching for. The word “&lt;strong&gt;crash&lt;/strong&gt;” needs to know if it is &lt;strong&gt;financial&lt;/strong&gt; or &lt;strong&gt;physical&lt;/strong&gt;. Its Query is asking: &lt;em&gt;“what domain am I in?”&lt;/em&gt; The word “it” needs to find its antecedent. Its Query is asking: &lt;em&gt;“who am I referring to?”&lt;/em&gt; The Query gets scored against every other token’s Key. &lt;strong&gt;The more precise the Query, the more accurately the model attends to the right context.&lt;/strong&gt; A sloppy Query means the model attends to the wrong tokens and the output degrades  no matter how good the rest of the sequence is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IN YOUR TECH LEAD&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It’s 11pm on Tuesday&lt;/strong&gt;. &lt;strong&gt;API latency has spiked to 8 seconds.&lt;/strong&gt; Alerts are firing. &lt;strong&gt;A weak tech lead fires a message to the whole channel&lt;/strong&gt; &lt;em&gt;“Hey, who can look at this?”&lt;/em&gt; That is not a Query. That is a panic broadcast the problem has not been read at all, just forwarded.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A strong tech lead takes fifteen seconds before typing anything.&lt;/strong&gt; They are reading the &lt;strong&gt;problem&lt;/strong&gt; &lt;strong&gt;precisely&lt;/strong&gt;: is this a &lt;strong&gt;database&lt;/strong&gt; write &lt;strong&gt;bottleneck&lt;/strong&gt;? A &lt;strong&gt;bad&lt;/strong&gt; &lt;strong&gt;deploy&lt;/strong&gt;? A downstream dependency choking? A traffic spike? Each of those is a different Query, and each points to a different person. Reading the problem precisely before reacting is not hesitation it is the entire foundation of what comes next. &lt;strong&gt;Get the Query wrong and everything downstream is wasted effort.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  K:Know what each engineer truly carries
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The principle:&lt;/strong&gt; &lt;strong&gt;Not their job title. Not their years of experience.&lt;/strong&gt; What they &lt;em&gt;actually&lt;/em&gt; carry right now the specific knowledge, the lived context, the warm mental model that matches this exact problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IN THE TRANSFORMER&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every token generates a Key vector *&lt;em&gt;a representation of what it holds and can offer to others. When a Query asks *&lt;/em&gt;&lt;em&gt;“what domain am I in?”&lt;/em&gt;&lt;/strong&gt;, the Keys from surrounding tokens compete to answer. The attention score between two tokens is the dot product of one’s Query against the other’s Key. High alignment means high attention. Low alignment means that token fades. The Key is not the same as the Value the Key is the advertisement that says &lt;em&gt;“I am relevant to your question.”&lt;/em&gt; What gets extracted once that match is confirmed is the Value, which we will get to next.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IN YOUR TECH LEAD&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Query is formed:&lt;/strong&gt; looks like a write contention issue in the orders table. Now the tech lead scans the team.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Sreeni&lt;/strong&gt; is first online. Senior, reliable, composed under pressure. &lt;strong&gt;But his background is frontend&lt;/strong&gt;. His Key what he &lt;em&gt;truly&lt;/em&gt; carries doesn’t match this problem. High score on “reliable team member,” low score on this specific database crisis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ragavan&lt;/strong&gt; wrote the orders pipeline eighteen months ago. &lt;strong&gt;He knows every design decision, every shortcut, every known failure mode. His Key is a near perfect match for the Query&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Siva&lt;/strong&gt; debugged a nearly identical write contention issue two sprints ago. The mental model is warm. The patterns are fresh. Siva’s &lt;strong&gt;Key is both relevant &lt;em&gt;and&lt;/em&gt; current.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A tech lead who knows their team only by title pages &lt;strong&gt;Sreeni because he’s available&lt;/strong&gt;. &lt;strong&gt;A tech lead who truly knows what each engineer carries reaches for Ragavan and Siva&lt;/strong&gt;. The depth of your Key knowledge is the single biggest factor in whether your team’s intelligence gets used or wasted.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;V:Extract the exact contribution that matters&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The principle:&lt;/strong&gt; Finding the right person is only half the job. The other half is knowing &lt;em&gt;what to pull from them&lt;/em&gt;  the specific piece of their knowledge that solves this problem right now, not everything they know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IN THE TRANSFORMER&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Value vector is the real payload&lt;/strong&gt;. Once the attention scores are computed and we know how much to attend to each token, what we actually pull from them is their Value not their Key. The Key said &lt;strong&gt;&lt;em&gt;“I am relevant.”&lt;/em&gt;&lt;/strong&gt; The &lt;strong&gt;Value delivers what that relevance actually contains.&lt;/strong&gt; These are two separate learned representations and they can be very different from each other.&lt;/p&gt;

&lt;p&gt;The final output for any token is a weighted sum of the Value vectors from every token in the sequence &lt;strong&gt;including itself&lt;/strong&gt;. That is the “self” in &lt;strong&gt;self attention. High attention score means a large portion of that token’s Value flows into the output&lt;/strong&gt;. Low score means a small contribution but nothing is ever fully zeroed out. The result is a single enriched representation that carries synthesized meaning from across the whole sequence.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;IN YOUR TECH LEAD&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tech lead has reached Ragavan and Siva.&lt;/strong&gt; The Keys matched. Now comes the part most tech leads miss extracting the &lt;em&gt;exact&lt;/em&gt; contribution that matters, not just getting them on a call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ragavan’s Value is specific:&lt;/strong&gt; the orders table has a known write hotspot on the status column. A nearly identical incident in 2022 was resolved by switching to a queue based write pattern. The full fix takes four hours, but there is a config level workaround that buys time right now. That is his Value vector not his presence, not his seniority, but that precise, usable knowledge.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Siva’s Value is different:&lt;/strong&gt; a step by step diagnosis approach from the recent incident, three specific queries to run against the slow query log, and a clear hunch about which index is missing based on the pattern of the spike. &lt;strong&gt;Different from Ragavan’s. Equally specific. Equally usable.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;The tech lead extracts architecture insight from Ragavan and live diagnosis steps from Siva *&lt;/em&gt; then synthesizes both into a single coherent response. Neither person alone had the full answer. The weighted combination of their two Value vectors did. &lt;em&gt;That&lt;/em&gt; is what great tech leadership actually produces.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A note for the technically precise: &lt;strong&gt;in actual self attention&lt;/strong&gt;, every token generates &lt;strong&gt;Q, K, and V simultaneously each team member would be questioner, advertiser, and content provider all at once.&lt;/strong&gt; The analogy maps these roles onto distinct actors for clarity. That’s a deliberate simplification, and the right trade off for a blog. The structural point holds.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Softmax: decisive, not democratic&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;After the Query Key scores are computed for every token pair&lt;/strong&gt;, a &lt;strong&gt;softmax function sharpens the distribution&lt;/strong&gt;. The &lt;strong&gt;highest&lt;/strong&gt; scoring tokens get heavily weighted. &lt;strong&gt;Lower&lt;/strong&gt; scoring ones are suppressed not erased, but pushed toward the edges. The result is focused, purposeful attention rather than diffuse averaging.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Great tech leads calibrate the same way&lt;/strong&gt;. During the incident, Ragavan and Siva carry the highest weights. Sreeni’s input on how to communicate the downtime to customers still matters and still flows into the output he’s not ignored. But he doesn’t drive the technical response. The &lt;strong&gt;softmax&lt;/strong&gt; isn’t a veto. It’s a &lt;strong&gt;&lt;em&gt;weighting&lt;/em&gt;&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The ability to weight confidently without dismissing is one of the hardest skills in the role. Too much sharpening and you become a dictator. Too little and you’re running a committee. The best tech leads calibrate this by problem type, stakes, and who is genuinely best positioned to contribute right now.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Multi head attention: running several concerns at once&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Real transformers use multi head attention several independent attention operations running in parallel, each learning to track a different type of relationship in the sequence. One head catches syntactic structure. Another tracks semantic similarity. &lt;strong&gt;Another handles long range dependencies&lt;/strong&gt;. The outputs are concatenated and projected into a single unified representation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Watch a strong tech lead manage a major incident&lt;/strong&gt; and you’ll see exactly this. One part of their mind is tracking the technical diagnosis. Another is watching team stress levels and deciding when to rotate people off the call. Another is composing the stakeholder update due in twenty minutes. &lt;strong&gt;Another is already thinking about the post-mortem structure and what process change this incident should trigger. None of those heads switches off while the others run&lt;/strong&gt;. The incident gets resolved, the team stays functional, stakeholders are informed, and the right lesson gets captured because all four heads ran and synthesized their outputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MultiHead(Q, K, V) = Concat(head₁, …, headₙ) · Wᵒ&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;head₁ = technical diagnosis  head₂ = team health &amp;amp; stress&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;head₃ = stakeholder comms  head₄ = process &amp;amp; post mortem&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;Why the old model fails the RNN problem&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Before transformers, the dominant approach was recurrent neural networks — process one token at a time, pass a hidden state forward, repeat. The problem was fundamental: information from early in the sequence degraded over time, gradients vanished on long sequences, and nothing could be parallelized. Every step depended on the last.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;command-and-control manager is an RNN&lt;/strong&gt;. Every problem routes through them serially. Context from earlier conversations gets dropped. Team throughput is capped at the manager’s personal bandwidth. In a small team this is merely inefficient. &lt;strong&gt;In a scaling organization it becomes catastrophic.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The tech lead who operates like self-attention doesn’t become the bottleneck.&lt;/strong&gt; They become the &lt;em&gt;context layer&lt;/em&gt; the mechanism that helps the whole team understand the situation more clearly and move together faster. &lt;strong&gt;The team’s intelligence is the output. Not the manager’s.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;So what does a great tech lead actually look like?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;They’re the one who pauses before reacting forming the Query before reaching for a person. They’re the one who knows that Ragavan is the right call at 11pm not because he’s available, but because he wrote the system. They’re the one who doesn’t just ping the right people, but knows exactly what to extract from each of them and how to stitch those pieces into a response no single engineer could have produced alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;They run multiple heads simultaneously without dropping any. Technical diagnosis, team morale, stakeholder communication, process improvement&lt;/strong&gt; &lt;strong&gt;all running in parallel, all synthesized into a single coherent output&lt;/strong&gt;. And they do it without becoming the bottleneck, without turning every decision into a committee, and without making anyone feel unseen.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;That is self attention. Not as a metaphor. As a description of the job.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Attention is all you need. And a tech lead who truly understands that who attends broadly, weights wisely, and synthesizes instead of dictating is everything a team needs to become more than the sum of its people.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Run Open Source AI Models with Docker Model Runner</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Sun, 05 Apr 2026 01:52:02 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/run-open-source-ai-modelswith-docker-model-runner-5hei</link>
      <guid>https://future.forem.com/sreeni5018/run-open-source-ai-modelswith-docker-model-runner-5hei</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;If you've spent any time in &lt;strong&gt;software&lt;/strong&gt; &lt;strong&gt;development&lt;/strong&gt;, &lt;strong&gt;cloud&lt;/strong&gt; engineering, or &lt;strong&gt;microservices&lt;/strong&gt; architecture, the name &lt;strong&gt;Docker&lt;/strong&gt; needs no introduction. But for those newer to the ecosystem, here's the short version.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker&lt;/strong&gt; is an &lt;strong&gt;open platform for developing, shipping, and running applications&lt;/strong&gt;. Its core idea is elegant: separate your application from the underlying infrastructure so you can build fast, test consistently, and deploy confidently. By standardizing how code is packaged and delivered, Docker dramatically shrinks the gap between "it works on my machine" and "it works in production."&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Docker Desktop?
&lt;/h3&gt;

&lt;p&gt;Docker Desktop takes everything Docker offers and wraps it into a single, batteries-included application for macOS, Windows, and Linux. It bundles the Docker Engine, CLI, &lt;strong&gt;Docker&lt;/strong&gt; &lt;strong&gt;Compose&lt;/strong&gt;, &lt;strong&gt;Kubernetes&lt;/strong&gt;, and a &lt;strong&gt;visual dashboard&lt;/strong&gt; giving developers a complete container workflow without ever touching low level OS configuration.&lt;/p&gt;

&lt;p&gt;Over the years, Docker Desktop has become the &lt;strong&gt;de facto local development environment for millions of engineers worldwide&lt;/strong&gt;. Version 4.x doubled down on AI workloads, and the latest releases ship with &lt;strong&gt;Docker Model Runner&lt;/strong&gt; as a first class, built in feature accessible directly from the Docker Dashboard or the CLI you already use every day.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Docker Model Runner?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Docker Model Runner (DMR)&lt;/strong&gt; is an inference engine embedded directly into Docker Desktop. It lets you pull, run, and interact with open-source large language models using the same familiar &lt;code&gt;docker&lt;/code&gt; CLI no new tools, no configuration headaches, no surprises.&lt;/p&gt;

&lt;p&gt;Under the hood, DMR uses &lt;strong&gt;llama.cpp&lt;/strong&gt; as its runtime backend, delivering high performance inference on both CPU and GPU — Metal on Apple Silicon, CUDA on Linux and Windows out of the box.&lt;/p&gt;

&lt;p&gt;Models are distributed as OCI compliant artifacts through Docker Hub's &lt;strong&gt;&lt;code&gt;ai/&lt;/code&gt; namespace&lt;/strong&gt;. That means model versioning, access control, and distribution are all handled by the same battle tested infrastructure already powering your container images.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;"What Docker did for application packaging, Model Runner does for AI inference one pull command, consistent behavior everywhere."&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  When to Use Docker Model Runner
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasaj91x60ox69l30su7w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fasaj91x60ox69l30su7w.png" alt=" " width="720" height="794"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How It Works Under the Hood
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lxfecn0msm15s8xxxqn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9lxfecn0msm15s8xxxqn.png" alt=" " width="800" height="276"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you run a model through DMR, Docker Desktop spins up a local HTTP server exposing an &lt;strong&gt;OpenAI-compatible REST API&lt;/strong&gt; including &lt;strong&gt;&lt;code&gt;/v1/chat/completions&lt;/code&gt;, &lt;code&gt;/v1/completions&lt;/code&gt;, and &lt;code&gt;/v1/models&lt;/code&gt;.&lt;/strong&gt; Any application or SDK already speaking the OpenAI protocol works against &lt;strong&gt;DMR&lt;/strong&gt; with &lt;strong&gt;zero code changes&lt;/strong&gt;, making it a drop in local alternative for &lt;strong&gt;AI-powered development&lt;/strong&gt;.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Latest Docker Desktop based on your OS &lt;/li&gt;
&lt;li&gt;Start the Docker Desktop &lt;/li&gt;
&lt;li&gt;Click the Settings icon top Right corner &lt;/li&gt;
&lt;li&gt;Select AI and enable Docker Model Runner, Enable DMR and Host TCP  as shown below . &lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Note: Default TCP port is 12434 , you can change it whatever free port available in your machine , Mine i set it 5018 &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvsuil46rdgcut6n9eo5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcvsuil46rdgcut6n9eo5.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Next, click the models left side as shown below &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqvvwic76k9j6alv0fdyl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqvvwic76k9j6alv0fdyl.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now, click pull and download the model and run it. &lt;/p&gt;

&lt;p&gt;Below screenshot shows i pulled or downloaded two open source models &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzvoh2k46d2j6dlsamh1o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzvoh2k46d2j6dlsamh1o.png" alt=" " width="800" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Test the Model within docker desktop itself
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jq2kz1mhz637xno19v9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3jq2kz1mhz637xno19v9.png" alt=" " width="800" height="415"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Testing GPT-OSS
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0t3p4067lzo6qxp0187.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fx0t3p4067lzo6qxp0187.png" alt=" " width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;docker model subcommand&lt;/strong&gt; is your primary interface. Let's walk through pulling and running &lt;strong&gt;qwen3.5 step by step.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Pull a model from Docker Hub
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqvk2a51jr11o11ym7goy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqvk2a51jr11o11ym7goy.png" alt=" " width="800" height="195"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2. List available models ( what models, downloaded locally )
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyca2oj16b0i2mhg44rk2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyca2oj16b0i2mhg44rk2.png" alt=" " width="800" height="127"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick reference cheat sheet
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewauiqveujpg3c2cpq2y.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fewauiqveujpg3c2cpq2y.png" alt=" " width="742" height="1093"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Docker Model Runner matters
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpokrx5bhlfcx6hybkzgl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpokrx5bhlfcx6hybkzgl.png" alt=" " width="800" height="496"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Using DMR in your applications
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Python with the OpenAI SDK&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Since DMR speaks the OpenAI protocol, swap the base URL and you're done no model specific library needed:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;OpenAI&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;OpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:5018/engines/v1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;not-needed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;bye&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="k"&gt;break&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;completions&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-oss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;}]&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;choices&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Testing the above code.
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjl6gtnovj2s4uke3pn33.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjl6gtnovj2s4uke3pn33.png" alt=" " width="800" height="299"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Docker Model Runner&lt;/strong&gt; closes the gap between &lt;strong&gt;containerized&lt;/strong&gt; &lt;strong&gt;application development&lt;/strong&gt; and &lt;strong&gt;AI-powered application development&lt;/strong&gt;. &lt;strong&gt;By treating models as OCI&lt;/strong&gt;(Open Container Initiative) artifacts and exposing a standard OpenAI compatible API, DMR lets you build with local LLMs using the same mental model, the same toolchain, and the same workflows you already use for everything else.&lt;/p&gt;

&lt;p&gt;The combination of &lt;strong&gt;zero setup inference, hardware acceleration, and Compose&lt;/strong&gt; integration makes DMR the most practical way to add local AI capabilities to any project whether you're building a RAG pipeline, a coding assistant, or a document summarizer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>docker</category>
      <category>machinelearning</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Agent Middleware in Microsoft Agent Framework 1.0</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Sat, 04 Apr 2026 19:14:54 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/agent-middleware-in-microsoft-agent-framework-10-2bm0</link>
      <guid>https://future.forem.com/sreeni5018/agent-middleware-in-microsoft-agent-framework-10-2bm0</guid>
      <description>&lt;p&gt;&lt;em&gt;A familiar pipeline pattern applied to AI agents&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Covers &lt;strong&gt;three&lt;/strong&gt; middleware types, &lt;strong&gt;registration&lt;/strong&gt; scopes, &lt;strong&gt;termination&lt;/strong&gt;, result override, and &lt;strong&gt;when to use each&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Not a New Idea
&lt;/h2&gt;

&lt;p&gt;If you have used &lt;strong&gt;ASP.NET Core&lt;/strong&gt; or &lt;strong&gt;Express.js&lt;/strong&gt;, you already understand the core concept. Both frameworks let you &lt;strong&gt;register&lt;/strong&gt; a &lt;strong&gt;chain&lt;/strong&gt; of functions around every request. Each function receives a context and a &lt;strong&gt;next() delegate&lt;/strong&gt;. Calling &lt;strong&gt;next() continues&lt;/strong&gt; the chain. Not calling it &lt;strong&gt;short circuits&lt;/strong&gt; it. That is the pipeline pattern &lt;strong&gt;a clean way to apply cross cutting concerns like logging, authentication, and error handling without touching any business logic.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microsoft’s Agent Framework&lt;/strong&gt; applies this exact pattern to AI agents. The next() delegate becomes call_next(), &lt;strong&gt;the context object holds the agent’s conversation instead of an HTTP request&lt;/strong&gt;, and the pipeline wraps an &lt;strong&gt;AI reasoning turn instead of a web request&lt;/strong&gt;. If you know app.Use() or app.use(), you already know the shape of what follows.&lt;/p&gt;

&lt;p&gt;What is new, and worth understanding deeply, is that an agent turn is &lt;strong&gt;not a single request/response cycle&lt;/strong&gt;. It is a &lt;strong&gt;multi step reasoning loop&lt;/strong&gt;, and Agent Framework exposes three distinct interception points within it. The rest of this post covers all three types, how they differ, when to use each, and how they come together in a real SQL agent example.&lt;/p&gt;

&lt;h2&gt;
  
  
  Middleware
&lt;/h2&gt;

&lt;p&gt;The Agent Framework supports three types of middleware, each intercepting a different layer of execution:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Agent middleware&lt;/strong&gt; wraps agent runs, giving you access to inputs, outputs, and overall control flow.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Function middleware&lt;/strong&gt; wraps individual tool calls, enabling input validation, result transformation, and execution control.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chat middleware&lt;/strong&gt; wraps the underlying requests sent to AI models, exposing raw messages, options, and responses.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All three types support both &lt;strong&gt;function based&lt;/strong&gt; and &lt;strong&gt;class based&lt;/strong&gt; implementations.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chaining
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffrts2avkvnzmlsp9v0fu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffrts2avkvnzmlsp9v0fu.png" alt=" " width="800" height="416"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When multiple middleware of the same type are registered, they execute as a chain each middleware calls &lt;strong&gt;&lt;code&gt;call_next()&lt;/code&gt;&lt;/strong&gt; to hand off control to the next one in line.&lt;/p&gt;

&lt;p&gt;Rather than passing updated values into &lt;strong&gt;&lt;code&gt;call_next()&lt;/code&gt;&lt;/strong&gt; as arguments, middleware mutates the shared context object directly. This means any changes you make to the context before calling &lt;code&gt;call_next()&lt;/code&gt; &lt;strong&gt;are automatically visible to downstream middleware&lt;/strong&gt;, with no need to thread values through the call explicitly.&lt;/p&gt;

&lt;h3&gt;
  
  
  Execution Order
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Agent&lt;/strong&gt; level middleware always wraps &lt;strong&gt;run&lt;/strong&gt; level middleware. Given agent middleware &lt;code&gt;[A1, A2]&lt;/code&gt; and run middleware &lt;code&gt;[R1, R2]&lt;/code&gt;, the execution order is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;A1 → A2 → R1 → R2 → Agent → R2 → R1 → A2 → A1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Function and chat middleware follow the same wrapping principle, applied at the time of each tool call or chat request respectively.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why we need it
&lt;/h2&gt;

&lt;p&gt;The biggest value is not convenience; it is correctness and consistency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Without middleware&lt;/strong&gt;, teams usually end up in one or both of these patterns:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 1: policy hidden in prompts&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Example instruction:&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  "Never run destructive SQL. Never send data to personal email."
&lt;/h2&gt;

&lt;p&gt;This is useful guidance, but it is still model behavior, not a hard gate. As prompts get long, tools increase, and edge cases appear, this policy can become inconsistent. It is also hard to audit after the fact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pattern 2: policy duplicated in each tool&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;drop&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blocked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;export_data&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gmail.com&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blocked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;quote_inventory_line&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;quantity&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;quantity&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;blocked&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This looks safe, but it creates:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;duplicated logic&lt;/li&gt;
&lt;li&gt;inconsistent rules across tools&lt;/li&gt;
&lt;li&gt;expensive updates when policy changes&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Middleware fixes both
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;With middleware&lt;/strong&gt;, concerns live at the right boundary:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;run level checks in &lt;strong&gt;Agent middleware&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;per tool checks in &lt;strong&gt;Function middleware&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;model call telemetry/metadata in &lt;strong&gt;Chat middleware&lt;/strong&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Result:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;cleaner tools&lt;/li&gt;
&lt;li&gt;stronger guardrails&lt;/li&gt;
&lt;li&gt;easier tests&lt;/li&gt;
&lt;li&gt;better observability&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  1. Agent Middleware-outermost layer
&lt;/h2&gt;

&lt;p&gt;Agent middleware is the &lt;strong&gt;outermost layer of the pipeline&lt;/strong&gt;. It fires &lt;strong&gt;once per turn&lt;/strong&gt; before any LLM call is made and after the final reply or response is produced making it the right place for concerns that span the entire turn: input validation, security screening, audit logging, and output transformation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcs50tvtnjc56tfbg7cbl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcs50tvtnjc56tfbg7cbl.png" alt=" " width="800" height="710"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation Styles &amp;amp; Chaining
&lt;/h2&gt;

&lt;p&gt;Agent middleware supports both &lt;strong&gt;class based&lt;/strong&gt; and &lt;strong&gt;function based&lt;/strong&gt; implementations both are fully equivalent, and the choice comes down to whether you need &lt;strong&gt;instance state or prefer a lighter syntax&lt;/strong&gt;.&lt;br&gt;
When multiple middleware components are registered, they form a chain. Each component is responsible for calling call_next() to pass control to the next layer; omitting this call short-circuits the pipeline, preventing any downstream middleware or the LLM from running.&lt;/p&gt;

&lt;p&gt;Note that call_next() takes no arguments. Instead of passing updated values explicitly, middleware mutates the shared AgentContext object directly — any changes made before await call_next() are automatically visible to everything further down the chain.&lt;/p&gt;
&lt;h2&gt;
  
  
  Class-Based Implementation
&lt;/h2&gt;

&lt;p&gt;Subclass &lt;strong&gt;AgentMiddleware&lt;/strong&gt; and &lt;strong&gt;override process()&lt;/strong&gt;. The example below shows SecurityAgentMiddleware It inspects the latest user message and short-circuits the pipeline if it detects a threat the LLM is never invoked for blocked requests.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SecurityAgentMiddleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;AgentMiddleware&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Agent-level guard: blocks risky **user chat text** before the model runs.

    Inspects ``context.messages[-1]`` (latest user turn). If :func:`_unsafe_input_reason`
    returns a reason, sets ``context.result`` to a canned assistant reply and **does not**
    call ``call_next()``, so the LLM and tools are skipped for that turn.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[],&lt;/span&gt; &lt;span class="n"&gt;Awaitable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Only the latest user utterance is checked (typical for a single-turn REPL).
&lt;/span&gt;        &lt;span class="n"&gt;last_message&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;last_message&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;last_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;last_message&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;
            &lt;span class="n"&gt;reason&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;_unsafe_input_reason&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[SecurityAgentMiddleware] Security Warning: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;; blocking request.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="c1"&gt;# Short-circuit: set the assistant reply here; do NOT call call_next() → no LLM, no tools.
&lt;/span&gt;                &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AgentResponse&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                        &lt;span class="nc"&gt;Message&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;assistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                            &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Request blocked: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reason&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                        &lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="p"&gt;]&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[SecurityAgentMiddleware] Security check passed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="c1"&gt;# Continue pipeline: model + optional run_sql; function middleware runs inside tool path.
&lt;/span&gt;        &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;call_next&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# here is the _unsafe_input_reason function &amp;amp; For brevity, I’ve omitted the full code.”
&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_unsafe_input_reason&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Classify why a user message should be blocked, or ``None`` if it may proceed.

    Checks run in order: injection-style patterns first, then destructive natural language.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="c1"&gt;# Order matters: catch obvious SQL fragments before broader NL patterns.
&lt;/span&gt;    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_looks_like_dangerous_sql&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;injection-style or suspicious SQL fragment in your message&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;_looks_like_destructive_database_intent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;destructive database request (e.g. delete/drop/truncate)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws66yzcciue97cni8th3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fws66yzcciue97cni8th3.png" alt=" " width="800" height="389"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Function Based and Decorator Based Styles
&lt;/h2&gt;

&lt;p&gt;Agent Framework also supports function based and decorator based implementations. All three styles are equivalent; choose based on whether you need state or explicit type annotations.&lt;/p&gt;

&lt;h1&gt;
  
  
  Function based
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;logging_agent_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;

&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;AgentContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

&lt;span class="nb"&gt;next&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;AgentContext&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;Awaitable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;

&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Agent] Turn starting&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Agent] Turn completed&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Decorator-based (no type annotation required)
&lt;/h1&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="nd"&gt;@agent_middleware&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;simple_agent_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Before agent execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;After agent execution&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Registering Middleware
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Middleware is registered when constructing the agent&lt;/strong&gt;. Pass a list to the middleware argument different middleware types can be mixed in the same list and the framework routes each to the correct pipeline layer automatically:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://sreeniagent.services.ai.azure.com/api/projects/sreeni_foundry&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;FOUNDRY_MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4.1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;


&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="nf"&gt;with &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;AzureCliCredential&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;FoundryChatClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;credential&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;project_endpoint&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FOUNDRY_PROJECT_ENDPOINT&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# Your Microsoft Foundry project URL 
&lt;/span&gt;            &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;FOUNDRY_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="c1"&gt;# The model you deployed 
&lt;/span&gt;        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Sreeni-SqlAssistant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;instructions&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You help users query a small demo database. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;The only table is `customers` with columns id, name, city. &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Always use the run_sql tool with a proper SELECT; explain results briefly.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;run_sql&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="c1"&gt;# Agent middleware wraps the turn; function middleware wraps each tool call
&lt;/span&gt;        &lt;span class="n"&gt;middleware&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;SecurityAgentMiddleware&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="nc"&gt;LoggingFunctionMiddleware&lt;/span&gt;&lt;span class="p"&gt;()],&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;):&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  When to Use Agent Middleware
&lt;/h2&gt;

&lt;p&gt;Agent middleware is the right choice for any concern that applies to the &lt;strong&gt;turn as a whole, rather than to a specific tool call or model request&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fauwa9vp12vf96mw8boh1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fauwa9vp12vf96mw8boh1.png" alt=" " width="800" height="408"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  2.FunctionMiddleware- The ToolCall Layer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;FunctionMiddleware&lt;/strong&gt; fires inside the agent turn, but only when the &lt;strong&gt;LLM decides to invoke a tool&lt;/strong&gt;. A single agent turn can trigger multiple tool calls, and FunctionMiddleware wraps each one independently. This makes it the right place for concerns that are specific to tool execution: timing, input validation, result &lt;strong&gt;transformation, and tool call auditing.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The FunctionInvocationContext Object
&lt;/h2&gt;

&lt;p&gt;Each FunctionMiddleware component receives a FunctionInvocationContext, which is scoped to a single tool invocation:&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use FunctionMiddleware
&lt;/h2&gt;

&lt;p&gt;Use it for concerns &lt;strong&gt;specific to tool execution&lt;/strong&gt; the &lt;strong&gt;execution&lt;/strong&gt; &lt;strong&gt;timing&lt;/strong&gt; and performance monitoring, &lt;strong&gt;validating&lt;/strong&gt; or sanitising tool arguments before they run, capping the number of times a tool may be called in one turn, transforming tool results before the LLM sees them, or auditing exactly which tools were called and with what arguments.&lt;/p&gt;

&lt;h2&gt;
  
  
  Terminating the Function Calling Loop
&lt;/h2&gt;

&lt;p&gt;Setting &lt;strong&gt;context.terminate = True&lt;/strong&gt; inside FunctionMiddleware does something powerful: it stops the LLM’s function calling loop entirely. The LLM will not receive the tool result and will not make any further tool calls in this turn. This is useful for enforcing tool call budgets or stopping a loop that is going in an undesirable direction.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
&lt;span class="nd"&gt;@function_middleware&lt;/span&gt;

&lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;budget_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;next&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;

 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;function&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;run_sql&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

 &lt;span class="c1"&gt;# Allow at most one SQL query per turn
&lt;/span&gt;
 &lt;span class="n"&gt;call_count&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;call_count&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;=&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

 &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Query limit reached for this turn.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

 &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;terminate&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;  &lt;span class="c1"&gt;# stop the LLM tool-calling loop
&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt;

 &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;metadata&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;sql_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;call_count&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;

 &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Warning: Termination and Chat History&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Terminating the function calling loop can leave the chat history in an inconsistent state a tool-call message with no corresponding tool result. This may cause errors if the same history is used in subsequent agent runs. Use termination carefully and consider clearing or repairing the history afterward.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. ChatMiddleware —The LLM Call Layer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;ChatMiddleware&lt;/strong&gt; is the deepest layer. It wraps the actual inference call sent to the &lt;strong&gt;underlying language model&lt;/strong&gt;  the raw list of messages, the model options, and the response that comes back. This layer fires for every call to the &lt;strong&gt;LLM within a turn, which can be more than one if tools are used.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The ChatContext Object
&lt;/h2&gt;

&lt;p&gt;Each ChatMiddleware component receives a ChatContext.&lt;/p&gt;

&lt;h2&gt;
  
  
  Function Based Example
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;
  &lt;span class="k"&gt;async&lt;/span&gt; &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;logging_chat_middleware&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;

  &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatContext&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;

  &lt;span class="nb"&gt;next&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Callable&lt;/span&gt;&lt;span class="p"&gt;[[&lt;/span&gt;&lt;span class="n"&gt;ChatContext&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;Awaitable&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;]],&lt;/span&gt;

  &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;

  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Chat] Sending &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; messages to model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="k"&gt;await&lt;/span&gt; &lt;span class="nf"&gt;next&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

  &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Chat] Model response received&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Because &lt;strong&gt;ChatMiddleware&lt;/strong&gt; sees the exact message list going to the model, it can be used to inject system instructions, strip sensitive content, enforce token budgets, or even substitute a cached response all without the &lt;strong&gt;AgentMiddleware&lt;/strong&gt; or &lt;strong&gt;FunctionMiddleware&lt;/strong&gt; layers knowing anything changed.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Use ChatMiddleware
&lt;/h2&gt;

&lt;p&gt;Use it when you need access to the raw LLM call: injecting or modifying system level instructions per call, redacting PII from messages before they leave your infrastructure, enforcing token count limits, caching repeated inference calls, or monitoring every model request for compliance purposes.&lt;/p&gt;

&lt;h1&gt;
  
  
  Registration: Agent Level vs. Run Level (run scope)
&lt;/h1&gt;

&lt;p&gt;&lt;strong&gt;Microsoft Agent Framework&lt;/strong&gt; supports &lt;strong&gt;two scopes&lt;/strong&gt; for registering middleware. Understanding the difference is important for designing flexible agent systems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent Level Middleware
&lt;/h2&gt;

&lt;p&gt;Middleware passed in the middleware=[...] list when constructing the Agent applies to every single call to agent.run() for the lifetime of that agent. This is where you put policies that should always be enforced: security guards, mandatory audit logging, content filters.&lt;/p&gt;

&lt;h2&gt;
  
  
  Run Level Middleware
&lt;/h2&gt;

&lt;p&gt;You can also pass middleware directly to a single agent.run() call. This middleware applies only to that one invocation and is discarded afterward. It is useful for per request customisation: adding a trace ID for a specific call, applying extra validation for a sensitive operation, or attaching a debug logger without affecting every other turn.&lt;/p&gt;

&lt;h1&gt;
  
  
  Choosing the Right Middleware Type
&lt;/h1&gt;

&lt;p&gt;With three types available, the choice usually comes down to what you need to see and at what granularity.&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hrawj5dnjxdu3e9llv0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1hrawj5dnjxdu3e9llv0.png" alt=" " width="800" height="517"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;Microsoft Agent Framework’s middleware brings the same pipeline contract you know from &lt;strong&gt;ASP.NET Core and Express&lt;/strong&gt;  ordered components, a context object, and a call_next() delegate into the world of AI agents. The structural difference is that an agent turn is not a single request/response cycle but a multi-step reasoning loop, and Agent Framework exposes three separate interception points within it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentMiddleware&lt;/strong&gt; is the right home for &lt;strong&gt;turn level&lt;/strong&gt; concerns: &lt;strong&gt;security screening, content policy, and audit logging&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FunctionMiddleware&lt;/strong&gt; is the right home for &lt;strong&gt;tool level&lt;/strong&gt; concerns: execution timing, argument validation, and tool call budgets. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ChatMiddleware&lt;/strong&gt; is the right home for &lt;strong&gt;model level&lt;/strong&gt; concerns: raw message inspection, token enforcement, and caching.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>dotnet</category>
    </item>
    <item>
      <title>Five Agent Memory Types in LangGraph: A Deep Code Walkthrough (Part 2)</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Fri, 03 Apr 2026 03:23:44 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/five-agent-memory-types-in-langgraph-a-deep-code-walkthrough-part-2-17kb</link>
      <guid>https://future.forem.com/sreeni5018/five-agent-memory-types-in-langgraph-a-deep-code-walkthrough-part-2-17kb</guid>
      <description>&lt;p&gt;In &lt;strong&gt;Part-1 [&lt;a href="https://dev.to/sreeni5018/the-5-types-of-ai-agent-memory-every-developer-needs-to-know-part-1-52fn"&gt;https://dev.to/sreeni5018/the-5-types-of-ai-agent-memory-every-developer-needs-to-know-part-1-52fn&lt;/a&gt;]&lt;/strong&gt; we covered the &lt;strong&gt;five memory types&lt;/strong&gt;, why the LLM is stateless by design, and why memory is always an &lt;strong&gt;infrastructure&lt;/strong&gt; concern. This post is the how. Same five types, but now we wire each one up with &lt;strong&gt;LangGraph&lt;/strong&gt;, dissect every line of code, flag the gotchas, and leave you with a single working script you can run today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Before We Write a Single Line: Two Things You Must Understand
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The &lt;strong&gt;Context Window&lt;/strong&gt; Is the Only Reality&lt;br&gt;
&lt;strong&gt;Repeat this like a mantra&lt;/strong&gt; and the model only knows what is in the context window at inference time. Every token your message, retrieved facts, conversation history, tool results, system instructions has to be physically present in that window at the moment of the call. If it is not there, the model does not know it exists. Your memory infrastructure's entire job is to decide what goes in, when, and in what form.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Checkpointer ≠ Store&lt;/strong&gt; This Confusion Breaks Designs&lt;br&gt;
LangGraph gives you two distinct persistence hooks and mixing them up is the most common architecture mistake beginners make.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrd37pf23jlkg9mkceed.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnrd37pf23jlkg9mkceed.png" alt=" " width="800" height="434"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The practical consequence:&lt;/strong&gt; if you store a user preference in the &lt;strong&gt;checkpointer&lt;/strong&gt; (i.e., in state["messages"]), it &lt;strong&gt;vanishes&lt;/strong&gt; the moment you start a new thread_id. If you store it in the store, it is there regardless of which thread the user returns on. Choose deliberately.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;For local production setups you typically use SQLite for both, as two separate files:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SqliteSaver&lt;/strong&gt; → durable per thread checkpoint history&lt;br&gt;
&lt;strong&gt;SqliteStore&lt;/strong&gt; → durable cross thread LTM/episodic records&lt;/p&gt;

&lt;p&gt;The demos below use &lt;em&gt;InMemory*&lt;/em&gt; backends so you can run them with zero setup. That is a teaching choice, not a recommendation for production.&lt;/p&gt;
&lt;h2&gt;
  
  
  &lt;strong&gt;Environment Setup&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;bashpip install langgraph langchain-openai langchain-community faiss-cpu python-dotenv&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;export OPENAI_API_KEY=sk-...&lt;br&gt;
export OPENAI_CHAT_MODEL=gpt-4o-mini   # optional, this is the default&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;macOS note:&lt;/strong&gt; If you have PyTorch installed alongside FAISS, two OpenMP runtimes may be loaded and Python will abort on import. The fix is one line: &lt;strong&gt;os.environ.setdefault("KMP_DUPLICATE_LIB_OK", "TRUE")&lt;/strong&gt; — set it before importing FAISS. The full script at the end does this automatically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KMP_DUPLICATE_LIB_OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TRUE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Must be before FAISS import
&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.documents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Document&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BaseMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SystemMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.vectorstores&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.checkpoint.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemorySaver&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_store&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph.message&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.prebuilt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools_condition&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.store.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemoryStore&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Memory Type 1: Short Term Memory (STM) The Conversation Buffer
&lt;/h2&gt;

&lt;p&gt;What it is?&lt;/p&gt;

&lt;p&gt;Short-term memory (STM) is the rolling transcript of the current conversation. It is what allows the model to understand "make it shorter" without you specifying what "it" refers to. Every prior message in the session is assembled into the context window on each subsequent call.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;pythondef&lt;/span&gt; &lt;span class="nf"&gt;demo_short_term_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Short-term memory = this thread&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s message list, restored by the checkpointer.

    The same thread_id on each invoke reloads prior turns into state[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;]
    so the model sees continuity without you manually merging history.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# state["messages"] already contains ALL prior turns for this thread_id,
&lt;/span&gt;        &lt;span class="c1"&gt;# restored from the checkpoint. We pass the full list to the LLM.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]}&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Compile with a checkpointer. Without this, state is not saved between invokes.
&lt;/span&gt;    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;InMemorySaver&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session-stm-demo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configurable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;

    &lt;span class="c1"&gt;# First turn: store the codename.
&lt;/span&gt;    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My codename for this session is Bluejay.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Second turn: only the new message is passed in.
&lt;/span&gt;    &lt;span class="c1"&gt;# The checkpointer reloads the first turn automatically.
&lt;/span&gt;    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What codename did I give?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[STM] Last reply:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Line-by-line breakdown
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;def chat(state: MessagesState) -&amp;gt; dict:&lt;/strong&gt;&lt;br&gt;
This is the only node in the graph. MessagesState is a TypedDict with one key: messages. By the time this function executes on the second invoke, state["messages"] already contains both turns the original "My codename…" message, the model's reply to it, and the new "What codename…" message. The checkpointer loaded the prior checkpoint and the &lt;strong&gt;add_messages&lt;/strong&gt; reducer merged the new input on top.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;app = graph.compile(checkpointer=InMemorySaver())&lt;/strong&gt;&lt;br&gt;
This is the critical line. Without checkpointer=, each &lt;strong&gt;invoke&lt;/strong&gt; starts with an empty state. With it, LangGraph saves a snapshot after every node completes and restores it at the start of the next invoke for the same &lt;strong&gt;thread_id&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;cfg: dict = {"configurable": {"thread_id": tid}}&lt;/strong&gt;&lt;br&gt;
This config dict is how you identify which conversation thread this call belongs to. The same thread_id = same checkpoint = continuity. A different thread_id = blank slate. This is intentional — you support multiple concurrent users by giving each a unique thread_id.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;app.invoke({"messages": [HumanMessage("What codename did I give?")]}, cfg)&lt;/strong&gt;&lt;br&gt;
Notice we only pass the new message. We do not rebuild the history manually. The &lt;strong&gt;checkpointer&lt;/strong&gt; and the &lt;strong&gt;add_messages&lt;/strong&gt; reducer do that for us.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The token budget problem and how to handle it&lt;/strong&gt;&lt;br&gt;
STM has one fundamental weakness: &lt;strong&gt;as the conversation grows, the context window fills up. For production systems you have two standard strategies&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Truncation — &lt;strong&gt;drop the oldest messages once you exceed a token threshold&lt;/strong&gt;. Simple, but the model loses early context.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Summarization — periodically ask the LLM to write a running summary of the conversation so far, then replace the old messages with that summary. More expensive, but preserves the gist.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;LangGraph does not do this automatically for you. You would add a summarization node that fires conditionally when len(state["messages"]) exceeds a threshold.&lt;/p&gt;
&lt;h2&gt;
  
  
  Production upgrade
&lt;/h2&gt;

&lt;p&gt;Swap &lt;strong&gt;InMemorySaver&lt;/strong&gt;() for &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SqliteSaver.from_conn_string("checkpoints.db")&lt;/strong&gt; and thread history survives process restarts. Swap for AsyncPostgresSaver for a cloud deployed multi instance setup.&lt;/p&gt;
&lt;h2&gt;
  
  
  Memory Type 2: Long Term Memory(LTM) Cross Thread Persistence
&lt;/h2&gt;

&lt;p&gt;What it is?&lt;/p&gt;

&lt;p&gt;Long-term memory (LTM) solves the problem that checkpoints can't: persistence across different thread_id values. When a user returns next week in a new session (new thread_id), their preferences, constraints, and facts should still be available. That requires the store.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_long_term_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Long-term memory = LangGraph Store: keyed data shared across thread_ids.

    Checkpoints reset per thread; store.put / get survives that boundary.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remember_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# get_store() is injected by LangGraph at runtime because the graph
&lt;/span&gt;        &lt;span class="c1"&gt;# was compiled with store=. Do not pass the store as a function argument.
&lt;/span&gt;        &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_store&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

        &lt;span class="c1"&gt;# Namespace is a tuple of strings — like a file path for your data.
&lt;/span&gt;        &lt;span class="c1"&gt;# ("users", "demo-user", "facts") scopes this record to one user.
&lt;/span&gt;        &lt;span class="n"&gt;ns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo-user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="n"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remember:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="c1"&gt;# Extract the fact and store it under key "profile" in this namespace.
&lt;/span&gt;            &lt;span class="n"&gt;fact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stored: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]}&lt;/span&gt;

        &lt;span class="c1"&gt;# For any other query, retrieve the stored fact and inject it as context.
&lt;/span&gt;        &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;fact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

        &lt;span class="c1"&gt;# The retrieved fact goes into a SystemMessage so it conditions the reply
&lt;/span&gt;        &lt;span class="c1"&gt;# without appearing as part of the user's message.
&lt;/span&gt;        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stored user fact (long-term): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;remember_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemoryStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;InMemorySaver&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Thread A: Store the user's preference.
&lt;/span&gt;    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remember: I always want concise bullet answers.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configurable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ltm-a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Thread B: Completely different thread_id. No shared checkpoint history.
&lt;/span&gt;    &lt;span class="c1"&gt;# But store.get still finds the preference stored under the same namespace.
&lt;/span&gt;    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What style do I prefer?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configurable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ltm-b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[LTM] Reply on a *different* thread_id:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line-by-line breakdown&lt;/strong&gt;&lt;br&gt;
store = get_store()&lt;/p&gt;

&lt;p&gt;This is not get_store from a module level import in the traditional sense it is called inside the node function at runtime. LangGraph's execution engine makes the compiled store available via this call. If you try to use the store object directly from the outer scope inside a node, it works in this simple example, but get_store() is the correct pattern for production because it handles async contexts and subgraph injection correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;ns = ("users", "demo-user", "facts")&lt;/strong&gt;&lt;br&gt;
Namespaces are tuples of strings. Think of them as a path in a key-value hierarchy. You could have ("users", user_id, "facts") for facts, &lt;strong&gt;("users", user_id, "episodes")&lt;/strong&gt; for events, and ("global", "config") for shared config. The store does not enforce any schema — the structure is entirely yours.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;store.put(ns, "profile", {"text": fact})&lt;/strong&gt;&lt;br&gt;
Three arguments: namespace tuple, key string, value dict. The value must be JSON-serializable. Here we use a single "profile" key which gets overwritten each time. For multi-fact storage you'd use a unique key per fact (perhaps the fact's text, hashed, or a UUID).&lt;br&gt;
item = store.get(ns, "profile")&lt;/p&gt;

&lt;p&gt;Returns an Item object (or None if the key does not exist). The dict you stored is at item.value. Always check for None before accessing .value  a missing key returns None, not an exception.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;SystemMessage&lt;/strong&gt; injection pattern Retrieved LTM facts almost always go into a &lt;strong&gt;SystemMessage&lt;/strong&gt;, not a &lt;strong&gt;HumanMessage&lt;/strong&gt;. This is intentional: you are giving the model background context before it reads the user's actual query. Putting it in the system prompt keeps it conceptually separate from the conversation.&lt;/p&gt;
&lt;h2&gt;
  
  
  What "vector-based LTM" looks like
&lt;/h2&gt;

&lt;p&gt;In the demo, retrieval is a direct key lookup: store.get(ns, "profile"). In production you typically want semantic retrieval — given the user's current query, find the most relevant stored facts, not all of them. The pattern is:&lt;/p&gt;

&lt;p&gt;On write: embed the fact text, store embedding + text + metadata.&lt;br&gt;
On read: embed the current query, run similarity search, inject top-k results.&lt;/p&gt;

&lt;p&gt;LangGraph's &lt;strong&gt;SqliteStore&lt;/strong&gt; and &lt;strong&gt;InMemoryStore&lt;/strong&gt; both support a search(namespace, query=..., limit=k) call when an embedding function is configured. For larger scale, swap the store backend for Pinecone, Weaviate, or ChromaDB with the same put/get/search interface pattern.&lt;/p&gt;
&lt;h2&gt;
  
  
  Production upgrade
&lt;/h2&gt;

&lt;p&gt;Replace InMemoryStore() with SqliteStore.from_conn_string("ltm.db") for local durability, or use a cloud vector store for multi-instance deployments.&lt;/p&gt;
&lt;h2&gt;
  
  
  Memory Type 3: Working Memory — The Reasoning Scratchpad
&lt;/h2&gt;

&lt;p&gt;What it is?&lt;/p&gt;

&lt;p&gt;Working memory is the temporary state that accumulates across multiple nodes within a single graph run. When an agent needs to research five things before answering one question, intermediate results need somewhere to live between steps. That place is an extra field in the graph state, cleared when the run ends.&lt;br&gt;
The code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;WorkingState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Custom state schema: messages + a scratchpad notes list.

    The Annotated[list[str], operator.add] declaration tells LangGraph:
    when multiple nodes return a &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; key, concatenate the lists
    rather than replacing the field. This is the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;reducer&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; pattern.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseMessage&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;research_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WorkingState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Simulated research/tool step.
    In a real agent this would call APIs, databases, or search tools.
    Returns a partial state update — only the &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; field.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Competitor A monthly price = $49&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Competitor B monthly price = $39&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_working_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Working memory: research node fills notes, answer node reads them in one run.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;answer_from_notes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WorkingState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# By the time this node runs, state["notes"] contains everything
&lt;/span&gt;        &lt;span class="c1"&gt;# appended by research_step (and any other upstream nodes).
&lt;/span&gt;        &lt;span class="n"&gt;notes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer using only the working notes below.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;## Working notes&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which competitor is cheaper and by how much?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WorkingState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;research_step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer_from_notes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# No checkpointer needed for working memory.
&lt;/span&gt;    &lt;span class="c1"&gt;# The scratchpad lives only for the duration of this single invoke call.
&lt;/span&gt;    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]})&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Working] Final:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line-by-line breakdown&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;notes: Annotated[list[str], operator.add]&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is the key architectural decision. Without the operator.add reducer, if two nodes both return {"notes": [...]}, the second write would overwrite the first. With operator.add, LangGraph calls operator.add(current_notes, new_notes) — which for lists is concatenation. Multiple research nodes can all write notes and they accumulate correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;graph.add_edge(START, "research") and graph.add_edge("research", "answer")&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This creates a sequential two-step pipeline. The research node runs first, populates notes. Then answer runs and reads the accumulated notes. This is a simple linear chain — real agents might have fan-out (multiple parallel research nodes) feeding into a single synthesis node.&lt;br&gt;
&lt;strong&gt;app = graph.compile() (no checkpointer)&lt;/strong&gt;&lt;br&gt;
Working memory is intentionally ephemeral. You do not need a checkpointer for it. Adding one would checkpoint the scratchpad state, which is sometimes useful for debugging but not necessary for the pattern to work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;app.invoke({"messages": [], "notes": []})&lt;/strong&gt;&lt;br&gt;
Both fields must be initialized. If you omit "notes": [], LangGraph will error because the state schema declares notes as required. The initial empty list is the starting point for the operator.add reducer.&lt;/p&gt;
&lt;h2&gt;
  
  
  The multi-node fan-out pattern
&lt;/h2&gt;

&lt;p&gt;The real power of working memory emerges when you parallelize:&lt;br&gt;
START → [research_a, research_b, research_c] → synthesize → END&lt;br&gt;
Each research node appends to notes. Because all three use operator.add, their results accumulate in whatever order they complete. The synthesize node sees all of them. You would wire this with:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research_c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research_a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synthesize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research_b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synthesize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research_c&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;synthesize&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Working memory vs &lt;strong&gt;long-term memory&lt;/strong&gt;  the key difference&lt;br&gt;
Working MemoryLong-Term MemoryLifespanOne invoke callIndefinitely, across sessionsStorageGraph state (in-process)Store backend (in-memory or durable)PurposeAccumulate intermediate resultsPersist user facts and preferencesCleared wheninvoke returnsExplicitly deleted, or never&lt;/p&gt;
&lt;h2&gt;
  
  
  Memory Type 4: Episodic Memory — The Event Log
&lt;/h2&gt;

&lt;p&gt;What it is&lt;br&gt;
Episodic memory stores what happened, not just what is true. Long-term memory holds preferences ("I like bullet points"). Episodic memory holds events ("Last Tuesday we reviewed three quotes and chose Plan B"). It is the agent's diary — structured, timestamped, queryable.&lt;br&gt;
The code&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_episodic_memory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Episodic memory = append-only events (task, outcome, ...), recalled by search.

    In production: add timestamps, semantic search over episode summaries,
    and filters by date range, task type, or user ID.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemoryStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Namespace: scoped to this user's episode log.
&lt;/span&gt;    &lt;span class="n"&gt;ns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo-user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;episodes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Each episode gets a UUID so records are uniquely addressable.
&lt;/span&gt;    &lt;span class="c1"&gt;# If the same event needs to be updated later (e.g., outcome changed),
&lt;/span&gt;    &lt;span class="c1"&gt;# use the same key. For append-only logs, always generate a fresh UUID.
&lt;/span&gt;    &lt;span class="n"&gt;eid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;eid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pricing_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chose plan B after comparing three quotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="c1"&gt;# In production, add: "timestamp": datetime.utcnow().isoformat()
&lt;/span&gt;            &lt;span class="c1"&gt;# and embed the outcome text for semantic search.
&lt;/span&gt;        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Retrieve recent episodes. In production, filter by timestamp or
&lt;/span&gt;    &lt;span class="c1"&gt;# use store.search(ns, query="pricing decision", limit=5) for semantic recall.
&lt;/span&gt;    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Episodic] Stored episodes:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line-by-line breakdown&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;eid = str(uuid.uuid4())&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Each episode is a separate record with a unique key&lt;/strong&gt;. This is the append only pattern: you never overwrite an existing episode, you always create a new one. If you need to mark an episode as completed or update its outcome, you can use the same UUID as the key (the put call will overwrite it). The choice depends on whether you want a full audit trail or just the latest state of each event.&lt;br&gt;
store.put(ns, eid, {...})&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The value dict can contain any JSON serializable data.&lt;/strong&gt; In production, you would always include a timestamp so you can filter by date range. You might also store the full conversation summary, the user who triggered it, the tool calls made, and structured outcomes.&lt;br&gt;
store.search(ns, limit=5)&lt;/p&gt;

&lt;p&gt;Without a query parameter, search returns the most recently written records up to limit. With a query string and an embedding function configured on the store, it performs semantic similarity search over stored records. The toy demo uses simple listing; real recall would look like:&lt;/p&gt;

&lt;p&gt;python# Production-style episodic recall (pseudocode):&lt;br&gt;
results = store.search(&lt;br&gt;
    ns,&lt;br&gt;
    query="what pricing decisions did we make?",&lt;br&gt;
    limit=5&lt;br&gt;
)&lt;br&gt;
The r.value access&lt;br&gt;
store.search returns a list of SearchItem objects. Each has .key, .namespace, and .value (the dict you stored). Filter and process them however you need before injecting into context.&lt;/p&gt;
&lt;h2&gt;
  
  
  Connecting episodic memory to the conversation
&lt;/h2&gt;

&lt;p&gt;The episodic demo is intentionally standalone — it shows the storage pattern without a full graph. In a real agent, you'd write episodes in an after-action node that fires after every task completes, and you'd surface them in a context-building node at the start of each new session:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;START → retrieve_episodes → main_agent → [task] → log_episode → END&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Memory Type 5: Semantic Memory RetrievalAugmented Generation (RAG)
&lt;/h2&gt;

&lt;p&gt;What it is?&lt;/p&gt;

&lt;p&gt;Semantic memory is your agent's domain knowledge layer grounded in a corpus of verified text, retrieved dynamically rather than hallucinated from training weights. The pattern is: &lt;strong&gt;embed a query, find the most relevant document chunks&lt;/strong&gt;, inject those chunks as tool output, let the model answer from the retrieved evidence.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_kb&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Build a small FAISS vector index over profile documents.

    In production: load from PDFs, databases, or a web crawl.
    Use a persistent vector store (Pinecone, Weaviate, ChromaDB) instead of FAISS
    so the index survives process restarts.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Seenivasa Ramadurai works at Provizient. He architects cloud-native software — &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;microservices, gRPC, REST — and delivers GenAI, LLMs, and agentic patterns.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;At Provizient, skills include C#, Python, Java, Scala, TypeScript; LLMs, RAG, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orchestration; ML and MLOps; vector databases; APIs; Kubernetes and Docker.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Node factory: bind a list of tools to the LLM and return a graph node function.

    bind_tools() tells the model what tools are available and how to call them.
    The model&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s response may be a plain AIMessage OR an AIMessage with tool_calls populated.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="c1"&gt;# Pass the full message history (including any prior tool results) to the model.
&lt;/span&gt;        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;bound&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_semantic_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
    Semantic memory: model calls a KB search tool, ToolNode executes it,
    results are appended to messages, model reads them and answers.
    This is the standard ReAct (Reason + Act) loop.
    &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;kb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_kb&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@tool&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;profile_kb_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;
        Retrieve top-k chunks from the profile knowledge base.

        The docstring is shown to the LLM as the tool description —
        write it clearly so the model knows when and how to use this tool.
        &lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;profile_kb_search&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Two nodes: the LLM agent and the tool executor.
&lt;/span&gt;    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;_bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Conditional routing: if the agent emitted tool calls → run ToolNode.
&lt;/span&gt;    &lt;span class="c1"&gt;# If the agent emitted a final answer → END.
&lt;/span&gt;    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;tools_condition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__end__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# After ToolNode runs, go back to the agent so it can read the tool results.
&lt;/span&gt;    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# No checkpointer needed for this demo, but you'd add one in production.
&lt;/span&gt;    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which company does Seenivasa work for, and what are some of his skills? &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use the knowledge tool.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Semantic] Last message:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Line-by-line breakdown&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FAISS.from_documents([...], OpenAIEmbeddings())&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;FAISS (Facebook AI Similarity Search) builds an in memory vector index. &lt;strong&gt;OpenAIEmbeddings()&lt;/strong&gt; calls &lt;strong&gt;text-embedding-ada-002&lt;/strong&gt; (or the latest embedding model) to convert each document chunk into a vector. from_documents is a class method that handles both embedding and indexing in one call. For production, replace FAISS with a persistent vector store — FAISS is RAM-only and rebuilds from scratch on every process start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;@tool decorator&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;@tool decorator from langchain_core.tools&lt;/strong&gt; does three things: (1) wraps the Python function so it can be called by ToolNode, (2) extracts the function signature to build a JSON schema for the tool parameters, and (3) uses the docstring as the tool description sent to the LLM. Write clear docstrings — the model reads them to decide which tool to call and when &lt;strong&gt;model.bind_tools(tools)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This attaches the tool definitions to the model in the format required by the OpenAI function-calling API. When you call bound.invoke(messages), the model can now return an AIMessage with a populated tool_calls list in addition to (or instead of) plain text content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;tools_condition&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This is a prebuilt LangGraph router function. It inspects the last message in state: if it has tool_calls, it returns "tools"; otherwise it returns "&lt;strong&gt;end&lt;/strong&gt;". The conditional edge uses this to route traffic. The {"tools": "tools", "&lt;strong&gt;end&lt;/strong&gt;": END} dict maps those return values to node names.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;graph.add_edge("tools", "agent")&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;After ToolNode executes the tool call and appends the result as a ToolMessage to state, control returns to the agent. The agent now sees the tool result in its message history and generates a final answer. This loop continues until the agent produces a response with no tool calls.&lt;/p&gt;

&lt;h2&gt;
  
  
  The execution flow, step by step
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;User: "Which company does Seenivasa work for?"&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. agent node runs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM sees the question + tool definition&lt;/li&gt;
&lt;li&gt;LLM responds: AIMessage(tool_calls=[{name: "profile_kb_search", args: {query: "Seenivasa company"}}])&lt;/li&gt;
&lt;li&gt;tools_condition sees tool_calls → routes to "tools"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. tools node runs:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ToolNode calls profile_kb_search("Seenivasa company")&lt;/li&gt;
&lt;li&gt;FAISS returns the two most similar chunks&lt;/li&gt;
&lt;li&gt;Result appended as ToolMessage to state["messages"]&lt;/li&gt;
&lt;li&gt;Edge sends control back to "agent"&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. agent node runs again:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;LLM now sees: original question + tool call + tool result&lt;/li&gt;
&lt;li&gt;LLM produces a final AIMessage with no tool_calls&lt;/li&gt;
&lt;li&gt;tools_condition sees no tool_calls → routes to END&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;4. Graph returns state["messages"][-1].content&lt;/strong&gt; = the grounded answer&lt;br&gt;
Why not just put knowledge in the system prompt?&lt;br&gt;
For small knowledge bases, you could. For anything non-trivial:&lt;/p&gt;

&lt;p&gt;System prompts have token limits&lt;br&gt;
You pay for all tokens even if most are irrelevant&lt;br&gt;
RAG retrieves only what's relevant to the current query&lt;br&gt;
You can update the knowledge base without redeploying the agent&lt;/p&gt;

&lt;p&gt;The Complete, Runnable Script&lt;br&gt;
Copy this file, set OPENAI_API_KEY, and run it. All five memory patterns execute sequentially.&lt;br&gt;
python"""&lt;br&gt;
Five agent memory patterns with LangGraph (Part 2 companion script).&lt;/p&gt;
&lt;h2&gt;
  
  
  Memory types demonstrated:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Short-term  : MessagesState + InMemorySaver + stable thread_id&lt;/li&gt;
&lt;li&gt;Long-term   : InMemoryStore + get_store() across different thread_ids&lt;/li&gt;
&lt;li&gt;Working     : Custom WorkingState with notes merged via operator.add&lt;/li&gt;
&lt;li&gt;Episodic    : Append-only store rows + search (toy recall)&lt;/li&gt;
&lt;li&gt;Semantic    : FAISS + @tool + ReAct loop (ToolNode / tools_condition)&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;All demos use InMemory* backends (zero setup required).&lt;br&gt;
For production: swap InMemorySaver → SqliteSaver, InMemoryStore → SqliteStore.&lt;/p&gt;
&lt;h2&gt;
  
  
  Dependencies:
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;pip install langgraph langchain-openai langchain-community faiss-cpu python-dotenv
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;h2&gt;
  
  
  Environment:
&lt;/h2&gt;
&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;OPENAI_API_KEY  (required)
OPENAI_CHAT_MODEL  (optional, defaults to gpt-4o-mini)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;
&lt;p&gt;"""&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;__future__&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;annotations&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="c1"&gt;# Set before any FAISS import to prevent OpenMP duplicate library crash on macOS.
&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setdefault&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;KMP_DUPLICATE_LIB_OK&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;TRUE&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;uuid&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;TypedDict&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.documents&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Document&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;BaseMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;SystemMessage&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_community.vectorstores&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;OpenAIEmbeddings&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.checkpoint.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemorySaver&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.config&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_store&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;StateGraph&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.graph.message&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.prebuilt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools_condition&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.store.memory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;InMemoryStore&lt;/span&gt;

&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parent&lt;/span&gt;
&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_ROOT&lt;/span&gt; &lt;span class="o"&gt;/&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.env&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;CHAT_MODEL&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_CHAT_MODEL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;require_api_key&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Exit with a clear message if the OpenAI key is missing.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;OPENAI_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ERROR: Set OPENAI_API_KEY in the environment or in a .env file next to this script.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;stderr&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;exit&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  # 1. SHORT-TERM MEMORY
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_short_term_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;STM: conversation buffer restored per thread_id via checkpointer.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]}&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;chat&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;InMemorySaver&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;

    &lt;span class="n"&gt;tid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;session-stm-demo&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configurable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;tid&lt;/span&gt;&lt;span class="p"&gt;}}&lt;/span&gt;

    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;My codename for this session is Bluejay.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What codename did I give?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt; &lt;span class="n"&gt;cfg&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[STM] Last reply:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;p&gt;&lt;br&gt;
python&lt;/p&gt;

&lt;h2&gt;
  
  
  2. LONG-TERM MEMORY
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_long_term_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;LTM: LangGraph Store persists facts across different thread_ids.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remember_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_store&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;ns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo-user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;facts&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;startswith&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;remember:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;fact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;)[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stored: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]}&lt;/span&gt;

        &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;profile&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;fact&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;item&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Stored user fact (long-term): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;fact&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;none&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;remember_node&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemoryStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;checkpointer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nc"&gt;InMemorySaver&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Remember: I always want concise bullet answers.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configurable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ltm-a&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What style do I prefer?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)]},&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;configurable&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;thread_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ltm-b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[LTM] Reply on a *different* thread_id:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  3. WORKING MEMORY
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;WorkingState&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TypedDict&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;State with a scratchpad: notes lists from all nodes are concatenated.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;BaseMessage&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;add_messages&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;notes&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Annotated&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;operator&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;research_step&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WorkingState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Simulated research node — returns structured data into working memory.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Competitor A monthly price = $49&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Competitor B monthly price = $39&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_working_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Working memory: research node fills notes, answer node reads them.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;answer_from_notes&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;WorkingState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;notes&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;([&lt;/span&gt;
            &lt;span class="nc"&gt;SystemMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Answer using only the working notes below.&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;## Working notes&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;notes&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which competitor is cheaper and by how much?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;])&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;]}&lt;/span&gt;

    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;WorkingState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;research_step&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer_from_notes&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;answer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[],&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;notes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[]})&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Working] Final:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  4. EPISODIC MEMORY
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_episodic_memory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Episodic memory: one logged event written to store, recalled via search.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;InMemoryStore&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;ns&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;users&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;demo-user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;episodes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;eid&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uuid&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;uuid4&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;
    &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;put&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="n"&gt;eid&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;task&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;pricing_review&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;outcome&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Chose plan B after comparing three quotes&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;store&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ns&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;limit&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Episodic] Stored episodes:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;r&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;value&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;r&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  5. SEMANTIC MEMORY (RAG)
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;build_kb&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Build an in-memory FAISS index over profile document chunks.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Seenivasa Ramadurai works at Provizient. He architects cloud-native software — &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;microservices, gRPC, REST — and delivers GenAI, LLMs, and agentic patterns.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="nc"&gt;Document&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="n"&gt;page_content&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;At Provizient, skills include C#, Python, Java, Scala, TypeScript; LLMs, RAG, &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;orchestration; ML and MLOps; vector databases; APIs; Kubernetes and Docker.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="p"&gt;),&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="nc"&gt;OpenAIEmbeddings&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;list&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Node factory: bind tools to the LLM and return a graph node function.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;bound&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;bound&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])]}&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;node&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;demo_semantic_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Semantic memory: ReAct loop with FAISS retrieval tool.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;kb&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;build_kb&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nd"&gt;@tool&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;profile_kb_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Retrieve top-k chunks from the profile knowledge base.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="n"&gt;docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;kb&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;similarity_search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;d&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;page_content&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;d&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;docs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;profile_kb_search&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;StateGraph&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;MessagesState&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;_bind_tools&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_node&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nc"&gt;ToolNode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;START&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_conditional_edges&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools_condition&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__end__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;END&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;add_edge&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tools&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;graph&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;compile&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="n"&gt;out&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="nc"&gt;HumanMessage&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Which company does Seenivasa work for, and what are some of his skills? &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Use the knowledge tool.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="p"&gt;})&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;[Semantic] Last message:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;out&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;][&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ENTRY POINT
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run all five memory demos in sequence.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="nf"&gt;require_api_key&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;CHAT_MODEL&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== 1. SHORT-TERM MEMORY ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;demo_short_term_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== 2. LONG-TERM MEMORY ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;demo_long_term_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== 3. WORKING MEMORY ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;demo_working_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== 4. EPISODIC MEMORY ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;demo_episodic_memory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;=== 5. SEMANTIC MEMORY ===&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;demo_semantic_memory&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>python</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>The 5 Types of AI Agent Memory Every Developer Needs to Know (Part 1)</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Thu, 02 Apr 2026 04:13:24 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/the-5-types-of-ai-agent-memory-every-developer-needs-to-know-part-1-52fn</link>
      <guid>https://future.forem.com/sreeni5018/the-5-types-of-ai-agent-memory-every-developer-needs-to-know-part-1-52fn</guid>
      <description>&lt;p&gt;&lt;em&gt;Because building agents without understanding memory is like hiring an employee who forgets everything by morning.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Your Agent Is Not Broken. It Was Never Built to Remember.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Here is something most people get wrong when they first build an AI agent. They set it up, give it context, run a few tasks, it works great. Then they come back the next session and it has no idea who they are, what the project is, or what was decided. So they open a GitHub issue. They try different prompts. They assume something is misconfigured.&lt;/p&gt;

&lt;p&gt;Nothing is misconfigured. The agent is working exactly as designed.&lt;br&gt;
The hard truth is this: &lt;strong&gt;agent memory is not a model problem. It is an infrastructure problem.&lt;/strong&gt; The LLM at the core of your agent is stateless by design every inference call starts completely fresh. No history, no context, no record of what happened before. That is never going to change, because statelessness is precisely what allows LLMs to scale to millions of users at once.&lt;/p&gt;

&lt;p&gt;What this means for builders is important: &lt;strong&gt;you cannot give the model memory. You have to build memory infrastructure around it.&lt;/strong&gt;&lt;br&gt;
The agent does not remember. The infrastructure remembers. The agent only knows what the infrastructure decides to place in front of it inside the context window.&lt;/p&gt;

&lt;p&gt;That distinction is the foundation of everything in this post. Once you understand it, the five memory types stop being abstract concepts and start being concrete engineering decisions you make when designing an agent system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Context Window: Why It's at the Center of Every Memory Decision
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before we get into the memory types&lt;/strong&gt;, you need to understand one thing clearly and &lt;strong&gt;the context window is the only reality the LLM has&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Every token the model can reason about your message&lt;/strong&gt;, the conversation history, retrieved documents, tool outputs, system instructions must be inside the &lt;strong&gt;context window at the moment of inference&lt;/strong&gt;. If it is not in the window, the model does not know it exists. Full stop.&lt;/p&gt;

&lt;p&gt;This is why memory architecture matters so much. Context windows are finite they have token limits, they cost money to fill, and they reset completely between sessions. You cannot just dump everything into them and call it done. You need a system that intelligently decides what information gets retrieved, when, and injected into that window at the right moment.&lt;/p&gt;

&lt;p&gt;That system is agent memory. And because different situations demand different kinds of information recent conversation turns, &lt;strong&gt;user preferences, mid task reasoning state, past interaction history,&lt;/strong&gt; domain facts &lt;strong&gt;there is not one type of memory but five&lt;/strong&gt;, each built to retrieve and inject the right information at the right moment.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Memory Problem Got Serious
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AI applications did not start as agents&lt;/strong&gt;. They started as simple &lt;strong&gt;request&lt;/strong&gt; &lt;strong&gt;response&lt;/strong&gt; &lt;strong&gt;systems&lt;/strong&gt; you send a message, the model replies, &lt;strong&gt;nothing is retained&lt;/strong&gt;. Each call was &lt;strong&gt;completely isolated from the last (pervious)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;first attempt&lt;/strong&gt; to &lt;strong&gt;fix this was brute force send the entire conversation history with every request&lt;/strong&gt;. It worked well enough for short conversations, but it was never really memory it was just a &lt;strong&gt;growing pile of text being thrown at the model each time&lt;/strong&gt;. Once conversations got long enough, older messages fell off the token limit and disappeared. &lt;strong&gt;The "memory" was already leaking&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Then models gained the ability to &lt;strong&gt;call tools APIs, databases, search engines and the use case jumped entirely&lt;/strong&gt;. Now you could build agents systems that take a &lt;strong&gt;goal&lt;/strong&gt;, &lt;strong&gt;break it into steps&lt;/strong&gt;, &lt;strong&gt;call tools, observe results&lt;/strong&gt;, and &lt;strong&gt;loop until the task is complete&lt;/strong&gt;. Then &lt;strong&gt;came multi agent systems&lt;/strong&gt;, where specialized agents work as a team, routing tasks between each other like a coordinated workforce.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Each step forward&lt;/strong&gt; made the &lt;strong&gt;memory problem worse&lt;/strong&gt;. A single chatbot &lt;strong&gt;forgetting context is annoying&lt;/strong&gt;. An agent losing state mid task is a failure. A multi agent system where no agent knows what the others have decided is a broken system. The "stuff everything into the context window" approach simply does not hold at this level of complexity.&lt;/p&gt;

&lt;p&gt;What you need instead is &lt;strong&gt;intentional memory architecture&lt;/strong&gt; a layer that knows what to store, how long to keep it, and exactly when to surface it. &lt;strong&gt;That layer is built on five distinct memory type&lt;/strong&gt;s, each designed to solve a different part of the problem.&lt;/p&gt;

&lt;h2&gt;
  
  
  The 5 Types of Agent Memory
&lt;/h2&gt;

&lt;h2&gt;
  
  
  1. Short-Term Memory (STM) The Conversation Buffer
&lt;/h2&gt;

&lt;p&gt;Short Term Memory(STM) is the simplest form of agent memory and the one you are almost certainly already using without thinking about it.&lt;br&gt;
Every message the user sends and every response the agent gives gets stored in a session buffer. &lt;strong&gt;That buffer gets assembled into the context window on every subsequent request&lt;/strong&gt;. This is how the agent understands &lt;strong&gt;follow up questions&lt;/strong&gt; when you say "make it shorter," it knows what "it" refers to because the prior exchange is sitting in the context window.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frk6kgtm7c146plepqait.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frk6kgtm7c146plepqait.png" alt=" " width="518" height="463"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The technical implementation is a rolling token buffer. When the buffer approaches the model's &lt;strong&gt;token limit, older messages get truncated or summarized before dropping off&lt;/strong&gt;. New inputs overwrite old ones. When the session ends, the buffer clears entirely.&lt;br&gt;
Think of it like RAM in a computer fast, active, and useful right now. But the moment you turn it off, it's gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it solves:&lt;/strong&gt; Conversation coherence within a single session. &lt;strong&gt;Follow up questions&lt;/strong&gt;. Context continuity across a short interaction.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it does not solve:&lt;/strong&gt; Anything beyond the current session. Come back tomorrow, and the agent has no idea who you are.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. Long Term Memory (LTM) Persistence Across Sessions
&lt;/h2&gt;

&lt;p&gt;Long Term Memory is what makes an agent feel like it actually knows you.&lt;/p&gt;

&lt;p&gt;Instead of losing everything when a session ends, LTM stores important information in a persistent external store user preferences, past decisions, project context, communication style, recurring constraints. The next time you interact with the agent, the most relevant pieces of that stored knowledge get retrieved and injected into the context window before the model ever sees your message.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmkk0wvchvjcklpo4unnu.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmkk0wvchvjcklpo4unnu.png" alt=" " width="560" height="445"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The standard implementation &lt;strong&gt;uses a vector database&lt;/strong&gt; like &lt;strong&gt;Pinecone&lt;/strong&gt;, &lt;strong&gt;Weaviate&lt;/strong&gt;, or &lt;strong&gt;ChromaDB&lt;/strong&gt;. When something worth remembering happens, it gets converted into a vector embedding and stored with metadata. On future sessions, incoming queries trigger a similarity search the &lt;strong&gt;top-k&lt;/strong&gt; most semantically relevant memories are retrieved and quietly injected into context. The model then responds as if it already knew those things about you, because from its perspective, it does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The workflow in practice:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User shares something &lt;strong&gt;reusable preferences, goals, constraints, project structure&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;That information is embedded and stored in the vector database&lt;/li&gt;
&lt;li&gt;On every future session, a similarity search retrieves what is relevant&lt;/li&gt;
&lt;li&gt;Retrieved memories are injected into the context window before the model processes the request&lt;/li&gt;
&lt;li&gt;Memory updates when new important information is provided&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What it solves:&lt;/strong&gt; &lt;strong&gt;Cross session personalization. User preference retention. Long running project continuity&lt;/strong&gt;. Making the agent feel like a real colleague who knows your context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real example:&lt;/strong&gt; An AI assistant that remembers your name, your team's preferred report format, and the fact that you always prioritize cost over speed in trade off decisions even when you return after weeks away.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Working Memory The Reasoning Scratchpad
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Working Memory&lt;/strong&gt; is what the agent uses while it is actively thinking through a complex, &lt;strong&gt;multi step task.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Imagine you ask an &lt;strong&gt;agent to research five competitors&lt;/strong&gt;, extract their pricing, compare them against your product, and write a summary recommendation. That is not one step it is a chain of steps where each result feeds into the next. Working memory is the temporary store that holds intermediate results across those steps, so the agent does not lose track of what it has already done.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsuwo4q93dh72rfz6ft4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgsuwo4q93dh72rfz6ft4.png" alt=" " width="531" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Without working memory, each loop iteration in an agentic workflow would start with no knowledge of previous iterations. The agent would spin in circles or repeat steps it had already completed.&lt;br&gt;
The implementation is typically an in-memory structure a dict or JSON object — maintained by the agent framework across loop iterations. At each step, the current working memory state gets injected into the context window alongside the new task, so the model can build on prior results. Once the task is complete, working memory is cleared.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it solves:&lt;/strong&gt; &lt;strong&gt;Multi step task execution&lt;/strong&gt;. Complex reasoning chains. Agentic loops that need to carry state from one iteration to the next without losing the thread.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real example:&lt;/strong&gt; An agent planning a travel itinerary holds flights, hotel constraints, budget limits, and date conflicts in working memory building the full picture step by step before producing a final recommendation.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Episodic Memory The Interaction Log
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Episodic Memory&lt;/strong&gt; gives an agent the &lt;strong&gt;ability to recall specific things that happened in the past not just general preferences&lt;/strong&gt;, but actual events with context and outcomes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fae8las0a528rt5tjq94h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fae8las0a528rt5tjq94h.png" alt=" " width="450" height="395"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Where &lt;strong&gt;Long Term Memory stores what you like, Episodic Memory stores what happened&lt;/strong&gt;. It is a structured log of past interactions, each saved as an event record with a timestamp, the task that was performed, inputs, actions taken, and the outcome. Think of it as the agent's diary  specific, timestamped, retrievable.&lt;/p&gt;

&lt;p&gt;When you come back and ask &lt;strong&gt;"what did we work on last week?"&lt;/strong&gt; or "remind me of the decision we made on the pricing model," the agent queries the episodic store by timestamp, keyword, or semantic similarity &lt;strong&gt;retrieves the relevant episodes&lt;/strong&gt;, compresses them into a summary, and injects that summary into the current context window.&lt;br&gt;
This is also what enables agents to say things like: "Last time you reviewed this type of document, you flagged the legal section first want me to start there again?" That is episodic memory working correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it solves:&lt;/strong&gt; Specific past event recall. &lt;strong&gt;Long running project continuity&lt;/strong&gt;. Agents that learn from experience and build on prior decisions rather than repeating mistakes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real example:&lt;/strong&gt; "Last time you chose Option A over Option B because of budget should I apply the same logic here?" That sentence could only come from an agent with episodic memory.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Semantic Memory The Knowledge Layer
&lt;/h2&gt;

&lt;p&gt;Semantic Memory is the agent's understanding of the world &lt;strong&gt;facts&lt;/strong&gt;, concepts, domain knowledge, relationships between things independent of any specific interaction with you.&lt;/p&gt;

&lt;p&gt;It is not about your history with the agent. It is about what the agent knows to be true. That Python is a programming language. That Singapore's corporate tax rate is 17%. That a JWT token expires and must be refreshed. This kind of knowledge lives either in the model's pre-trained weights or more usefully for &lt;strong&gt;domain specific&lt;/strong&gt; and u*&lt;em&gt;p-to-date needs&lt;/em&gt;* in an external knowledge base accessed through &lt;strong&gt;RAG&lt;/strong&gt; (&lt;strong&gt;R&lt;/strong&gt;etrieval &lt;strong&gt;A&lt;/strong&gt;ugmented &lt;strong&gt;G&lt;/strong&gt;eneration).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7twiwcyl4jc034ewgwug.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F7twiwcyl4jc034ewgwug.png" alt=" " width="493" height="390"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;When you ask a factual or domain specific question, the agent does a semantic search against the knowledge base, retrieves the most &lt;strong&gt;relevant facts, injects them into the context window&lt;/strong&gt;, and generates a grounded response. This is how you build agents that give accurate answers in specialized domains without hallucinating details they were never trained on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What it solves:&lt;/strong&gt; Factual accuracy. Domain specific expertise. Keeping agents grounded in verified knowledge beyond their training cutoff. Enterprise knowledge bases where accuracy is non negotiable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real example:&lt;/strong&gt; An agent asked "Is &lt;strong&gt;Bangalore&lt;/strong&gt; more populous than &lt;strong&gt;Amaravathi&lt;/strong&gt;?" does not guess from training data it queries semantic memory, retrieves the fact, and answers with confidence.&lt;/p&gt;

&lt;h2&gt;
  
  
  How All Five Work Together
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;These memory types are not mutually exclusive&lt;/strong&gt; a well designed &lt;strong&gt;agent&lt;/strong&gt; &lt;strong&gt;uses&lt;/strong&gt; all of them &lt;strong&gt;simultaneously&lt;/strong&gt;, each handling a different layer of the memory problem.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nnswrxcjxl8r17vkxki.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8nnswrxcjxl8r17vkxki.png" alt=" " width="800" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Tools or Frameworks That Make This Real
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;This is not theoretical. The tooling is production ready right now.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain&lt;/strong&gt; handles buffer memory, summary memory, and vector-based LTM out of the box. It is the most flexible starting point for composing memory types together in one agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LlamaIndex&lt;/strong&gt; is purpose built for connecting external knowledge sources PDFs, APIs, databases, knowledge graphs making it the go to for &lt;strong&gt;RAG&lt;/strong&gt; heavy Semantic Memory implementations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pinecone&lt;/strong&gt;, &lt;strong&gt;Weaviate&lt;/strong&gt;, &lt;strong&gt;ChromaDB&lt;/strong&gt; are dedicated vector stores that power both &lt;strong&gt;LTM&lt;/strong&gt; and &lt;strong&gt;Semantic&lt;/strong&gt; &lt;strong&gt;Memory&lt;/strong&gt; with fast, scalable similarity based retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangGraph&lt;/strong&gt; brings graph based orchestration to stateful, multistep agentic workflows  this is what &lt;strong&gt;Part 2&lt;/strong&gt; uses to wire all five memory types into a real working system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AWS Strands&lt;/strong&gt; Agents provides production grade agent infrastructure with memory at cloud scale also covered hands on in &lt;strong&gt;Part 2&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
    </item>
    <item>
      <title>Retrieval Finds Candidates. Reranking Finds the Right One.</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Mon, 30 Mar 2026 05:09:53 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/retrieval-finds-candidates-reranking-finds-the-right-one-1p0i</link>
      <guid>https://future.forem.com/sreeni5018/retrieval-finds-candidates-reranking-finds-the-right-one-1p0i</guid>
      <description>&lt;p&gt;&lt;em&gt;A hiring analogy that finally makes RAG Reranking click&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  First, What Is RAG?
&lt;/h2&gt;

&lt;p&gt;Before we get into the analogy, let me give you a 30 second crash course on &lt;strong&gt;RAG&lt;/strong&gt; because this is where reranking lives.&lt;br&gt;
&lt;strong&gt;RAG&lt;/strong&gt; stands for &lt;strong&gt;R&lt;/strong&gt;etrieval &lt;strong&gt;A&lt;/strong&gt;ugmented &lt;strong&gt;G&lt;/strong&gt;eneration.&lt;/p&gt;
&lt;h2&gt;
  
  
  Here's the problem it solves:
&lt;/h2&gt;

&lt;p&gt;Large Language Models (LLMs) like &lt;strong&gt;GPT&lt;/strong&gt; or &lt;strong&gt;Claude&lt;/strong&gt; are &lt;strong&gt;incredibly powerful&lt;/strong&gt; but &lt;strong&gt;they only know what they were trained on&lt;/strong&gt;. They don't know about your &lt;strong&gt;company's internal documents&lt;/strong&gt;, last week's product update, or your &lt;strong&gt;customer support knowledge base&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;RAG fixes that by giving the LLM a memory it can search.&lt;br&gt;
Here's how it works in three simple steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieve&lt;/strong&gt; — When a user asks a question, the system searches your document library and pulls the most relevant chunks&lt;br&gt;
&lt;strong&gt;Augment&lt;/strong&gt; — Those retrieved chunks are added to the prompt as context&lt;br&gt;
&lt;strong&gt;Generate&lt;/strong&gt; — The LLM reads the context and generates a grounded, accurate answer&lt;/p&gt;

&lt;p&gt;Think of it like an &lt;strong&gt;open book exam&lt;/strong&gt;. The LLM doesn't have to &lt;strong&gt;memorize everything&lt;/strong&gt; it just needs to find the &lt;strong&gt;right page and read it&lt;/strong&gt;. Simple enough. &lt;strong&gt;But here's where most RAG systems quietly fail.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The Hiring Analogy That Changes Everything
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqobvry52ejg2pkr5ovj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Foqobvry52ejg2pkr5ovj.png" alt=" " width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;One of my friends recently asked me a &lt;strong&gt;simple but powerful question.&lt;/strong&gt; "Why do we even need &lt;strong&gt;reranking&lt;/strong&gt; after &lt;strong&gt;retrieval&lt;/strong&gt;? Isn't finding the right documents enough?. "Instead of going technical, I said "&lt;strong&gt;Let me tell you about a hiring process.&lt;/strong&gt;"&lt;br&gt;
Think of embedding based retrieval as your HR or Talent Acquisition team.&lt;/p&gt;
&lt;h2&gt;
  
  
  Their job is to:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;Scan thousands of resumes&lt;/li&gt;
&lt;li&gt;Filter based on keywords, skills, and experience&lt;/li&gt;
&lt;li&gt;Shortlist candidates that look relevant&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is exactly what &lt;strong&gt;vector similarity does&lt;/strong&gt;. It retrieves documents that are "&lt;strong&gt;close enough&lt;/strong&gt;" based on &lt;strong&gt;embeddings&lt;/strong&gt; fast, broad, and essential.&lt;/p&gt;
&lt;h2&gt;
  
  
  But here's the problem nobody talks about:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;👉 Relevance is not correctness.&lt;/li&gt;
&lt;li&gt;👉 Similarity is not suitability.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Just because a resume matches keywords doesn't mean the candidate can actually solve the hiring manager's real problem&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;&lt;em&gt;The same way, just because a document is topically similar doesn't mean it actually answers the user's question.&lt;/em&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Now enters the Hiring Manager.
&lt;/h2&gt;

&lt;p&gt;The hiring manager:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Reviews the shortlisted candidates deeply&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Evaluates beyond surface level keywords&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Matches candidates against the actual needs of the role&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Rejects those who don't truly fit&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;Surfaces the one who genuinely belongs&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This step is exactly what we call &lt;strong&gt;Reranking&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vxgux9ytd0zo3tovpa5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6vxgux9ytd0zo3tovpa5.png" alt=" " width="800" height="289"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  In AI Terms
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Retrieval gives you &lt;strong&gt;Top-K similar documents&lt;/strong&gt; (the shortlist)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reranking&lt;/strong&gt; evaluates &lt;strong&gt;semantic relevance&lt;/strong&gt; to the actual question (the deep review)&lt;/li&gt;
&lt;li&gt;It pushes the most &lt;strong&gt;useful answer to the top&lt;/strong&gt; and filters out the noise&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;
  
  
  Real World Example: Cohere Reranking Model
&lt;/h2&gt;

&lt;p&gt;One of the most popular and production ready reranking solutions today is Cohere's Rerank API.&lt;/p&gt;

&lt;p&gt;Here's how it fits into a RAG pipeline in practice:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;cohere&lt;/span&gt;

&lt;span class="n"&gt;co&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cohere&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;your-api-key&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 1: Your retrieval system fetches top-K documents
&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What is the refund policy for enterprise customers?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;retrieved_docs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Our refund policy allows returns within 30 days.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enterprise customers get dedicated support and SLA guarantees.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Enterprise plans include custom refund terms negotiated at contract signing.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Refunds are processed within 5–7 business days.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Customer support is available 24/7 for enterprise accounts.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="c1"&gt;# Step 2: Cohere Reranker evaluates each document against the query
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;co&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;rerank&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;rerank-english-v3.0&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retrieved_docs&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;top_n&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;  &lt;span class="c1"&gt;# Return only the top 3 most relevant
&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: Most relevant documents bubble to the top
&lt;/span&gt;&lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Rank &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Score: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;relevance_score&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="n"&gt;f&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Document: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;retrieved_docs&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  What Cohere Rerank does differently:
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;It doesn't just compare embeddings  it reads the query and document together&lt;/li&gt;
&lt;li&gt;It uses a &lt;strong&gt;cross encoder architecture&lt;/strong&gt; that understands the relationship between the question and each document&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;It returns a relevance score for each document&lt;/strong&gt; so you know exactly why something ranked higher&lt;/li&gt;
&lt;li&gt;It works on top of any retrieval system &lt;strong&gt;FAISS&lt;/strong&gt;, &lt;strong&gt;Pinecone&lt;/strong&gt;, Weaviate, you name it&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Sample Output:
&lt;/h2&gt;

&lt;p&gt;_Rank 1 | Score: 0.9821&lt;br&gt;
Document: Enterprise plans include custom refund terms negotiated at contract signing.&lt;/p&gt;

&lt;p&gt;Rank 2 | Score: 0.7134&lt;br&gt;
Document: Our refund policy allows returns within 30 days.&lt;/p&gt;

&lt;p&gt;Rank 3 | Score: 0.4821&lt;br&gt;
Document: Refunds are processed within 5–7 business days._&lt;/p&gt;

&lt;p&gt;Notice how the document that specifically answers the enterprise refund question jumps to the top  even though all five documents were "&lt;strong&gt;about&lt;/strong&gt;" refunds or enterprise. That's the hiring manager effect in action.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Real Insight
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Without Reranking:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You get &lt;strong&gt;good&lt;/strong&gt; &lt;strong&gt;looking&lt;/strong&gt; answers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;But not always correct&lt;/strong&gt; or truly useful ones&lt;/li&gt;
&lt;li&gt;Your LLM is working with &lt;strong&gt;noisy&lt;/strong&gt;, approximate inputs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;With Reranking:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You move from approximate similarity → &lt;strong&gt;precise relevance&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Your LLM gets exactly the &lt;strong&gt;right context to generate sharp&lt;/strong&gt;, accurate answers&lt;/li&gt;
&lt;li&gt;The difference in output quality is night and day.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  One Line Takeaway
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Retrieval&lt;/strong&gt; is about &lt;strong&gt;finding&lt;/strong&gt; options. &lt;strong&gt;Reranking&lt;/strong&gt; is about making the &lt;strong&gt;right decision.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The next time someone asks &lt;strong&gt;why reranking matters&lt;/strong&gt; skip the jargon.&lt;br&gt;
Just say: "HR shortlists the candidates. The hiring manager picks the right one. Your AI needs both."&lt;br&gt;
Because in RAG systems, just like in hiring, &lt;strong&gt;getting the right candidates in the room is only half the battle. Choosing the right one is where the magic happens.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt; &lt;/p&gt;

</description>
      <category>ai</category>
      <category>beginners</category>
      <category>llm</category>
      <category>rag</category>
    </item>
    <item>
      <title>Agent Skills: The Missing Layer That Makes AI Agents Enterprise Ready</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Wed, 18 Mar 2026 06:55:45 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/agent-skills-the-missing-layer-that-makes-ai-agents-enterprise-ready-3gc</link>
      <guid>https://future.forem.com/sreeni5018/agent-skills-the-missing-layer-that-makes-ai-agents-enterprise-ready-3gc</guid>
      <description>&lt;h2&gt;
  
  
  How Enterprises Encode Institutional Knowledge into AI Agents
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F239prkrxyw1o8uqneppo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F239prkrxyw1o8uqneppo.png" alt=" " width="800" height="1175"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is an AI Agent?
&lt;/h2&gt;

&lt;p&gt;An AI agent is a system that uses a large language model (LLM) to reason, plan, and act to reach a goal. Unlike a simple chatbot that only answers questions, an agent can:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Perceive&lt;/strong&gt; its environment (user input, tool results, context)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Decide&lt;/strong&gt; what to do next (reason, plan, choose tools)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Act&lt;/strong&gt; by calling tools, APIs, or scripts&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Iterate&lt;/strong&gt; until the task is done (ReAct: Reason → Act → Observe → repeat)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI agents are being deployed across every department  &lt;strong&gt;HR, Finance, Legal, IT, Customer Support. They can read documents, call APIs, extract data, and take actions. On paper, the capability is there&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;But in practice, &lt;strong&gt;something keeps going wrong&lt;/strong&gt;. The agent gives an answer that is &lt;strong&gt;technically correct but does not match how your company actually operates&lt;/strong&gt;. It follows a process that made sense in general but violates your internal policy. It retrieves the right data but does not know what to do with it next. It escalates everything when only some things need escalating or escalates nothing when everything does.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The problem is not the model&lt;/strong&gt;. &lt;strong&gt;The problem is that the agent has no knowledge of your organization your leave policies&lt;/strong&gt;, your approval thresholds, your GL coding rules, your escalation criteria, your vendor requirements. It has been trained on the public internet. It knows what companies do in general. It does not know what your company does specifically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This is the gap that Agent Skills closes&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Are Agent Skills?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Agent Skills are a simple, open format for giving AI agents domain expertise and procedural knowledge&lt;/strong&gt;. They answer one question: How should the agent approach this kind of task?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pilot, the Plane, and the Flight Manual
&lt;/h2&gt;

&lt;p&gt;The clearest way to understand how the &lt;strong&gt;LLM&lt;/strong&gt;, &lt;strong&gt;Tools&lt;/strong&gt;, and &lt;strong&gt;Agent&lt;/strong&gt; &lt;strong&gt;Skills&lt;/strong&gt; work together is through a single analogy one that maps all three, not just two.&lt;/p&gt;

&lt;p&gt;Picture a &lt;strong&gt;commercial&lt;/strong&gt; &lt;strong&gt;aircraft&lt;/strong&gt; sitting at the &lt;strong&gt;gate&lt;/strong&gt;, ready for &lt;strong&gt;departure&lt;/strong&gt;. It has everything a plane is supposed to have. But before we can talk about what makes &lt;strong&gt;it fly safely, we need to understand what each part actually does&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Controls-Tools
&lt;/h2&gt;

&lt;p&gt;The aircraft has throttles, a yoke, flaps, landing gear, and dozens of other physical controls. These are the mechanisms that change the state of the world. Push the throttle forward and the engines spool up. Deploy the flaps and the lift characteristics of the wing change. Lower the landing gear and the plane is ready to touch down.&lt;/p&gt;

&lt;p&gt;The controls cannot do anything on their own. They sit inert until someone operates them. But without them, nothing can happen at all  no matter how skilled the pilot or how detailed the manual.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In an AI agent, Tools are the controls&lt;/strong&gt;. They are the &lt;strong&gt;callable&lt;/strong&gt; functions that interact with the outside world  &lt;strong&gt;querying&lt;/strong&gt; a database, calling a &lt;strong&gt;REST API&lt;/strong&gt;, &lt;strong&gt;reading a PDF&lt;/strong&gt;, posting to Slack, writing a record to an ERP. &lt;strong&gt;Without tools, the agent can reason about anything but change nothing. With tools&lt;/strong&gt;, every decision the agent makes can become a real action in a real system.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Pilot-The LLM
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The pilot is what brings the aircraft to life&lt;/strong&gt;. They sit in the &lt;strong&gt;cockpit&lt;/strong&gt;, read the instruments altitude, airspeed, weather radar, traffic alerts and make continuous decisions: when to climb, when to level off, when to adjust course, when to call air traffic control.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The pilot does not invent the controls&lt;/strong&gt;. The throttle was already there. The pilot decides which control to use, when to use it, in what combination, and in what sequence. They are the reasoning layer that turns raw capability into purposeful action.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In an AI agent, the LLM is the pilot&lt;/strong&gt;. It reads the &lt;strong&gt;inputs&lt;/strong&gt; the user message, the tool results, the conversation history  and decides what to do next. &lt;strong&gt;Which tool to call&lt;/strong&gt;. What parameters to pass. Whether the task is complete or needs another step. The LLM does not execute tools directly; it decides to invoke them, just as a pilot decides to operate a control.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Flight Manual—Agent Skills
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Now imagine a highly experienced pilot in an unfamiliar aircraft type, flying into an airport they have never visited before&lt;/strong&gt;, under regulations they were not trained on. They can fly. They can read instruments. They can operate controls. But they are improvising every decision because they do not have the specific procedures for this situation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The flight manual the Standard Operating Procedures is what fills that gap.&lt;/strong&gt; It tells the pilot exactly what checklist to run before takeoff at this airport. What altitude to maintain in this specific airspace. Precisely what to do when this warning light illuminates. How to coordinate with ground control using this airline’s specific protocols.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The manual does not fly the plane&lt;/strong&gt;. It does not operate the controls. What it does is ensure that every decision the pilot makes &lt;strong&gt;is the correct decision for this context&lt;/strong&gt;, not just a reasonable guess based on general experience.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;In an AI agent, Agent Skills are the flight manual&lt;/strong&gt;. They encode your organization’s specific rules, workflows, and policies the leave entitlements, the approval thresholds, the invoice validation steps, the escalation criteria. The LLM still does the reasoning. The tools still take the actions. But now every decision is grounded in your actual procedures, not generic training data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yq8hfujr3gk21qbuyem.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8yq8hfujr3gk21qbuyem.png" alt=" " width="800" height="317"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What Happens When One Is Missing
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Controls but no pilot and no manual (Tools only)&lt;/strong&gt;: the throttle is there but nothing is operating it. The agent has APIs it can call but no reasoning to decide which one, when, or in what order. It cannot complete a task.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pilot but no controls (LLM only)&lt;/strong&gt;: the pilot reads every instrument perfectly and knows exactly what to do but has no way to act. The agent reasons flawlessly but cannot retrieve data, call a system, or change anything in the world. It can only generate text.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Pilot and controls but no manual (LLM + Tools, no Agent Skills):&lt;/strong&gt; the pilot can fly and the controls respond but every decision is improvised from general experience. This is most enterprise agents today. They work, inconsistently. They produce plausible answers that do not match your actual policies. Each run may go differently. Nothing is auditable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;All three together:&lt;/strong&gt; the controls take action, the pilot reasons about what to do, and the manual ensures every decision follows your organization’s exact procedures. Consistent. Auditable. Trustworthy.&lt;/p&gt;

&lt;p&gt;Tools (controls) give the agent reach. The LLM (pilot) gives the agent reasoning. Agent Skills (flight manual) give the agent organizational judgment. You need all three just like a flight needs controls, a pilot, and the procedures to fly it safely.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Same Pattern Across Every Skilled Domain
&lt;/h2&gt;

&lt;p&gt;The analogy holds anywhere expertise is applied through instruments:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Surgeon and scalpel:&lt;/strong&gt; the scalpel can cut anywhere. The surgeon’s training specifies exactly where, how deep, at what angle, and what to do if something unexpected is found. Remove the training and the scalpel is just a sharp object.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chef and kitchen:&lt;/strong&gt; the kitchen has every tool ovens, knives, heat, timers. The recipe encodes the sequence, temperatures, timings, and substitutions that produce a consistent dish. Without it, two chefs produce two different meals from the same ingredients.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Architect and CAD tools:&lt;/strong&gt; the software can draw anything. The architect’s expertise encodes load bearing constraints, building codes, spatial relationships, and material properties that make the drawing a safe, buildable structure.&lt;/p&gt;

&lt;p&gt;In every case the pattern is identical: tools provide capability, expertise provides judgment. The tool without the expertise is hardware. The expertise without the tool cannot act. Together they produce something reliable.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Agent Skills Actually Are
&lt;/h2&gt;

&lt;p&gt;A skill is a directory containing:&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;SKILL.md *&lt;/em&gt;(required) — instructions in plain Markdown with YAML frontmatter metadata&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;scripts/ (optional)&lt;/strong&gt; — code the agent can run&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;references/ (optional)&lt;/strong&gt; — policy documents, FAQs, reference material&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;assets/ (optional)&lt;/strong&gt; — templates, schemas, examples&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmdzycv3h2pbil003qlb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffmdzycv3h2pbil003qlb.png" alt=" " width="800" height="238"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5n9dzbrsv8gvrvmfvwi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5n9dzbrsv8gvrvmfvwi.png" alt=" " width="800" height="256"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c6dbdrpx7t131myd6il.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2c6dbdrpx7t131myd6il.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  All Three Working Together
&lt;/h2&gt;

&lt;p&gt;In a well designed enterprise agent, MCP and Agent Skills each play their role and neither replaces the other:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Tool:&lt;/strong&gt; fetches the vendor contract document from SharePoint&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Skill:&lt;/strong&gt; applies liability cap rules, escalation logic, and policy references&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LLM:&lt;/strong&gt; reads both, produces the grounded compliance response&lt;/p&gt;

&lt;p&gt;Remove any one of the three and the agent breaks. The tool without the skill fetches the document but does not know what to look for. The skill without the tool knows the rules but cannot access the data. The LLM without either produces a plausible guess.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&amp;gt; Need to connect to something? Use MCP. Need to teach the agent how to approach something? Use a Skill. Need a policy-grounded answer? You need both.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Scaling Agent Skills Across the Enterprise
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;One skill in one department is a proof of concept&lt;/strong&gt;. The real value is a &lt;strong&gt;skills library&lt;/strong&gt; a &lt;strong&gt;version-controlled repository&lt;/strong&gt; of organizational expertise that any agent can draw from, on any compatible platform, across every department.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftw6ociqjvzwzk02p3km8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftw6ociqjvzwzk02p3km8.png" alt=" " width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Any agent  on any platform granted read access to this repository can load skills from it&lt;/strong&gt;. When a policy changes, one &lt;strong&gt;SKILL.md&lt;/strong&gt; update propagates to every agent on every platform simultaneously.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;&lt;strong&gt;One skills library. Every department&lt;/strong&gt;. Every platform. Updated in one place. Governed like code. T*&lt;em&gt;his is the enterprise value proposition of Agent Skills&lt;/em&gt;*.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6diheob0m79s88lyxz0n.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6diheob0m79s88lyxz0n.png" alt=" " width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Building an Agent with Tools and AgentSkills .
&lt;/h2&gt;

&lt;p&gt;*&lt;em&gt;Folder Structure *&lt;/em&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb5px1x5v7sm5lkcgkch2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fb5px1x5v7sm5lkcgkch2.png" alt=" " width="617" height="739"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Sample Tool
&lt;/h2&gt;


&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Incident report status tool - implements incident-report skill.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.tools&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;tool&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pydantic&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Field&lt;/span&gt;


&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;IncidentInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;BaseModel&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Field&lt;/span&gt;&lt;span class="p"&gt;(...,&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Incident ID (e.g., INC-2025-001)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;min_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_length&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;32&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="n"&gt;_DEMO_INCIDENTS&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INC-2025-001&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;API latency spike in us-east-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Resolved&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;affected_service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;payment-gateway&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;started_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-03-16 14:30 UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolved_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-03-16 15:45 UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;SRE On-Call&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INC-2025-002&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Database replica lag exceeding threshold&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Investigating&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;affected_service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;analytics-db&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;started_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-03-17 09:00 UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolved_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Platform Team&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;INC-2025-003&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CDN cache miss rate elevated&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;P2&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Monitoring&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;affected_service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;cdn-edge&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;started_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;2025-03-17 11:20 UTC&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolved_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Infrastructure&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;},&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lookup_incident&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;dict&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Lookup incident - used by tool and scripts.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;incident_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;upper&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Incident ID is required. Use format INC-2025-001.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;incident_id&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;_DEMO_INCIDENTS&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Incident &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; not found. Known demo incidents: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;, &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;_DEMO_INCIDENTS&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;keys&lt;/span&gt;&lt;span class="p"&gt;())&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;_DEMO_INCIDENTS&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;incident_id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;affected_service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;affected_service&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;started_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;started_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolved_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolved_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A - still active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;


&lt;span class="nd"&gt;@tool&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;incident_report_status&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Check incident report status. Use when user asks about outage status, P1/P2 incidents, or specific incident ID (e.g., INC-2025-001).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;validated&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;IncidentInput&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;lookup_incident&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;validated&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;error&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="n"&gt;resolved&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Resolved: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;resolved_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;resolved_at&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;!=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;N/A - still active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Status: Active&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="nf"&gt;return &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Incident &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;incident_id&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;title&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Severity: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;severity&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Status: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;status&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Affected: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;affected_service&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | Owner: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;owner&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
            &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Started: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;started_at&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; | &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;resolved&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: Invalid incident ID format. Use INC-YYYY-NNN. Details: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;

&lt;h2&gt;
  
  
  SKILL.md
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xvrcr22jgcwgaokudpb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0xvrcr22jgcwgaokudpb.png" alt=" " width="800" height="486"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Lookup incident by ID. Usage: python lookup.py &amp;lt;incident_id&amp;gt;&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;pathlib&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Path&lt;/span&gt;

&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nc"&gt;Path&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;__file__&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="n"&gt;parents&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;]))&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentskills.skills._tools.incident_report&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;lookup_incident&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="n"&gt;incident_id&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sys&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;argv&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;lookup_incident&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;incident_id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;indent&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  agent.py
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;LangGraph ReAct agent with AgentSkills.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_openai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ChatOpenAI&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langgraph.prebuilt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_react_agent&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentskills.skills&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_agent_skills&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4o-mini&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Create a ReAct agent with AgentSkills tools.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;ChatOpenAI&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_agent_skills&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;create_react_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run the agent and return the final response.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;}]})&lt;/span&gt;
    &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;hasattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;last&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run_agent_interactive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run the agent with visible execution: streams tool calls and results.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain_core.messages&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ToolMessage&lt;/span&gt;

    &lt;span class="n"&gt;final_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
        &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_message&lt;/span&gt;&lt;span class="p"&gt;}]},&lt;/span&gt;
        &lt;span class="n"&gt;stream_mode&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;updates&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;node_name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;node_output&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;chunk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;items&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
            &lt;span class="n"&gt;messages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;node_output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[])&lt;/span&gt;
            &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;messages&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;AIMessage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;getattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tool_calls&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tool_calls&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                            &lt;span class="n"&gt;name&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                            &lt;span class="n"&gt;args&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;tc&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;args&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{})&lt;/span&gt;
                            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  🔧 AgentSkills: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;(&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;args&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  💭 Agent: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                        &lt;span class="n"&gt;final_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;
                &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="nf"&gt;isinstance&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;ToolMessage&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                    &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;)[:&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
                    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;msg&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                        &lt;span class="n"&gt;content&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  ✓ Result: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;content&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;final_content&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  main.py
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Run the AgentSkills LangGraph agent.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dotenv&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;load_dotenv&lt;/span&gt;

&lt;span class="nf"&gt;load_dotenv&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentskills&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;get_agent_skills&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;run_agent_interactive&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;agentskills.skills.registry&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_skill_catalog&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;_print_skills&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Show available Agent Skills (agentskills.io format).&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="n"&gt;catalog&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_skill_catalog&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;  Agent Skills (agentskills.io):&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;catalog&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;desc&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    • &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;name&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;70&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="nf"&gt;len&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;desc&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;&amp;gt;&lt;/span&gt; &lt;span class="mi"&gt;70&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;tools&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_agent_skills&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;  Agent Skills:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;t&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;tools&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;    • &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;t&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;


&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Creating AgentSkills agent...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;agent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;create_agent&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;  AgentSkills – enterprise skills for the agent&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt; &lt;span class="o"&gt;*&lt;/span&gt; &lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="nf"&gt;_print_skills&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Commands: &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = list AgentSkills | &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;quit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;exit&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; = stop&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="n"&gt;user_input&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;input&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You: &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="nf"&gt;except &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nb"&gt;EOFError&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;KeyboardInterrupt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Goodbye!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;quit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;exit&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;q&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Goodbye!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;break&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;skills&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="nf"&gt;_print_skills&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
            &lt;span class="k"&gt;continue&lt;/span&gt;

        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;--- AgentSkills executing ---&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_agent_interactive&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;user_input&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;--- Agent ---&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;


&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;__name__&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;__main__&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;main&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Run the agent with sample query or prompt
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;(AgentSkills) sreenir@Seenivasas-MacBook-Pro AgentSkills % uv run python main.py&lt;br&gt;
Creating AgentSkills agent...&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;==================================================&lt;/p&gt;

&lt;h1&gt;
  
  
    AgentSkills – enterprise skills for the agent
&lt;/h1&gt;

&lt;p&gt;Agent Skills (agentskills.io):&lt;br&gt;
    • contract-review: Check vendor contract review status in legal or procurement. Use when ...&lt;br&gt;
    • incident-report: Check incident and outage status for P1/P2 incidents. Use when the use...&lt;br&gt;
    • jira-ticket: Fetch Jira ticket details including status, assignee, and priority. Us...&lt;/p&gt;

&lt;p&gt;Commands: 'skills' = list AgentSkills | 'quit'/'exit' = stop&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;You: outage status *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;--- AgentSkills executing ---&lt;/strong&gt;&lt;br&gt;
  🔧 AgentSkills: incident_report_status({'incident_id': 'INC-2025-001'})&lt;br&gt;
  ✓ Result: Incident INC-2025-001: API latency spike in us-east-1&lt;br&gt;
Severity: P1 | Status: Resolved&lt;br&gt;
Affected: payment-gateway | Owner: SRE On-Call&lt;br&gt;
Started: 2025-03-16 14:30 UTC | Resolved: 2025-03-16 15:45 UTC&lt;br&gt;
  💭 Agent: The current outage status is as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Incident ID&lt;/strong&gt;: INC-2025-001&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Description&lt;/strong&gt;: API latency spike in us-east-1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity&lt;/strong&gt;: P1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status&lt;/strong&gt;: Resolved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Affected Service&lt;/strong&gt;: Payment Ga...&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;--- Agent ---&lt;/strong&gt;&lt;br&gt;
The current outage status is as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Incident ID&lt;/strong&gt;: INC-2025-001&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Description&lt;/strong&gt;: API latency spike in us-east-1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity&lt;/strong&gt;: P1&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status&lt;/strong&gt;: Resolved&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Affected Service&lt;/strong&gt;: Payment Gateway&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Owner&lt;/strong&gt;: SRE On-Call&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident Start Time&lt;/strong&gt;: March 16, 2025, 14:30 UTC&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Incident Resolved Time&lt;/strong&gt;: March 16, 2025, 15:45 UTC&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you need more information, feel free to ask!&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;You: RCA Requests *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;--- AgentSkills executing ---&lt;/strong&gt;&lt;br&gt;
  💭 Agent: It seems you're looking for information related to Root Cause Analysis (RCA) requests. However, I can't assist with creating or managing RCA requests directly. If you have a specific incident or issue...&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;--- Agent ---&lt;/strong&gt;&lt;br&gt;
It seems you're looking for information related to Root Cause Analysis (RCA) requests. However, I can't assist with creating or managing RCA requests directly. If you have a specific incident or issue in mind, please provide details, and I can help check the status of that incident or provide related information.&lt;/p&gt;

&lt;p&gt;*&lt;em&gt;You: any active outages *&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;--- AgentSkills executing ---&lt;/strong&gt;&lt;br&gt;
  🔧 AgentSkills: incident_report_status({'incident_id': 'INC-2025-001'})&lt;br&gt;
  ✓ Result: Incident INC-2025-001: API latency spike in us-east-1&lt;br&gt;
Severity: P1 | Status: Resolved&lt;br&gt;
Affected: payment-gateway | Owner: SRE On-Call&lt;br&gt;
Started: 2025-03-16 14:30 UTC | Resolved: 2025-03-16 15:45 UTC&lt;br&gt;
  🔧 AgentSkills: incident_report_status({'incident_id': 'INC-2025-002'})&lt;br&gt;
  ✓ Result: Incident INC-2025-002: Database replica lag exceeding threshold&lt;br&gt;
Severity: P2 | Status: Investigating&lt;br&gt;
Affected: analytics-db | Owner: Platform Team&lt;br&gt;
Started: 2025-03-17 09:00 UTC | Status: Active&lt;br&gt;
  💭 Agent: Currently, there is one active outage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Incident ID:&lt;/strong&gt; INC-2025-002

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Description:&lt;/strong&gt; Database replica lag exceeding threshold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity:&lt;/strong&gt; P2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status:&lt;/strong&gt; Investigating&lt;/li&gt;
&lt;li&gt;**Affected ...&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;--- Agent ---&lt;/strong&gt;&lt;br&gt;
Currently, there is one active outage:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Incident ID:&lt;/strong&gt; INC-2025-002

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Description:&lt;/strong&gt; Database replica lag exceeding threshold&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Severity:&lt;/strong&gt; P2&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Status:&lt;/strong&gt; Investigating&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Affected Service:&lt;/strong&gt; analytics-db&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Owner:&lt;/strong&gt; Platform Team&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Started:&lt;/strong&gt; March 17, 2025, 09:00 UTC&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;p&gt;The previous incident (INC-2025-001) regarding API latency spike has been resolved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;You:&lt;/strong&gt; &lt;/p&gt;

&lt;h2&gt;
  
  
  The Bottom Line
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The core challenge in enterprise AI is not raw capability&lt;/strong&gt;. Modern LLMs are already powerful enough for most business use cases. The real challenge is reliability &lt;strong&gt;ensuring agents consistently follow enterprise rules, policies, and standards at scale&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is where Agent Skills and MCP complement each other:
&lt;/h2&gt;

&lt;p&gt;MCP provides access to systems  APIs, databases, and external actions (capabilities)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent Skills&lt;/strong&gt; define how those capabilities should be used &lt;strong&gt;policies&lt;/strong&gt;, &lt;strong&gt;reasoning&lt;/strong&gt; steps, and &lt;strong&gt;constraints&lt;/strong&gt; (procedures)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Used together&lt;/strong&gt;, they close the gap between what an agent can do and what it should do.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A tool without a skill leads to inconsistent&lt;/strong&gt;, &lt;strong&gt;guess based&lt;/strong&gt; behavior&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A skill without a tool produces correct reasoning but no execution&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;When combined, they create agents that are:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Predictable (consistent outputs)&lt;/li&gt;
&lt;li&gt;Precise (aligned with business rules)&lt;/li&gt;
&lt;li&gt;Auditable (traceable decisions and actions)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;These are not optional qualities they are baseline requirements for production enterprise systems&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  &lt;strong&gt;In One Line&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;MCP gives agents reach. Agent Skills give them discipline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Write your skills once.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Apply them everywhere.
&lt;/h2&gt;

&lt;h2&gt;
  
  
  Govern them like code.
&lt;/h2&gt;

&lt;p&gt;That is enterprise grade AI.&lt;/p&gt;

&lt;p&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>automation</category>
      <category>llm</category>
    </item>
    <item>
      <title>Understanding the Model Router in Microsoft Foundry</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Fri, 13 Mar 2026 04:44:29 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/understanding-the-model-router-in-microsoft-foundry-3hg</link>
      <guid>https://future.forem.com/sreeni5018/understanding-the-model-router-in-microsoft-foundry-3hg</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;As &lt;strong&gt;generative AI applications move&lt;/strong&gt; from &lt;strong&gt;prototypes&lt;/strong&gt; to &lt;strong&gt;production&lt;/strong&gt; systems, &lt;strong&gt;developers increasingly&lt;/strong&gt; face a new &lt;strong&gt;architectural&lt;/strong&gt; challenge is  choosing the &lt;strong&gt;right model for each task&lt;/strong&gt;. &lt;strong&gt;Modern AI platforms now offer dozens or even hundreds of models&lt;/strong&gt; with different &lt;strong&gt;strengths&lt;/strong&gt; some &lt;strong&gt;optimized&lt;/strong&gt; for &lt;strong&gt;reasoning&lt;/strong&gt;, others for &lt;strong&gt;speed&lt;/strong&gt;, &lt;strong&gt;cost&lt;/strong&gt;, or domain specialization. Selecting the best model dynamically becomes critical for both performance and cost efficiency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Microsoft addresses this challenge through Model Router&lt;/strong&gt;, a capability &lt;strong&gt;within Microsoft Foundry&lt;/strong&gt;, its enterprise platform for building and operating AI applications.&lt;/p&gt;

&lt;p&gt;Before exploring &lt;strong&gt;how Model Router works&lt;/strong&gt;, it is useful to understand the platform it belongs to.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Router: How AI Selects Models the Way We Choose Apartments
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Analogy&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Think of Model Router in &lt;strong&gt;Microsoft Foundry like an apartment finder&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When &lt;strong&gt;searching&lt;/strong&gt; for an &lt;strong&gt;apartment&lt;/strong&gt;, you usually consider:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Budget&lt;/li&gt;
&lt;li&gt;Distance to work&lt;/li&gt;
&lt;li&gt;Amenities (gym, parking, pool)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;You don’t manually evaluate every apartment&lt;/strong&gt;. &lt;strong&gt;The platform analyzes your preferences and recommends the best match&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Router works the same way for AI models.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When an application sends a prompt, the router evaluates factors such as cost, latency, and model capabilities, and then selects the most suitable model automatically.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcyzk3ueodlr1lgeis5w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvcyzk3ueodlr1lgeis5w.png" alt=" " width="800" height="539"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Just as an &lt;strong&gt;apartment finder helps you pick the best place to live&lt;/strong&gt;, &lt;strong&gt;Model Router helps your application choose the best model to answer the prompt&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Microsoft Foundry: The AI Application Platform
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Microsoft Foundry is Microsoft’s unified platform&lt;/strong&gt; for &lt;strong&gt;building&lt;/strong&gt;, &lt;strong&gt;deploying&lt;/strong&gt;, and operating AI applications and intelligent &lt;strong&gt;agents on Azure&lt;/strong&gt;. It provides a centralized environment where developers can discover models, build AI powered applications, integrate enterprise data, and deploy systems with built in &lt;strong&gt;governance&lt;/strong&gt; and &lt;strong&gt;observability&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The platform brings &lt;strong&gt;together several core capabilities&lt;/strong&gt; required for modern AI systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Catalog&lt;/strong&gt; for discovering and deploying foundation models&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent development tools&lt;/strong&gt; for building AI copilots and multi-step agent workflows&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Enterprise AI services&lt;/strong&gt; such as language, vision, speech, and document intelligence&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation and monitoring&lt;/strong&gt; for measuring AI quality and reliability&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Security and governance&lt;/strong&gt; through Azure’s RBAC, networking, and policy controls&lt;/p&gt;

&lt;p&gt;In practice, Microsoft Foundry acts as the development and operational layer for enterprise AI applications, enabling teams to build systems that integrate models, tools, and data while maintaining enterprise grade reliability and security.&lt;/p&gt;

&lt;p&gt;However, &lt;strong&gt;once multiple models become available within a platform&lt;/strong&gt;, another question arises&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which model should handle each request?&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why This Matters
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Without a router, developers would need to implement custom logic&lt;br&gt;
 such as:&lt;/strong&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;simple_prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;use_small_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;coding_task&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;use_reasoning_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="nf"&gt;use_general_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Maintaining such logic quickly becomes complex&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Model Router removes this burden by allowing the &lt;strong&gt;platform to learn the routing strategy automatically&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  This is where Model Router comes in.
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Problem: Model Selection in Multi-Model Systems&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;In most AI applications, &lt;strong&gt;developers initially choose a single model&lt;/strong&gt; for example, a &lt;strong&gt;large reasoning model&lt;/strong&gt; such as &lt;strong&gt;GPT4&lt;/strong&gt; class models. &lt;strong&gt;While this approach works, it often leads to inefficiencies&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple queries do not require a large reasoning model.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;High quality models may introduce unnecessary latency&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Large&lt;/strong&gt; models significantly increase &lt;strong&gt;operational&lt;/strong&gt; &lt;strong&gt;costs&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;As organizations adopt &lt;strong&gt;multi model architectures&lt;/strong&gt;, &lt;strong&gt;manually&lt;/strong&gt; &lt;strong&gt;choosing&lt;/strong&gt; the correct model becomes increasingly complex.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Developers would need to implement logic such as:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Route simple queries to small models&lt;/li&gt;
&lt;li&gt;Route complex reasoning tasks to large models&lt;/li&gt;
&lt;li&gt;Route coding tasks to specialized models&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Maintaining this routing logic manually quickly becomes difficult to scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  Model Router: Intelligent Model Selection
&lt;/h2&gt;

&lt;p&gt;The &lt;strong&gt;Model Router in Microsoft Foundry&lt;/strong&gt; solves this problem by acting as an &lt;strong&gt;intelligent routing layer across multiple models&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead of developers &lt;strong&gt;explicitly selecting a model&lt;/strong&gt;, the &lt;strong&gt;router&lt;/strong&gt; &lt;strong&gt;evaluates&lt;/strong&gt; each &lt;strong&gt;request&lt;/strong&gt; and &lt;strong&gt;automatically&lt;/strong&gt; forwards it to the &lt;strong&gt;most appropriate model in a configured pool&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;From the developer’s perspective, the &lt;strong&gt;application interacts with a single endpoint&lt;/strong&gt;. &lt;strong&gt;Behind&lt;/strong&gt; the scenes, the &lt;strong&gt;router&lt;/strong&gt; performs model &lt;strong&gt;selection&lt;/strong&gt; &lt;strong&gt;dynamically&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The router analyzes characteristics of the incoming prompt, such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Prompt complexity&lt;/li&gt;
&lt;li&gt;Reasoning requirements&lt;/li&gt;
&lt;li&gt;Expected response quality&lt;/li&gt;
&lt;li&gt;Latency requirements&lt;/li&gt;
&lt;li&gt;Cost considerations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Based on this evaluation, the router selects the most suitable model for that request.&lt;/p&gt;

&lt;h2&gt;
  
  
  For example:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Simple&lt;/strong&gt; &lt;strong&gt;informational&lt;/strong&gt; queries may be routed to &lt;strong&gt;smaller&lt;/strong&gt;, &lt;strong&gt;faster models&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Complex&lt;/strong&gt; &lt;strong&gt;reasoning&lt;/strong&gt; tasks may be routed to &lt;strong&gt;larger&lt;/strong&gt; reasoning models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Coding&lt;/strong&gt; prompts may be routed to &lt;strong&gt;specialized coding models&lt;/strong&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This &lt;strong&gt;architecture&lt;/strong&gt; allows organizations to &lt;strong&gt;optimize&lt;/strong&gt; &lt;strong&gt;cost&lt;/strong&gt;, &lt;strong&gt;performance&lt;/strong&gt;, and &lt;strong&gt;response&lt;/strong&gt; quality simultaneously.&lt;/p&gt;

&lt;h2&gt;
  
  
  How Model Router Works
&lt;/h2&gt;

&lt;p&gt;At a high level, Model Router functions as a meta model a model trained to evaluate prompts and determine which underlying model should handle them.&lt;/p&gt;

&lt;p&gt;The routing process typically follows these steps:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Client Request&lt;/strong&gt;&lt;br&gt;
The application sends a prompt to the Model Router endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Prompt Analysis&lt;/strong&gt;&lt;br&gt;
The router evaluates the prompt’s complexity and characteristics.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Model Selection&lt;/strong&gt;&lt;br&gt;
Based on the evaluation, the router selects the most appropriate model from the configured model pool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Request Forwarding&lt;/strong&gt;&lt;br&gt;
The router forwards the prompt to the selected model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Response Return&lt;/strong&gt;&lt;br&gt;
The response from the selected model is returned to the client through the same endpoint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From the application’s perspective&lt;/strong&gt;, the entire interaction appears as a &lt;strong&gt;single model invocation&lt;/strong&gt;, &lt;strong&gt;even though different models may handle different requests&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deploying Model Router in Microsoft Foundry
&lt;/h2&gt;

&lt;p&gt;Deploying Model Router in Microsoft Foundry is designed to be straightforward.&lt;/p&gt;

&lt;p&gt;Developers create a router deployment that references a set of available models. The router then dynamically selects among those models during inference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Typical deployment steps include:
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Create a Foundry project in Azure&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy16wmtme12oywxwzb4si.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy16wmtme12oywxwzb4si.png" alt=" " width="790" height="482"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Select models from the Foundry model catalog&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7pi1vhw4qsbr9k1pzqw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq7pi1vhw4qsbr9k1pzqw.png" alt=" " width="800" height="568"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Create a Model Router deployment&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp00gwdv9q4ip3nrwm6wy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp00gwdv9q4ip3nrwm6wy.png" alt=" " width="800" height="370"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Configure the routing model set&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiar45mx3vkp3x0xx15vg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiar45mx3vkp3x0xx15vg.png" alt=" " width="800" height="735"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqt4uc948ahsd1febum73.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fqt4uc948ahsd1febum73.png" alt=" " width="800" height="1150"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Test the model Router with different prompts&lt;/strong&gt; &lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7mdmrvjw379hnj2658r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fi7mdmrvjw379hnj2658r.png" alt=" " width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25rrd6621hbrgsubtnu0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F25rrd6621hbrgsubtnu0.png" alt=" " width="800" height="277"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Expose the router as a single API endpoint&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcasi7oukys1bos312myr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcasi7oukys1bos312myr.png" alt=" " width="800" height="504"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Applications&lt;/strong&gt; then &lt;strong&gt;send prompts&lt;/strong&gt; to the &lt;strong&gt;router&lt;/strong&gt; endpoint instead of directly calling individual models.&lt;/p&gt;

&lt;p&gt;This architecture simplifies multi model systems while allowing the platform to optimize routing decisions automatically.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why Model Routers Matter&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;As AI platforms continue to expand their model catalogs, multi-model architectures will become the norm. Model routers represent an important architectural shift:&lt;/p&gt;

&lt;p&gt;Instead of building applications around a single model, systems will be designed around dynamic model orchestration.&lt;/p&gt;

&lt;h2&gt;
  
  
  The benefits include:
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Cost optimization&lt;/strong&gt; by avoiding unnecessary use of large models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Performance improvements&lt;/strong&gt; through faster models for simpler tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Higher quality responses&lt;/strong&gt; through specialized model selection&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simpler application architecture&lt;/strong&gt; through a single API interface&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In this sense, Model Router acts as a control &lt;strong&gt;layer for multi model&lt;/strong&gt; AI systems, enabling developers to focus on application logic while the platform handles model selection.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;As &lt;strong&gt;AI systems evolve&lt;/strong&gt;, applications are &lt;strong&gt;no longer built around a single model&lt;/strong&gt;. Modern platforms like &lt;strong&gt;Microsoft Foundry&lt;/strong&gt; make it possible to work &lt;strong&gt;with multiple LLMs&lt;/strong&gt;, each optimized for different capabilities such as &lt;strong&gt;reasoning&lt;/strong&gt;, &lt;strong&gt;speed&lt;/strong&gt;, cost &lt;strong&gt;efficiency&lt;/strong&gt;, or &lt;strong&gt;specialized&lt;/strong&gt; tasks.&lt;/p&gt;

&lt;p&gt;This is where the &lt;strong&gt;Model Router&lt;/strong&gt; becomes an important architectural component. Instead of developers manually deciding which model should handle each request, the router evaluates the prompt and dynamically selects the most appropriate model based on factors like &lt;strong&gt;cost&lt;/strong&gt;, &lt;strong&gt;latency&lt;/strong&gt;, and model &lt;strong&gt;capabilities&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Just as an apartment search platform helps you find the best place to live by balancing &lt;strong&gt;budget&lt;/strong&gt;, &lt;strong&gt;distance&lt;/strong&gt;, and &lt;strong&gt;amenities&lt;/strong&gt;, the Model Router helps &lt;strong&gt;AI applications find the best model for every prompt&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;The result is a simpler architecture, better performance, and optimized cost allowing developers to focus on building intelligent applications while the platform handles &lt;strong&gt;model selection behind the scenes&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;In many ways, &lt;strong&gt;Model Router represents the future of multi model AI systems&lt;/strong&gt;, where intelligent routing becomes just as important as the models themselves.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>llm</category>
      <category>microsoft</category>
    </item>
    <item>
      <title>Why Production AI Agents Are Hard &amp; How Amazon Bedrock AgentCore Makes Them Production Ready</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Thu, 12 Mar 2026 06:43:00 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/why-production-ai-agents-are-hard-how-amazon-bedrock-agentcore-makes-them-production-ready-1fpn</link>
      <guid>https://future.forem.com/sreeni5018/why-production-ai-agents-are-hard-how-amazon-bedrock-agentcore-makes-them-production-ready-1fpn</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Over the &lt;strong&gt;past couple of years&lt;/strong&gt;, I have &lt;strong&gt;architected&lt;/strong&gt; and &lt;strong&gt;delivered&lt;/strong&gt; a significant number of &lt;strong&gt;agentic AI applications across enterprise environments&lt;/strong&gt;. Many of these deployments ran on &lt;strong&gt;Azure infrastructure&lt;/strong&gt; using &lt;strong&gt;Azure Web Apps&lt;/strong&gt; for lightweight agent endpoints and &lt;strong&gt;Azure Container Apps&lt;/strong&gt; for more sophisticated &lt;strong&gt;multi agent&lt;/strong&gt; systems that required &lt;strong&gt;orchestration&lt;/strong&gt;, &lt;strong&gt;scaling&lt;/strong&gt;, and reliable session routing.&lt;/p&gt;

&lt;p&gt;In building these systems, I have repeatedly implemented the underlying foundations myself &lt;strong&gt;credential vaults&lt;/strong&gt;, &lt;strong&gt;memory pipelines&lt;/strong&gt;, &lt;strong&gt;observability layers&lt;/strong&gt;, and isolation mechanisms. After doing this enough times, you develop a clear understanding of both how long these pieces take to build and where the real production challenges tend to surface.&lt;/p&gt;

&lt;p&gt;When I first evaluated &lt;strong&gt;Amazon Bedrock AgentCore&lt;/strong&gt;, it was the first platform I encountered that appeared to &lt;strong&gt;address many of these challenges holistically&lt;/strong&gt;. Not just through surface level &lt;strong&gt;abstractions&lt;/strong&gt;, but with &lt;strong&gt;production&lt;/strong&gt; &lt;strong&gt;grade&lt;/strong&gt; &lt;strong&gt;depth&lt;/strong&gt; designed for real world &lt;strong&gt;deployments&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That practical experience is the perspective I bring to this blog.&lt;/p&gt;

&lt;p&gt;Before we talk about &lt;strong&gt;AWS Bedrock AgentCore&lt;/strong&gt;, we need to answer a more fundamental question &lt;strong&gt;what exactly is an AI agent?&lt;/strong&gt;, and why is it so different from a regular &lt;strong&gt;chatbot&lt;/strong&gt; or &lt;strong&gt;API&lt;/strong&gt; call?&lt;/p&gt;

&lt;h2&gt;
  
  
  What is an AI Agent?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;“An AI agent is a software system that uses a large language model not just to generate text, but to reason, plan, take actions, and work toward a goal often across multiple steps, over time, with minimal human involvement.”&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ppy5zqjbkilf9ac8gzs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4ppy5zqjbkilf9ac8gzs.png" alt=" " width="800" height="532"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Most people encounter AI through a prompt response loop&lt;/strong&gt; type something in, get something back. That model is useful, but it is fundamentally passive. The language model sits in a box, waits to be asked, generates text, and stops.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;An AI agent is something entirely different&lt;/strong&gt;. Think of a brilliant expert &lt;strong&gt;locked in a room with no tools&lt;/strong&gt;. They can give extraordinary advice but they cannot act on it. Give that same expert a &lt;strong&gt;phone&lt;/strong&gt;, a &lt;strong&gt;laptop&lt;/strong&gt;, &lt;strong&gt;access to databases&lt;/strong&gt;, the &lt;strong&gt;ability&lt;/strong&gt; to &lt;strong&gt;send emails&lt;/strong&gt;, &lt;strong&gt;run code&lt;/strong&gt;, and call APIs. They no longer just advise. They &lt;strong&gt;act&lt;/strong&gt;, &lt;strong&gt;verify&lt;/strong&gt;, &lt;strong&gt;execute&lt;/strong&gt;, and &lt;strong&gt;report&lt;/strong&gt; &lt;strong&gt;back&lt;/strong&gt;. That is the &lt;strong&gt;agentic paradigm&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Formally, an AI agent is a software system that uses a large language model not just to generate text, but to reason, plan, take actions, and work toward a goal — across multiple steps, over time, with minimal human direction.&lt;/p&gt;

&lt;p&gt;“An AI agent doesn’t just answer your question. It takes on your objective, &lt;strong&gt;plans a path to achieve&lt;/strong&gt; it, &lt;strong&gt;executes that plan&lt;/strong&gt;, monitors its own progress, and self corrects when things go wrong &lt;strong&gt;without you directing each step&lt;/strong&gt;.”&lt;/p&gt;

&lt;h2&gt;
  
  
  A Concrete Example
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Ask an agent: “Find our top three open support tickets today&lt;/strong&gt;, check each against the known issues database, draft replies, and email them to the support team.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A plain language model cannot do this it has no access to your ticketing system, knowledge base, or email infrastructure. An AI agent handles the entire workflow end to end.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1:&lt;/strong&gt; Query the ticketing &lt;strong&gt;tool&lt;/strong&gt; for today’s open critical tickets&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 2:&lt;/strong&gt; Search the &lt;strong&gt;knowledge&lt;/strong&gt; &lt;strong&gt;base&lt;/strong&gt; for related known issues&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 3:&lt;/strong&gt; &lt;strong&gt;Reason&lt;/strong&gt; about which tickets &lt;strong&gt;match which issues&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4:&lt;/strong&gt; Draft personalized reply emails &lt;strong&gt;using the LLM&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5:&lt;/strong&gt; &lt;strong&gt;Send those emails via the email API&lt;/strong&gt; (This may Tool /MCP server)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The LLM is the reasoning engine&lt;/strong&gt;. The &lt;strong&gt;tools are how the agent reaches into real systems&lt;/strong&gt;. And it does not stop after one response it pursues the objective through every &lt;strong&gt;step until the goal is met.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Agents Are Goal Driven
&lt;/h2&gt;

&lt;p&gt;The most &lt;strong&gt;critical characteristic of an AI agent&lt;/strong&gt; and the one most often glossed over is that it is &lt;strong&gt;goal driven&lt;/strong&gt;, not &lt;strong&gt;prompt driven&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt driven systems (plain LLMs)&lt;/strong&gt; &lt;strong&gt;receive&lt;/strong&gt; an &lt;strong&gt;input&lt;/strong&gt; and produce an output. The interaction is complete. No awareness of a broader objective, &lt;strong&gt;no adaptation if the first attempt fails.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal driven systems (agents)&lt;/strong&gt; receive an objective and autonomously determine the steps, &lt;strong&gt;tool calls&lt;/strong&gt;, and decisions required to achieve it. &lt;strong&gt;They persist, adapt, retry, and self correct until the goal is met&lt;/strong&gt;  or explicitly report that it cannot be.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Agentic Loop:
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Observe, Think, Act, Repeat&lt;/strong&gt;&lt;br&gt;
The mechanics of goal driven behaviour are captured in what is called &lt;strong&gt;the agentic loop the cognitive cycle every agent&lt;/strong&gt; runs until its objective is achieved. &lt;strong&gt;Strands Agents, AWS’s own open source framework, describes this as its core architecture.&lt;/strong&gt; in each loop iteration the model is invoked with the prompt, &lt;strong&gt;agent context&lt;/strong&gt;, and available &lt;strong&gt;tools&lt;/strong&gt;, and it decides whether to respond in natural language, plan next steps, reflect on prior results, or select one or more tools to use. &lt;strong&gt;This loop continues until the task is complete&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  1. Observe
&lt;/h2&gt;

&lt;p&gt;The agent reads &lt;strong&gt;its current goal and decomposed sub goals&lt;/strong&gt;. It reviews all results from prior steps. It retrieves relevant &lt;strong&gt;short term memory&lt;/strong&gt;. It incorporates new information from the environment since the last cycle.&lt;/p&gt;
&lt;h2&gt;
  
  
  2. Think
&lt;/h2&gt;

&lt;p&gt;The LLM reasons over accumulated context and available tools to determine the single best next action. &lt;strong&gt;It outputs either a tool call with exact parameters&lt;/strong&gt;, or if the goal is satisfied a final answer. Planning capable models may first decompose the goal into an explicit sub task sequence.&lt;/p&gt;
&lt;h2&gt;
  
  
  3. Act
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;If a tool call was selected&lt;/strong&gt;, the &lt;strong&gt;framework&lt;/strong&gt; &lt;strong&gt;executes&lt;/strong&gt; the &lt;strong&gt;real function calling an API&lt;/strong&gt;, querying a &lt;strong&gt;database&lt;/strong&gt;, running code, navigating a browser, or invoking any registered tool. If a &lt;strong&gt;Human In The Loop(HITL)&lt;/strong&gt; checkpoint is configured, the agent pauses and &lt;strong&gt;waits for approval&lt;/strong&gt;. The result is captured and fed back into context.&lt;/p&gt;
&lt;h2&gt;
  
  
  4. Loop
&lt;/h2&gt;

&lt;p&gt;T*&lt;em&gt;he action result becomes new input to the next Observe phase&lt;/em&gt;*. Is the goal achieved? If yes, the agent produces its final answer and terminates. If no, the loop continues potentially dozens or hundreds of iterations for complex, long running tasks.&lt;/p&gt;
&lt;h2&gt;
  
  
  ReAct: The Reasoning Pattern Behind the Loop
&lt;/h2&gt;

&lt;p&gt;ReAct (&lt;strong&gt;Reason&lt;/strong&gt; + &lt;strong&gt;Act&lt;/strong&gt;) is the &lt;strong&gt;foundational&lt;/strong&gt; prompting pattern that implements the agentic loop. The agent produces a structured Thought describing its reasoning, followed by an Action (the chosen tool call), followed by an Observation incorporating the result. This interleaved reasoning action format makes goal pursuit &lt;strong&gt;explicit&lt;/strong&gt;, &lt;strong&gt;traceable&lt;/strong&gt;, and &lt;strong&gt;debuggable&lt;/strong&gt;. It is the pattern that AWS Strands Agents uses internally to power its &lt;strong&gt;model driven approach.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Goal:&lt;/strong&gt; Find top 3 critical GitHub bugs and draft fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thought:&lt;/strong&gt; I need to query GitHub for open issues labelled critical.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; github_search(repo=‘company/api’, label=‘critical’, state=‘open’, limit=3)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt; Returned issues #16, #5018, #18.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thought:&lt;/strong&gt; I have the issues. Now I need each affected code file.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Action:&lt;/strong&gt; github_get_file(repo=‘company/api’, path=‘src/auth.py’)&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observation:&lt;/strong&gt; File retrieved.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thought:&lt;/strong&gt; Enough context. Drafting fix for issue #16.&lt;/p&gt;

&lt;p&gt;… [loop continues for all 3 bugs] …&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Answer:&lt;/strong&gt; Three bugs with proposed patches attached.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Self Correction&lt;/strong&gt; and &lt;strong&gt;Adaptive Planning&lt;/strong&gt;&lt;br&gt;
Goal driven agents are not &lt;strong&gt;fragile pipelines&lt;/strong&gt;. When an action fails   &lt;strong&gt;a tool returns an error&lt;/strong&gt;, a query finds nothing, a generated solution does not compile the agent observes the failure, reasons about the &lt;strong&gt;cause&lt;/strong&gt;, and &lt;strong&gt;adapts&lt;/strong&gt;. A failed search triggers a reformulated query. An unavailable tool triggers a fallback. A complex goal gets decomposed into smaller achievable sub goals. When the agent determines a goal truly cannot be achieved, &lt;strong&gt;it surfaces a clear explanation rather than silently returning empty results&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  HITL Human In The Loop
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Goal driven does not mean unsupervised&lt;/strong&gt;. &lt;strong&gt;Production&lt;/strong&gt; agents are designed with explicit &lt;strong&gt;human in the loop&lt;/strong&gt; checkpoints moments where the agent pauses, presents its proposed action, and waits for approval before taking any irreversible step, sending emails, deleting records, initiating payments, deploying code. AgentCore Runtime’s bi-directional WebSocket streaming makes these &lt;strong&gt;pause-and-resume&lt;/strong&gt; flows practical within long running sessions, enabling &lt;strong&gt;real-time human collaboration without terminating and restarting the session.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  The 4 Pillars of Every Production AI Agent
&lt;/h2&gt;
&lt;h2&gt;
  
  
  Pillar 1
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Tools:&lt;/strong&gt; How Agents &lt;strong&gt;Act&lt;/strong&gt; on the external or real World&lt;br&gt;
&lt;strong&gt;Without&lt;/strong&gt; tools, a &lt;strong&gt;goal driven agent has nowhere to go&lt;/strong&gt;. Tools allow agents to reach beyond language generation into &lt;strong&gt;real business systems.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Read tools retrieve information&lt;/strong&gt;: database queries, document reads, semantic search against knowledge bases, API calls to Salesforce, GitHub, Jira, Slack, and any other SaaS tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Write tools create or modify data&lt;/strong&gt;: email senders, database writers, file generators, CRM updaters, ticket creators, calendar schedulers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Execution tools run processes&lt;/strong&gt;: code interpreters, browser automation for web based applications that have no API, and shell command runners.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The production challenge:&lt;/strong&gt; A prototype might hard code three tools. An enterprise deployment often &lt;strong&gt;needs fifty tools across ten SaaS platforms&lt;/strong&gt;, each with its &lt;strong&gt;own authentication scheme&lt;/strong&gt;, error patterns, and schema. Tool management becomes a major engineering project on its own.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pillar 2
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Memory: How Agents Remember&lt;/strong&gt;&lt;br&gt;
&lt;strong&gt;Language models (LLM)&lt;/strong&gt; are &lt;strong&gt;stateless&lt;/strong&gt;. Every API call starts blank. For an agent serving the same user across weeks of ongoing work, statelessness is a fundamental blocker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short term memory covers the active session:&lt;/strong&gt; conversation history, &lt;strong&gt;task state&lt;/strong&gt;, intermediate tool results, and reasoning steps. It requires intelligent &lt;strong&gt;summarization&lt;/strong&gt; to manage the LLM’s context window limits without losing critical thread.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long term memory persists across sessions.&lt;/strong&gt; &lt;strong&gt;User preferences&lt;/strong&gt;, past project outcomes, &lt;strong&gt;accumulated domain knowledge&lt;/strong&gt;, and learned patterns must survive session end and be retrievable in future sessions. This requires extraction logic, persistent storage, and semantic retrieval.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Episodic memory is the most powerful form:&lt;/strong&gt; storing specific past experiences what the agent tried, what worked, what failed, what the outcome was so it can recall and apply successful strategies in future similar situations. This is the mechanism by which agents genuinely improve over time.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pillar 3
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Observability&lt;/strong&gt;: &lt;strong&gt;How Agents Are Understood and Governed&lt;/strong&gt;&lt;br&gt;
When an AI agent &lt;strong&gt;produces&lt;/strong&gt; a &lt;strong&gt;wrong&lt;/strong&gt; output &lt;strong&gt;after&lt;/strong&gt; twelve reasoning steps and seven &lt;strong&gt;tool&lt;/strong&gt; &lt;strong&gt;calls&lt;/strong&gt;, traditional logs tell you almost nothing useful. You cannot search for ‘sessions where the agent called the wrong tool’ in standard APM tools.&lt;/p&gt;

&lt;p&gt;“You cannot safely govern what you cannot observe. For AI agents in enterprise production, &lt;strong&gt;observability is not optional&lt;/strong&gt; it is the difference between a system you can audit and a black box waiting to cause a compliance incident.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agent native observability&lt;/strong&gt; must capture the full reasoning chain in &lt;strong&gt;step by step&lt;/strong&gt; order, every tool invocation with exact inputs and outputs, every LLM prompt and response with token counts, decision points where the agent chose between alternatives, failure attribution pinpointing which specific step caused a wrong downstream output, and token consumption per step for cost control. Without this, AI assisted decisions in regulated environments cannot be explained, investigated, or defended.&lt;/p&gt;
&lt;h2&gt;
  
  
  Pillar 4
&lt;/h2&gt;

&lt;p&gt;MCP Server bridge the agent and external Data sources and &lt;strong&gt;MCP solves the M×N Integration Problem&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP: The Universal Connectivity Standard (USB-C)&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;For years, every team connecting agents to external services built bespoke adapters custom code per tool, per framework, per model. This created the classic &lt;strong&gt;M×N integration problem&lt;/strong&gt; if there are &lt;strong&gt;M agent&lt;/strong&gt; frameworks and &lt;strong&gt;N external services&lt;/strong&gt;, teams end up building &lt;strong&gt;M × N separate integrations.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;A LangChain Salesforce connector did not work with a &lt;strong&gt;Strands agent&lt;/strong&gt;. Every framework switch meant rewriting all integrations. As the number of models, frameworks, and enterprise systems grew, the integration burden multiplied.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP the Model Context Protocol&lt;/strong&gt; is the open standard that ended this fragmentation. Published by &lt;strong&gt;Anthropic&lt;/strong&gt; in 2024 and now adopted across the industry by &lt;strong&gt;AWS&lt;/strong&gt;, &lt;strong&gt;Microsoft&lt;/strong&gt;, &lt;strong&gt;Google&lt;/strong&gt;, and others, MCP defines a universal language for &lt;strong&gt;agent-to-tool communication.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of building M × N bespoke connectors, developers can build one MCP server for a data source, and any MCP compatible agent regardless of framework or model can connect to it immediately. In effect, MCP transforms the integration landscape from M × N complexity to reusable connectivity, much &lt;strong&gt;like USB-C&lt;/strong&gt; standardized hardware connectivity across devices.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The MCP architecture is built around three roles:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Host&lt;/strong&gt; —the agent framework that initiates connections and sends tool requests&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Server&lt;/strong&gt; —the lightweight connector process wrapping an external service&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Resources and Tools&lt;/strong&gt; —the capabilities exposed: actions the agent can invoke, data sources it can read, and prompt templates it can use&lt;/p&gt;

&lt;p&gt;By introducing a standard protocol layer, MCP removes the need to repeatedly rebuild integrations and enables true interoperability across agent frameworks, models, and enterprise systems.&lt;/p&gt;
&lt;h2&gt;
  
  
  The Production Gap  Why Building Enterprise AI Agents Is Mostly an Infrastructure Problem?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Across nearly every enterprise agent project&lt;/strong&gt;, the &lt;strong&gt;same pattern **appears. Before the agent logic can even be written, **engineering&lt;/strong&gt; teams must build a large amount of &lt;strong&gt;supporting&lt;/strong&gt; &lt;strong&gt;infrastructure&lt;/strong&gt;, including:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Session routing&lt;/li&gt;
&lt;li&gt;Credential vaults&lt;/li&gt;
&lt;li&gt;Memory extraction pipelines&lt;/li&gt;
&lt;li&gt;Observability wiring&lt;/li&gt;
&lt;li&gt;Multi tenant context isolation&lt;/li&gt;
&lt;li&gt;Policy enforcement&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In practice, a &lt;strong&gt;substantial portion of early development effort&lt;/strong&gt; goes into these foundations &lt;strong&gt;before the agent’s intelligence is implemented.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Let’s walk through the key engineering challenges that create this gap.&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem 1: Infrastructure for Long Running Stateful Sessions
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Traditional&lt;/strong&gt; &lt;strong&gt;serverless&lt;/strong&gt; platforms are designed for &lt;strong&gt;short&lt;/strong&gt; &lt;strong&gt;lived&lt;/strong&gt;, &lt;strong&gt;stateless&lt;/strong&gt; workloads.&lt;/p&gt;

&lt;p&gt;Agents behave very differently.&lt;/p&gt;

&lt;p&gt;They often require long running, stateful execution environments that maintain context across many tool calls and reasoning steps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Supporting this requires infrastructure for:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Session routing&lt;/li&gt;
&lt;li&gt;Per user state management&lt;/li&gt;
&lt;li&gt;Lifecycle management&lt;/li&gt;
&lt;li&gt;Dynamic scaling of execution environments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Constructing this infrastructure on top of general purpose compute platforms can become a significant engineering effort before any agent logic is written.&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem 2: Security Isolation at Scale
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Enterprise agents frequently process sensitive user data.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When &lt;strong&gt;thousands of users&lt;/strong&gt; run &lt;strong&gt;concurrent&lt;/strong&gt; &lt;strong&gt;sessions&lt;/strong&gt;, strong isolation between sessions becomes critical. Without proper &lt;strong&gt;safeguards&lt;/strong&gt;, a defect could potentially expose:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;One user’s data to another user&lt;/li&gt;
&lt;li&gt;Information across tenants&lt;/li&gt;
&lt;li&gt;Privileged credentials or tokens&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Achieving secure isolation at scale requires carefully designed execution environments, container isolation, and strict identity boundaries, rather than relying solely on application level safeguards.&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem 3:Identity, OAuth, and Credential Management
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Agents rarely operate in isolation.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;They interact with external services on behalf of users, which introduces the need to manage authentication and authorization flows such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;OAuth consent processes&lt;/li&gt;
&lt;li&gt;Secure token storage&lt;/li&gt;
&lt;li&gt;Automatic token refresh&lt;/li&gt;
&lt;li&gt;Fine grained permission enforcement&lt;/li&gt;
&lt;li&gt;Audit trails for every access&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;When agents integrate &lt;strong&gt;with multiple SaaS platforms across thousands of users&lt;/strong&gt;, credential management becomes a full platform capability, not just a small feature.&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem 4: Memory Infrastructure
&lt;/h2&gt;

&lt;p&gt;Agents depend heavily on memory systems to function effectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short Term Memory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Maintaining conversation context across long interactions often requires summarization pipelines that compress earlier dialogue while preserving meaning.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long Term Memory&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Persistent knowledge typically involves:&lt;/li&gt;
&lt;li&gt;Information extraction pipelines&lt;/li&gt;
&lt;li&gt;Vector storage&lt;/li&gt;
&lt;li&gt;Semantic retrieval&lt;/li&gt;
&lt;li&gt;Mechanisms to reconcile new information with existing knowledge&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these components introduces potential failure modes that can gradually degrade agent behaviour if not carefully managed, particularly in multi-tenant environments.&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem 5: Observability for Agent Reasoning
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Traditional monitoring tools measure metrics such as:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Latency&lt;/li&gt;
&lt;li&gt;Error rates&lt;/li&gt;
&lt;li&gt;Throughput&lt;/li&gt;
&lt;li&gt;But production AI agents require deeper visibility.&lt;/li&gt;
&lt;li&gt;Engineers often need to understand:&lt;/li&gt;
&lt;li&gt;Which reasoning step produced an incorrect output&lt;/li&gt;
&lt;li&gt;Which tool call returned unexpected data&lt;/li&gt;
&lt;li&gt;Why the agent chose a particular decision path&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Achieving this level of visibility requires trace level instrumentation, structured logs, and AI aware observability dashboards.&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem 6: Policy Enforcement Outside the Agent
&lt;/h2&gt;

&lt;p&gt;Early agent systems often embed governance rules directly inside prompts.&lt;/p&gt;

&lt;p&gt;This approach is fragile.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A carefully crafted user input can sometimes&lt;/strong&gt; influence the agent to ignore or reinterpret its own instructions.&lt;/p&gt;

&lt;p&gt;Production systems therefore require external policy enforcement layers that evaluate permissions and constraints independently of the agent’s reasoning process.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;This ensures governance cannot be bypassed&lt;/strong&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Problem 7: Multi Agent Coordination
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Real enterprise workflows rarely rely on a single agent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Instead, they often involve multiple specialized agents working together. For example:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;research&lt;/strong&gt; agent to gather information&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;writing&lt;/strong&gt; agent to generate responses&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;verification&lt;/strong&gt; agent to validate outputs&lt;/li&gt;
&lt;li&gt;An &lt;strong&gt;approval&lt;/strong&gt; agent to enforce governance&lt;/li&gt;
&lt;li&gt;Supporting these workflows requires infrastructure for:&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Inter agent communication&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Shared state management&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Workflow&lt;/strong&gt; &lt;strong&gt;orchestration&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Failure handling and retries&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This coordination layer introduces yet another architectural component to an already complex system.&lt;/p&gt;
&lt;h2&gt;
  
  
  Introducing Amazon Bedrock AgentCore
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wrhd6iv1c81yw79us39.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3wrhd6iv1c81yw79us39.png" alt=" " width="800" height="481"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon Bedrock AgentCore&lt;/strong&gt; is an &lt;strong&gt;agentic platform&lt;/strong&gt; from &lt;strong&gt;AWS&lt;/strong&gt; designed to &lt;strong&gt;build&lt;/strong&gt;, &lt;strong&gt;deploy&lt;/strong&gt;, and &lt;strong&gt;operate AI agents securely at scale&lt;/strong&gt;. It &lt;strong&gt;provides&lt;/strong&gt; a set of &lt;strong&gt;modular&lt;/strong&gt;, &lt;strong&gt;enterprise grade services&lt;/strong&gt; that handle the infrastructure required to run &lt;strong&gt;production&lt;/strong&gt; &lt;strong&gt;grade&lt;/strong&gt; AI agents without developers having to manage the underlying systems.&lt;/p&gt;

&lt;p&gt;In real world deployments, building an agent is only a small part of the challenge. &lt;strong&gt;Production systems must manage runtime execution, memory, tool connectivity, identity, security, and observability before agents can reliably interact with enterprise data and services&lt;/strong&gt;. These infrastructure concerns often become the primary barrier to moving from prototype agents to production systems.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Amazon Bedrock AgentCore&lt;/strong&gt; addresses this challenge by providing fully managed services that remove the undifferentiated heavy lifting of building agent infrastructure. Developers can focus on implementing the agent’s reasoning and workflows while AgentCore manages the operational backbone required to run agents reliably in enterprise environments.&lt;/p&gt;

&lt;p&gt;AgentCore services are modular and composable, meaning they can be used together or independently depending on the architecture of the system. The platform is also &lt;strong&gt;framework agnostic&lt;/strong&gt; and &lt;strong&gt;model agnostic&lt;/strong&gt;, supporting &lt;strong&gt;popular open source agent frameworks such as LangGraph, CrewAI, LlamaIndex, and Strands Agents&lt;/strong&gt;, and it can work with foundation models from &lt;strong&gt;Amazon Bedrock&lt;/strong&gt; or &lt;strong&gt;external providers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At a high level, AgentCore provides capabilities such as:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Runtime :&lt;/strong&gt; A secure serverless environment for running agents and tools&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Memory :&lt;/strong&gt;  Managed &lt;strong&gt;short term&lt;/strong&gt; and &lt;strong&gt;long term memory&lt;/strong&gt; for context aware agents&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Gateway :&lt;/strong&gt; A service that converts APIs and services into MCP-compatible tools for agents&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Identity :&lt;/strong&gt; Identity and access management designed specifically for AI agents&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Built in tools and observability :&lt;/strong&gt; Including code execution, browser automation, monitoring, and evaluation capabilities&lt;/p&gt;

&lt;p&gt;Together, these services form a production infrastructure layer for agentic systems, allowing teams to deploy AI agents that are secure, scalable, observable, and capable of interacting with real enterprise systems&lt;/p&gt;
&lt;h2&gt;
  
  
  AgentCore Runtime
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Runtime&lt;/strong&gt; is the &lt;strong&gt;secure&lt;/strong&gt;, &lt;strong&gt;serverless&lt;/strong&gt; execution &lt;strong&gt;environment for AI agents&lt;/strong&gt;. Each user session runs inside a &lt;strong&gt;dedicated&lt;/strong&gt;, hardware isolated &lt;strong&gt;microVM&lt;/strong&gt;, providing strong isolation of CPU, memory, and filesystem resources.&lt;/p&gt;

&lt;p&gt;Isolation is enforced at the virtualization layer, ensuring one user’s agent cannot access another user’s data. &lt;strong&gt;When a session ends  due to 15 minutes of inactivity, user termination, or the 8 hour maximum session limit the microVM is destroyed and memory is fully sanitized, preventing cross session data leakage&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Framework Compatibility&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Runtime is framework agnostic and works with common agent frameworks such as:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strands Agents (AWS)&lt;/li&gt;
&lt;li&gt;LangChain / LangGraph&lt;/li&gt;
&lt;li&gt;LlamaIndex&lt;/li&gt;
&lt;li&gt;Microsoft Agent Framework (Autogen + Semantic Kernel)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;It can also host any custom agent implementation that runs inside a container.&lt;/p&gt;

&lt;p&gt;Minimal Integration&lt;/p&gt;

&lt;p&gt;Existing agents can be deployed with a small wrapper:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;bedrock_agentcore.runtime&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BedrockAgentCoreApp&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BedrockAgentCoreApp&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="nd"&gt;@app.entrypoint&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;your_agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Deployment:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;agentcore configure&lt;br&gt;
agentcore deploy&lt;br&gt;
Model Support&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore is model agnostic and works with major foundation models including&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amazon Nova&lt;/li&gt;
&lt;li&gt;Anthropic Claude&lt;/li&gt;
&lt;li&gt;OpenAI GPT&lt;/li&gt;
&lt;li&gt;Google Gemini&lt;/li&gt;
&lt;li&gt;Meta Llama&lt;/li&gt;
&lt;li&gt;Mistral&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your agent chooses the model; AgentCore only provides the execution environment.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Communication&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;AgentCore supports &lt;strong&gt;two interaction modes&lt;/strong&gt;:&lt;/p&gt;

&lt;p&gt;HTTP API—standard request/response execution&lt;/p&gt;

&lt;p&gt;Bi directional WebSocket streaming real-time conversational and multi-turn agents&lt;/p&gt;

&lt;p&gt;Using a sessionId keeps requests routed to the same microVM session, preserving state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strands Agents&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Strands Agents is &lt;strong&gt;AWS’s open source agent framework&lt;/strong&gt; designed &lt;strong&gt;around a model first approach&lt;/strong&gt;. A Strands agent is defined by three elements:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Model&lt;/li&gt;
&lt;li&gt;Tools&lt;/li&gt;
&lt;li&gt;Prompt&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The model drives planning and tool usage. Strands agents deploy to AgentCore Runtime using the same lightweight SDK wrapper&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Options
&lt;/h2&gt;

&lt;p&gt;AgentCore supports &lt;strong&gt;two deployment&lt;/strong&gt; paths.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Direct code upload&lt;/strong&gt;&lt;br&gt;
AgentCore automatically builds the container and deploys the agent — no Dockerfile required.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Container deployment&lt;/strong&gt;&lt;br&gt;
Provides full control over runtime dependencies and system configuration.&lt;/p&gt;

&lt;p&gt;Both use the same lifecycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;agentcore configure&lt;/li&gt;
&lt;li&gt;agentcore deploy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deployments are immutable and versioned, allowing multiple versions and canary testing before traffic promotion.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1li4m7i3wpsi3qda9uvv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1li4m7i3wpsi3qda9uvv.png" alt=" " width="800" height="1166"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  AgentCore Gateway
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Gateway converts existing APIs, AWS Lambda functions, and OpenAPI specifications into agent ready MCP tools automatically without writing custom adapters&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;From API to Agent Tool&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Point Gateway to a Lambda function or OpenAPI specification and it automatically:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generates the MCP tool schema&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handles protocol translation&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Exposes the API as a discoverable agent tool&lt;/p&gt;

&lt;p&gt;What previously required weeks of custom integration can now be done in minutes.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;agentcore gateway create \&lt;br&gt;
  --name "crm-tools" \&lt;br&gt;
  --lambda-arn "arn:aws:lambda:us-east-1:123:function:crm-api" \&lt;br&gt;
  --protocol MCP&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Once registered, any MCP compatible agent can discover and invoke the tool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;MCP Native Architecture&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gateway is built around the &lt;strong&gt;Model Context Protocol (MCP)&lt;/strong&gt;. Registered tools become automatically usable by &lt;strong&gt;MCP compatible frameworks&lt;/strong&gt; such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Strands&lt;/li&gt;
&lt;li&gt;LangGraph&lt;/li&gt;
&lt;li&gt;CrewAI&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Agents can dynamically discover tools at runtime rather than requiring tools to be hardcoded during initialization.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SaaS Integration&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Gateway provides built in connectors for common enterprise platforms such as:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;GitHub&lt;/li&gt;
&lt;li&gt;Salesforce&lt;/li&gt;
&lt;li&gt;Slack&lt;/li&gt;
&lt;li&gt;Google Workspace&lt;/li&gt;
&lt;li&gt;Microsoft 365&lt;/li&gt;
&lt;li&gt;Jira / Confluence&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These connectors handle authentication, schema generation, and error handling automatically.&lt;/p&gt;

&lt;h2&gt;
  
  
  Agent-to-Agent Communication (A2A)
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Gateway also supports the Agent2Agent (A2A) protocol&lt;/strong&gt;, which standardizes how agents communicate with each other.&lt;/p&gt;

&lt;p&gt;Agents built using different frameworks can delegate tasks across systems while communicating through standardized A2A messages.&lt;/p&gt;

&lt;h2&gt;
  
  
  AgentCore Identity
&lt;/h2&gt;

&lt;p&gt;AgentCore Identity manages authentication and credential delegation for AI agents accessing external systems.&lt;/p&gt;

&lt;p&gt;It controls both:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Who can invoke the agent&lt;/li&gt;
&lt;li&gt;How the agent authenticates to external services&lt;/li&gt;
&lt;li&gt;Supported authentication mechanisms include:&lt;/li&gt;
&lt;li&gt;AWS IAM SigV4 for internal services&lt;/li&gt;
&lt;li&gt;OAuth 2.0 and OpenID Connect for external users and applications&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Compatible identity providers include Amazon Cognito, Okta, Microsoft Entra ID, and Auth0.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Machine-to-Machine Access (2LO)
&lt;/h2&gt;

&lt;p&gt;For system-level tasks, agents authenticate using OAuth Client Credentials without a user involved.&lt;/p&gt;

&lt;p&gt;Common scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Scheduled workflows&lt;/li&gt;
&lt;li&gt;Background analytics&lt;/li&gt;
&lt;li&gt;System integrations&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  User Delegated Access (3LO)
&lt;/h2&gt;

&lt;p&gt;When agents act on behalf of a user, AgentCore manages the full OAuth lifecycle:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;User consent flow&lt;/li&gt;
&lt;li&gt;Encrypted token storage&lt;/li&gt;
&lt;li&gt;Token refresh&lt;/li&gt;
&lt;li&gt;Access auditing&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All credentials are stored in an encrypted vault protected by customer managed KMS keys.&lt;/p&gt;

&lt;h2&gt;
  
  
  AgentCore Memory
&lt;/h2&gt;

&lt;p&gt;AgentCore Memory provides built in memory management for agents without requiring developers to build custom vector pipelines.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It supports three types of memory:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Short Term Memory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Maintains session context&lt;/strong&gt;, including conversation history, tool outputs, and reasoning state.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Long Term Memory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stores extracted knowledge such as &lt;strong&gt;user preferences&lt;/strong&gt;, decisions, and discovered facts so future sessions begin with relevant context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Episodic Memory&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Stores past experiences  what actions were attempted and which strategies &lt;strong&gt;succeeded enabling agents to improve behavior over time&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  AgentCore Browser
&lt;/h2&gt;

&lt;p&gt;Some enterprise systems can only be accessed through a web interface.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore Browser provides isolated browser instances&lt;/strong&gt; that agents can use to interact with websites and web applications.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agents can:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Navigate multi step workflows&lt;/li&gt;
&lt;li&gt;Fill forms&lt;/li&gt;
&lt;li&gt;Extract information from dynamic pages&lt;/li&gt;
&lt;li&gt;Interact with internal portals&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each session runs in a sandboxed browser environment, which is destroyed when the session ends.&lt;/p&gt;

&lt;h2&gt;
  
  
  AgentCore Code Interpreter
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;When agents generate code for analysis or computation, that code must execute safely&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;AgentCore Code Interpreter provides an isolated execution sandbox where generated code can run securely.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agents can use it to:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analyze datasets&lt;/li&gt;
&lt;li&gt;Run calculations&lt;/li&gt;
&lt;li&gt;Generate charts and files&lt;/li&gt;
&lt;li&gt;Validate generated code&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each execution occurs in a separate ephemeral sandbox with no access to other sessions or infrastructure.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;The Platform for the Production Agent Era&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Having architected agentic systems across &lt;strong&gt;Azure Web Apps, Azure Container Apps, and custom infrastructure&lt;/strong&gt;, I know how much engineering effort goes into the layers that production agents require.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Session routing, credential management, memory pipelines, observability, governance policies&lt;/strong&gt;, and multi tenant isolation are all necessary pieces of a reliable agent system. None of them are impossible to build but they consume time that should be spent improving the reasoning, behavior, and usefulness of the agent itself.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;This is the problem Amazon Bedrock AgentCore is designed to solve.&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;AgentCore provides &lt;strong&gt;7 purpose built services&lt;/strong&gt; that handle the production infrastructure required for agent systems:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Runtime:&lt;/strong&gt; Secure microVM execution for agents&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gateway:&lt;/strong&gt; MCP-native tool integration and API exposure&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Identity:&lt;/strong&gt; OAuth credential lifecycle and delegated access&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Memory:&lt;/strong&gt; Short term and  Long term persistent memory for agents&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Browser:&lt;/strong&gt; managed browser automation for web interactions&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Code Interpreter:&lt;/strong&gt; Isolated sandbox for executing generated code&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Observability:&lt;/strong&gt; CloudWatch native tracing with OpenTelemetry support&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AgentCore is framework agnostic&lt;/strong&gt; and works with common agent frameworks such as Strands, LangChain, LangGraph, LlamaIndex, CrewAI, and AutoGen, as well as custom implementations.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It is also model agnostic&lt;/strong&gt;, allowing agents to use foundation models including Amazon Nova, Anthropic Claude, OpenAI GPT models, Google Gemini, Meta Llama, and Mistral, or any model accessible through an API.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The question is no longer whether a production AI agent can be built&lt;/strong&gt;.&lt;br&gt;
With AgentCore, the real question becomes what agent you want to build and &lt;strong&gt;how quickly you can deliver it to the people who need it.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;&lt;strong&gt;pip install bedrock-agentcore bedrock-agentcore-starter-toolkit&lt;/strong&gt;&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>agents</category>
      <category>ai</category>
      <category>architecture</category>
      <category>aws</category>
    </item>
    <item>
      <title>Semantic Caching in RAG Systems &amp; AI Agents</title>
      <dc:creator>Seenivasa Ramadurai</dc:creator>
      <pubDate>Mon, 09 Mar 2026 06:39:33 +0000</pubDate>
      <link>https://future.forem.com/sreeni5018/semantic-caching-in-rag-systems-ai-agents-2gal</link>
      <guid>https://future.forem.com/sreeni5018/semantic-caching-in-rag-systems-ai-agents-2gal</guid>
      <description>&lt;h2&gt;
  
  
  What Is Caching?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Caching&lt;/strong&gt; is the &lt;strong&gt;practice of storing&lt;/strong&gt; the &lt;strong&gt;result&lt;/strong&gt; of an &lt;strong&gt;expensive&lt;/strong&gt; &lt;strong&gt;operation&lt;/strong&gt; so that future requests for the same result &lt;strong&gt;can be served instantly&lt;/strong&gt; without repeating the work.&lt;/p&gt;

&lt;p&gt;The concept is foundational in computing. A** web browser caches images so pages load faster*&lt;em&gt;. A database caches query results so it does not re read the disk. **A CDN caches static files close to the user&lt;/em&gt;*. In every case the principle is the same: compute once, reuse many times.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87lko9sgpx7k4fnrujzm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F87lko9sgpx7k4fnrujzm.png" alt=" " width="800" height="102"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Traditional caches work on exact matches&lt;/strong&gt;. The key is the exact input string or request. “&lt;em&gt;What is the capital of France?&lt;/em&gt;” and “&lt;strong&gt;What’s the capital of France?&lt;/strong&gt;” are &lt;strong&gt;different keys&lt;/strong&gt; the cache &lt;strong&gt;misses on the second even though the answer is identical&lt;/strong&gt;. This works fine for static web assets, but it falls apart the moment users express the same intent in different words.&lt;/p&gt;

&lt;h2&gt;
  
  
  What Is Semantic Caching?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching replaces the exact string key&lt;/strong&gt; with a &lt;strong&gt;meaning based key&lt;/strong&gt;. Instead of asking “is this input identical to a stored input?”, it asks “is this input similar enough in meaning to a stored input?”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It does this using embeddings&lt;/strong&gt;. Every query is converted into a dense &lt;strong&gt;numerical vector a point in high dimensional space&lt;/strong&gt; where semantically similar sentences sit close together. The cache stores these vectors alongside their responses. &lt;strong&gt;When a new query arrives, its vector is compared to all stored vectors using cosine similarity&lt;/strong&gt;. If the closest match is above a threshold (e.g. 0.92), the cached response is returned.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5b5sjuiz3nczr2ue87v.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw5b5sjuiz3nczr2ue87v.png" alt=" " width="800" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The result:&lt;/strong&gt; a user who asks “&lt;strong&gt;How many sick days do I get&lt;/strong&gt;?” and another who asks “&lt;strong&gt;What is our sick leave allowance?&lt;/strong&gt;” both get the same cached answer because the &lt;strong&gt;questions mean the same thing&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem Why RAG Pipelines Waste Money at Scale
&lt;/h2&gt;

&lt;p&gt;A standard &lt;strong&gt;RAG pipeline does four things every time&lt;/strong&gt; a user sends a query:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Embed&lt;/strong&gt; the query &lt;strong&gt;convert the text to a vector&lt;/strong&gt; using an embedding model&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector search&lt;/strong&gt; and find the &lt;strong&gt;most relevant chunks&lt;/strong&gt; in your document store&lt;/li&gt;
&lt;li&gt;Assemble &lt;strong&gt;context&lt;/strong&gt; + build a &lt;strong&gt;prompt&lt;/strong&gt; from the retrieved chunks and the user query&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLM invocation&lt;/strong&gt; send the prompt to the model and pay per token&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each step adds latency and cost. An LLM call alone typically adds 1–4 seconds and &lt;strong&gt;costs money on every single request&lt;/strong&gt;. The problem is that in production, over 40% of queries are near paraphrases of questions already answered. &lt;strong&gt;Without caching, the system repeats all four steps for every one of them.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;10,000 queries per day. 40% duplicates = 4,000 unnecessary LLM calls. At $0.002 per call that is $8 wasted daily, $2,920 per year before accounting for latency degradation during peak load.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The solution is to intercept duplicate queries before they reach step 2 that is vector search. A &lt;strong&gt;semantic&lt;/strong&gt; &lt;strong&gt;cache&lt;/strong&gt; &lt;strong&gt;sits&lt;/strong&gt; at the front of the &lt;strong&gt;pipeline&lt;/strong&gt;. If a semantically equivalent query has been answered before, return that answer immediately. &lt;strong&gt;The entire pipeline retrieval, context assembly, LLM call is bypassed.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Solution How Semantic Caching Works
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62jqcgzam5arr2y9t10x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F62jqcgzam5arr2y9t10x.png" alt=" " width="800" height="512"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Every cached entry is a triple: the query embedding vector, the stored response, and a timestamp for expiry management.&lt;/p&gt;

&lt;p&gt;📄 Cache entry structure&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fstqctvtqdrje81emyxjz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fstqctvtqdrje81emyxjz.png" alt=" " width="800" height="322"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;At inference time the flow is:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embed the incoming query → vector v_q&lt;/li&gt;
&lt;li&gt;Search the cache: find the stored vector with highest cosine similarity to v_q&lt;/li&gt;
&lt;li&gt;If max similarity ≥ threshold θ → return the cached response y_i immediately&lt;/li&gt;
&lt;li&gt;Otherwise → run the full RAG pipeline, then store the new (v_q, response, timestamp) in the cache&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Three Cache Layers
&lt;/h2&gt;

&lt;p&gt;Semantic caching can be applied at three points in the pipeline, each with different hit rates and trade offs:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo8c4u0kp8jvnpkqh7bh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvo8c4u0kp8jvnpkqh7bh.png" alt=" " width="800" height="621"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Semantic Caching in the AI Agentic World
&lt;/h2&gt;

&lt;p&gt;A standard &lt;strong&gt;RAG pipeline&lt;/strong&gt; answers &lt;strong&gt;one question per request&lt;/strong&gt;. An &lt;strong&gt;AI agent does something more ambitious&lt;/strong&gt;. it &lt;strong&gt;plans&lt;/strong&gt;, decides which &lt;strong&gt;tools&lt;/strong&gt; to &lt;strong&gt;call&lt;/strong&gt;, executes those tools, reasons over the results, and repeats, sometimes across many turns to complete a task.&lt;/p&gt;

&lt;p&gt;This makes the &lt;strong&gt;cost&lt;/strong&gt; and &lt;strong&gt;latency problem significantly worse&lt;/strong&gt;. Where a RAG system makes one LLM call per query, an agent may make 5–15. Where a RAG system makes one tool call, an agent may make the same tool call repeatedly across different user sessions fetching the same product info, the same knowledge article, the same company record.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;🤖 In an agentic system, semantic caching is not just about saving one LLM call it is about short circuiting entire reasoning chains. A cached tool result prevents a retrieval step, which prevents a reasoning step, which may prevent two further tool calls downstream.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Where Caching Fits in an Agent Loop
&lt;/h2&gt;

&lt;p&gt;An agent loop has two natural places to insert a cache:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before the first LLM call&lt;/strong&gt;: If the user’s intent has been handled before in a similar session, return the full cached final answer immediately. Bypasses the entire loop.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Before each tool execution:&lt;/strong&gt; Before calling an external tool (&lt;strong&gt;database lookup, API call, knowledge base search&lt;/strong&gt;), check whether the same call or a semantically equivalent one was made recently. Return the cached tool result instead of executing.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real World Use Cases
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;HR Policy Bot&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Internal HR chatbots are one of the highest value deployments for semantic caching. Employees ask the same questions constantly sick leave, parental leave, expense claims, performance review timelines  phrased differently by every person.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A 2,000 person company&lt;/strong&gt;. Employees ask &lt;strong&gt;~3 HR questions&lt;/strong&gt; each per year = 6,000 annual queries. Analysis shows 60% are near paraphrases of existing questions. Semantic caching eliminates ~3,600 LLM calls per year. HR policy changes at most quarterly stale cache is rarely a risk.&lt;/p&gt;

&lt;p&gt;Example queries that resolve to the same cached answer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“How many sick days do I get?”&lt;/li&gt;
&lt;li&gt;“What is our sick leave allowance?”&lt;/li&gt;
&lt;li&gt;“Can I take a sick day without a doctor’s note?”&lt;/li&gt;
&lt;li&gt;“What’s the policy on calling in sick?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;All four embed to vectors with &lt;strong&gt;cosine similarity &amp;gt; 0.93&lt;/strong&gt; against the same cached entry. One LLM call. Four employees served.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;⚠️Never cache personal HR queries. “How many sick days do I have left?” is a personal balance query it must bypass the cache and hit the HRIS system directly. Detect these with entity patterns before the cache lookup.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Use Case 2  Customer Support Services
&lt;/h2&gt;

&lt;p&gt;Customer support is the &lt;strong&gt;highest volume use case&lt;/strong&gt;. &lt;strong&gt;Password&lt;/strong&gt; &lt;strong&gt;resets&lt;/strong&gt;, billing queries, &lt;strong&gt;refund policies&lt;/strong&gt;, order tracking questions a small set of issues accounts for the vast majority of ticket volume. The same problem, asked by thousands of different customers, in thousands of different ways.&lt;/p&gt;

&lt;p&gt;An &lt;strong&gt;Ecommerce&lt;/strong&gt; platform with 50,000 support queries per day. 12 issue categories account for 73% of volume. Semantic caching at query level reduces average response time from 4.2s to 0.3s for cache hits a 93% latency improvement for nearly three quarters of all users.&lt;/p&gt;

&lt;p&gt;Example queries resolving to the same cached response:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;“I can’t log in”&lt;/li&gt;
&lt;li&gt;“How do I reset my password?”&lt;/li&gt;
&lt;li&gt;“I’m locked out of my account”&lt;/li&gt;
&lt;li&gt;“Forgot my password, what do I do?”&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Identical reset flow, identical cached answer. Cache hit rate for this category alone: 85%+.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;⚠️Order specific queries must bypass the cache. Any query containing an order number, transaction ID, or account reference is personal it gets routed to a live data lookup, never a cached response.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  When NOT to Use Semantic Caching
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;A cache that returns a wrong, stale, or contextually mismatched answer is worse than no cache&lt;/strong&gt;. Applied in the wrong contexts, semantic caching silently degrades quality and in regulated environments creates compliance liability.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;🚫 The cache returns a past response to a new user. If that response was wrong, &lt;strong&gt;personalized&lt;/strong&gt;, or &lt;strong&gt;time sensitive&lt;/strong&gt;, the cache amplifies the mistake at scale. One bad cached entry can poison thousands of downstream responses.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyh0hu1e3ep1j2tsfz2jn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyh0hu1e3ep1j2tsfz2jn.png" alt=" " width="800" height="392"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Quick Decision Checklist
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Before adding any query or tool call to the cache, run through these six checks:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Is the answer the same regardless of who asks it?&lt;/strong&gt; If no → skip cache.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Could the answer change within your TTL window?&lt;/strong&gt; If yes → shorten TTL or skip cache.&lt;/li&gt;
&lt;li&gt;Does the query contain personal identifiers? If yes → bypass cache unconditionally.&lt;/li&gt;
&lt;li&gt;Is the domain regulated (medical, legal, financial)? If yes → get explicit policy approval before caching.&lt;/li&gt;
&lt;li&gt;Is the expected hit rate above ~15%? If no → cache overhead likely exceeds the savings.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Has the response passed output validation?&lt;/strong&gt; If no → never cache unvalidated LLM outputs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fubocn40ymrrisss21fw2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fubocn40ymrrisss21fw2.png" alt=" " width="800" height="149"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation  Qdrant Docker + Repository Pattern
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4fzqbho5iljk1786398a.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4fzqbho5iljk1786398a.png" alt=" " width="800" height="453"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Qdrant is a purpose built, open source vector database written in Rust. It is the ideal default backend for a semantic cache: it stores vectors and response payloads together, supports native TTL filtering, and runs in Docker in under 30 seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Repository Pattern&lt;/strong&gt; wraps the backend behind a clean interface. Your RAG pipeline and agent loop only ever talk to that interface  they never import &lt;strong&gt;Qdrant or FAISS directly&lt;/strong&gt;. Swapping &lt;strong&gt;backends is one environment variable&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw98n95hve1g0ejpdoghw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fw98n95hve1g0ejpdoghw.png" alt=" " width="800" height="565"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  I am running Qdrant docker container in Docker Desktop
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qf7s3xwfqy681ls642o.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2qf7s3xwfqy681ls642o.png" alt=" " width="800" height="449"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Repository Pattern — Semantic Caching Implementation
&lt;/h2&gt;

&lt;p&gt;This project uses the Repository Pattern to abstract cache storage behind a clean interface. The application code depends only on &lt;strong&gt;CacheRepository&lt;/strong&gt; not on &lt;strong&gt;Qdrant&lt;/strong&gt;, &lt;strong&gt;FAISS&lt;/strong&gt;, or any specific backend. Backends are swappable via configuration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F654tr3u7761h18kg0lxn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F654tr3u7761h18kg0lxn.png" alt=" " width="800" height="578"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This project applies the &lt;strong&gt;Repository&lt;/strong&gt; &lt;strong&gt;Pattern&lt;/strong&gt; to decouple semantic cache storage from application logic. Every consumer the agent loop, RAG pipeline, HR pipeline, and tool cache  depends only on the CacheRepository abstract interface, never on a concrete backend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The payoff:&lt;/strong&gt; &lt;strong&gt;swap&lt;/strong&gt; &lt;strong&gt;backends&lt;/strong&gt; by changing a &lt;strong&gt;single&lt;/strong&gt; &lt;strong&gt;environment&lt;/strong&gt; variable. &lt;strong&gt;Qdrant&lt;/strong&gt; runs in production, &lt;strong&gt;FAISS&lt;/strong&gt; runs locally for zero Docker development, and adding Redis or Pinecone means creating one file and one factory registration. No application code changes.&lt;/p&gt;

&lt;h2&gt;
  
  
  DESIGN PRINCIPLE
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Application code never imports&lt;/strong&gt; &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;QdrantCache&lt;/strong&gt;&lt;br&gt;
 or &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;FaissCache&lt;/strong&gt; directly. Everything flows through &lt;strong&gt;get_cache()&lt;/strong&gt;, which returns a &lt;strong&gt;CacheRepository&lt;/strong&gt;. Storage concerns are fully isolated from business logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python · src/cache_repository.py
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;abc&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;abstractmethod&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CacheRepository&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;ABC&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Abstract base class for semantic cache backends.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lookup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return cached response if similarity &amp;gt;= threshold, else None.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lookup_with_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;
                         &lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Same as lookup but also returns the similarity score.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;ndarray&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;ttl_hours&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Store query embedding and response in the cache.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="nd"&gt;@abstractmethod&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invalidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_age_hours&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Remove entries older than max_age_hours. Returns count removed.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
        &lt;span class="bp"&gt;...&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9eeh6i6423m0dohn7504.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9eeh6i6423m0dohn7504.png" alt=" " width="800" height="308"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  WHY LOOKUP_WITH_SCORE?
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;lookup_with_score&lt;/strong&gt; exists because &lt;strong&gt;consumers&lt;/strong&gt; &lt;strong&gt;occasionally&lt;/strong&gt; need the &lt;strong&gt;raw similarity score&lt;/strong&gt;, not just a &lt;strong&gt;hit/miss boolean&lt;/strong&gt;. The tool cache uses it to log near misses (high score, below threshold) for threshold tuning. The agent loop uses it to decide whether to show a &lt;strong&gt;cache-hit&lt;/strong&gt; indicator in the UI.&lt;/p&gt;

&lt;h2&gt;
  
  
  Factory: cache_factory.get_cache()
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cache_factory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;
&lt;span class="n"&gt;The&lt;/span&gt; &lt;span class="n"&gt;factory&lt;/span&gt; &lt;span class="n"&gt;reads&lt;/span&gt; &lt;span class="n"&gt;CACHE_BACKEND&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;creates&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;appropriate&lt;/span&gt; &lt;span class="n"&gt;concrete&lt;/span&gt; &lt;span class="n"&gt;implementation&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;storing&lt;/span&gt; &lt;span class="n"&gt;it&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;a&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;-&lt;/span&gt;&lt;span class="n"&gt;level&lt;/span&gt; &lt;span class="n"&gt;singleton&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt; &lt;span class="n"&gt;Every&lt;/span&gt; &lt;span class="n"&gt;subsequent&lt;/span&gt; &lt;span class="n"&gt;call&lt;/span&gt; &lt;span class="n"&gt;returns&lt;/span&gt; &lt;span class="n"&gt;the&lt;/span&gt; &lt;span class="n"&gt;same&lt;/span&gt; &lt;span class="n"&gt;instance&lt;/span&gt; &lt;span class="err"&gt;—&lt;/span&gt; &lt;span class="n"&gt;avoiding&lt;/span&gt; &lt;span class="n"&gt;reconnections&lt;/span&gt; &lt;span class="n"&gt;to&lt;/span&gt; &lt;span class="n"&gt;Qdrant&lt;/span&gt; &lt;span class="ow"&gt;or&lt;/span&gt; &lt;span class="n"&gt;repeated&lt;/span&gt; &lt;span class="n"&gt;FAISS&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="n"&gt;construction&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;

&lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="err"&gt;·&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cache_factory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;py&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.cache_repository&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CacheRepository&lt;/span&gt;

&lt;span class="n"&gt;_cache&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;CacheRepository&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;get_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;CacheRepository&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="sh"&gt;"""&lt;/span&gt;&lt;span class="s"&gt;Return singleton backend based on CACHE_BACKEND env var.&lt;/span&gt;&lt;span class="sh"&gt;"""&lt;/span&gt;
    &lt;span class="k"&gt;global&lt;/span&gt; &lt;span class="n"&gt;_cache&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;_cache&lt;/span&gt; &lt;span class="ow"&gt;is&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_cache&lt;/span&gt;

    &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;CACHE_BACKEND&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qdrant&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.qdrant_cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;QdrantCache&lt;/span&gt;
        &lt;span class="n"&gt;_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;QdrantCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QDRANT_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;http://localhost:6333&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="n"&gt;collection_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;QDRANT_COLLECTION&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;semantic_cache&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;faiss&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.faiss_cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;FaissCache&lt;/span&gt;
        &lt;span class="n"&gt;_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;FaissCache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;ValueError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Unknown CACHE_BACKEND: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;backend&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;_cache&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  standard consumer pattern
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;src.cache_factory&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;get_cache&lt;/span&gt;

&lt;span class="n"&gt;cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;get_cache&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;              &lt;span class="c1"&gt;# Always returns CacheRepository
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 1: check before running the pipeline
&lt;/span&gt;&lt;span class="n"&gt;cached&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lookup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.85&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;cached&lt;/span&gt;                &lt;span class="c1"&gt;# LLM is never called on a hit
&lt;/span&gt;
&lt;span class="c1"&gt;# Step 2: run the full pipeline
&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;run_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Step 3: store for future queries
&lt;/span&gt;&lt;span class="n"&gt;cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;user_query&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Adding a New Backend
&lt;/h2&gt;

&lt;p&gt;Adding a backend  &lt;strong&gt;Redis&lt;/strong&gt;, &lt;strong&gt;Pinecone&lt;/strong&gt;, &lt;strong&gt;Azure Cache&lt;/strong&gt; requires exactly &lt;strong&gt;three steps&lt;/strong&gt;. &lt;strong&gt;No changes are needed in agent_loop.py, tool_cache.py, hr_pipeline.py, or main.py.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1 — Implement the interface
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="err"&gt;·&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;redis_cache&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;py  &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new&lt;/span&gt; &lt;span class="nb"&gt;file&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.cache_repository&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CacheRepository&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;typing&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;numpy&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;np&lt;/span&gt;

&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;RedisCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;CacheRepository&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;redis://localhost:6379&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_url&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lookup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lookup_with_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;lookup_with_score&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;threshold&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.92&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                          &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="n"&gt;Tuple&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;Optional&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="nb"&gt;float&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="c1"&gt;# implement vector similarity search via Redis VSIM or custom hashing
&lt;/span&gt;        &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;insert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;v_q&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;namespace&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
               &lt;span class="n"&gt;ttl_hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="bp"&gt;None&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;invalidate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_age_hours&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;24&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="nb"&gt;int&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;...&lt;/span&gt;


&lt;span class="c1"&gt;## Step 2 — Register in the factory
&lt;/span&gt;
&lt;span class="n"&gt;Python&lt;/span&gt; &lt;span class="err"&gt;·&lt;/span&gt; &lt;span class="n"&gt;src&lt;/span&gt;&lt;span class="o"&gt;/&lt;/span&gt;&lt;span class="n"&gt;cache_factory&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;py  &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;add&lt;/span&gt; &lt;span class="n"&gt;one&lt;/span&gt; &lt;span class="k"&gt;elif&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;elif&lt;/span&gt; &lt;span class="n"&gt;backend&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;.redis_cache&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RedisCache&lt;/span&gt;
        &lt;span class="n"&gt;_cache&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;RedisCache&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getenv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;REDIS_URL&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;redis://localhost:6379&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;


&lt;span class="c1"&gt;## Step 3 — Set the environment variable
&lt;/span&gt;
&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;env&lt;/span&gt;
&lt;span class="n"&gt;CACHE_BACKEND&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;
&lt;span class="n"&gt;REDIS_URL&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;redis&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="o"&gt;//&lt;/span&gt;&lt;span class="n"&gt;localhost&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;6379&lt;/span&gt;

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  ZERO APPLICATION CHANGES
&lt;/h2&gt;

&lt;p&gt;After these three steps every consumer agent_loop, tool_cache, hr_pipeline, main automatically uses Redis. No imports to update, no signatures to change. This is the Repository Pattern payoff.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;Repository Pattern separates&lt;/strong&gt; what the cache does (the interface) from how it stores data (the backend). This is the property that makes the system testable with FAISS, production ready with &lt;strong&gt;Qdrant&lt;/strong&gt;, and infinitely extensible without &lt;strong&gt;touching application code.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgc4evqeysjcn0l2g7u0w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgc4evqeysjcn0l2g7u0w.png" alt=" " width="800" height="327"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Request flow
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28w9dumop4go60sexp6q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F28w9dumop4go60sexp6q.png" alt=" " width="800" height="861"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Output/Testing
&lt;/h2&gt;

&lt;h1&gt;
  
  
  SEMANTIC CACHING — ALL QUERIES + TOOL CACHE EXAMPLES
&lt;/h1&gt;

&lt;p&gt;TOOL CACHE TEST (same tool, same param — Call 1 then Call 2):&lt;br&gt;
  get_company_policy(topic='remote work') — Call 1: MISS, Call 2: HIT&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query #1&lt;/strong&gt;&lt;br&gt;
  &lt;strong&gt;PROMPT&lt;/strong&gt;:       What is our sick leave policy?&lt;br&gt;
  &lt;strong&gt;RESPONSE&lt;/strong&gt;:     Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.&lt;br&gt;
  &lt;strong&gt;QUERY CACHE:  MISS  |  TOOL CACHE:  get_company_policy(topic='sick leave'): MISS  |  TIME:  2.48s&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query #2&lt;/strong&gt;&lt;br&gt;
  &lt;strong&gt;PROMPT&lt;/strong&gt;:       Tell me about our sick leave policy&lt;br&gt;
  &lt;strong&gt;RESPONSE&lt;/strong&gt;:     Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.&lt;br&gt;
  &lt;strong&gt;QUERY CACHE:  HIT  |  TOOL CACHE:&lt;/strong&gt;  skipped (query cached)  |  TIME:  0.17s&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query #3&lt;/strong&gt;&lt;br&gt;
  &lt;strong&gt;PROMPT&lt;/strong&gt;:       What is our sick leave policy?&lt;br&gt;
  &lt;strong&gt;RESPONSE&lt;/strong&gt;:     Our sick leave policy allows employees to take 15 sick days per year. A doctor's note is required if you are absent for more than 3 consecutive days.&lt;br&gt;
  &lt;strong&gt;QUERY CACHE:  HIT  |  TOOL CACHE:&lt;/strong&gt;  skipped (query cached)  |  TIME:  0.16s&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query #4&lt;/strong&gt;&lt;br&gt;
  &lt;strong&gt;PROMPT&lt;/strong&gt;:       Tell me about vacation days&lt;br&gt;
  &lt;strong&gt;RESPONSE&lt;/strong&gt;:     Employees are entitled to 20 days of paid time off (PTO) per year for vacation. Requests for vacation days must be submitted at least 2 weeks in advance.&lt;br&gt;
 ** QUERY CACHE:  MISS  |  TOOL CACHE**:  get_company_policy(topic='vacation days'): MISS  |  TIME:  2.59s&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query #5&lt;/strong&gt;&lt;br&gt;
  &lt;strong&gt;PROMPT&lt;/strong&gt;:       Tell me about vacation&lt;br&gt;
  &lt;strong&gt;RESPONSE&lt;/strong&gt;:     Employees are entitled to 20 days of paid time off (PTO) per year for vacation. Requests for vacation days must be submitted at least 2 weeks in advance.&lt;br&gt;
 ** QUERY CACHE:  HIT  |  TOOL CACHE:**  skipped (query cached)  |  TIME:  0.22s&lt;/p&gt;

&lt;p&gt;Query #6&lt;br&gt;
  PROMPT:       How many PTO days for vacation?&lt;br&gt;
  RESPONSE:     You have 20 PTO days for vacation per year. Please note that you must request your time off at least 2 weeks in advance.&lt;br&gt;
  QUERY CACHE:  MISS  |  TOOL CACHE:  get_company_policy(topic='PTO days for vacation'): HIT  |  TIME:  2.21s&lt;/p&gt;

&lt;p&gt;=============================================================&lt;/p&gt;

&lt;h2&gt;
  
  
  TOOL CACHE EXAMPLES (tool call → result, from this run):
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;get_company_policy&lt;/strong&gt;(topic='remote work')  →  &lt;strong&gt;Call 1: MISS, Call 2: HIT&lt;/strong&gt;&lt;br&gt;
Result:   Hybrid: 2 days in office required. Equipment provided.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;get_company_policy&lt;/strong&gt;(topic='sick leave')  →  &lt;strong&gt;MISS&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Result:   Employees get 15 sick days per year. Doctor's note required after 3 consecutive days.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;get_company_policy&lt;/strong&gt;(topic='vacation days')  →  &lt;strong&gt;MISS&lt;/strong&gt;&lt;br&gt;
 Result:   20 days PTO per year. Must request 2 weeks in advance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;get_company_policy&lt;/strong&gt;(topic='PTO days for vacation')  →  &lt;strong&gt;HIT&lt;/strong&gt;&lt;br&gt;
 Result:   20 days PTO per year. Must request 2 weeks in advance.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;==============================================================&lt;/p&gt;

&lt;p&gt;Total: 7.83s for 6 queries&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Semantic caching&lt;/strong&gt; turns &lt;strong&gt;repeated&lt;/strong&gt; or &lt;strong&gt;similar&lt;/strong&gt; &lt;strong&gt;questions&lt;/strong&gt; into instant answers **instead of expensive LLM **and tool calls. By using embeddings and vector similarity instead of exact string matching, you can cache responses for "What is our sick leave policy?" and serve them when users ask "Tell me about sick leave", "How many sick days do we get?", or other paraphrased variants.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Repository Pattern keeps the implementation clean and flexible&lt;/strong&gt;: you can run with &lt;strong&gt;FAISS&lt;/strong&gt; for local development and switch to &lt;strong&gt;Qdrant&lt;/strong&gt; for &lt;strong&gt;production&lt;/strong&gt; without changing application code. The two layer approach query cache for full responses and tool cache for tool results gives you fine grained control over what gets cached and when.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key takeaways:&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Meaning over exact match&lt;/strong&gt; Embeddings capture intent, so paraphrases and synonyms hit the cache.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Two layers&lt;/strong&gt; Cache full responses and tool results separately; each has its own threshold and namespace.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Swappable backends&lt;/strong&gt; Use &lt;strong&gt;Qdrant&lt;/strong&gt; for persistence or &lt;strong&gt;FAISS&lt;/strong&gt; for fast, &lt;strong&gt;in-memory&lt;/strong&gt; testing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Know when to skip&lt;/strong&gt; Avoid caching personal, entity-specific, or rapidly changing data.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Thanks&lt;br&gt;
Sreeni Ramadorai&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>performance</category>
      <category>rag</category>
    </item>
  </channel>
</rss>
