AI Context and Memory Landscape
Not written by me. Generated from my notes, from agent research, heavily written by llms for my reference. Published so it is easy to reference
A field guide to memory systems, retrieval architectures, code-understanding engines, semantic layers, and the patterns that connect them. Filtered for people actually building or shipping with this stuff.
1. The two axes that separate everything
Almost every "AI memory / context" product fits on two axes:
- What's being remembered? — Code structure, conversations and decisions, arbitrary documents, or structured data definitions.
- How is it represented? — Flat chunks (vector search), graphs (entities + relations), verbatim text, or model-managed scratchpads.
The systems that look like competitors usually aren't. They sit at different points on these axes and stack rather than replace each other.
2. The base categories
| System type | What it indexes | Representation | Lives where | Killer use case |
|---|---|---|---|---|
| Normal RAG | Documents/chunks | Flat vector index | External DB | Q&A over a doc corpus |
| GraphRAG | Documents → extracted entities/relations | Knowledge graph + community summaries | External DB + graph store | "Big picture" questions across a corpus |
| Code knowledge engines (GitNexus etc.) | Source code (calls, imports, inheritance, flows) | Code-aware knowledge graph | MCP server over a repo | "What breaks if I change this function?" |
| Verbatim conversation memory (MemPalace) | AI conversations, verbatim | Hierarchy + vectors + temporal KG | Local files | Persistent personal/agent memory |
| Memory tools (Anthropic, ChatGPT memory, mem0, Letta) | Model-curated facts/notes | Model-managed scratchpad/files | Provider-side or local | The model writes its own notes, no infra |
How each category actually works
Normal RAG. Chunk documents → embed → store vectors → at query time, embed the query, fetch top-k, stuff into context. Cheap, simple, the baseline. Struggles with multi-hop questions, "summarize the whole corpus," and anything requiring relationships between chunks.
GraphRAG (Microsoft's pattern, and the family of techniques after it). Use an LLM at index time to extract entities and relationships from documents into a knowledge graph, cluster the graph, summarize each community. At query time, do entity-centric local search ("what does X relate to?") or community-summary-based global search ("themes across the whole corpus"). Better for synthesis, worse for cost — the indexing pass is LLM-heavy.
Code knowledge engines. GraphRAG's idea, but specialized for code. Instead of an LLM extracting "Alice — works_at — Acme", a parser extracts validate() — called_by — checkout() and UserService — implements — IService. Exposed through MCP so coding agents can ask one structural question instead of grepping ten times. Not really a "memory" system — a code understanding layer.
Verbatim conversation memory. A different philosophy from RAG/GraphRAG: don't summarize, store verbatim. The bet is that LLM-summarized memory loses the thing you actually needed. Conversations get organized into a structured hierarchy (e.g., MemPalace's "method of loci": wings = projects/people, rooms = topics, halls = memory types, drawers = the raw text). Retrieval is scoped vector search over that hierarchy, often plus a small temporal knowledge graph.
Memory tools for AI. The newest category and the most different. Instead of you building an index, the model itself writes notes to a memory store and reads them later. Anthropic's memory tool, ChatGPT memory, mem0, Letta (formerly MemGPT) sit here. The model decides what's worth saving. Pros: zero infra. Cons: it forgets/hallucinates, you have less control, and it has no idea about your codebase or documents unless something else feeds it.
When to reach for which
- Q&A over a stable doc corpus, low budget → normal RAG.
- Synthesis/themes across a corpus, multi-hop questions → GraphRAG.
- AI agent editing your code and you're tired of it breaking distant callers → a code-graph tool.
- You want your assistant to remember decisions, debugging history, and conventions across sessions → MemPalace, MemNexus, or mem0.
- You don't want to run infrastructure and trust the model to manage its own notes → built-in memory tools.
The mental model that ties it together
Think of it as a stack, not a competition:
- GraphRAG / normal RAG = retrieval over knowledge you brought in (docs).
- Code knowledge engines = retrieval over the structure of your code.
- MemPalace / memory tools = retrieval over what was said and decided in past sessions.
A serious agent setup runs all three layers at once. They feed the same context window from different angles.
The one place these genuinely overlap is verbatim memory vs. memory tools: same goal (persistent memory), different control model. MemPalace says you own the store and storage is verbatim; memory tools say the model decides. That's the real philosophical fork.
3. The Augment Code case study (vertically integrated context)
Augment Code is interesting because it spans two of the categories above, plus pioneers a particular philosophy on a third.
| Augment product | Fits in category | Closest peer |
|---|---|---|
| Context Engine | Code understanding layer | GitNexus |
| Context Engine MCP | Code understanding, exposed externally | GitNexus MCP |
| Memories (in the IDE agent) | Memory tools | mem0, MemPalace, ChatGPT memory |
| Augment Agent / Auggie CLI | The agent that consumes all of the above | Cursor, Claude Code |
| Intent | Agent orchestration on top of the stack | Conductor.build, Claude Code swarms |
Context Engine vs. GitNexus
Both are "codebase understanding via MCP," but philosophically different:
- GitNexus is graph-first: parses code into an explicit knowledge graph (function calls, imports, class inheritance, execution flow). When you ask "who calls
validate()?", you're traversing edges, not embedding-similarity-searching. Open source, structural. - Augment Context Engine is semantic-first: a "full search engine for code" with embedding-based retrieval, plus relationship awareness on top. Closed source, hosted (with a local Auggie CLI mode), and indexes well beyond the repo — commit history, multiple repos, internal docs/wikis/runbooks via Context Connectors.
Same problem, different bet:
- GitNexus bets that explicit code structure is what you need.
- Augment bets that semantic similarity + curated retrieval + scale is what you need.
In practice, Augment leans heavier on retrieval engineering (chunking, ranking, multi-source, real-time incremental indexing); GitNexus leans heavier on guarantees that come from a typed graph ("these are all the callers, with confidence scores"). The Feb 2026 release of Context Engine MCP made this an apples-to-apples competition — you can plug Augment's index into Cursor or Claude Code the same way you plug in GitNexus.
Memories vs. MemPalace and other memory tools
Augment's Memories feature stakes out a position that's almost the philosophical opposite of MemPalace:
- MemPalace: "Store everything verbatim, never let an LLM decide what to throw away. Retrieval finds the right thing later."
- mem0 / ChatGPT memory / Anthropic memory tool: "Let the model auto-save what it thinks is important. Convenient, but you lose visibility."
- Augment Memories: "Surface every memory before it's saved. You approve, edit, or discard. Promote good ones to workspace Rules so the team gets them."
Augment is essentially saying: auto-save is dangerous (the agent keeps remembering deprecated patterns), and verbatim storage is overkill — what you want is curated, human-in-the-loop memory that can graduate into shared rules. It's the most opinionated take on the "what should we remember?" question.
Intent: orchestration on top of the stack
Intent (Mac desktop app, public beta as of early 2026) is Augment's orchestration layer — positioned as "what comes after the IDE." It runs on top of everything else they've built and is the only product in the landscape that spans orchestration + context + memory under one roof. Backed by $472M in funding (Index, Lightspeed, Eric Schmidt). SOC 2 Type II and ISO/IEC 42001 certified.
Architecturally:
- Spaces — each is an isolated git worktree (mechanically similar to Conductor.build's workspaces; see §4).
- Three-agent default team per Space:
- Coordinator — uses the Context Engine to analyze the codebase, drafts a spec, generates tasks.
- Specialists — six personas (Implementation, Architecture, Testing, etc.) plus custom; execute in parallel waves, each in their own context.
- Verifier — checks results against the spec before human review.
- BYOA (Bring Your Own Agent) — Claude Code, Codex, OpenCode, or Augment's Auggie. Augment-native gets full Context Engine; BYOA gets it via the Context Engine MCP server.
- Multi-model per task — Opus 4.5 for architecture, Sonnet 4.5 for speed, GPT-5.2 for code review, Haiku for lightweight work.
The genuinely new idea is the Living Spec — a self-updating document that replaces chat history as the coordination substrate between agents. The Coordinator drafts it from your prompt + Context Engine analysis. You approve it. As Specialists work, they read from and write to the spec. When code changes, the spec updates to reflect what was actually built. When requirements change, updates propagate to active agents.
How Intent's layers stack with what we covered:
- Per-specialist context fanning. The Coordinator queries the Context Engine to identify which files/services/patterns each Specialist needs, and the Specialist starts with that pre-computed slice. The Context Engine recognizes indirect dependencies through event systems, message queues, configuration files, and database triggers — so the Auth Specialist knows the API Specialist consumes its tokens without manual context-passing. Marketing line: "each agent gets the context it actually needs, instead of whatever you remembered to paste into a prompt." Reviewers report indexing 40K+ files in ~6 minutes with 45-second incremental updates.
- Cross-Space, cross-session memory. Augment Memories (the approval-gated feature) carries user/team conventions across all Spaces and sessions.
- In-Space coordination memory. The Living Spec is the shared mutable substrate.
- Resumable sessions. Space state persists on disk; close Intent, reopen tomorrow, every Space is exactly where you left it. Auto-commit captures work as Specialists finish.
- MCP support throughout. External tools and other memory/context MCPs plug in at the Space level.
CLAUDE.md configuration files carry over intact when BYOA agents (Claude Code etc.) run under Intent's orchestration — they read from their working directory as normal, preserving project-specific conventions inside the Augment workflow.
The vertically integrated stack
┌─────────────────────────────────────────────────────────┐
│ Orchestration │ Intent (spec-driven) │ ← also: Conductor.build, custom swarms
├─────────────────────────────────────────────────────────┤
│ Agent layer │ Augment Agent / Auggie CLI │ ← also: Cursor, Claude Code
├─────────────────────────────────────────────────────────┤
│ Code understanding │ Context Engine (semantic) │ ← also: GitNexus (graph)
├─────────────────────────────────────────────────────────┤
│ Doc retrieval │ Context Connectors (docs/wikis) │ ← also: RAG, GraphRAG
├─────────────────────────────────────────────────────────┤
│ Persistent memory │ Memories (approval-gated) │ ← also: MemPalace, mem0
└─────────────────────────────────────────────────────────┘
Augment now owns four of five layers. They're the only company in the landscape that does. The bet behind Intent is that the differentiator isn't any single layer — it's the cohesion across layers. Mix-and-match alternative: Conductor.build + Claude Code + GitNexus + a RAG system + MemPalace. More flexible, less cohesive, you're the integrator.
4. Agent orchestration: the layer above the runtime
Once you have one good coding agent, the next problem is running several of them without them stepping on each other. This is its own product category in 2026, and there are three distinct philosophies plus Augment's Intent.
The three "Conductor" projects (all different)
Confusingly, three projects share the name. They solve different problems.
Conductor.build (Melty Labs) — parallel Claude Code on git worktrees
Mac app from a YC S24 team (the people behind the Melty editor). Built on Claude Code SDK + Tauri.
- Each workspace = an isolated git worktree, set up automatically with your repo + setup script in ~10 seconds.
- Supports Claude Code and Codex; uses your existing logins (API key, Pro, or Max).
- Multi-model mode — run Claude and Codex in different tabs on the same prompt to compare.
- Spotlight testing — sync changes back to your main repo for testing without merging.
/resolve-merge-conflictsslash command — Claude reads diffs from both branches and reasons about reconciliation.- Free (uses your Claude/Codex compute).
How it deals with context: deliberately doesn't. Conductor.build sits in the orchestration/runtime layer only and leaves context/memory to Claude Code itself.
- Context = whatever's in the worktree +
CLAUDE.md/ project rules. - Cross-agent shared state = none (worktree isolation by design).
- Cross-session memory = none — defer to
CLAUDE.mdand external MCP servers (memory, code intelligence).
How it deals with memory: checkpointing, not memory. Their engineering work is on in-session rollback:
- Captures three pieces of state per turn: HEAD, index (staged), worktree (all files including untracked).
- Stored as private git refs at
.git/refs/conductor-checkpoints/<id>— they wrote a custom approach becausegit stash createmisses untracked files andgit stash -umodifies the user's working tree. They usegit write-treeagainst a temp index viaGIT_INDEX_FILEto capture untracked files non-disruptively. - One-click rollback restores files, git state, and chat history — the chat rewind is the unusual part.
- Caveat: two agents in the same workspace can't be unwound separately (state is mixed). Their answer: use isolated workspaces or use subagents.
Trade-offs reviewers flag: token costs multiply with N agents; spec discipline upstream (clearly delineated file/component boundaries) is required for parallelization to be useful.
Conductor OSS (Netflix-origin) — durable workflow execution
Completely different beast. Started at Netflix as a workflow orchestration engine for microservices, now extended with first-class AI agent primitives. Memory = persisted workflow state.
- Every task persists. If the server crashes mid-execution, the workflow resumes from the last completed task. No re-running expensive LLM calls.
- Native MCP support —
LIST_MCP_TOOLS,CALL_MCP_TOOLare first-class system tasks. Agents discover tools at runtime. - Durable human-in-the-loop —
HUMANtask type pauses indefinitely (across server restarts and deploys) until approval. DO_WHILEagent loops — turn a planner-executor into an autonomous loop where each iteration is a durable checkpoint. Crash at iteration 12 → resume at iteration 12.- Full observability — every prompt, response, tool call, and decision recorded as part of workflow history.
"Memory" here is the workflow execution log — not LLM context, not user-facing memory, but a perfect replayable audit trail. If you want LLM-style semantic memory on top, add a memory MCP server as just another tool.
blueman82/conductor (GitHub OSS) — wave-based multi-agent
Less prominent third project, focused on autonomous multi-agent execution of implementation plans for Claude Code.
- Wave-based execution — parallel within waves, sequential between waves; dependency-aware.
- Quality control with GREEN/RED/YELLOW verdicts and automatic retries on failure.
- Adaptive learning — learns from history, swaps agents on failures.
- Pattern Intelligence / STOP protocol — prior-art detection, duplicate prevention. This is essentially memory of what's already been built.
- Mandatory commits — agents instructed to commit; conductor verifies via
git log. Memory of progress lives in git history. - Budget & rate limits with intelligent auto-resume, session resume, state persistence.
- Git rollback — task-level checkpoints with automatic rollback on QC failure.
- Slash command
/conductorfor Claude Code generates conductor-compatible YAML plans via animplementation-plannerskill. - Inspired by CLAWED (Claude Agent Workflow Execution Delegation).
Memory model: stateful orchestration metadata (which agents work for which tasks, what's been built, budget/rate state) + git history as the canonical record. No LLM-context memory.
Augment Intent (recap, in this category)
Covered in §3. The unique pitch in the orchestration category is the Living Spec as the shared substrate between agents — instead of agents coordinating implicitly via filesystem isolation (Conductor.build) or workflow state (Conductor OSS) or wave dependencies (blueman82), they coordinate via a structured, mutable spec document that all of them read and update.
How they stack on context and memory
Despite different products, all four handle context/memory at the orchestration level in a consistent way: they hand off to layers below them. Orchestrators handle agent state, not agent memory. The semantic-memory layer (Mem0, Supermemory, Zep, MemPalace) sits underneath them as just another MCP-exposed capability.
| Layer | Conductor.build | Conductor OSS | blueman82 | Intent |
|---|---|---|---|---|
| In-session state | Git checkpoints (HEAD + index + worktree + chat) | Workflow execution log, fully replayable | Task-level git checkpoints, QC-tied | Auto-commit + Living Spec updates |
| Cross-agent shared state | None (worktree isolation) | Workflow inputs/outputs flow between tasks | Wave outputs flow to next wave | Living Spec (mutable, shared) |
| Cross-session memory | None — defer to CLAUDE.md + external MCP |
None — defer to external memory MCP | Adaptive learning over orchestration metadata | Augment Memories (cross-Space, approval-gated) |
| Context source | Whatever Claude Code grabs | Tools the workflow calls | Whatever Claude Code grabs | Context Engine fans out per-Specialist |
| LLM-style semantic memory | Not their problem | Not their problem | Not their problem | Owned via Memories |
The philosophical split: what's the coordination substrate?
This is the new fork that didn't exist a year ago:
- Git/filesystem as coordinator (Conductor.build) — agents are isolated, the only thing they share is the repo. Discipline lives in
spec.md-style attack documents committed up front. - Workflow state as coordinator (Conductor OSS) — agents pass typed inputs/outputs through a durable workflow graph. Coordination is encoded as DAG edges.
- Wave dependencies as coordinator (blueman82) — explicit dependency declarations between tasks. Agents are stateless executors.
- Living spec as coordinator (Intent) — agents share a structured mutable document. Coordination happens by reading and writing to it.
- Chat history as coordinator (default Claude Code / Cursor multi-agent) — agents share what's in the conversation. The model that doesn't work at scale, which is why the others exist.
Practical guidance for layering memory under orchestrators
If you use any of these and want real memory:
- Keep
CLAUDE.md/ rules files at the repo root — your stable context, gets loaded into every workspace automatically. - Configure a memory MCP (Supermemory, Mem0, MemNexus, MemPalace) in each workspace's
.mcp.json— runs independently of the orchestrator and persists across worktrees. - Don't expect orchestrator checkpoints to substitute for memory — they're undo, not recall.
- For codebase context, mount GitNexus MCP (or Augment Context Engine MCP) in each workspace.
- For Conductor.build specifically: spec discipline upstream is what makes parallelization useful. The specs become a form of shared "memory" between worktrees, just stored as committed files.
When orchestration is overkill
For most single-developer flows on a single feature, a single Claude Code or Cursor session is still the right tool. Orchestration pays off when:
- You have 3+ tracks ready to implement simultaneously
- Tracks have no shared code boundaries (specs/attack docs define these)
- Each track is substantial enough to benefit from isolation (>2 hours of single-agent work)
- You can spec well enough to let agents run unsupervised for stretches
If those don't all hold, the coordination overhead exceeds the parallelism win.
5. Memory frameworks: the agent / conversational memory category
The serious players, ordered by how distinctive their architecture is:
- Letta (formerly MemGPT) — the OG. UC Berkeley paper turned commercial agent runtime. Three-tier "virtual memory": core (always in context), recall (cache), archival (cold). Agent edits its own memory via tool calls. Recently added sleep-time compute — the agent reorganizes memory during idle periods. ~22K stars, Apache 2.0. ~$10M seed (Jeff Dean among investors).
- Mem0 — pluggable layer, not a runtime. Vector + optional graph (Mem0g, paywalled at $249/mo). One-line integration. ~48K stars, YC-backed ($24M Series A). Independent eval: 49% LongMemEval, 66.9% LoCoMo (84% with Mem0g).
- Zep / Graphiti — temporal knowledge graph with
valid_from/valid_to/invalid_aton every edge. Built on Neo4j/FalkorDB/Kuzu. Strongest temporal reasoning of the bunch — 71.2% self-reported on LongMemEval (63.8% independent). Graphiti is open source and has its own life as a temporal KG library (~25K stars). - Supermemory — currently claims #1 on all three memory benchmarks (LongMemEval 81.6%, LoCoMo, ConvoMem). 21K stars, MIT. Drop-in wrappers for Mastra, Vercel AI SDK, LangChain. Also published MemoryBench, an open eval framework.
- MemPalace — verbatim-storage philosophy, 96.6% LongMemEval R@5 raw mode. Local-first, ChromaDB + SQLite, no API key required. Inspired by the ancient method of loci. AAAK shorthand for 30x compression. 4-layer memory stack (L0 identity / L1 critical facts / L2 room recall / L3 deep semantic).
- Cognee — memory-as-pipeline (ingest → normalize → extract → graph → retrieve). Closer to a structured RAG pipeline than the others.
- MemMachine — open-source universal memory layer; the "Mem0 alternative without the paywall" pitch.
- MemNexus — hosted persistent memory with
mx setupauto-config for Claude Code/Cursor/Copilot/Codex/Windsurf, MCP server, CommitContext post-commit hook capturing reasoning behind every commit.
Research-flavored, worth tracking
- A-MEM — Agentic Memory for LLM Agents (Xu et al., 2025). Notes-style structured memories the agent maintains.
- MIRIX — six typed memory components (Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault) with a Meta Memory Manager. Hits 85% on LoCoMo.
- LiCoMemory — hierarchical, beats Mem0/Zep/A-MEM on multi-session subsets of LongMemEval.
- ENGRAM-R — fact-card orchestration, 95% input token reduction.
Comparison snapshot
| Framework | p95 Latency | LoCoMo | LongMemEval | License | Best for |
|---|---|---|---|---|---|
| Mem0 (base) | 1.44s | 66.9% | 49.0% | Apache 2.0 | Fast integration, real-time chat |
| Mem0g (graph) | 2.59s | 68.4% | ~54% est. | Apache 2.0 (paywalled) | Multi-hop |
| Zep | ~4s avg | not pub. | 63.8% (indep.) / 71.2% (self) | -- | Temporal-heavy domains |
| Letta | model-dep. | ~83.2% | not pub. | Apache 2.0 | Long-running autonomous agents |
| Supermemory | -- | #1 | 81.6% | MIT | Production memory layer |
| MemPalace | -- | -- | 96.6% R@5 (raw) | -- | Local personal/agent memory |
| Full-context | 17.12s | 72.9% | -- | -- | Baseline reference |
Quick decision framework
- Mem0 if you need production integration in days, latency is hard, and conversations are short-to-medium.
- Zep if conversations span weeks/months, users update facts that contradict earlier statements, temporal accuracy is a hard product requirement.
- Letta if you're building memory-first from scratch and want the agent itself to manage memory; willing to adopt the full runtime.
- Supermemory if you want highest benchmarks with simple integration and MIT license.
- MemPalace if local-first, single-user, and verbatim-everything is the philosophy you want.
6. Code understanding tools: the GitNexus / Augment category
A real ecosystem, four tiers:
Knowledge-graph engines
- GitNexus — ~28K stars. MCP-native code knowledge graph. Multi-phase indexing: cross-file resolution, clustering, hybrid search (BM25 + semantic vectors + RRF), Leiden community detection.
- CodeGraphContext — MIT-licensed alternative, ~2.2K stars. Same approach, more permissive license.
MCP code search
- Octocode MCP / CodePathFinder — lightweight, no graph build.
- Sourcegraph MCP server — Stripe uses this for their internal "Minions." Built on SCIP, exhaustive over enterprise monorepos.
Context packing (no graph, just smart flattening)
- Repomix — 22K stars, the category leader. XML-structured output, tree-sitter compression, claims ~70% token reduction.
- code2prompt — 7K stars, Rust CLI, fast. Template system, Python bindings.
- Aider's repo-map — built into Aider; tree-sitter tag map dynamically tuned per chat. Often-overlooked because it's not a separate product, but architecturally the most sophisticated of the "lightweight" approaches.
- Context Hub — Andrew Ng's curated, versioned API docs CLI for coding agents.
Hosted / enterprise
- Sourcegraph Cody — full code search + AI, agentic context fetching, MCP support. SCIP-powered, claims 400K+ files indexed across multiple repos.
- Greptile — YC-backed, focused on PR review with full-codebase context.
- Augment Context Engine — covered in §3 (and is the engine powering Intent's per-Specialist context, see §4).
- DeepWiki — Cognition's cloud-only doc generator for public GitHub repos.
The pattern
- Knowledge graphs are winning — GitNexus + CodeGraphContext combined have ~30K stars.
- MCP is the integration standard — every serious tool ships an MCP server.
- Incremental real-time graph updates is the open gap nobody has fully solved. The first to solve always-current structural intelligence wins the category.
The Augment vs. Cursor benchmark gap
From a published 2026 comparison:
- Cursor + Claude Opus 4.5: 71% improvement when adding Context Engine MCP (completeness +60%, correctness +5x).
- Claude Code + Opus 4.5: 80% improvement.
- Cursor + Composer-1: 30% improvement, bringing a struggling model into viable territory.
Context architecture matters as much or more than model choice. A weaker model with great context (Sonnet + good context) outperforms a stronger model with poor context (Opus without).
7. RAG / retrieval research: the actual ideas
If you want to understand what's under the hood of all these systems, these are the load-bearing papers and patterns.
Knowledge-graph RAG family
- GraphRAG (Microsoft, Edge et al. 2024) — community detection + per-community summaries, the original.
- LightRAG (Guo et al. 2024) — dual-level retrieval (low-level facts + high-level summaries), faster/cheaper than GraphRAG.
- HippoRAG / HippoRAG 2 (Gutiérrez et al., NeurIPS '24 + ICML '25) — KG + Personalized PageRank to do single-step multi-hop. 10–30× cheaper than iterative retrieval. Inspired by hippocampal indexing theory. The paper to read if you only read one. ~3.4K stars.
- NodeRAG, HopRAG, ArchRAG, MA-RAG — 2025 variants chasing different trade-offs.
- MemoRAG (Sep 2024) — global memory model that drafts a "rough answer" first to guide retrieval.
HippoRAG 2's headline: outperforms GraphRAG / LightRAG / RAPTOR / HippoRAG on associativity (multi-hop) without sacrificing factual or sense-making performance. 9M tokens for indexing vs. 115M for GraphRAG on MuSiQue. Recall@5 lift on 2Wiki: 76.5% → 90.4%.
Hierarchical / structural
- RAPTOR (Stanford, 2024) — recursive clustering + LLM summarization to build a tree. Cluster similar chunks, summarize each cluster, repeat. Still the standard reference for "summarize your way up." Pros: handles abstract questions; cons: slow to build, expensive, complex.
Query-side techniques
- HyDE — embed a hypothetical answer instead of the raw query. Big win on vague queries; hurts on specific ones.
- Multi-query expansion, step-back prompting, query decomposition — variants of "rewrite the query first."
Embedding / chunking
- ColBERT / ColBERT v2 — late interaction. Token-level MaxSim, faster than cross-encoders, better than bi-encoders.
- Late chunking (Jina AI, Günther et al. 2024, arXiv:2409.04701) — embed the whole document, then split. Each chunk's embedding carries context from the entire text. The single most underrated recent idea.
- Cross-encoder rerankers (Cohere
rerank-english-v3, BGE rerankers) — the empirically biggest single ROI in production RAG. Published benchmarks credit +9.3% Recall@10 from this alone.
Chunking strategies (best-of guidance)
| Strategy | Best for | Main risk |
|---|---|---|
| Fixed-size 128–256 tokens | Precise factual lookups | Insufficient context |
| Fixed-size 512–1024 tokens | General QA | Increasing noise |
| Fixed-size 2048+ tokens | Complex reasoning | Diluted embeddings |
| Recursive character (400–512, 10–20% overlap) | Default | -- |
| Document-structure-aware | Markdown/HTML/PDF | Brittle to format changes |
| Semantic | Long docs covering many topics | Cost, can produce tiny fragments |
| Parent-child (hierarchical) | Precision + rich context | Index size |
| RAPTOR | Large corpus, mixed query abstraction | Slow/expensive to build |
| Late chunking | Long-context embedders | Requires modern embedding model |
NVIDIA's benchmark crowned page-level chunking for PDFs (0.648 accuracy). Vecta's Feb 2026 benchmark put recursive 512-token first at 69%. Chroma's "context rot" research (July 2025) found a "context cliff" around 2,500 tokens where retrieval performance degrades even on simple tasks. Rule of thumb: start with recursive 512, move to semantic or page-level only if metrics demand.
The hybrid pattern that actually wins in production
Almost every serious system in 2026 converges on:
BM25 (keyword) + dense (vectors) + (optional) graph traversal
↓
RRF (reciprocal rank fusion)
↓
cross-encoder rerank (Cohere or self-hosted)
Real-world example (IncidentFox writeup, beating RAPTOR):
- BM25 baseline: ~45% Recall@10
- Dense retrieval: ~55%
- RAPTOR (paper): ~70%
- Their full system (RAPTOR + KG + HyDE + BM25 + Cohere rerank + query decomposition): 72.89%
The single biggest lever was the Cohere reranker (+9.3 points). Reranking is where you get precision.
8. Benchmarks (this is how to evaluate any of it)
If you're picking a system, run these instead of trusting marketing.
Memory benchmarks
- LongMemEval (Wu et al., ICLR 2025) — 500 questions over ~50 sessions, ~115K tokens each. Six query types: single-session-user, single-session-assistant, single-session-preference, multi-session, temporal reasoning, knowledge update. Currently the most-cited memory benchmark.
- LoCoMo (Maharana et al., 2024, arXiv:2402.17753) — Snap/Stanford, 300 turns × up to 35 sessions, multimodal. Includes question answering, event summarization, multi-modal dialogue generation. Standard for very-long-term dialogue.
- LoCoMo-Plus (2025) — adds cue-trigger semantic disconnect: testing whether memory survives when the trigger isn't lexically/semantically similar to the cue. Current systems fail badly on this.
- ConvoMem — preference learning / personalization.
- MemoryBench (Supermemory, open-source) — head-to-head harness for memory providers.
RAG benchmarks
- MultiHop-RAG, MuSiQue, 2WikiMultiHopQA, HotpotQA — standard multi-hop QA suite.
Text-to-SQL / structured-data benchmarks
- BIRD (NeurIPS 2023) — 12,751 question-SQL pairs, 95 databases (finance, healthcare, sports, government, education). Up to 33 tables × 11K rows each. Current SOTA: ~73% (CHESS agent + GPT-5).
- Spider 2.0 (2024) — multi-database enterprise workflows including Snowflake and BigQuery. Tasks require complete data-science workflows. End-to-end execution scoring. SOTA: ~35%.
- BIRD-INTERACT — multi-turn interactive text-to-SQL with user simulator. Two test modes: (1) passive Conversational Interaction, (2) active Agentic Interaction. 600 tasks. GPT-5 solves 8.7% (constrained) / 17% (agentic). Tells you that interactive disambiguation is mostly unsolved.
- BIRD-Ent / Spider-Ent (ICLR 2026 submission) — enterprise scale: 4,000+ columns per query scope. SOTA drops to 39% / 60%. Reality check.
- LiveSQLBench-Large-v1 — ~1K columns, ~54 tables per DB, 480 tasks, ~84K avg prompt tokens, Business Rule Drift (external knowledge changes across releases).
Notable text-to-SQL leaderboard (2026)
| Rank | System | BIRD Dev EX % | Spider 2.0 % |
|---|---|---|---|
| 1 | CHESS Agent (GPT-5 backbone) | 73.0 | 35.2 |
| 2 | GPT-5 (zero-shot) | 71.8 | 31.7 |
| 3 | Claude 4 Opus (zero-shot) | 70.3 | 29.4 |
| 4 | DIN-SQL (GPT-5 backbone) | 69.9 | 27.8 |
| 5 | Gemini 2.5 Pro (zero-shot) | 68.7 | 26.3 |
| -- | Arctic-Text2SQL-R1-32B | 71.83 | -- |
| -- | Arctic-Text2SQL-R1-7B | 68.47 | -- |
The 7B Arctic model beating DeepSeek-V3 (671B) is the headline: fine-tuned small models on schema-specific data are surprisingly competitive.
Surveys worth reading
- Graph Retrieval-Augmented Generation: A Survey (Aug 2024, arXiv:2408.08921) — 58+ citations, the right starting point for the KG-RAG family.
- In-depth Analysis of Graph-based RAG in a Unified Framework (Mar 2025, arXiv:2503.04338) — apples-to-apples comparison of GraphRAG / LightRAG / HippoRAG.
9. LLM-side memory tools (model writes its own notes)
- Anthropic memory tool — beta, server-side, file-based store the model edits via tool calls.
- ChatGPT memory — consumer-facing, opaque.
- Claude Code's
CLAUDE.md/ project rules — the "AGENTS.md" pattern is now everywhere (Cursor's.cursor/rules/, Continue's.continue/rules/, Windsurf's.windsurf/rules/, Kiro, JetBrains'.jetbrains/ai.md). - GITCORTEX — pattern, not a product. Use git + markdown + commit-message convention as the memory store. Boot sequence reads
BOOT.md, thenCLAUDE_MEMORY.md, thengit log --grep="correct"to load corrections first as the highest-priority memory class. Clusters of corrections get promoted to "habits" the AI watches for.
10. Audience-filtered guidance: who should use what
The previous lists mix products, patterns, and papers. They split sharply along three audiences.
10.1 Solo LLM power user
You live in Cursor / Claude Code / ChatGPT, you don't run infrastructure, you just want your tools to stop forgetting things and to actually understand your codebase.
Use:
- Session memory (pick one): MemPalace (most powerful, local, free, MCP), Anthropic memory tool (zero setup, in Claude), or ChatGPT memory.
- Project rules / curated memory: rules files.
AGENTS.md,CLAUDE.md,.cursor/rules/,.continue/rules/. 80% of what most people actually need. - Codebase context: Aider's repo-map if you use Aider; Repomix to dump a whole repo into a chat; GitNexus when you hit "agent broke distant callers" pain.
- GITCORTEX pattern: it's a pattern, not a tool — git + markdown + commit conventions. Worth doing for long-running personal projects.
- Parallel agents: Conductor.build (free Mac app) when you have multiple independent tasks ready to go simultaneously. Augment Intent if you want spec-driven coordination and are willing to commit to the Augment ecosystem.
Skip:
- Letta, Zep/Graphiti — both require infra. Overkill.
- Mem0, Supermemory — SDKs for products, not personal tools.
- HippoRAG, GraphRAG, LightRAG, RAPTOR — research/library-level.
- Sourcegraph Cody, Greptile, Augment — only if you have a serious codebase to justify them.
- All the benchmarks — irrelevant unless you're choosing for an org.
- Conductor OSS / blueman82 — workflow-engine territory; Conductor.build / Intent are the right tier.
Honest minimum stack: Cursor/Claude Code + rules files + MemPalace (or built-in memory) + Repomix when you need to dump a repo. Add Conductor.build when you start having "I wish I could parallelize this" moments.
10.2 Org with complex LLM infra (n8n, workflows, multi-tool integrations)
You have agents, workflows, internal tools, multiple services calling LLMs. Memory and context need to flow between systems. The constraint is integration, not novelty.
Use:
- Agent/user memory: Supermemory (#1 on the three memory benchmarks, drop-in wrappers for Mastra, LangChain, LangGraph, OpenAI Agents SDK, Vercel AI SDK, n8n; MIT). Or Mem0 for the lightest integration.
- Temporal correctness (CRM, healthcare, finance — anything where "true then but not now" matters): Zep / Graphiti. Validity-window model is unique. Worth the Neo4j/FalkorDB/Kuzu ops cost only if temporal accuracy is a product requirement.
- Codebase intelligence in workflows: Sourcegraph MCP server (Stripe's Minions story is the proof point), Augment Context Engine MCP if you also want it as a coding assistant, GitNexus MCP for self-hosted OSS.
- Pipeline-style memory (more like ETL): Cognee.
- Framework glue: Mastra and the AI SDK natively support these as processors/tools. n8n has community nodes for Mem0 and Supermemory.
- Durable agent workflows (mission-critical, audit/replay required): Conductor OSS — JSON workflow definitions, MCP-native, every step persisted, durable human-in-the-loop pauses across server restarts. The right tool when "the workflow must survive a deploy mid-execution" is a hard requirement.
- Multi-agent coding workflows for your devs: Augment Intent if your team is heavily Augment-aligned (spec-driven, Living Spec coordination, pre-computed per-Specialist context); Conductor.build if you want lighter-weight worktree parallelism without provider lock-in.
Skip:
- MemPalace, GITCORTEX — single-user designs.
- Letta — only if you're willing to replace your agent runtime. Don't bolt onto an existing stack.
- HippoRAG / RAPTOR / GraphRAG as direct dependencies — too research-grade.
- Anthropic / ChatGPT memory — locked to one provider/surface.
- blueman82/conductor — niche; if you need wave-based plan execution for Claude Code, fine, but Intent or Conductor.build cover most cases.
Watch:
- MCP server proliferation — every memory tool ships one. Build assuming MCP is the universal port.
- Sleep-time compute (Letta's pattern) — expect Mem0/Supermemory/Zep to copy this.
- Spec-as-coordinator (Intent's Living Spec) vs. workflow-as-coordinator (Conductor OSS) vs. filesystem-as-coordinator (Conductor.build) — the orchestration design space is consolidating; pick the substrate that matches how your team works.
Sensible default stack: Supermemory or Mem0 + Sourcegraph or GitNexus MCP for code + Cognee or your existing RAG (with cross-encoder rerank) for docs. Mastra or LangGraph as orchestrator for workflows; Conductor.build or Intent for engineer-facing multi-agent coding.
10.3 Org / team building LLM products
You're shipping something where memory or retrieval is a product feature. You care about benchmarks, latency, cost, control, and defensibility.
Use (memory layer):
- Letta if memory autonomy is the product (long-running autonomous agents, characters, tutors). High lock-in, most sophisticated memory model.
- Supermemory / Mem0 / Zep if memory is a feature, not the architecture. Supermemory has the benchmarks; Mem0 has the simplest API; Zep has temporal guarantees.
- Roll your own with HippoRAG 2 + your stack if you have ML talent and need to differentiate.
Use (retrieval architecture, build don't buy):
- Hybrid: BM25 + dense + RRF.
- Late chunking at the embedding stage — biggest "free" win.
- Cross-encoder reranking (Cohere rerank-v3, BGE, or self-hosted) — single largest empirical lift.
- HyDE for vague-query workloads, query decomposition for multi-hop.
- HippoRAG 2 or LightRAG if your data has dense entity relationships; RAPTOR if natural hierarchy.
Use (evaluation, table stakes):
- LongMemEval for memory features.
- LoCoMo + LoCoMo-Plus for long conversations.
- MemoryBench for head-to-head harness work.
- MultiHop-RAG / MuSiQue / 2Wiki / HotpotQA for retrieval QA.
Background reading:
- HippoRAG 2 paper (arXiv:2502.14802) — current SOTA on the KG-RAG family.
- Late Chunking paper (arXiv:2409.04701) — quick read, big lever.
- Graph RAG: A Survey (arXiv:2408.08921) — lay of the land.
- In-depth Analysis of Graph-based RAG (arXiv:2503.04338) — apples-to-apples.
Skip:
- MemPalace, GITCORTEX, ChatGPT memory, Anthropic memory tool — wrong tier.
- Repomix / code2prompt / Aider repo-map / DeepWiki — dev tools, not building blocks.
- GitNexus, Sourcegraph, Augment, Greptile, Cody — also dev tools, unless your product is a coding agent. (If it is: GitNexus is the open-source reference architecture worth studying.)
Watch:
- Temporal/validity-window memory is becoming table stakes.
- LoCoMo-Plus results across the board are bad — open research problem in cognitive memory under cue-trigger disconnect.
- MIRIX, A-MEM, LiCoMemory, ENGRAM-R — academic systems doing better than commercial ones on benchmarks. May turn into products or get folded into Letta/Zep/Mem0.
10.4 Cross-cutting summary table
| Tool / pattern | Power user | Complex infra org | Product builder |
|---|---|---|---|
| Rules files (CLAUDE.md etc.) | core | for devs | -- |
| MemPalace | best pick | -- | -- |
| Anthropic / ChatGPT memory | free | -- | -- |
| GITCORTEX pattern | if disciplined | -- | -- |
| Repomix / Aider repo-map | yes | -- | -- |
| GitNexus | if needed | MCP | study, don't depend |
| Sourcegraph Cody MCP | overkill | for monorepos | -- |
| Augment Context Engine | for devs | for devs | -- |
| Conductor.build | when parallelizing | for devs | -- |
| Augment Intent | if Augment-aligned | for devs | -- |
| Conductor OSS | -- | durable workflows | -- |
| blueman82/conductor | niche | rarely | -- |
| Mem0 | -- | yes | yes |
| Supermemory | -- | default | default |
| Zep / Graphiti | -- | if temporal | if temporal |
| Letta | -- | rewrites your stack | if memory-first product |
| Cognee | -- | for pipelines | -- |
| HippoRAG 2 | -- | -- | build-your-own |
| Late chunking, ColBERT, cross-encoder rerank | -- | yes | critical |
| GraphRAG / LightRAG / RAPTOR | -- | rarely | as references |
| LongMemEval / LoCoMo / MemoryBench | -- | to choose vendors | critical |
Pattern: as you move right across columns, relevant items get lower-level and more research-y. Power users want products. Infra orgs want APIs and MCP servers. Product builders care about benchmarks and underlying retrieval ideas, because that's where their differentiation lives.
11. Text-to-SQL and BI agents (a different problem)
Text-to-SQL over a data warehouse is fundamentally a NL-to-SQL + semantic layer + governance problem, not a memory/RAG problem. Most of the previous list doesn't apply directly.
Why semantic layers exist
From the Snowflake engineering blog: "An LLM alone is like a super-smart analyst who is new to your data org — given only the raw schema, it's challenging for any analyst to write SQL accurately." Raw schemas lack:
- Business definitions (what is "MRR"? what counts as a "customer"?).
- Metric calculation rules (correct aggregation, time grains).
- Join paths (which join is right when there are multiple FKs?).
- Synonyms / vocabulary bridges ("USA" vs "United States of America" in the actual data).
Every credible text-to-SQL system in 2026 has a semantic layer. With a well-built one, accuracy hits 90%+ on real-world use cases.
The semantic-layer landscape
- Snowflake Semantic Views — schema-level Snowflake objects defining business concepts, metrics, relationships, verified queries, and Cortex Search hookups. Native RBAC.
- Snowflake Cortex Analyst — fully-managed NL→SQL service consuming semantic models / views. Reports 90%+ SQL accuracy.
- dbt Semantic Layer / MetricFlow — popular if you already use dbt. Metric definitions live with transformations.
- Cube.dev — warehouse-agnostic semantic layer with REST/GraphQL/SQL API.
- LookML — Looker's semantic layer.
- WrenAI's MDL — open-source semantic-modeling DSL.
Open-source NL-to-SQL platforms
- WrenAI (Canner, ~15K stars, AGPL-3.0) — full-stack open-source GenBI agent. Stack:
wren-ui(Next.js + Apollo GraphQL),wren-ai-service(Python/FastAPI pipeline with intent classification, vector retrieval from Qdrant, LLM prompting, SQL correction loops),wren-engine(Rust + Apache DataFusion for query execution and MDL semantic resolution). Supports Snowflake + 11 other sources. Exposes MCP server. Generates charts and SQL. - Vanna 2.0 (~23K stars, MIT) — "SQL agent" pattern. Train it on schema descriptions, business documentation, and Q→SQL examples (verified queries). 2.0 added user-aware permissions, row-level security, audit logs, rate limiting, FastAPI integration, pre-built web component. Works with any LLM (including local via Ollama) and any major DB.
- Dataherald (Apache-2.0) — older, less active. Engine, Enterprise (auth/orgs), Admin Console, Slackbot. Worth knowing as comp.
- Databricks Genie — closed, only relevant if you're on Databricks. Unity Catalog metadata at the source.
The platform-native option
For Snowflake specifically:
- Cortex Analyst + Semantic Views + Cortex Search is a vertically-integrated text-to-SQL stack with built-in RBAC and audit. Now exposed via Snowflake-Managed MCP Server, configurable in Cursor with a one-line
mcp.jsonentry. Supports Cortex Analyst, Cortex Search, SQL execution, and Cortex Agents as MCP tools. - April 13, 2026 update: Cortex Agents now generate SQL directly using semantic views as tools (replacing the old
cortex_analyst_text_to_sqlblock withsystem_execute_sql). Lower latency, better accuracy.
Cortex Analyst's pipeline (worth understanding even if you build your own)
- Context enrichment agent — pulls relevant verified queries (human-blessed Q→SQL pairs) and relevant literals (entity values from Cortex Search) from the semantic model.
- Multiple SQL generation agents with different LLMs in parallel — different LLMs excel at different question types (time-related vs. multi-level aggregations).
- Logical schema construction — agents first generate against a simpler logical schema, then post-process to executable physical-schema SQL.
- Synthesizer agent — receives multiple candidate SQL queries plus context, generates the final answer.
What from the broader landscape applies
| From earlier | Applies here? | Why |
|---|---|---|
| Mem0 / Supermemory | ✅ for user/team memory | "Always show me revenue net of refunds" — user-level preferences, past corrections, query history |
| Zep / Graphiti | ✅ specifically | Metric definitions change over time. Validity windows on facts like "MRR formula changed Q3" are exactly what Graphiti models |
| Hybrid retrieval (BM25 + dense + RRF) | ✅ critical | Over semantic-layer assets: descriptions, verified queries, glossary. Cortex Search does this internally |
| Cross-encoder reranking | ✅ | Picking the right view/table among many — where most NL→SQL errors come from |
| HyDE | ✅ for vague queries | "How are we doing?" → generate a hypothetical SQL, embed that against verified-query library |
| GraphRAG / HippoRAG / LightRAG | ⚠️ relevant in spirit | Schemas are graphs (FK relationships, dimensional hierarchies). KG-RAG techniques help with multi-hop join planning, but semantic-model join paths are simpler |
| Code-knowledge engines (GitNexus / Sourcegraph / Augment) | ✅ for the dbt/Looker repo | Index dbt models, MetricFlow/Cube definitions, LookML — these are code defining metrics. Useful for engineers, not the BI tool itself |
| MemPalace, GITCORTEX, Letta | ❌ | Wrong tier |
Domain-specific concerns that don't appear in memory/RAG literature
- Verified-query library — single highest-ROI investment after the semantic layer. Curated, growing set of Q→SQL pairs retrieved as few-shots. Build the workflow to capture and approve them.
- Disambiguation UX — when a question is ambiguous, the system should ask, not guess. BIRD-INTERACT is the open research direction here. Where most products are weakest.
- SQL validation pipeline —
EXPLAINthe generated SQL, dry-run withLIMIT 0, check estimated cost, sandbox execution, return result preview before user sees it. - Cost / blast-radius guards — query timeout, max bytes scanned, warehouse routing, deny-list for expensive operations.
- Feedback loop — 👍/👎 → corrections become candidate verified queries. Your moat over time.
- Result explanation — non-technical users need "I joined
orderstocustomersand counted distinct customer IDs in Q1 2026." Both Cortex Analyst and WrenAI surface this. - Permissions propagation through MCP — each user's PAT/OAuth must scope warehouse access. Don't use a shared service account.
- Metric-definition drift — when finance updates the MRR formula, every cached version is wrong. Centralize definitions in the semantic layer; treat user-memory definitions as overrides; version them.
- Continuous internal eval — your own 200–500 verified business questions across your actual semantic model. Run on every change to semantic model or prompt. More valuable than any public benchmark.
Hard problems specific to BI for non-technical users
- Cursor (or any IDE) is a poor surface for non-technical users. They bounce off the IDE chrome. A custom chat UI is usually required, with the IDE as the power-user / admin / debugging surface.
- Hallucinated column names under big schemas — Spider-Ent shows SOTA models drop ~25 points when you go from academic schemas to 4,000-column enterprise schemas. Verified queries and rerank-on-schema are the primary mitigations.
- Interactive disambiguation is mostly unsolved — BIRD-INTERACT shows GPT-5 at 8.7% (constrained) / 17% (agentic) success.
12. Things genuinely new in 2026 worth tracking
Filtering for "this changes how you'd build the system":
- HippoRAG 2 — graph + Personalized PageRank dominates GraphRAG/LightRAG/RAPTOR on associativity without tanking factual QA, with way less indexing cost. Likely the new default for KG-RAG.
- Late chunking — flips the chunking debate on its head. If your embedding model has long context, you almost always want this on.
- Temporal validity — Zep/Graphiti's idea (every fact has an end timestamp) has spread. A-MEM, MIRIX, and others now bake it in. Forgetting/invalidation is now table stakes.
- Sleep-time compute (Letta) — agents reorganizing memory between tasks. Early data is good. Expect everyone to copy.
- Cross-encoder reranking everywhere — the single biggest empirical win. If your stack doesn't have it, that's where to start.
- MCP as the universal interface — every memory and code-context tool ships an MCP server. Lock-in moats are weaker than they looked a year ago.
- Fine-tuned small SQL models (Arctic-Text2SQL-R1) — a 7B model beating 671B general-purpose models on BIRD. For organizations where latency, cost, or on-prem matter, fine-tuning beats scale.
- Interactive disambiguation as a benchmark axis (BIRD-INTERACT) — current systems can't ask good follow-up questions. Open product opportunity.
- Enterprise-scale schema benchmarks (BIRD-Ent, Spider-Ent, LiveSQLBench) — academic accuracy numbers fall ~25 points when schemas get realistically large. Plan for this in your eval.
- Verified queries as the dominant accuracy lever — Cortex Analyst, Vanna, WrenAI all converge on "a curated library of human-blessed Q→SQL pairs is what makes the system reliable."
- Multi-agent orchestration as a new layer — Conductor.build (worktree isolation), Conductor OSS (durable workflows), blueman82 (wave dependencies), and Augment Intent (Living Spec) emerged in 2025–2026 as a layer above the agent runtime. The shift from "one agent in your IDE" to "a coordinated team of agents" is happening fast.
- Spec as coordination substrate (Intent's Living Spec) — the genuinely new idea in orchestration. Replaces chat history as the shared mutable state between agents. The other fork — git/filesystem as coordinator (Conductor.build) — is older but newly productized.
- Vertically integrated context stacks — Augment now spans orchestration + agent runtime + code understanding + doc retrieval + memory. They're the only company that does. Whether vertical integration wins over best-of-breed stacks (Conductor.build + Claude Code + GitNexus + MemPalace + Mem0) is the open product question.
13. The unifying picture
The systems described here look like competitors but mostly aren't. They sit at different points in a layered context architecture:
┌─────────────────────────────────────────────────────────────┐
│ Orchestration (multi-agent coordination) │
│ Intent (spec-driven), Conductor.build (worktrees), │
│ Conductor OSS (durable workflows), blueman82 (waves) │
├─────────────────────────────────────────────────────────────┤
│ Agent layer │
│ Cursor, Claude Code, Augment Agent, Letta, custom │
├─────────────────────────────────────────────────────────────┤
│ Code understanding (over your repo) │
│ GitNexus, Augment Context Engine, Sourcegraph, Aider map │
├─────────────────────────────────────────────────────────────┤
│ Doc retrieval (over your docs/wikis) │
│ Normal RAG, GraphRAG, LightRAG, HippoRAG 2, Cognee │
├─────────────────────────────────────────────────────────────┤
│ Structured-data retrieval (over your warehouse) │
│ Cortex Analyst, WrenAI, Vanna, semantic layers │
├─────────────────────────────────────────────────────────────┤
│ Persistent memory (across sessions / users) │
│ Mem0, Supermemory, Zep, Letta, MemPalace │
├─────────────────────────────────────────────────────────────┤
│ Retrieval primitives (used by all of the above) │
│ BM25, dense embeddings, late chunking, ColBERT, │
│ cross-encoder rerankers, HyDE, RRF │
└─────────────────────────────────────────────────────────────┘
The interesting product decisions are:
- Which layers do you own vs. delegate? Augment owns four (orchestration through memory); most stacks own one or two.
- Memory: human-curated, verbatim, or model-curated? This is the philosophical fork (Augment Memories vs. MemPalace vs. Anthropic memory tool).
- Retrieval: graph-first or semantic-first? GitNexus vs. Augment Context Engine in microcosm.
- Coordination substrate: chat, spec, workflow, or git? This is the new fork at the orchestration layer (Intent's Living Spec vs. Conductor OSS's workflow state vs. Conductor.build's worktrees vs. default chat-history coordination).
- Storage: local-first or hosted? MemPalace vs. Supermemory.
- Temporal: validity windows or snapshots? Zep vs. everyone else.
Where the field is going: more layers exposed via MCP, cross-encoder reranking as universal practice, temporal validity as table stakes, multi-agent orchestration as table stakes, and the lines between "memory," "RAG," and "code understanding" blurring as everything becomes some flavor of typed retrieval over typed knowledge stores with cross-encoder rerank on top — coordinated by orchestrators that pick a substrate (spec, workflow, git, or chat) for how the agents share state.