The architectural shift nobody is marketing
If you read vendor blog posts, you would think every framework is converging on the same feature set: tool calling, memory, multi-agent handoffs, MCP support. That's true at the surface. Underneath, the field has split into three distinct philosophies, and choosing the wrong one for your problem now costs more than picking a worse model.
A fourth option deserves explicit naming because senior engineers keep choosing it: no framework at all. The "12-FactorAgents" thesis from 2025 still defines the prevailing philosophy. The most reliable production systems are well-engineered software with LLM calls placed at deterministic junctures, backed by ordinary databases, ordinary retries, and ordinary observability. Raw code is the baseline against which every framework should be judged.
What everyone is converging on
Three protocols have effectively won, and any framework that doesn't support them is unviable for enterprise work.
Protocol Convergence — May 2026
A few other patterns have become defaults. Model tiering (cheap models triage, expensive models reason) reduces token costs by 40 to 60 percent in production. Sandboxed code execution for code-action agents has consolidated around E2B, Modal, and Docker. And the industry has quietly abandoned dynamic, LLM-driven execution graphs after the 2025 experiments showed catastrophic looping behaviour. The developer defines the graph; the LLM operates inside it.
The frameworks worth knowing
The production standard for stateful workflows with human-in-the-loop steps. Native checkpointing means a workflow can pause mid-execution, wait days for human approval on a financial decision, and resume cleanly. Verified production users include Klarna, Uber, LinkedIn, BlackRock, Cisco, and JPMorgan.
The friction is real and well documented. State is a TypedDict at best, which means runtime type errors are common and debugging traces of how data moves through nodes is painful. A 2025 Hacker News critique that "you spend so much time fixing stupid runtime type errors because the state of every graph is a JSON blob with very minimal typing" still gets recirculated in 2026 because v1.0 didn't fix it. The default in-memory checkpointer won't survive a container restart, so production deployments need Postgres, Redis, or SQLite backends.
March 2026 brought CVE disclosures from Cyera: CVE-2025-67644 (SQL injection in the SQLite checkpointer) and CVE-2025-68664 (high-severity deserialization). All patched, but they reframe LangGraph from "trust by default" to "audit before deploying."
The right answer for roughly half of agent projects that don't actually need a graph runtime. Type safety is the headline feature: outputs are validated against Pydantic models, and the framework won't return a response that fails validation. In a 90-day benchmark from early 2026, Pydantic AI caught 23 production data-integrity bugs that other frameworks silently ignored.
Built-in usage limits let engineers cap request tokens, response tokens, and tool calls directly on the agent definition, which prevents the cost overruns that plague more autonomous frameworks. Infrastructure costs in the same benchmark came in at around $390 over 90 days, versus over $1,000 for heavier frameworks running equivalent workloads.
The honest weakness: Pydantic AI is stateless by default and has no native multi-agent orchestration. The emerging pattern in 2026 is LangGraph as the orchestrator with Pydantic AI inside each node — explicit graph state, type-safe agent behaviour.
The fastest path from blank file to working multi-agent prototype. Define agents with role, goal, and backstory, assemble them into a Crew, and you have a demoable system in an afternoon. Enterprise adoption is real, with PwC, DocuSign, and IBM among the named users.
The production picture is less flattering. GitHub issue #3154 documents agents producing valid-looking Thought-Action-Observation traces without actually calling the tool, fabricating an observation and continuing to the final answer. Reproducible on GPT-4 and Qwen. Issue #3202 documents dependency hell on macOS through ChromaDB's onnxruntime constraints. Role-based prompts add roughly 30 to 50 percent to token counts versus equivalent hand-tuned graphs.
The official successor to Swarm. Minimal abstraction, clean handoff semantics, native integration with the Responses API, sessions, guardrails, and tracing. The codebase is small enough to read in an afternoon, which is unusually honest by 2026 standards. Version 0.14 added Sandbox Agents with first-class filesystem workspaces for long-horizon coding tasks.
Practitioners consistently report it "sucks at delegation" relative to mature open-source frameworks, struggling with complex hierarchical task distribution without explicit hardcoded routing. The "provider-agnostic" claim is technically true via LiteLLM but the developer experience is unmistakably tuned for OpenAI.
The accuracy-first option for teams committed to Anthropic. Native support for extended thinking, desktop browser control, vision, and MCP integration. The SDK compiles organisational skills into compressed CLAUDE.md files that meaningfully improve tool invocation precision. Alice Labs ranked it the premier choice for safety-critical environments in their April 2026 assessment of 18 production deployments.
Vendor lock-in is absolute. Code written against the Claude Agent SDK requires a complete rewrite to run against any other model. A June 2026 policy change moves usage to a separate monthly credit pool, which complicates enterprise budget forecasting.
Generally available since April 2026, MAF replaces both AutoGen (now in maintenance mode) and Semantic Kernel (now officially absorbed). Graph-based workflow runtime, declarative YAML agent definitions, native .NET telemetry, Azure Monitor integration, Azure Identity for auth. Migration assistants for AutoGen and Semantic Kernel codebases work well enough that most existing Microsoft shops can convert in weeks.
GitHub Issue #2329 documents a serious design flaw in max_invocations state scope that complicates deployment in standard long-running web servers. The community goodwill Microsoft burned by deprecating AutoGen is recoverable but has not been recovered.
The TypeScript-native framework that finally gives full-stack web teams a credible option. Created by the team behind Gatsby. Hit 1.0 in January 2026 with around 19,800 GitHub stars and 300,000 weekly npm downloads. Production references include Replit Agent 3, PayPal, Sanity, and Marsh McLennan.
Covers most of LangGraph's feature surface — cyclical graph workflows, durable execution, HITL approvals — but compatible with serverless platforms like Vercel and Cloudflare Workers, which LangGraph isn't. Native context compression via Observational Memory removes most of the token-management code that LangGraph requires you to write.
The only OSS framework with serious multi-language support. Python, Go, Java, and TypeScript with feature parity. ADK 2.0 added a graph-based Workflow Runtime with routing, fan-out, loops, retry, dynamic nodes, HITL, and nested workflows. The deepest A2A protocol implementation in the field. Production users include AWS Transform for .NET, Tyson Foods, and Gordon Food Service.
Gemini is the happy path. Non-Gemini models work through LiteLLM but need more configuration. Breaking changes between ADK 1.x and 2.0 are real, particularly around session schemas.
AWS's model-driven SDK, already powering Amazon Q Developer, AWS Glue, and the VPC Reachability Analyzer. The pitch is radical simplicity: define an agent with a prompt and tools, let the LLM plan, and deploy on Bedrock AgentCore. Production users include Smartsheet, Swisscom, and Verisk Analytics.
"As someone who builds agents with LangGraph daily at work, Strands was a genuine surprise. The model-driven approach cut my setup from 40 lines to 3, and for the 80 percent case, it just works without sacrificing flexibility." — Tetiana Mostova, AWS Community Builder
The Hugging Face library that bet on code-as-action instead of JSON tool calls. The agent writes and executes Python in a sandboxed interpreter. Under 1,000 lines of core framework code, which means there's almost nowhere for hidden bugs to hide. Pooya Golchian's April 2026 benchmark on 200 medium-complexity tasks ranks LangGraph at 76%, Smolagents at 73%, CrewAI at 71%, and AutoGen at 68%, run against Qwen3 32B through Ollama.
The risk is obvious: code execution. The project explicitly warns that LocalPythonExecutor is not a security boundary. An April 2026 issue described severe RCE risk without proper isolation.
The de-facto standard for real-time voice and video agents. WebRTC infrastructure plus a streaming STT-LLM-TTS pipeline. Customers include Spotify, Meta, Microsoft, Character.AI, and Speak. Provider-agnostic across STT (Deepgram, AssemblyAI, Whisper), LLM (any), and TTS (Cartesia, ElevenLabs, Azure). Above roughly 10,000 minutes per month, self-hosting undercuts managed platforms like Vapi and Retell by 60 to 80 percent.
Voice latency is hard, and the framework doesn't change that. Hamming AI's January 2026 analysis across four million calls reports industry median latency of 1.4 to 1.7 seconds, with p99 between 3 and 5 seconds. Hitting sub-500ms perceived latency requires streaming at every stage, co-located regions, and pre-warmed contexts.
Memory-first runtime built on the MemGPT hierarchy. Three-tier memory (core, recall, archival) with self-editing memory tools. Benchmark claim: maintains task context across 500-plus interactions versus RAG baselines that fragment after 50. Letta Code, released December 2025, ranks first on Terminal-Bench among model-agnostic open-source coding agents.
Lock-in is high. Every memory operation costs inference tokens because the agent reasons about what to store. Vectorize.io's 2026 analysis: "Letta owns your agent loop. Switching means rebuilding the loop, tool execution, state management, and memory logic elsewhere." For most teams, Mem0 as a memory layer on top of LangGraph or Pydantic AI is the lighter option.
Frameworks losing momentum
Real differentiators versus marketing
Genuine differentiators
- Durable execution depth — LangGraph and MAF lead; Mastra and Letta close behind
- Language ecosystem — Mastra for TypeScript, ADK for multi-language, MAF for .NET
- Memory architecture — Letta is genuinely different; most others are RAG with extra steps
- Type safety — Pydantic AI is uncontested
- Code-as-action — Smolagents and OpenAI Agents SDK Sandbox Agents
- Real-time infrastructure — LiveKit, alone in its category
Marketing, not architecture
- "Tool calling support" — every framework has it
- "Multi-agent support" — depth varies wildly; the claim is meaningless
- Benchmark percentage points within ±5 — model choice dominates at that margin
- Elaborate role-playing parameters (CrewAI backstories)
- Unverified speed claims (the Agno "10,000x" masked initialisation deferral)
- Conversational debate loops consuming tokens with diminishing returns
Picking a framework
The decision is rarely "which framework is best." It's "which framework matches my language, cloud, and complexity constraints." Pick the language first, the cloud and model second, the agent complexity third. Reversing them is how teams end up rewriting their stack six months in.
| Scenario | Primary | Secondary |
|---|---|---|
| Long-running, stateful, regulated with HITL | LangGraph | MAF |
| Microsoft, .NET, or Azure enterprise | Microsoft Agent Framework | LangGraph |
| GCP, multi-language, or A2A-heavy | Google ADK | MAF |
| AWS-native, Bedrock | Strands Agents | LangGraph |
| OpenAI-first, thin SDK | OpenAI Agents SDK | Pydantic AI |
| TypeScript or Next.js product | Mastra | Vercel AI SDK + Pydantic AI |
| Voice (support, telephony, telehealth) | LiveKit Agents | OpenAI Agents SDK Realtime |
| Document Q&A or enterprise search | LlamaIndex Workflows | LangGraph + LlamaParse |
| Type-safe single-agent backend | Pydantic AI | OpenAI Agents SDK |
| Persistent-memory companion | Letta | Mem0 on LangGraph |
| Fast role-based prototype (<1 week) | CrewAI | Agno |
| High-throughput swarm (content, social) | Agno | CrewAI |
| Research, data analysis, code-action | Smolagents | OpenAI Agents SDK Sandbox |
| Anthropic-mandated environment | Claude Agent SDK | — |