5x Performance - Blog

The AI Agent Framework Shakeout: What Actually Works in Production, May 2026

Two years of hype have finally produced something useful: a clear picture of which AI agent frameworks survive contact with production traffic, and which were always demos in disguise.

The story is no longer about which framework has the cleverest abstraction. It's about which ones let engineers retain control of state, concurrency, and observability when the LLM inevitably misbehaves. By May 2026, that filter has narrowed the field considerably.

The architectural shift nobody is marketing

If you read vendor blog posts, you would think every framework is converging on the same feature set: tool calling, memory, multi-agent handoffs, MCP support. That's true at the surface. Underneath, the field has split into three distinct philosophies, and choosing the wrong one for your problem now costs more than picking a worse model.

Graph State Machines
LangGraph, Microsoft Agent Framework, Google ADK 2.0, Mastra
Role & Conversation
CrewAI, AutoGen lineage
Type-First Lightweight SDKs
Pydantic AI, OpenAI Agents SDK, Smolagents

A fourth option deserves explicit naming because senior engineers keep choosing it: no framework at all. The "12-FactorAgents" thesis from 2025 still defines the prevailing philosophy. The most reliable production systems are well-engineered software with LLM calls placed at deterministic junctures, backed by ordinary databases, ordinary retries, and ordinary observability. Raw code is the baseline against which every framework should be judged.

What everyone is converging on

Three protocols have effectively won, and any framework that doesn't support them is unviable for enterprise work.

Protocol Convergence — May 2026

Model Context Protocol
For tool definitions. Build a tool once as an MCP server and it works across Claude Agent SDK, LangGraph, Pydantic AI, Mastra, MAF, and the rest.
Agent-to-Agent (A2A)
For inter-agent communication. Google ADK leads the implementation; MAF, CrewAI, Strands, and Mastra have followed.
OpenTelemetry
Tracing as the observability spine. LangSmith and Logfire layer on top of it; raw OTLP works everywhere.

A few other patterns have become defaults. Model tiering (cheap models triage, expensive models reason) reduces token costs by 40 to 60 percent in production. Sandboxed code execution for code-action agents has consolidated around E2B, Modal, and Docker. And the industry has quietly abandoned dynamic, LLM-driven execution graphs after the 2025 experiments showed catastrophic looping behaviour. The developer defines the graph; the LLM operates inside it.

The frameworks worth knowing

LangGraph Production Standard

The production standard for stateful workflows with human-in-the-loop steps. Native checkpointing means a workflow can pause mid-execution, wait days for human approval on a financial decision, and resume cleanly. Verified production users include Klarna, Uber, LinkedIn, BlackRock, Cisco, and JPMorgan.

The friction is real and well documented. State is a TypedDict at best, which means runtime type errors are common and debugging traces of how data moves through nodes is painful. A 2025 Hacker News critique that "you spend so much time fixing stupid runtime type errors because the state of every graph is a JSON blob with very minimal typing" still gets recirculated in 2026 because v1.0 didn't fix it. The default in-memory checkpointer won't survive a container restart, so production deployments need Postgres, Redis, or SQLite backends.

March 2026 brought CVE disclosures from Cyera: CVE-2025-67644 (SQL injection in the SQLite checkpointer) and CVE-2025-68664 (high-severity deserialization). All patched, but they reframe LangGraph from "trust by default" to "audit before deploying."

Use forLong-running stateful workflows, regulated industries, HITL
Avoid forSimple RAG prototypes, serverless (Vercel/Cloudflare)
Pydantic AI Recommended

The right answer for roughly half of agent projects that don't actually need a graph runtime. Type safety is the headline feature: outputs are validated against Pydantic models, and the framework won't return a response that fails validation. In a 90-day benchmark from early 2026, Pydantic AI caught 23 production data-integrity bugs that other frameworks silently ignored.

Built-in usage limits let engineers cap request tokens, response tokens, and tool calls directly on the agent definition, which prevents the cost overruns that plague more autonomous frameworks. Infrastructure costs in the same benchmark came in at around $390 over 90 days, versus over $1,000 for heavier frameworks running equivalent workloads.

The honest weakness: Pydantic AI is stateless by default and has no native multi-agent orchestration. The emerging pattern in 2026 is LangGraph as the orchestrator with Pydantic AI inside each node — explicit graph state, type-safe agent behaviour.

Use forType-safe single-agent backends, cost-sensitive deployments
Avoid forLong-running stateful orchestration without a graph layer
CrewAI Prototype Only

The fastest path from blank file to working multi-agent prototype. Define agents with role, goal, and backstory, assemble them into a Crew, and you have a demoable system in an afternoon. Enterprise adoption is real, with PwC, DocuSign, and IBM among the named users.

The production picture is less flattering. GitHub issue #3154 documents agents producing valid-looking Thought-Action-Observation traces without actually calling the tool, fabricating an observation and continuing to the final answer. Reproducible on GPT-4 and Qwen. Issue #3202 documents dependency hell on macOS through ChromaDB's onnxruntime constraints. Role-based prompts add roughly 30 to 50 percent to token counts versus equivalent hand-tuned graphs.

Use forRapid prototyping, content pipelines, demos
Avoid forCompliance-sensitive workloads, latency-bound systems
OpenAI Agents SDK Solid Choice

The official successor to Swarm. Minimal abstraction, clean handoff semantics, native integration with the Responses API, sessions, guardrails, and tracing. The codebase is small enough to read in an afternoon, which is unusually honest by 2026 standards. Version 0.14 added Sandbox Agents with first-class filesystem workspaces for long-horizon coding tasks.

Practitioners consistently report it "sucks at delegation" relative to mature open-source frameworks, struggling with complex hierarchical task distribution without explicit hardcoded routing. The "provider-agnostic" claim is technically true via LiteLLM but the developer experience is unmistakably tuned for OpenAI.

Use forOpenAI-native stacks, thin orchestration layers
Avoid forMaximal portability, multi-provider cost tiering
Claude Agent SDK Accuracy-First

The accuracy-first option for teams committed to Anthropic. Native support for extended thinking, desktop browser control, vision, and MCP integration. The SDK compiles organisational skills into compressed CLAUDE.md files that meaningfully improve tool invocation precision. Alice Labs ranked it the premier choice for safety-critical environments in their April 2026 assessment of 18 production deployments.

Vendor lock-in is absolute. Code written against the Claude Agent SDK requires a complete rewrite to run against any other model. A June 2026 policy change moves usage to a separate monthly credit pool, which complicates enterprise budget forecasting.

Use forDeep coding assistants, automated desktop operators, Anthropic-mandated environments
Avoid forCost-tiering architectures, multi-model deployments
Microsoft Agent Framework Azure Standard

Generally available since April 2026, MAF replaces both AutoGen (now in maintenance mode) and Semantic Kernel (now officially absorbed). Graph-based workflow runtime, declarative YAML agent definitions, native .NET telemetry, Azure Monitor integration, Azure Identity for auth. Migration assistants for AutoGen and Semantic Kernel codebases work well enough that most existing Microsoft shops can convert in weeks.

GitHub Issue #2329 documents a serious design flaw in max_invocations state scope that complicates deployment in standard long-running web servers. The community goodwill Microsoft burned by deprecating AutoGen is recoverable but has not been recovered.

Use forHeavily regulated Azure-native environments, .NET shops
Avoid forGreenfield Python projects with no Microsoft dependency
Mastra TypeScript-Native

The TypeScript-native framework that finally gives full-stack web teams a credible option. Created by the team behind Gatsby. Hit 1.0 in January 2026 with around 19,800 GitHub stars and 300,000 weekly npm downloads. Production references include Replit Agent 3, PayPal, Sanity, and Marsh McLennan.

Covers most of LangGraph's feature surface — cyclical graph workflows, durable execution, HITL approvals — but compatible with serverless platforms like Vercel and Cloudflare Workers, which LangGraph isn't. Native context compression via Observational Memory removes most of the token-management code that LangGraph requires you to write.

Use forNext.js, SvelteKit, Remix products embedding AI
Avoid forHeavy offline data engineering, scientific computing, Python stacks
Google ADK 2.0 Multi-Language Leader

The only OSS framework with serious multi-language support. Python, Go, Java, and TypeScript with feature parity. ADK 2.0 added a graph-based Workflow Runtime with routing, fan-out, loops, retry, dynamic nodes, HITL, and nested workflows. The deepest A2A protocol implementation in the field. Production users include AWS Transform for .NET, Tyson Foods, and Gordon Food Service.

Gemini is the happy path. Non-Gemini models work through LiteLLM but need more configuration. Breaking changes between ADK 1.x and 2.0 are real, particularly around session schemas.

Use forGCP-native deployments, multi-language enterprise stacks, A2A-heavy architectures
Avoid forAWS-only or Azure-only shops
Strands Agents AWS-Native

AWS's model-driven SDK, already powering Amazon Q Developer, AWS Glue, and the VPC Reachability Analyzer. The pitch is radical simplicity: define an agent with a prompt and tools, let the LLM plan, and deploy on Bedrock AgentCore. Production users include Smartsheet, Swisscom, and Verisk Analytics.

"As someone who builds agents with LangGraph daily at work, Strands was a genuine surprise. The model-driven approach cut my setup from 40 lines to 3, and for the 80 percent case, it just works without sacrificing flexibility." — Tetiana Mostova, AWS Community Builder

Use forAWS-native shops, Bedrock deployments
Avoid forMulti-cloud teams, GCP-heavy architectures
Smolagents Research / Sandboxed

The Hugging Face library that bet on code-as-action instead of JSON tool calls. The agent writes and executes Python in a sandboxed interpreter. Under 1,000 lines of core framework code, which means there's almost nowhere for hidden bugs to hide. Pooya Golchian's April 2026 benchmark on 200 medium-complexity tasks ranks LangGraph at 76%, Smolagents at 73%, CrewAI at 71%, and AutoGen at 68%, run against Qwen3 32B through Ollama.

The risk is obvious: code execution. The project explicitly warns that LocalPythonExecutor is not a security boundary. An April 2026 issue described severe RCE risk without proper isolation.

Use forResearch agents, data analysis, self-hosted sandboxed setups
Avoid forCompliance-bound systems, any production path without hard isolation
LiveKit Agents Voice Standard

The de-facto standard for real-time voice and video agents. WebRTC infrastructure plus a streaming STT-LLM-TTS pipeline. Customers include Spotify, Meta, Microsoft, Character.AI, and Speak. Provider-agnostic across STT (Deepgram, AssemblyAI, Whisper), LLM (any), and TTS (Cartesia, ElevenLabs, Azure). Above roughly 10,000 minutes per month, self-hosting undercuts managed platforms like Vapi and Retell by 60 to 80 percent.

Voice latency is hard, and the framework doesn't change that. Hamming AI's January 2026 analysis across four million calls reports industry median latency of 1.4 to 1.7 seconds, with p99 between 3 and 5 seconds. Hitting sub-500ms perceived latency requires streaming at every stage, co-located regions, and pre-warmed contexts.

Use forVoice support, telehealth, real-time translation
Avoid forGeneral agent orchestration — pair with LangGraph or Pydantic AI for tools
Letta (formerly MemGPT) Memory-Specialist

Memory-first runtime built on the MemGPT hierarchy. Three-tier memory (core, recall, archival) with self-editing memory tools. Benchmark claim: maintains task context across 500-plus interactions versus RAG baselines that fragment after 50. Letta Code, released December 2025, ranks first on Terminal-Bench among model-agnostic open-source coding agents.

Lock-in is high. Every memory operation costs inference tokens because the agent reasons about what to store. Vectorize.io's 2026 analysis: "Letta owns your agent loop. Switching means rebuilding the loop, tool execution, state management, and memory logic elsewhere." For most teams, Mem0 as a memory layer on top of LangGraph or Pydantic AI is the lighter option.

Use forPersistent-memory companions, long-horizon agents where memory is the product
Avoid forEverything else — use Mem0 on LangGraph instead

Frameworks losing momentum

LangChain core
Increasingly viewed as glue rather than orchestration. The LangChain Expression Language is widely treated as a debugging hazard. March 2026 CVEs reinforced that this is a library to audit, not trust. Still useful as integration glue. Not the right runtime for new production agents.
AutoGen
In maintenance mode per Microsoft's own README, with new users directed to MAF. The community fork AG2 continues but has not produced strong 2026 production evidence. No greenfield project should start here.
Semantic Kernel
Officially folded into MAF. The repo banner says so. Treat it as predecessor lineage only.
Bee Agent Framework
Donated by IBM to the Linux Foundation in 2025, with IBM explicitly disclaiming maintenance. Survives but is not enterprise-invested.
OpenClaw
Viral adoption in early 2026 ended with the May 2026 "Claw Chain" disclosure: CVE-2026-44112 through 44118 — TOCTOU race conditions, bearer-token spoofing, and one-click RCE through prompt injection. Over 42,000 instances were found exposed to the public internet. The framework is functionally dead for any deployment touching a network. The lesson generalises: ease of configuration is not a substitute for security hygiene.

Real differentiators versus marketing

Genuine differentiators

  • Durable execution depth — LangGraph and MAF lead; Mastra and Letta close behind
  • Language ecosystem — Mastra for TypeScript, ADK for multi-language, MAF for .NET
  • Memory architecture — Letta is genuinely different; most others are RAG with extra steps
  • Type safety — Pydantic AI is uncontested
  • Code-as-action — Smolagents and OpenAI Agents SDK Sandbox Agents
  • Real-time infrastructure — LiveKit, alone in its category

Marketing, not architecture

  • "Tool calling support" — every framework has it
  • "Multi-agent support" — depth varies wildly; the claim is meaningless
  • Benchmark percentage points within ±5 — model choice dominates at that margin
  • Elaborate role-playing parameters (CrewAI backstories)
  • Unverified speed claims (the Agno "10,000x" masked initialisation deferral)
  • Conversational debate loops consuming tokens with diminishing returns

Picking a framework

The decision is rarely "which framework is best." It's "which framework matches my language, cloud, and complexity constraints." Pick the language first, the cloud and model second, the agent complexity third. Reversing them is how teams end up rewriting their stack six months in.

Scenario Primary Secondary
Long-running, stateful, regulated with HITLLangGraphMAF
Microsoft, .NET, or Azure enterpriseMicrosoft Agent FrameworkLangGraph
GCP, multi-language, or A2A-heavyGoogle ADKMAF
AWS-native, BedrockStrands AgentsLangGraph
OpenAI-first, thin SDKOpenAI Agents SDKPydantic AI
TypeScript or Next.js productMastraVercel AI SDK + Pydantic AI
Voice (support, telephony, telehealth)LiveKit AgentsOpenAI Agents SDK Realtime
Document Q&A or enterprise searchLlamaIndex WorkflowsLangGraph + LlamaParse
Type-safe single-agent backendPydantic AIOpenAI Agents SDK
Persistent-memory companionLettaMem0 on LangGraph
Fast role-based prototype (<1 week)CrewAIAgno
High-throughput swarm (content, social)AgnoCrewAI
Research, data analysis, code-actionSmolagentsOpenAI Agents SDK Sandbox
Anthropic-mandated environmentClaude Agent SDK
What I Would Tell a Team Starting Today

Wire observability before you wire a second agent. LangSmith, Logfire, Langfuse, Arize, or raw OpenTelemetry into Honeycomb. Retrofitting observability after the fact is rework, every time.

Treat the framework as an attack surface. The 2026 LangChain CVEs were a category warning, not just a LangChain warning. Audit deserialization, prompt loading, and any SQL paths in any framework you adopt.

Separate framework limitations from LLM limitations. CrewAI's tool-call hallucination bug reproduces with both GPT-4 and Qwen. The framework didn't cause the hallucination, but it didn't catch it either. Pick frameworks partly on how well they detect and recover from LLM failure modes through guardrails, validators, retries, and fallback models.

The winning frameworks of 2026 are the ones that admit they cannot make the underlying model reliable, and instead make its unreliability observable, controllable, and recoverable. Anything claiming more than that is selling you something.