The AI Agent Framework Shakeout: What Actually Works in Production

Architecture

The architectural shift nobody is marketing

If you read vendor blog posts, you would think every framework is converging on the same feature set: tool calling, memory, multi-agent handoffs, MCP support. That's true at the surface. Underneath, the field has split into three distinct philosophies, and choosing the wrong one for your problem now costs more than picking a worse model.

Graph State Machines

LangGraph, Microsoft Agent Framework, Google ADK 2.0, Mastra

Role & Conversation

CrewAI, AutoGen lineage

Type-First Lightweight SDKs

Pydantic AI, OpenAI Agents SDK, Smolagents

A fourth option deserves explicit naming because senior engineers keep choosing it: no framework at all. The "12-FactorAgents" thesis from 2025 still defines the prevailing philosophy. The most reliable production systems are well-engineered software with LLM calls placed at deterministic junctures, backed by ordinary databases, ordinary retries, and ordinary observability. Raw code is the baseline against which every framework should be judged.

Standards

What everyone is converging on

Three protocols have effectively won, and any framework that doesn't support them is unviable for enterprise work.

Protocol Convergence — May 2026

Model Context Protocol

For tool definitions. Build a tool once as an MCP server and it works across Claude Agent SDK, LangGraph, Pydantic AI, Mastra, MAF, and the rest.

Agent-to-Agent (A2A)

For inter-agent communication. Google ADK leads the implementation; MAF, CrewAI, Strands, and Mastra have followed.

OpenTelemetry

Tracing as the observability spine. LangSmith and Logfire layer on top of it; raw OTLP works everywhere.

A few other patterns have become defaults. Model tiering (cheap models triage, expensive models reason) reduces token costs by 40 to 60 percent in production. Sandboxed code execution for code-action agents has consolidated around E2B, Modal, and Docker. And the industry has quietly abandoned dynamic, LLM-driven execution graphs after the 2025 experiments showed catastrophic looping behaviour. The developer defines the graph; the LLM operates inside it.

Frameworks

The frameworks worth knowing

LangGraph Production Standard

The production standard for stateful workflows with human-in-the-loop steps. Native checkpointing means a workflow can pause mid-execution, wait days for human approval on a financial decision, and resume cleanly. Verified production users include Klarna, Uber, LinkedIn, BlackRock, Cisco, and JPMorgan.

The friction is real and well documented. State is a TypedDict at best, which means runtime type errors are common and debugging traces of how data moves through nodes is painful. A 2025 Hacker News critique that "you spend so much time fixing stupid runtime type errors because the state of every graph is a JSON blob with very minimal typing" still gets recirculated in 2026 because v1.0 didn't fix it. The default in-memory checkpointer won't survive a container restart, so production deployments need Postgres, Redis, or SQLite backends.

March 2026 brought CVE disclosures from Cyera: CVE-2025-67644 (SQL injection in the SQLite checkpointer) and CVE-2025-68664 (high-severity deserialization). All patched, but they reframe LangGraph from "trust by default" to "audit before deploying."

Use forLong-running stateful workflows, regulated industries, HITL

Avoid forSimple RAG prototypes, serverless (Vercel/Cloudflare)

Pydantic AI Recommended

The right answer for roughly half of agent projects that don't actually need a graph runtime. Type safety is the headline feature: outputs are validated against Pydantic models, and the framework won't return a response that fails validation. In a 90-day benchmark from early 2026, Pydantic AI caught 23 production data-integrity bugs that other frameworks silently ignored.

Built-in usage limits let engineers cap request tokens, response tokens, and tool calls directly on the agent definition, which prevents the cost overruns that plague more autonomous frameworks. Infrastructure costs in the same benchmark came in at around $390 over 90 days, versus over $1,000 for heavier frameworks running equivalent workloads.

The honest weakness: Pydantic AI is stateless by default and has no native multi-agent orchestration. The emerging pattern in 2026 is LangGraph as the orchestrator with Pydantic AI inside each node — explicit graph state, type-safe agent behaviour.

Use forType-safe single-agent backends, cost-sensitive deployments

Avoid forLong-running stateful orchestration without a graph layer

CrewAI Prototype Only

The fastest path from blank file to working multi-agent prototype. Define agents with role, goal, and backstory, assemble them into a Crew, and you have a demoable system in an afternoon. Enterprise adoption is real, with PwC, DocuSign, and IBM among the named users.

The production picture is less flattering. GitHub issue #3154 documents agents producing valid-looking Thought-Action-Observation traces without actually calling the tool, fabricating an observation and continuing to the final answer. Reproducible on GPT-4 and Qwen. Issue #3202 documents dependency hell on macOS through ChromaDB's onnxruntime constraints. Role-based prompts add roughly 30 to 50 percent to token counts versus equivalent hand-tuned graphs.

Use forRapid prototyping, content pipelines, demos

Avoid forCompliance-sensitive workloads, latency-bound systems

OpenAI Agents SDK Solid Choice

The official successor to Swarm. Minimal abstraction, clean handoff semantics, native integration with the Responses API, sessions, guardrails, and tracing. The codebase is small enough to read in an afternoon, which is unusually honest by 2026 standards. Version 0.14 added Sandbox Agents with first-class filesystem workspaces for long-horizon coding tasks.

Practitioners consistently report it "sucks at delegation" relative to mature open-source frameworks, struggling with complex hierarchical task distribution without explicit hardcoded routing. The "provider-agnostic" claim is technically true via LiteLLM but the developer experience is unmistakably tuned for OpenAI.

Use forOpenAI-native stacks, thin orchestration layers

Avoid forMaximal portability, multi-provider cost tiering

Claude Agent SDK Accuracy-First

The accuracy-first option for teams committed to Anthropic. Native support for extended thinking, desktop browser control, vision, and MCP integration. The SDK compiles organisational skills into compressed CLAUDE.md files that meaningfully improve tool invocation precision. Alice Labs ranked it the premier choice for safety-critical environments in their April 2026 assessment of 18 production deployments.

Vendor lock-in is absolute. Code written against the Claude Agent SDK requires a complete rewrite to run against any other model. A June 2026 policy change moves usage to a separate monthly credit pool, which complicates enterprise budget forecasting.

Use forDeep coding assistants, automated desktop operators, Anthropic-mandated environments

Avoid forCost-tiering architectures, multi-model deployments

Microsoft Agent Framework Azure Standard

Generally available since April 2026, MAF replaces both AutoGen (now in maintenance mode) and Semantic Kernel (now officially absorbed). Graph-based workflow runtime, declarative YAML agent definitions, native .NET telemetry, Azure Monitor integration, Azure Identity for auth. Migration assistants for AutoGen and Semantic Kernel codebases work well enough that most existing Microsoft shops can convert in weeks.

GitHub Issue #2329 documents a serious design flaw in max_invocations state scope that complicates deployment in standard long-running web servers. The community goodwill Microsoft burned by deprecating AutoGen is recoverable but has not been recovered.

Use forHeavily regulated Azure-native environments, .NET shops

Avoid forGreenfield Python projects with no Microsoft dependency

Mastra TypeScript-Native

The TypeScript-native framework that finally gives full-stack web teams a credible option. Created by the team behind Gatsby. Hit 1.0 in January 2026 with around 19,800 GitHub stars and 300,000 weekly npm downloads. Production references include Replit Agent 3, PayPal, Sanity, and Marsh McLennan.

Covers most of LangGraph's feature surface — cyclical graph workflows, durable execution, HITL approvals — but compatible with serverless platforms like Vercel and Cloudflare Workers, which LangGraph isn't. Native context compression via Observational Memory removes most of the token-management code that LangGraph requires you to write.

Use forNext.js, SvelteKit, Remix products embedding AI

Avoid forHeavy offline data engineering, scientific computing, Python stacks

Google ADK 2.0 Multi-Language Leader

The only OSS framework with serious multi-language support. Python, Go, Java, and TypeScript with feature parity. ADK 2.0 added a graph-based Workflow Runtime with routing, fan-out, loops, retry, dynamic nodes, HITL, and nested workflows. The deepest A2A protocol implementation in the field. Production users include AWS Transform for .NET, Tyson Foods, and Gordon Food Service.

Gemini is the happy path. Non-Gemini models work through LiteLLM but need more configuration. Breaking changes between ADK 1.x and 2.0 are real, particularly around session schemas.

Use forGCP-native deployments, multi-language enterprise stacks, A2A-heavy architectures

Avoid forAWS-only or Azure-only shops

Strands Agents AWS-Native

AWS's model-driven SDK, already powering Amazon Q Developer, AWS Glue, and the VPC Reachability Analyzer. The pitch is radical simplicity: define an agent with a prompt and tools, let the LLM plan, and deploy on Bedrock AgentCore. Production users include Smartsheet, Swisscom, and Verisk Analytics.

"As someone who builds agents with LangGraph daily at work, Strands was a genuine surprise. The model-driven approach cut my setup from 40 lines to 3, and for the 80 percent case, it just works without sacrificing flexibility." — Tetiana Mostova, AWS Community Builder

Use forAWS-native shops, Bedrock deployments

Avoid forMulti-cloud teams, GCP-heavy architectures

Smolagents Research / Sandboxed

The Hugging Face library that bet on code-as-action instead of JSON tool calls. The agent writes and executes Python in a sandboxed interpreter. Under 1,000 lines of core framework code, which means there's almost nowhere for hidden bugs to hide. Pooya Golchian's April 2026 benchmark on 200 medium-complexity tasks ranks LangGraph at 76%, Smolagents at 73%, CrewAI at 71%, and AutoGen at 68%, run against Qwen3 32B through Ollama.

The risk is obvious: code execution. The project explicitly warns that LocalPythonExecutor is not a security boundary. An April 2026 issue described severe RCE risk without proper isolation.

Use forResearch agents, data analysis, self-hosted sandboxed setups

Avoid forCompliance-bound systems, any production path without hard isolation

LiveKit Agents Voice Standard

The de-facto standard for real-time voice and video agents. WebRTC infrastructure plus a streaming STT-LLM-TTS pipeline. Customers include Spotify, Meta, Microsoft, Character.AI, and Speak. Provider-agnostic across STT (Deepgram, AssemblyAI, Whisper), LLM (any), and TTS (Cartesia, ElevenLabs, Azure). Above roughly 10,000 minutes per month, self-hosting undercuts managed platforms like Vapi and Retell by 60 to 80 percent.

Voice latency is hard, and the framework doesn't change that. Hamming AI's January 2026 analysis across four million calls reports industry median latency of 1.4 to 1.7 seconds, with p99 between 3 and 5 seconds. Hitting sub-500ms perceived latency requires streaming at every stage, co-located regions, and pre-warmed contexts.

Use forVoice support, telehealth, real-time translation

Avoid forGeneral agent orchestration — pair with LangGraph or Pydantic AI for tools

Letta (formerly MemGPT) Memory-Specialist

Memory-first runtime built on the MemGPT hierarchy. Three-tier memory (core, recall, archival) with self-editing memory tools. Benchmark claim: maintains task context across 500-plus interactions versus RAG baselines that fragment after 50. Letta Code, released December 2025, ranks first on Terminal-Bench among model-agnostic open-source coding agents.

Lock-in is high. Every memory operation costs inference tokens because the agent reasons about what to store. Vectorize.io's 2026 analysis: "Letta owns your agent loop. Switching means rebuilding the loop, tool execution, state management, and memory logic elsewhere." For most teams, Mem0 as a memory layer on top of LangGraph or Pydantic AI is the lighter option.

Use forPersistent-memory companions, long-horizon agents where memory is the product

Avoid forEverything else — use Mem0 on LangGraph instead

Legacy

Frameworks losing momentum

LangChain core

Increasingly viewed as glue rather than orchestration. The LangChain Expression Language is widely treated as a debugging hazard. March 2026 CVEs reinforced that this is a library to audit, not trust. Still useful as integration glue. Not the right runtime for new production agents.

AutoGen

In maintenance mode per Microsoft's own README, with new users directed to MAF. The community fork AG2 continues but has not produced strong 2026 production evidence. No greenfield project should start here.

Semantic Kernel

Officially folded into MAF. The repo banner says so. Treat it as predecessor lineage only.

Bee Agent Framework

Donated by IBM to the Linux Foundation in 2025, with IBM explicitly disclaiming maintenance. Survives but is not enterprise-invested.

OpenClaw

Viral adoption in early 2026 ended with the May 2026 "Claw Chain" disclosure: CVE-2026-44112 through 44118 — TOCTOU race conditions, bearer-token spoofing, and one-click RCE through prompt injection. Over 42,000 instances were found exposed to the public internet. The framework is functionally dead for any deployment touching a network. The lesson generalises: ease of configuration is not a substitute for security hygiene.

Analysis

Real differentiators versus marketing

Genuine differentiators

Durable execution depth — LangGraph and MAF lead; Mastra and Letta close behind
Language ecosystem — Mastra for TypeScript, ADK for multi-language, MAF for .NET
Memory architecture — Letta is genuinely different; most others are RAG with extra steps
Type safety — Pydantic AI is uncontested
Code-as-action — Smolagents and OpenAI Agents SDK Sandbox Agents
Real-time infrastructure — LiveKit, alone in its category

Marketing, not architecture

"Tool calling support" — every framework has it
"Multi-agent support" — depth varies wildly; the claim is meaningless
Benchmark percentage points within ±5 — model choice dominates at that margin
Elaborate role-playing parameters (CrewAI backstories)
Unverified speed claims (the Agno "10,000x" masked initialisation deferral)
Conversational debate loops consuming tokens with diminishing returns

Decision Guide

Picking a framework

The decision is rarely "which framework is best." It's "which framework matches my language, cloud, and complexity constraints." Pick the language first, the cloud and model second, the agent complexity third. Reversing them is how teams end up rewriting their stack six months in.

Scenario	Primary	Secondary
Long-running, stateful, regulated with HITL	LangGraph	MAF
Microsoft, .NET, or Azure enterprise	Microsoft Agent Framework	LangGraph
GCP, multi-language, or A2A-heavy	Google ADK	MAF
AWS-native, Bedrock	Strands Agents	LangGraph
OpenAI-first, thin SDK	OpenAI Agents SDK	Pydantic AI
TypeScript or Next.js product	Mastra	Vercel AI SDK + Pydantic AI
Voice (support, telephony, telehealth)	LiveKit Agents	OpenAI Agents SDK Realtime
Document Q&A or enterprise search	LlamaIndex Workflows	LangGraph + LlamaParse
Type-safe single-agent backend	Pydantic AI	OpenAI Agents SDK
Persistent-memory companion	Letta	Mem0 on LangGraph
Fast role-based prototype (<1 week)	CrewAI	Agno
High-throughput swarm (content, social)	Agno	CrewAI
Research, data analysis, code-action	Smolagents	OpenAI Agents SDK Sandbox
Anthropic-mandated environment	Claude Agent SDK	—

5x Performance - Blog

The AI Agent Framework Shakeout: What Actually Works in Production, May 2026

The architectural shift nobody is marketing

What everyone is converging on

Protocol Convergence — May 2026

The frameworks worth knowing

Frameworks losing momentum

Real differentiators versus marketing

Genuine differentiators

Marketing, not architecture

Picking a framework