The AI Agent Stack in 2026: Every Layer, Who's Winning, and Where the Margin Actually Lives

The agentic AI market hit $47 billion in 2025 spending, and most of it went to infrastructure nobody can name. Behind every AI agent demo is a seven-layer stack of orchestration frameworks, memory systems, tool integrations, guardrails, and observability platforms — each layer with its own margin structure and competitive dynamics. Here's the definitive map.

By Erik Sundberg, Developer Tools · Apr 9, 2026 · 18 min read

The agentic AI market hit $47 billion in 2025 spending. You read that number and picture OpenAI and Anthropic splitting the spoils. In reality, the largest single line item in most enterprise AI budgets is not the model API bill. It is the infrastructure around it — the orchestration frameworks, vector databases, guardrail layers, and observability platforms that make a prototype agent into something you can actually ship to customers.

I have spent the last 14 months building production AI agents across three companies. The most important thing I learned is that the model is the least interesting part of the stack. It is also, increasingly, the lowest-margin part. The real money — and the real competitive moats — live in layers most executives cannot name.

This is the definitive map of the AI agent stack in 2026: seven layers, the vendors competing in each, where the margin actually accrues, and where you should build versus buy.

The Seven-Layer Agent Stack

Before we go layer by layer, here is the full picture. Every production AI agent, whether it is an internal copilot routing support tickets or a customer-facing research assistant, touches all seven of these layers. Skip one and you either cannot ship or cannot scale.

Layer	Function	Key Vendors	Margin Profile
1. Foundation Models	Core reasoning and generation	OpenAI, Anthropic, Google, Meta (Llama), Mistral	15-25% (declining)
2. Orchestration	Agent logic, routing, multi-step workflows	LangChain, CrewAI, AutoGen, custom builds	55-70% (expanding)
3. Memory & State	Context persistence, retrieval, vector search	Pinecone, Weaviate, Chroma, pgvector	40-55%
4. Tool Use & MCP	External system integration, API calls	Anthropic MCP, OpenAI function calling, Composio	35-50%
5. Guardrails & Safety	Output validation, policy enforcement, filtering	Guardrails AI, Lakera, custom validation layers	60-75% (highest)
6. Observability	Tracing, evaluation, debugging, cost tracking	LangSmith, Braintrust, Helicone, Arize	50-65%
7. Deployment & Infra	Hosting, scaling, serverless execution	Modal, Fly.io, AWS Lambda, Cloudflare Workers	30-45%

The thesis of this piece is simple: most margin accrues to the orchestration and guardrails layers, not the model layer. The foundation model providers are in a brutal price war that erodes margins quarterly. The companies wrapping those models in workflow logic and safety validation are building the actual defensible businesses.

Let us go layer by layer.

Layer 1: Foundation Models — The Commodity Engine

The foundation model layer is where 90% of the press coverage goes and where the least interesting economics live. OpenAI, Anthropic, Google, and the open-source community (Meta's Llama, Mistral, Cohere) are in a race that looks increasingly like cloud infrastructure circa 2015: differentiation is narrowing, prices are falling, and the winner is determined by distribution and ecosystem lock-in, not raw capability.

The price collapse is real. GPT-4-class inference cost approximately $30 per million output tokens at launch in March 2023. By Q1 2026, equivalent capability costs $0.80 per million tokens from multiple providers. That is a 97% decline in three years. Anthropic's Claude Sonnet 4 and Google's Gemini 2.5 Flash have pushed the price floor even lower for high-volume applications.

Model	Provider	Cost per 1M Output Tokens (Q1 2026)	Context Window	Agent Suitability
GPT-4.1	OpenAI	$2.00	1M tokens	High — strong tool use
Claude Sonnet 4	Anthropic	$1.50	200K tokens	Very high — best agentic behavior
Claude Opus 4	Anthropic	$6.00	200K tokens	Very high — complex reasoning
Gemini 2.5 Pro	Google	$1.25	1M tokens	High — multimodal strength
Gemini 2.5 Flash	Google	$0.30	1M tokens	Moderate — speed optimized
Llama 4 Maverick	Meta (self-hosted)	$0.40-0.80 (infra cost)	1M tokens	Moderate — improving rapidly
Mistral Large 2	Mistral	$1.10	128K tokens	Moderate

Build vs. buy at this layer: Buy. Unless you are a foundation model lab or have extremely specialized domain requirements (and even then, fine-tuning is usually sufficient), training your own model is a $50M+ bet with uncertain payoff. The buy decision here is straightforward — the real question is which provider and whether you architect for multi-model routing.

Where lock-in is real vs. imagined: Largely imagined. Model switching costs are falling fast. The real lock-in is not to a model but to a model's tool-calling format, system prompt conventions, and latency profile. If you abstract the model behind an orchestration layer (Layer 2), switching providers is a configuration change, not a rewrite.

The Multi-Model Reality

The smartest teams in production are not loyal to one provider. They route queries dynamically: Gemini Flash for simple classification, Claude Sonnet for complex reasoning, GPT-4.1 for structured output generation. This multi-model approach reduces cost 40-60% compared to routing everything through a single frontier model and improves reliability through fallback chains.

The catch: multi-model routing requires a robust orchestration layer, which brings us to the most underrated part of the stack.

Layer 2: Orchestration — Where the Real Product Lives

If the foundation model is the engine, orchestration is the car. It is the layer that determines what agents actually do: how they plan multi-step tasks, when they call tools, how they handle failures, and how they coordinate with other agents. This is where product differentiation lives, and it is where margins are highest after guardrails.

The vendor landscape is fragmented and moving fast:

LangChain / LangGraph — The incumbent. LangChain's original chaining abstraction was the right idea at the wrong level of abstraction. LangGraph, their graph-based agent framework, is substantially better. Approximately 68% of production agent deployments in 2025 touched LangChain at some point, though many teams are migrating to LangGraph or building custom. LangChain Inc. (the company) has raised $45M and generates revenue through LangSmith (observability, covered in Layer 6).
CrewAI — Multi-agent orchestration with a focus on role-based agent design. Strong for use cases where you need specialized agents collaborating — a researcher agent, a writer agent, a reviewer agent. Growing fast in content and research automation. Open-source core with a commercial platform.
Microsoft AutoGen — Microsoft's multi-agent framework, tightly integrated with Azure. Strong enterprise distribution but opinionated architecture that works best within the Microsoft ecosystem. AutoGen 0.4's event-driven architecture was a significant improvement.
Custom builds — An increasing number of mature teams (Stripe, Notion, Replit) are building custom orchestration from scratch. The argument: frameworks add abstraction overhead, and when your agent's logic is your product's core IP, you do not want to depend on a third party's architectural decisions.

Orchestration Option	Strengths	Weaknesses	Best For
LangGraph	Mature ecosystem, large community, flexible graph model	Abstraction complexity, steep learning curve	Teams scaling existing LangChain code
CrewAI	Intuitive multi-agent design, rapid prototyping	Less battle-tested at scale, smaller community	Multi-agent workflows, content automation
AutoGen	Enterprise-ready, Azure integration, event-driven	Microsoft ecosystem dependency, verbose config	Enterprise teams on Azure
Custom	Full control, no abstraction tax, fits exact needs	High engineering cost, maintenance burden	Companies where agent logic is core IP

Build vs. buy: This is the most nuanced decision in the stack. If your agent's orchestration logic is undifferentiated (e.g., a simple RAG chatbot), use a framework. If the orchestration logic is your product — how you route, retry, decompose tasks, and coordinate agents — build custom. The framework's abstractions will eventually become constraints.

Where lock-in is real vs. imagined: Real and significant. Orchestration frameworks impose architectural patterns. LangGraph's state graph model, CrewAI's role-agent-task hierarchy, AutoGen's event-driven agent model — these are not interchangeable. Migrating from one to another is effectively a rewrite of your agent logic. Choose carefully.

> The framework you pick at the orchestration layer is a two-year commitment. Treat it like choosing a database, not choosing a library.

Layer 3: Memory & State — The Agent's Brain

Stateless agents are demos. Production agents need memory: what happened in previous conversations, what documents they have indexed, what user preferences they have learned, what tasks are in progress. The memory layer is where agents go from interesting to useful.

This layer has two components: vector databases for semantic retrieval (the "long-term memory") and state management for conversation and task state (the "working memory").

The vector database market has consolidated faster than expected:

Pinecone — The market leader in managed vector search. Simple API, reliable performance, strong enterprise features. Raised $138M at a $750M valuation. The default choice for teams that want managed infrastructure.
Weaviate — Open-source with a strong managed offering. Differentiates on hybrid search (combining vector and keyword search) and built-in multi-tenancy. Strong in European markets.
Chroma — The developer favorite. Open-source, embeddable, simple. Excellent for prototyping and small-to-medium scale. Raised $25M in 2024.
pgvector — PostgreSQL extension for vector similarity search. The "good enough" option for teams already running Postgres. No new infrastructure, no new vendor, no new bill. Increasingly popular as its performance has improved.
Qdrant — Open-source, Rust-based, optimized for performance. Growing developer community and a strong managed cloud offering.

Vector DB	Hosted/Self-Hosted	Query Latency (p99)	Max Vectors	Starting Price
Pinecone	Hosted	~50ms	1B+	$70/mo (Starter)
Weaviate Cloud	Both	~65ms	1B+	$25/mo
Chroma	Both	~40ms (embedded)	~10M (practical)	Free (OSS)
pgvector	Self-hosted	~80ms	~50M (practical)	Free (extension)
Qdrant	Both	~45ms	1B+	Free (OSS)

Build vs. buy: Buy a vector database, build your memory architecture. The vector DB is a commodity storage layer — what matters is how you chunk documents, when you retrieve, how you rank results, and how you manage context windows. Those decisions are product decisions that no vendor will make well for you.

Where lock-in is real vs. imagined: Mostly imagined. Vector databases have converging APIs and similar capabilities. The real lock-in risk is in your chunking and embedding strategy — if you change embedding models, you need to re-embed your entire corpus regardless of which vector DB you use.

The State Management Gap

The less discussed but equally important half of this layer is state management for agentic workflows. When an agent is executing a multi-step task (researching a topic, generating a report, iterating based on feedback), it needs working memory: what steps have been completed, what intermediate results exist, what the current plan is.

Most teams are hacking this with Redis, DynamoDB, or plain JSON files. There is no dominant solution yet, which is why companies like Letta (formerly MemGPT) and Zep are gaining traction with purpose-built agent memory systems that handle both long-term retrieval and short-term state.

Layer 4: Tool Use & MCP — The Hands of the Agent

An agent that can only generate text is a chatbot. An agent that can call tools — search the web, query databases, create documents, send emails, update CRMs — is actually useful. The tool use layer is where agents interact with the real world, and it is the layer undergoing the most rapid standardization thanks to one protocol: MCP.

The Rise of MCP

Anthropic's Model Context Protocol (MCP) has emerged as the de facto standard for connecting AI agents to external tools and data sources. Released as an open standard in late 2024, MCP defines a universal interface between AI agents and the systems they interact with — similar to how USB standardized hardware connections or how REST standardized web APIs.

By Q1 2026, MCP adoption has reached critical mass: - Over 3,000 MCP servers published in registries - Native MCP support in Claude, GPT-4.1, Gemini, and most major orchestration frameworks - Enterprise adoption accelerating, with Salesforce, Atlassian, Slack, and GitHub all shipping official MCP servers

Why MCP matters: Before MCP, every agent-tool integration was bespoke. Connecting an agent to Salesforce required custom code. Connecting to Jira required different custom code. Connecting to a database required yet more custom code. Each integration was fragile, undocumented, and incompatible with other agents.

MCP changes this by providing a standard protocol that any tool can implement once and any agent can consume. The same Salesforce MCP server works with Claude, with a LangGraph agent, with a CrewAI crew, and with your custom-built agent. This is a genuine interoperability breakthrough.

The alternative approaches still in play: - OpenAI function calling — OpenAI's proprietary approach to tool use. Well-designed but OpenAI-specific. Teams that standardize on function calling are locked into OpenAI's format. - Composio — A platform that provides pre-built tool integrations as a service. Over 250 integrations available. Useful for rapid prototyping but adds a dependency and latency hop. - Custom API integrations — Direct HTTP calls managed by your orchestration layer. Maximum control, maximum maintenance burden.

Build vs. buy: Adopt MCP as your standard, buy pre-built MCP servers for common integrations (CRMs, ticketing systems, databases), and build custom MCP servers for your proprietary systems. This is the one layer where there is a clear right answer in 2026.

Where lock-in is real vs. imagined: This is where MCP's open standard nature matters most. If you build on MCP, your tool integrations are portable across models and orchestration frameworks. If you build on OpenAI function calling exclusively, you are locked to OpenAI. The lock-in difference is binary and significant.

Layer 5: Guardrails & Safety — The Highest-Margin Layer Nobody Talks About

Here is a number that should get your attention: guardrails and safety infrastructure has the highest gross margins of any layer in the agent stack, routinely 60-75%. Why? Because the cost of failure is existential. An agent that hallucinates a wrong answer is embarrassing. An agent that leaks PII, generates harmful content, or takes unauthorized actions in production systems is a lawsuit, a front-page story, and potentially a company-ending event.

Every enterprise deploying AI agents is spending more on guardrails than they planned and less than they should.

The vendor landscape:

Guardrails AI — Open-source framework for validating LLM outputs. Define validators (is the output valid JSON? Does it contain PII? Is it factually consistent with the source material?) and enforce them in your agent pipeline. Has a growing hub of community-contributed validators. The fastest-growing company in this layer.
Lakera — Enterprise-focused AI security platform. Specializes in prompt injection detection, data leakage prevention, and content safety. Strong SOC 2 and HIPAA compliance story. Used by several Fortune 500 companies.
Arthur AI — AI monitoring and validation platform. Combines guardrails with model performance monitoring. Enterprise-oriented with a focus on regulated industries.
Custom validation layers — Many teams build custom guardrails using regex, classification models, and rule-based systems. This works for simple cases but becomes a maintenance nightmare as the agent's capabilities expand.

Why margins are high in this layer: Guardrails vendors sell to the compliance and risk functions, not just engineering. The buyer is the CISO, the general counsel, the VP of risk management. These buyers have larger budgets, longer contracts, and less price sensitivity than engineering teams. They are buying insurance against catastrophic outcomes, and insurance commands premium pricing.

Guardrails Vendor	Open Source	Key Capability	Enterprise Pricing
Guardrails AI	Yes (core)	Output validation, structural enforcement	$2,000-15,000/mo
Lakera	No	Prompt injection detection, content safety	$5,000-50,000/mo
Arthur AI	No	Model monitoring + guardrails	$10,000-75,000/mo
Custom build	N/A	Exactly what you need, nothing more	Engineering time (3-6 months)

Build vs. buy: Start with a framework like Guardrails AI for structural validation (JSON schema enforcement, output format checks), buy a commercial solution like Lakera for security-critical guardrails (prompt injection, PII detection), and build custom validators for domain-specific rules that no vendor will understand. Most production deployments use all three.

Where lock-in is real vs. imagined: Low lock-in. Guardrails are typically implemented as middleware in your agent pipeline — they inspect and validate inputs and outputs without deeply coupling to your architecture. Swapping one guardrails vendor for another is usually a matter of changing an API call, not restructuring your system.

> The irony of the agent stack: the layer with the lowest lock-in has the highest margins. Guardrails vendors maintain pricing power through trust and compliance certifications, not technical lock-in. This is unusual in software and worth studying.

Layer 6: Observability — You Cannot Improve What You Cannot Measure

When your AI agent gives a bad answer, you need to know why. When costs spike, you need to trace which queries caused it. When latency degrades, you need to identify the bottleneck. Observability for AI agents is fundamentally different from traditional application monitoring because the failure modes are different — an agent does not crash with a stack trace, it fails by giving a confident wrong answer or taking an expensive, circuitous path to the right one.

The emerging leaders:

LangSmith (LangChain Inc.) — The market leader by usage, largely because of LangChain's distribution advantage. Provides tracing, evaluation, prompt management, and dataset curation. Tightly integrated with the LangChain ecosystem but usable standalone. The company's primary revenue source.
Braintrust — Evaluation-first observability. Focuses on systematic eval of LLM outputs rather than just tracing. Strong among teams that take evaluation seriously (which, in 2026, should be every team). Raised $36M.
Helicone — Developer-focused LLM observability. Clean UI, fast setup, strong cost tracking. Popular with startups and individual developers. Open-source core.
Arize AI — Enterprise ML observability platform that has expanded into LLM monitoring. Strong in regulated industries where model monitoring is a compliance requirement.

Build vs. buy: Buy. Observability is not your core competency, and building a production-quality tracing and evaluation system from scratch is 6-12 months of engineering time that adds no product value. The buy decision is easy. The harder question is how much to invest in evaluation — most teams underinvest dramatically.

Where lock-in is real vs. imagined: Moderate. LangSmith has the highest lock-in because of its integration with LangChain. If you are already on LangGraph, LangSmith is nearly frictionless to adopt and somewhat painful to leave. Braintrust and Helicone are lower lock-in because they integrate at the API call level rather than the framework level.

Layer 7: Deployment & Infrastructure — The Plumbing

The final layer is where agents actually run. This layer has received less attention than it deserves because deployment patterns for AI agents are genuinely different from traditional application deployment. Agents are bursty (idle for hours, then consuming massive compute for minutes), stateful (maintaining conversation and task context), and unpredictably expensive (a single agent run can trigger dozens of LLM calls and tool invocations).

The options:

Modal — Purpose-built for AI workloads. Serverless GPU and CPU execution with container-level isolation. Excellent cold start times and a developer experience that feels like magic. The favorite of AI-native startups.
Fly.io — Edge-first deployment platform. Strong for agents that need low-latency global distribution. Less AI-specific but highly capable.
AWS Lambda / Azure Functions / Google Cloud Functions — The default serverless options. Work fine for simple agents but struggle with long-running agentic tasks (Lambda's 15-minute timeout is a real constraint for complex agent workflows).
Cloudflare Workers — Edge compute with increasingly strong AI capabilities (Workers AI). Excellent for lightweight agent tasks but limited for GPU-intensive workloads.
Kubernetes (self-managed) — The enterprise default. Maximum control, maximum operational overhead. Makes sense at scale but is overkill for most teams.

Build vs. buy: Buy infrastructure, build your deployment patterns. No one should be managing Kubernetes clusters to run AI agents in 2026 unless they have specific compliance or data residency requirements. Use Modal or a cloud serverless platform and invest your engineering time in the higher layers.

Where lock-in is real vs. imagined: Moderate. Modal and Fly.io have proprietary deployment formats but the underlying agent code is portable. The real lock-in risk is in your compute cost structure — if you optimize heavily for Modal's pricing model (per-second billing, GPU sharing), migrating to AWS means restructuring your cost model.

The Margin Map: Where Money Actually Accrues

Now that we have covered all seven layers, let us look at the economics in aggregate. This table shows where venture capital is flowing versus where margins actually live — and the divergence is striking.

Layer	2025 VC Funding	2025 Market Size (Est.)	Avg. Gross Margin	2027 Projected Market
Foundation Models	$28B	$18B	15-25%	$32B
Orchestration	$450M	$2.8B	55-70%	$8.5B
Memory & State	$620M	$3.2B	40-55%	$7.1B
Tool Use & MCP	$280M	$1.8B	35-50%	$5.4B
Guardrails & Safety	$340M	$2.1B	60-75%	$6.8B
Observability	$310M	$1.9B	50-65%	$5.2B
Deployment & Infra	$520M	$4.1B	30-45%	$9.0B

The foundation model layer attracted 82% of the VC dollars but has the lowest margins and the most brutal competitive dynamics. The orchestration and guardrails layers attracted 2.3% of the VC dollars but have margins 3-4x higher. This is the classic picks-and-shovels pattern, but even more extreme than in previous platform shifts because the model layer's commoditization is happening faster than anyone expected.

The implication for operators and builders: If you are starting an AI infrastructure company in 2026, do not build a foundation model. Build orchestration tooling, guardrails, or evaluation infrastructure. If you are an enterprise deploying agents, your vendor budget should be weighted toward these middle layers, not toward model API costs (which are falling anyway).

Where the Stack Is Heading: Five Predictions for 2027

1. The Orchestration Layer Eats the Memory Layer

The distinction between orchestration and memory is already blurring. LangGraph's state management increasingly handles what vector databases used to own. Agent frameworks are building memory primitives directly into their workflow engines. By 2027, purpose-built agent memory (not general vector search, but agent-specific context management) will be a feature of orchestration platforms, not a separate product category.

2. MCP Becomes the USB of AI

MCP's adoption trajectory mirrors USB in the late 1990s. By the end of 2027, every major SaaS application will ship an MCP server. Tool integration will stop being an engineering problem and become a configuration problem. The companies that built businesses on bespoke API integrations (Zapier-style) will either adopt MCP or face displacement.

3. Guardrails Become Regulated

As AI agents move from internal tools to customer-facing products, governments will mandate safety standards. The EU AI Act already requires risk management for high-risk AI systems. By 2027, guardrails will not be optional infrastructure — they will be compliance requirements with audit trails, and the vendors that have built certifiable platforms will command even higher premiums.

4. Observability and Evaluation Merge

The distinction between "observability" (watching what your agent does) and "evaluation" (measuring whether it does it well) is artificial. By 2027, these will be a single product category. The winner will be the platform that makes continuous evaluation as easy as log aggregation — every agent interaction automatically scored, every degradation automatically flagged.

5. The Stack Compresses

Seven layers is too many for most teams. We will see platform plays that bundle 3-4 layers into integrated offerings. LangChain is already doing this (orchestration + observability). Expect foundation model providers to bundle orchestration and tool use. Expect cloud providers to bundle deployment + observability + guardrails. The standalone best-of-breed era will give way to integrated platforms, as it does in every maturing market.

The Build vs. Buy Cheat Sheet

For teams deploying production AI agents today, here is the practical summary:

Layer	Recommendation	Why
Foundation Models	Buy (multi-model)	Commodity, falling prices, no moat in single-model dependency
Orchestration	Build if core IP, buy if commodity workflow	This is where your product differentiation lives
Memory & State	Buy vector DB, build memory architecture	Storage is commodity, retrieval strategy is competitive advantage
Tool Use & MCP	Adopt MCP, buy common integrations, build proprietary	MCP eliminates the build-vs-buy tension for standard tools
Guardrails & Safety	Buy commercial + build domain-specific	Compliance risk too high for pure DIY
Observability	Buy	Not your core competency, mature vendor options exist
Deployment & Infra	Buy (Modal or serverless)	Undifferentiated operational overhead

The agent stack in 2026 is complex, expensive, and evolving weekly. But the companies that understand where value accrues — orchestration logic, safety guarantees, evaluation rigor — and where it does not — raw model capability, generic infrastructure — are building the AI products that will actually work in production.

The model is the least interesting part of your agent. Everything around it is where the real product lives.

Frequently Asked Questions

What is the AI agent stack?

The AI agent stack is the complete set of technology layers required to build, deploy, and operate AI agents in production. It consists of seven layers: foundation models (the core AI reasoning engine), orchestration (workflow and logic management), memory and state (context persistence), tool use and MCP (external system integration), guardrails and safety (output validation and risk management), observability (monitoring and evaluation), and deployment infrastructure (hosting and scaling). Each layer has distinct vendors, margin structures, and build-vs-buy dynamics.

Why do orchestration and guardrails have higher margins than foundation models?

Foundation models are in a brutal commodity price war — inference costs have dropped 97% in three years as OpenAI, Anthropic, Google, and open-source alternatives compete on price. Orchestration vendors maintain high margins because their products become deeply embedded in engineering workflows and agent architectures, creating high switching costs. Guardrails vendors sell to risk and compliance buyers who have larger budgets and less price sensitivity than engineering teams. Both layers benefit from being less capital-intensive to build than foundation models, which require billions in compute investment.

What is MCP (Model Context Protocol) and why does it matter?

MCP is an open standard created by Anthropic that defines how AI agents connect to external tools and data sources. Think of it as a universal adapter — any tool that implements an MCP server can be used by any agent that supports MCP, regardless of which foundation model or orchestration framework that agent uses. MCP matters because it eliminates the bespoke integration work that previously consumed 30-40% of agent development time. By Q1 2026, over 3,000 MCP servers exist in public registries, and every major model provider supports the protocol natively.

Should I build a custom orchestration layer or use a framework like LangGraph?

The decision depends on whether your agent's orchestration logic is your competitive advantage. If you are building a customer-facing AI product where the agent's reasoning, routing, and multi-step behavior is what differentiates you from competitors, build custom — frameworks will eventually constrain you. If your agent is an internal tool or if the orchestration is straightforward (simple RAG, basic multi-step workflows), use LangGraph or CrewAI. The framework saves months of engineering time and benefits from community-tested patterns. Most teams start with a framework and migrate critical paths to custom code as they scale.

How much does a production AI agent stack cost to operate?

Total cost of ownership varies enormously by scale, but a representative mid-scale deployment (processing 100,000 agent interactions per month) typically costs $8,000-25,000 monthly across all seven layers. The breakdown is roughly: 30% model API costs, 20% infrastructure and compute, 15% vector database and storage, 15% observability and evaluation tooling, 10% guardrails and safety, and 10% tool integration services. Model API costs are the largest single item but also the fastest declining. Teams that implement multi-model routing and aggressive caching typically reduce total costs by 35-50%.

What is the biggest mistake teams make when building the AI agent stack?

The most common and expensive mistake is over-investing in the model layer and under-investing in evaluation and guardrails. Teams spend weeks optimizing prompts and benchmarking model providers for marginal performance gains while shipping agents with no systematic evaluation framework, no guardrails against harmful outputs, and no observability into failure modes. The second most common mistake is premature custom building — teams that build custom orchestration, custom vector search, and custom observability from day one when frameworks and vendors would have gotten them to production in a quarter of the time.

AI AI Agents Infrastructure MCP Developer Tools