The Hidden Cost of AI Agents: Unit Economics Nobody Is Talking About

Reflexion loops consume 50x tokens. Agents fail 50-75% of real-world tasks. Gartner says 40% of agentic projects will be canceled by 2027. Inside the cost structure that's breaking AI business models.

By Nina Okafor, Marketing Ops · Mar 9, 2026 · 14 min read

In September 2024, Klarna announced that its AI assistant was handling two-thirds of all customer service chats in its first month. The company claimed the AI was doing the equivalent work of 700 full-time human agents, resolving issues in under 2 minutes versus the previous 11-minute average. CEO Sebastian Siemiatkowski called it "a revolution in productivity."

Eleven months later, Klarna began rehiring human agents. The AI agent that was supposed to replace 700 people couldn't maintain quality on complex interactions — refund disputes, multi-product issues, escalations that required judgment. Siemiatkowski acknowledged publicly that AI "cannot fully replace humans" for customer service.

Klarna's reversal is not an anomaly. It's a preview of what happens when the demo performance of AI agents meets the cost structure of running them at production scale. And the cost structure is worse than almost anyone in the industry is willing to discuss publicly.

The Inference Cost Iceberg

The headline cost of an AI agent interaction seems manageable. A single GPT-4o API call costs roughly $2.50 per million input tokens and $10 per million output tokens. A typical customer service interaction might use 2,000-5,000 tokens total. At those rates, the raw inference cost per interaction is $0.01-0.05.

But agents don't make one API call. That's the fundamental misconception that distorts every business case built for agentic AI.

An agent completing a customer service resolution might:

Parse the customer's initial message (1 call)
Retrieve relevant account information via tool calls (2-3 calls)
Analyze the account history for context (1 call)
Determine the appropriate resolution path (1 call)
Execute the resolution action via API (1-2 calls)
Verify the action completed correctly (1 call)
Generate a customer-facing response (1 call)
Log the interaction for compliance (1 call)

That's 9-11 LLM calls for a straightforward resolution. A complex interaction — one requiring clarification, error correction, or escalation logic — can require 25-50 calls. And each call includes the full conversation context, meaning token consumption grows quadratically with conversation length.

Research on reflexion-based agent architectures — where agents review their own outputs and iterate — shows token consumption of up to 50x a single completion. An agent that checks its work, reconsiders its approach, and tries again is doing exactly what makes it more capable. It's also consuming tokens at a rate that demolishes the unit economics of the simple "cost per call" projection.

The real math looks like this:

Interaction type	LLM calls	Tokens consumed	Inference cost
Simple FAQ response	1-2	1,000-3,000	$0.01-0.03
Standard resolution	8-12	15,000-40,000	$0.10-0.50
Complex multi-step	20-40	80,000-200,000	$1.00-5.00
Error recovery + retry	40-80	200,000-500,000	$5.00-15.00
Multi-agent orchestration	50-100+	500,000-2,000,000	$15.00-50.00+

The bottom rows of that table are where agent economics break down. When an agent encounters an edge case, fails, retries, and escalates — a scenario that occurs in 50-75% of real-world tasks according to multiple agent benchmarks — the cost per interaction can exceed what a human agent costs for the same resolution.

The Error Amplification Problem

Single-step AI interactions have a straightforward error profile: the model either gets it right or it doesn't. Agent workflows have a compound error profile that most teams dramatically underestimate.

Consider a 10-step agent workflow where each step has a 95% success rate — a reasonable assumption for a well-tuned model on structured tasks. The probability of completing all 10 steps correctly is 0.95^10 = 0.60. A 5% per-step error rate produces a 40% end-to-end failure rate.

In practice, the compounding is worse because errors aren't independent. A mistake in step 3 doesn't just fail step 3 — it corrupts the context for steps 4 through 10. Research from Microsoft on compositional AI systems found that multi-step agent error rates are approximately 17.2x higher than single-step error rates when accounting for error propagation and context corruption.

This is the error amplification problem: agents that are impressively reliable on individual tasks become unacceptably unreliable when those tasks are chained together. And every retry to fix an error generates more inference costs, creating a cost-error spiral:

Agent attempts task → fails at step 6
Agent retries from step 5 with modified approach → fails at step 8
Agent retries from step 7 → succeeds but with degraded quality
Total cost: 3x the planned inference budget

Enterprise environments typically require less than 1% error rates for automated processes that touch customers, financial data, or compliance-relevant workflows. Current agents operate at 25-50% success rates on complex tasks. Bridging that gap — from 50% to 99% reliability — is not a linear engineering problem. It requires either dramatically better models, dramatically better error correction (which dramatically increases cost), or dramatically narrower task scopes (which dramatically reduces value).

The Klarna Postmortem: A Detailed Look

Klarna's AI agent journey deserves forensic examination because the company was more transparent about its AI strategy than most, providing enough data points to reconstruct what happened.

Phase 1: The impressive launch (September 2024). Klarna's AI assistant, built on OpenAI's technology, launched and immediately handled 2.3 million conversations in its first month. The company reported a 25% reduction in repeat inquiries and customer satisfaction scores on par with human agents. These numbers were real and impressive.

Phase 2: The quiet scaling problems (Late 2024 - Early 2025). As the AI agent handled more interactions, edge case frequency increased. Refund disputes involving multiple products. Account issues spanning multiple countries with different regulations. Complaints requiring empathy and nuanced judgment. Each edge case required more inference calls (increasing cost) and produced worse outcomes (decreasing quality). The company did not publicly discuss these issues during this phase.

Phase 3: Quality degradation becomes visible (Mid 2025). Customer complaints about AI interactions increased. Social media reports of frustrating bot loops — where the AI couldn't resolve an issue but also couldn't effectively escalate — began appearing. Klarna's customer satisfaction scores for AI-handled interactions reportedly diverged from human-handled scores, particularly for complex issues.

Phase 4: The reversal (Late 2025). Klarna began rehiring human agents. Siemiatkowski acknowledged the limitations. The company shifted to a hybrid model where AI handles straightforward interactions and humans handle anything requiring judgment. The 700 agents the AI was supposed to replace? Klarna now needed roughly half of them back.

The unit economics of Klarna's reversal tell the real story. The initial business case assumed an average cost per AI resolution of approximately $0.50, compared to roughly $5 for a human agent. The actual average cost, including error correction loops, escalation handling, and quality remediation, was closer to $3-4 for the AI — plus the residual cost of human agents needed for escalations. The savings were real but were perhaps 40% of the original projection, not the 90%+ that was marketed.

Why Initial Cost Projections Are Off by 10x

The Klarna example illustrates a broader pattern: initial cost projections for agentic AI are systematically too optimistic by approximately an order of magnitude.

The projections fail for consistent reasons:

Reason 1: Demo bias. Cost projections are built from demonstration scenarios — carefully chosen tasks where the agent performs well. Production environments include the full distribution of tasks, including the 20% of interactions that are 10x more complex and 50x more expensive than the average. This long tail of complex interactions dominates actual costs.

Reason 2: Ignoring human oversight costs. Every agentic system requires human oversight for quality assurance, exception handling, and compliance review. These human costs don't disappear — they shift from "doing the work" to "monitoring and correcting the AI doing the work." BCG research found that human oversight costs average 30-50% of the pre-automation human cost, meaning the net saving is 50-70%, not the 90%+ typically projected.

Reason 3: Infrastructure costs beyond inference. Running agents at scale requires vector databases for retrieval, logging infrastructure for compliance, monitoring systems for quality assurance, and orchestration platforms for multi-agent coordination. These infrastructure costs are typically excluded from initial projections but can equal or exceed raw inference costs. Replit's margin swing to -14% was driven largely by infrastructure costs scaling faster than revenue.

Reason 4: The cost of being wrong. When a human agent makes a mistake, it costs the company one remediation interaction. When an AI agent makes a mistake, it can cost the company a customer — because the customer already tried the automated system, failed, and now has to start over with a human. The brand damage and customer lifetime value impact of AI errors is systematically excluded from cost projections but is the primary reason Klarna reversed course.

The Platform Provider Problem

The companies building the foundation models — OpenAI, Anthropic, Google — face their own version of the cost problem, and it cascades to everyone building on top.

OpenAI reportedly burns approximately $2 for every $1 earned on inference across its product suite. This ratio has likely improved with model efficiency gains, but the company's losses — projected at $5 billion for 2024 on $3.7 billion in revenue — indicate that inference costs remain structurally above revenue for the products driving the most usage.

This matters because every company building AI agents on top of OpenAI, Anthropic, or Google APIs is implicitly betting that inference costs will continue to decline. If they do — and the historical trend supports this, with costs dropping roughly 10x every 18 months — then today's negative unit economics can turn positive at future cost structures. If cost declines stall because of energy constraints, chip supply limitations, or model capability plateaus, the entire agentic AI stack faces a sustainability crisis.

The dependency chain creates a peculiar dynamic: AI agent companies need inference costs to decline to achieve positive unit economics, but they also need to use more tokens per interaction (for better quality, more complex tasks, and agent autonomy) as their products mature. These two forces partially offset each other, and it's not clear which one wins.

Gartner projects that more than 40% of agentic AI projects initiated in 2025-2026 will be canceled, scaled back, or fundamentally restructured by 2027. The primary cited reasons are escalating costs that exceed initial projections and inability to achieve reliability targets. This is not a prediction about AI's long-term potential — it's a prediction about the gap between current capabilities, current costs, and current enterprise expectations.

The Scaling Trap

The most insidious aspect of AI agent economics is what I call the scaling trap: agents get more expensive per interaction as they get more capable.

In traditional software, scaling reduces marginal cost. Serve 10x more users and your per-user infrastructure cost drops. This is the fundamental economics behind SaaS margins.

AI agents work in reverse. Making an agent more capable requires:

Better models (more expensive per token)
More tool access (more API calls per interaction)
Longer context (more tokens per call)
More reasoning steps (more calls per task)
Better error handling (more retry loops)

Each improvement increases the inference cost per interaction. A basic chatbot that answers FAQs might cost $0.01 per interaction. An agent that can navigate your systems, take actions, and verify outcomes might cost $1-5 per interaction. An autonomous agent that can handle multi-step workflows with error recovery might cost $10-50 per interaction.

The scaling trap means that the agents capable enough to replace human workers are often expensive enough to make the replacement economics marginal. The agents cheap enough to run profitably at scale are often too limited to handle the tasks humans are most expensive to employ for.

This creates a narrow viability window: tasks that are complex enough to justify automation but simple enough that an agent can complete them reliably without excessive retry loops. That window is real — it's where Intercom's $0.99/resolution model works, where structured customer service interactions have well-defined resolution paths. But it's narrower than the market narrative suggests.

What the Smart Money Is Actually Building

Companies with the healthiest agentic AI economics share several characteristics that are worth noting:

Narrow task scopes. Rather than building general-purpose agents that attempt any task, successful deployments focus agents on specific, well-defined workflows. Harvey doesn't build a "legal AI agent." It builds specific agents for contract review, due diligence, and regulatory analysis — each optimized for a narrow task where reliability can exceed 95%.

Aggressive model routing. Not every step in an agent workflow requires a frontier model. Smart architectures route simple tasks (parsing, extraction, classification) to cheap, fast models and reserve expensive models for reasoning-heavy steps. Companies implementing intelligent routing report 40-60% inference cost reductions without meaningful quality degradation.

Human-in-the-loop by design, not by failure. Rather than deploying fully autonomous agents and adding human oversight when they fail, the best implementations design human checkpoints into the workflow from the start. This is not an admission of AI inadequacy — it's an acknowledgment that the cost of uncaught errors exceeds the cost of human review for high-stakes tasks. The human doesn't do every task — they verify the 10-20% of tasks where the agent's confidence is below a threshold.

Caching and determinism layers. Many agent interactions are variations of previously seen requests. Building a caching layer that recognizes similar inputs and reuses previous successful outputs — rather than running the full agent pipeline every time — can reduce average inference costs by 50-70%. This requires upfront investment in embedding-based similarity matching but pays back quickly at scale.

The Honest Math: When AI Agents Make Economic Sense

Stripping away the hype and the pessimism, the data points to a clear framework for when AI agents are and aren't economically viable:

Agents make sense when: - The task has a clear success/failure criterion (enabling outcome-based measurement) - The average task requires fewer than 15 agent steps (keeping error amplification manageable) - The human cost of the task exceeds $5 per instance (providing enough margin to cover inference costs) - Task volume exceeds 10,000 instances per month (justifying the infrastructure investment) - Error consequences are limited and recoverable (keeping remediation costs low)

Agents don't make sense when: - Tasks require judgment that varies by context (high error rates, expensive retries) - The average task requires more than 30 agent steps (error amplification makes reliability impractical) - The human cost of the task is under $2 per instance (inference costs eat the entire saving) - Task volume is under 1,000 per month (infrastructure costs can't be amortized) - Errors have regulatory, legal, or reputational consequences (human oversight costs eliminate savings)

The companies generating real returns on AI agents — Intercom, Sierra, Ironclad — all operate in the "makes sense" zone. Structured tasks, clear success criteria, high volume, moderate complexity, limited error consequences.

The companies announcing AI agent initiatives and then quietly scaling them back — and there are more of these than the industry acknowledges, with BCG reporting that 60% of enterprises deploying AI broadly see no material business value — are typically operating outside that zone. They're attempting to automate judgment-heavy, multi-step workflows where agent reliability is 50-75% and human oversight eliminates most of the projected savings.

What Happens Next

The AI agent cost problem will improve. Models will get cheaper. Architectures will get more efficient. Caching will get smarter. Error rates will decline. The question is not whether AI agents will become economically viable at scale — they almost certainly will — but whether the timeline matches the current investment thesis.

If inference costs follow their historical trajectory and drop another 10x by 2028, many agent deployments that are marginally negative today become solidly positive. If the decline stalls — due to energy constraints, chip supply issues, or the diminishing returns of model distillation — the shakeout will be severe.

The $2 trillion in enterprise AI spending projected through 2028 is premised on the assumption that costs decline and reliability improves on a curve that makes current investment rational. If the curve flattens, the cancelation rate will exceed Gartner's 40% estimate.

For operators evaluating AI agents today, the actionable advice is: build your business case on today's costs, not projected future costs. If the unit economics work at current inference rates with a 30% reliability buffer, proceed. If the business case requires 5x cost reduction and 2x reliability improvement to break even, wait. The technology will get there. The question is whether your budget and your board's patience will too.

The hidden cost of AI agents isn't hidden because companies are trying to obscure it. It's hidden because the cost structure — variable inference, error amplification, infrastructure overhead, human oversight — is genuinely difficult to measure before you run the system at scale. The companies discovering this in production are the ones generating the data that will eventually make agent economics predictable. Until then, the gap between the pitch deck and the P&L will remain the defining tension of the agentic AI era.

Frequently Asked Questions

Why are AI agents so expensive to run?

AI agents are expensive because they require multiple inference calls per task (an agent completing a 10-step workflow might make 30-100 LLM calls), use reflexion loops that consume up to 50x the tokens of a single completion, and need expensive frontier models for reasoning-heavy steps. Unlike simple chatbot interactions, agents can't predict their compute costs in advance because the number of steps varies with task complexity and error correction needs.

What is the failure rate of AI agents?

Current AI agents fail 50-75% of real-world tasks according to multiple benchmarks and production deployments. Enterprise environments typically require less than 1% error rates for automated processes, creating a massive gap between agent capabilities and enterprise requirements. Multi-agent systems face error amplification, where a 5% error rate per step compounds to a 17.2x higher failure rate across a 10-step workflow compared to single-step AI calls.

Why did Klarna reverse its AI agent strategy?

Klarna initially claimed its AI agent handled two-thirds of customer service chats and replaced 700 human agents. The company later reversed course and began rehiring human agents after discovering quality degradation in complex customer interactions. CEO Sebastian Siemiatkowski acknowledged that AI could not fully replace humans for nuanced customer service. The reversal illustrates the gap between AI agent demo performance and production reliability at scale.

What percentage of AI agent projects will be canceled?

Gartner projects that more than 40% of agentic AI projects will be canceled, scaled back, or restructured by 2027 due to escalating costs, unclear ROI, and implementation complexity. BCG research found that 60% of enterprises deploying AI broadly see no material business value. Initial cost projections for agentic AI implementations are typically off by a factor of 10x when accounting for error correction, human oversight, and infrastructure costs.

How do AI agent costs compare to traditional software?

Traditional SaaS has near-zero marginal cost per transaction. AI agents have variable, unpredictable costs that scale with task complexity. A simple customer service interaction might cost $0.05 in inference, but a complex multi-step resolution with error correction can cost $5-50. OpenAI reportedly spends $2 for every $1 earned on inference across its product suite. Replit's margins swung to -14% when AI usage spiked, illustrating how agent-heavy products face margin volatility that traditional software never experienced.