Are the costs of AI agents also rising exponentially? (2025)

easygenes · 2026-04-18T05:20:41.000Z 1776489641

While I understand why they used the METR data, a cleaner look would be against the current cost-optimal frontier of open models (e.g. GLM-5.1 and MiniMax-M2.7). That paints a very different picture. Comparing just the frontier models at the time of the METR report invariably leads to looking at providers who are pushing the limits of cost at the time of the report.

GPT-5 was shown as being on the costly end, surpassed by o3 at over $100/hr. I can't directly compare to METR's metrics, but a good proxy is the cost of the Artificial Analysis suite. GLM-5.1 is less than half the cost to complete the suite of GPT-5 and is dramatically more capable than both GPT-5 and o3.

So while their analysis is interesting, it points towards the frontier continuing to test the limits of acceptable pricing (as Mythos is clearly reinforcing) and the lagging 6-12 months of distillation and refinement continuing to bring the cost of comparable capabilities to much more reasonable levels.

thelastgallon · 2026-04-18T01:31:44.000Z 1776475904

> On many task lengths (including those near their plateau) they cost 10 to 100 times as much per hour. For instance, Grok 4 is at $0.40 per hour at its sweet spot, but $13 per hour at the start of its final plateau. GPT-5 is about $13 per hour for tasks that take about 45 minutes, but $120 per hour for tasks that take 2 hours. And o3 actually costs $350 per hour (more than the human price) to achieve tasks at its full 1.5 hour task horizon. This is a lot of money to pay for an agent that fails at the task you’ve just paid for 50% of the time — especially in cases where failure is much worse than not having tried at all.

nopinsight · 2026-04-18T05:09:13.000Z 1776488953

Ord's frontier-cost argument is right as far as it goes, but the piece doesn't engage with the counter-trend: inference cost for a fixed capability level has been falling faster than Moore's law. Pushing the frontier will likely keep getting more expensive and concentrated among a few players, while the intelligence needed for more mundane tasks keeps getting cheaper.

That raises a question: if practical-tier inference commoditizes, how does any company justify the ever-larger capex to push the frontier?

OpenAI's pitch is that their business model should "scale with the value intelligence delivers." Concretely, that means moving beyond API fees into licensing and outcome-based pricing in high-value R&D sectors like drug discovery and materials science, where a single breakthrough dwarfs compute cost. That's one possible answer, though it's unclear whether the mechanism will work in practice.

zozbot234 · 2026-04-18T05:21:28.000Z 1776489688

This effect is likely even larger when you consider that the raw cost per inferred token grows linearly with context, rather than being constant. So longer tasks performed with higher-context models will cost quadratically more. The computational cost also grows super-linearly with model parameter size: a 20B-active model is more than four times the cost of a 5B-active model.

tibbar · 2026-04-18T05:28:24.000Z 1776490104

Doesn't context cacheing mostly eliminate this problem? (I suppose for enough context the 90% discount is eventually a lot anyway)

zozbot234 · 2026-04-18T05:29:52.000Z 1776490192

Context caching is really storing the KV-cache for reuse. It saves running prefill for that part of the context, but tokens referencing that KV-cache will still cost more.

boxedemp · 2026-04-18T04:45:18.000Z 1776487518

If you gave me an agent that succeeded 50% of tasks I gave it, I could take over the world in a week. Faster if I wasn't so lazy.

I think you're overestimating, or oversimplifying. Maybe both.

jurgenburgen · 2026-04-18T07:42:22.000Z 1776498142

> If you gave me an agent that succeeded 50% of tasks I gave it, I could take over the world in a week. Faster if I wasn't so lazy.

Assuming you used o3, that would cost $58800 per week. That’s an expensive bet for only 50% odds in your favor.

Of course the agents are only that good on benchmarks, in reality your odds are worse. Maybe roulette instead?

raincole · 2026-04-18T05:12:42.000Z 1776489162

No one is claiming an agent can do 50% of arbitrary tasks. It's just 50% of METR's benchmark set.

> I think you're overestimating, or oversimplifying

Yeah if you only read comments on HN but not the actual linked article you will get oversimplified conclusion. Like, duh?

TeMPOraL · 2026-04-18T08:52:25.000Z 1776502345

> Yeah if you only read comments on HN but not the actual linked article you will get oversimplified conclusion. Like, duh?

Curiously, for most submissions it's the opposite - comments are much more useful and nuanced than the source being discussed.

boxedemp · 2026-04-18T05:14:10.000Z 1776489250

Sorry for stating something so obvious. I'll comment less from now on.

EdvinPL · 2026-04-18T09:50:35.000Z 1776505835

AI feels more like a gamble. People like gambling. From casinos (win-loose), to lootboxes (uncertainty) or even extramarital sex (whose baby is it?).

This way - AI work is like a slot machine - will this work or not? Either way - casino gets paid and casino always wins.

Nevertheless - if the idea or product is very good (filling high market pain) and not that difficult to build - it can enable non-coders to "gamble" for the outcome with AI for $.

Sadly - from by experiences hiring Devs - hiring people is also a gamble...

ketzu · 2026-04-18T10:06:42.000Z 1776506802

> or even extramarital sex (whose baby is it?).

This is the weirdest example of "gambling" I have seen in my life. If you'd've written "unprotected sex" I'd see the gambling part, but "extramartial sex" covers so much more than the tiny subset of "whose baby is it" (how many people are there having sex to gamble on who will be the father of a baby? 10?).

This made my day.

ting0 · 2026-04-18T09:04:15.000Z 1776503055

No, but the AI labs would love to frame it this way so they can continue to nerf models and increase prices while they use the cheap, highly performant, highly powerful models internally to replace all of your businesses.

dang · 2026-04-17T21:42:28.000Z 1776462148

Related ongoing thread:

Measuring Claude 4.7's tokenizer costs - https://news.ycombinator.com/item?id=47807006 (309 comments)

greenmilk · 2026-04-17T23:26:40.000Z 1776468400

Are any inference providers currently making profit (on inference, I know google makes money)?

wsun19 · 2026-04-18T00:05:52.000Z 1776470752

Pretty much every major American inference provider claims to make a profit on API-based inference. Consumer plans might be subsidized overall, but it's hard to say since they're a black box and some consumers don't fully use their plans

henry2023 · 2026-04-18T01:46:26.000Z 1776476786

Third parties selling open-weight inference on OpenRouter are surely selling on a profit. Zero reason to subsidize it.

dannersy · 2026-04-18T08:37:09.000Z 1776501429

If they were they would show evidence because they'd pull in more investment. I don't believe their claim that they make profits on inference, especially not with reports like this coming out.

wavemode · 2026-04-18T00:44:27.000Z 1776473067

Selling inference is not fundamentally different from selling compute - you amortize the lifetime cost of owning and operating the GPUs and then turn that into a per-token price. The risk of loss would be if there is low demand (and thus your facilities run underutilized), but I doubt inference providers are suffering from this.

Where the long-term payoff still seems speculative, is for companies doing training rather than just inference.

Gigachad · 2026-04-18T00:56:33.000Z 1776473793

There’s a lot of debate over what the useful lifespan of the hardware is though. A number that seems very vibes based determines if these datacenters are a good investment or disastrous.

hypercube33 · 2026-04-18T02:20:31.000Z 1776478831

I specifically remember this debate coming up when the H100 was the only player on the table and AMD came out with a card that was almost as fast in at least benchmarks but like half the cost. I haven't seen a follow up with real world use though and as a home labber I know that in the last three weeks the support for AMD stuff at least has gotten impressively useful covering even cuda if you enjoy pain and suffering.

What I'm curious about are what about the other stuff out there such as the ARM and tensor chips.

raincole · 2026-04-18T02:53:01.000Z 1776480781

All of them. It's simply impossible to sell tokens by usage at a loss now. You'll be arbitraged to death in a few days. It only makes sense to subsidize cost if you're selling a subscription.

jagged-chisel · 2026-04-17T23:44:04.000Z 1776469444

Google definitely makes money in other areas. Do they make money on inference?

quicklywilliam · 2026-04-18T00:16:07.000Z 1776471367

Interesting read. I don't know if I quite buy the evidence, but it's definitely enough to warrant further investigation. It also matches up with my personal experience, which is that tools like Claude Code are burning through more and more tokens as we push them to do bigger and bigger work. But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.

So: I buy that the cost of frontier performance is going up exponentially, but that doesn't mean there is a fundamental link. We also know that benchmark performance of much smaller/cheaper models has been increasing (as far as I know METR only looks at frontier models), so that makes me wonder if the exponential cost/time horizon relationship is only for the frontier models.

esperent · 2026-04-18T03:05:37.000Z 1776481537

> But we all know the frontier model companies are burning through money in an unsustainable race to get you and your company hooked on their tools.

Do we? Because elsewhere in the thread there's people claiming they are profitable in API billing and might be at least close to break even on subscription, given that many people don't use all of their allowance.

ai-x · 2026-04-18T05:15:16.000Z 1776489316

Anthropic has 50% gross margins on their tokens.

Step 1) Bubble callers will be proven wrong in 2026 if not already (no excess capacity)

Step 2) Models are not profitable are proven wrong (When Anthropic files their S1)

Step 3) FOMO and actual bubble (say around 2028/29)

dminik · 2026-04-18T09:38:12.000Z 1776505092

If they had such a high margin, they wouldn't need to fuck around with token usage/pricing every three days.

I have no data to support this, but I think they just about break even on API usage and take overall loss on subscriptions/free plans.

2848484995 · 2026-04-18T07:54:53.000Z 1776498893

Can we see them?

lwhi · 2026-04-18T09:00:53.000Z 1776502853

I think an interesting counterpoint, is whether the value obtained is reducing.

agentifysh · 2026-04-18T00:42:52.000Z 1776472972

Until there is some drastic new hardware, we are going to see a similar situation to proof of work, where a small group hordes the hardware and can collude on prices.

Difference is that the current prices have a lot of subsidies from OPM

Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.

So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.

jiggawatts · 2026-04-18T04:50:24.000Z 1776487824

> Until there is some drastic new hardware

For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.

During inference the model weights are static, so they can be stored in High Bandwidth Flash (HBF) instead of High Bandwidth Memory (HBM). Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

NVIDIA GPUs are general purpose. Sure, they have "tensor cores", but that's a fraction of the die area. Google's TPUs are much more efficient for inference because they're mostly tensor cores by area, which is why Gemini's pricing is undercutting everybody else despite being a frontier model.

New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.

There's also algorithmic improvements like the recently announced Google TurboQuant.

Not to mention that pure inference doesn't need the crazy fast networking that training does, or the storage, or pretty much anything other than the tensor units and a relatively small host server that can send a bit of text back and forth.

zozbot234 · 2026-04-18T05:08:58.000Z 1776488938

> Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.

Isn't reading from flash significantly more power intensive than reading DRAM? Anyway, the overhead of keeping weights in memory becomes negligible at scale because you're running large batches and sharding a single model over large amounts of GPU's. (And that needs the crazy fast networking to make it work, you get too much latency otherwise.)

jiggawatts · 2026-04-18T06:59:35.000Z 1776495575

For a given capacity of memory, Flash uses far less power than DRAM, especially when used mostly for reads.

> becomes negligible at scale

Nothing is negligible at scale! Both the cost and power draw of the HBMs is a limiting factor for the hyperscalers, to the point that Sam Altman (famously!) cornered the market and locked in something like 40% of global RAM production, driving up prices for everyone.

> sharding a single model over large amounts of GPUs

A single host server typically has 4-16 GPUs directly connected to the motherboard.

A part of the reason for sharding models between multiple GPUs is because their weights don't fit into the memory of any one card! HBF could be used to give each GPU/TPU well over a terabyte of capacity for weights.

Last but not least, the context cache needs to be stored somewhere "close" to the GPUs. Across millions of users, that's a lot of unique data with a high churn rate. HBF would allow the GPUs to keep that "warm" and ready to go for the next prompt at a much lower cost than keeping it around in DRAM and having to constantly refresh it.

zozbot234 · 2026-04-18T07:24:37.000Z 1776497077

> For a given capacity of memory, Flash uses far less power than DRAM, especially when used mostly for reads.

Flash has no idle power being non-volatile (whereas DRAM has refresh) but active power for reading a constantly-sized block is significantly larger for Flash. You can still use Flash profitably, but only for rather sparse and/or low-intensity reads. That probably fits things like MoE layers if the MoE is sparse enough.

Also, you can't really use flash memory (especially soldered-in HBF) for ephemeral data like the KV context for a single inference, it wears out way too quickly.

adrian_b · 2026-04-18T08:09:12.000Z 1776499752

Modern flash memory, with multi-bit cells, indeed requires more power for reading than DRAM, for the same amount of data.

However, for old-style 1-bit per cell flash memory I do not see any reason for differences in power consumption for reading.

Different array designs and sense amplifier designs and CMOS fabrication processes can result in different power consumptions, but similar techniques can be applied to both kinds of memories for reducing the power consumption.

Of course, storing only 1 bit per cell instead of 3 or 4 reduces a lot the density and cost advantages of flash memory, but what remains may still be enough for what inference needs.

colechristensen · 2026-04-18T01:02:04.000Z 1776474124

Doubtful, local models are the competitive future that will keep prices down.

128GB is all you need.

A few more generations of hardware and open models will find people pretty happy doing whatever they need to on their laptop locally with big SOTA models left for special purposes. There will be a pretty big bubble burst when there aren't enough customers for $1000/month per seat needed to sustain the enormous datacenter models.

Apple will win this battle and nvidia will be second when their goals shift to workstations instead of servers.

hypercube33 · 2026-04-18T02:16:28.000Z 1776478588

Weird how you're leaving stuff like Strix Halo out. Also weird you think 128gb is the future with all of the research done to reduce that to something around 12GB being a target with all of these papers out now. I assume we'll end up with less general purpose models and more specific small ones swapped out for whatever work you are asking to do.

MrBuddyCasino · 2026-04-18T04:37:53.000Z 1776487073

Strix Halo hasn‘t got nearly enough bandwidth, its just 256bit.

Tepix · 2026-04-18T05:07:18.000Z 1776488838

It‘s sufficient for some MoE models.

lookaround · 2026-04-18T01:42:15.000Z 1776476535

> 128GB is all you need.

My guy, look around.

They are coming for personal compute.

Where are you going to get these 128GBs? Aquaman? [0]

The ones who make RAM are inexplicably attaching their fate to the future being all LLMs only everywhere.

[0] https://www.youtube.com/watch?v=0-w-pdqwiBw

naveen99 · 2026-04-18T01:49:14.000Z 1776476954

Cloud can’t make money off of you and pay more than you for the hardware at the same time.

adrianN · 2026-04-18T03:49:18.000Z 1776484158

Batch inference is much more efficient. Using the hardware round the clock is much more efficient. Cloud can absolutely pay more for hardware and still make money off you.

bitwize · 2026-04-18T03:32:18.000Z 1776483138

Cloud can pay more for RAM until all the RAM producers withdraw from the consumer market, then prices will go back down.

End users will still get access to RAM. The cloud terminal they purchase from Apple, Google, Samsung, or HP will have all the RAM it will ever need directly soldered onto it.

seanmcdirmid · 2026-04-18T03:36:37.000Z 1776483397

Doesn’t Apple place RAM directly into the SoC package? We aren’t even talking about soldering it to mother boards anymore, it is coming in with the CPU like it would as a GPU.

xantronix · 2026-04-18T04:14:56.000Z 1776485696

I was really fucking hoping we weren't at the part where "cloud terminals" doesn't seem farfetched and paranoid and yet here we are. Jesus Christ.

bitwize · 2026-04-18T06:05:42.000Z 1776492342

The next step, I think, will be a "cash for clunkers" program to permit people to trade in old computer hardware to the government—especially since operating systems that do not collect KYC data on their users will soon be illegal to operate.

foota · 2026-04-18T01:51:54.000Z 1776477114

More like RAM producers are providing supplies to the highest bidder, no? If this doesn't peter out supply will normalize at a higher but less insane price eventually.

matt3210 · 2026-04-18T01:02:29.000Z 1776474149

I took a month break and my side project took 2x as much tokens

noosphr · 2026-04-18T02:50:58.000Z 1776480658

Yet again: Transformers are fundamentally quadratic.

If they can do a task that takes 1 unit of computation for 1 dollar they will cost 100 dollars for a 10 unit task and 10,000 for a 100 unit task.

Project costs from Claude Code bear this out in the real world.

twaldin · 2026-04-18T08:03:16.000Z 1776499396

idk over my testing, glm-5 inside opencode beats all other agents head to head

keepamovin · 2026-04-18T05:35:56.000Z 1776490556

My expectation: demand going up, prices will rise, supply will saturate to the point of ubiquitous "utility" status, and prices will drop, probably a bell curve shape with sine-wave undulations along the way.

chii · 2026-04-18T06:17:17.000Z 1776493037

> supply will saturate

that depends on the ability to produce supply at a saturation rate.

It did work for internet backhaul links - ala, those dark fibres. However, i reckon those fibres are easier to manufacture than silicon chips.

I wonder if saturation is possible for ai capable chips.

siliconc0w · 2026-04-18T03:08:03.000Z 1776481683

Working on a oss tool to help orgs identify where they can save on token costs: https://repogauge.org

Happy to run it on your repos for a free report: hi@repogauge.org