Hacker News
Are the costs of AI agents also rising exponentially? (2025)
thelastgallon
|next
[-]
nopinsight
|root
|parent
|next
[-]
That raises a question: if practical-tier inference commoditizes, how does any company justify the ever-larger capex to push the frontier?
OpenAI's pitch is that their business model should "scale with the value intelligence delivers." Concretely, that means moving beyond API fees into licensing and outcome-based pricing in high-value R&D sectors like drug discovery and materials science, where a single breakthrough dwarfs compute cost. That's one possible answer, though it's unclear whether the mechanism will work in practice.
boxedemp
|root
|parent
|previous
[-]
I think you're overestimating, or oversimplifying. Maybe both.
dang
|next
|previous
[-]
Measuring Claude 4.7's tokenizer costs - https://news.ycombinator.com/item?id=47807006 (309 comments)
quicklywilliam
|next
|previous
[-]
So: I buy that the cost of frontier performance is going up exponentially, but that doesn't mean there is a fundamental link. We also know that benchmark performance of much smaller/cheaper models has been increasing (as far as I know METR only looks at frontier models), so that makes me wonder if the exponential cost/time horizon relationship is only for the frontier models.
ai-x
|root
|parent
|next
[-]
Step 1) Bubble callers will be proven wrong in 2026 if not already (no excess capacity)
Step 2) Models are not profitable are proven wrong (When Anthropic files their S1)
Step 3) FOMO and actual bubble (say around 2028/29)
esperent
|root
|parent
|previous
[-]
Do we? Because elsewhere in the thread there's people claiming they are profitable in API billing and might be at least close to break even on subscription, given that many people don't use all of their allowance.
agentifysh
|next
|previous
[-]
Difference is that the current prices have a lot of subsidies from OPM
Once the narrative changes to something more realistic, I can see prices increase across the board, I mean forget $200/month for codex pro, expect $1000/month or something similar.
So its a race between new supply of hardware with new paradigm shifts that can hit market vs tide going out in the financial markets.
jiggawatts
|root
|parent
|next
[-]
For inference, there is already a 10x improvement possible over a setup based on NVIDIA server GPUs, but volume production, etc... will take a while to catch up.
During inference the model weights are static, so they can be stored in High Bandwidth Flash (HBF) instead of High Bandwidth Memory (HBM). Flash chips are being made with over 300 layers and they use a fraction of the power compared to DRAM.
NVIDIA GPUs are general purpose. Sure, they have "tensor cores", but that's a fraction of the die area. Google's TPUs are much more efficient for inference because they're mostly tensor cores by area, which is why Gemini's pricing is undercutting everybody else despite being a frontier model.
New silicon process nodes are coming from TSMC, Intel, and Samsung that should roughly double the transistor density.
There's also algorithmic improvements like the recently announced Google TurboQuant.
Not to mention that pure inference doesn't need the crazy fast networking that training does, or the storage, or pretty much anything other than the tensor units and a relatively small host server that can send a bit of text back and forth.
zozbot234
|root
|parent
[-]
Isn't reading from flash significantly more power intensive than reading DRAM? Anyway, the overhead of keeping weights in memory becomes negligible at scale because you're running large batches and sharding a single model over large amounts of GPU's. (And that needs the crazy fast networking to make it work, you get too much latency otherwise.)
colechristensen
|root
|parent
|previous
[-]
128GB is all you need.
A few more generations of hardware and open models will find people pretty happy doing whatever they need to on their laptop locally with big SOTA models left for special purposes. There will be a pretty big bubble burst when there aren't enough customers for $1000/month per seat needed to sustain the enormous datacenter models.
Apple will win this battle and nvidia will be second when their goals shift to workstations instead of servers.
hypercube33
|root
|parent
|next
[-]
lookaround
|root
|parent
|previous
[-]
My guy, look around.
They are coming for personal compute.
Where are you going to get these 128GBs? Aquaman? [0]
The ones who make RAM are inexplicably attaching their fate to the future being all LLMs only everywhere.
naveen99
|root
|parent
|next
[-]
adrianN
|root
|parent
|next
[-]
bitwize
|root
|parent
|previous
[-]
End users will still get access to RAM. The cloud terminal they purchase from Apple, Google, Samsung, or HP will have all the RAM it will ever need directly soldered onto it.
greenmilk
|next
|previous
[-]
wsun19
|root
|parent
|next
[-]
raincole
|root
|parent
|next
|previous
[-]
henry2023
|root
|parent
|next
|previous
[-]
wavemode
|root
|parent
|next
|previous
[-]
Where the long-term payoff still seems speculative, is for companies doing training rather than just inference.
Gigachad
|root
|parent
[-]
hypercube33
|root
|parent
[-]
What I'm curious about are what about the other stuff out there such as the ARM and tensor chips.
siliconc0w
|next
|previous
[-]
Happy to run it on your repos for a free report: hi@repogauge.org