In partnership with

AI
AI Cost Is Not Just Compute

Compute shows up clearly on the bill. But inference cost is usually shaped earlier, by technical decisions that compound over time. Once an AI feature is live, cost is driven by a set of levers that finance rarely controls directly, yet feels immediately.

1. Context length raises unit cost

Tokens are the chunks of text a model processes. More tokens mean more work, which translates directly into higher cost per request.

Context tends to expand as systems mature. Prompts grow as teams add instructions, guardrails, and examples. Products introduce memory or conversation history. Agents begin stitching together multiple steps and tool calls. Each of those improvements makes the product better and the unit cost higher.

Underneath the surface, long context creates real infrastructure pressure. The key-value cache that models use to process context grows with length, consuming GPU (graphics processing unit) memory and bandwidth. That reduces throughput and pushes up cost per output. NVIDIA's technical work on inference optimization shows how memory pressure, not raw compute alone, becomes the limiting factor at scale.

One helpful way to think about this: as the product experience improves through richer context, unit economics often shift underneath it, unless the system is designed with cost boundaries in mind from the start.

2. Throughput becomes a procurement problem

As soon as an AI feature needs consistent performance, the conversation shifts from individual requests to sustained capacity. That is where procurement enters the picture, even if it is not labeled that way.

Many organizations move to provisioned capacity models to keep latency predictable and avoid rate limits disrupting production workflows. Azure's documentation on PTU billing is explicit that PTU (Provisioned Throughput Unit) deployments are charged hourly based on the number of deployed units, and that reservation structures materially change the effective rate. The cost is no longer purely tied to what you use. It is tied to what you committed to have available.

For finance, that means spend shifts from variable to capacity-backed, which changes how forecasts and commitments behave. For procurement, contract structure becomes an optimization lever. Whether you are on hourly capacity, a reservation, or a blended approach shapes whether cost stays predictable or drifts over time.

3. Reliability overhead compounds spend

This one is easy to miss because it looks like standard engineering practice.

Production AI systems handle rate limits with retries. They route to fallback models when a primary call fails. They run parallel calls to hit latency targets. They layer in safety and policy checks that require additional model invocations. One user-facing request can quietly become two, three, or six model calls depending on how the system is built.

Finance sees one feature on the roadmap, engineering sees a call graph. What often gets missed here is that reliability patterns are designed in early and rarely revisited for cost impact once they are running. The call volume driving your invoice is not the same as the interaction volume your product team is reporting. The gap between those two numbers is worth understanding.

4. Observability and governance are part of the bill

AI systems drift and regress. They produce inconsistent outputs under conditions that are hard to predict in advance. So teams instrument them carefully. They log prompts, trace chains, measure quality, and run continuous evaluations to catch problems before users do.

That instrumentation is not free. Logging, storage, and data retention adds cost. The basics teams need to capture, including model version, prompt details, token counts, and cost attribution, all generate data that lives somewhere and gets billed accordingly.

What often gets treated as engineering overhead is, from a finance perspective, part of the cost of operating the AI feature. It belongs in the unit cost model alongside inference usage, not in a separate bucket that never makes it into the forecast.

5. Caching and batching change unit economics fast

This is the most underrated bridge between engineering design and financial outcomes.

If prompts share repeatable prefixes, such as system instructions, standard formatting, or shared context blocks, caching can dramatically reduce both cost and latency. OpenAI's documentation on prompt caching notes it can cut input token costs up to 90% and latency up to 80% when requests share an exact prefix match.

Some teams start to notice a significant spread in unit economics between similar AI features, not because of volume differences but because of how the underlying prompts are structured. Two teams can ship the same capability and end up with wildly different cost profiles depending on whether caching was designed in from the beginning.

Where This Leaves Finance, Procurement, and Engineering

Finance feels the impact first, through unit costs that move without obvious cause and forecasts that age quickly. Modeling AI spend requires visibility into technical drivers that were never part of traditional budgeting conversations.

Engineering owns many of the levers that shape cost behavior, often without a clear line of sight into how quickly small design choices accumulate into financial exposure.

Procurement increasingly sits upstream of both. Capacity models, reservation structures, and contract terms influence architecture long before an invoice exists. When intake does not capture technical cost drivers early, the leverage available at negotiation is already reduced.

AI spend is not decided at the bill. It is shaped across design, procurement, and operation. Once those perspectives are in the same conversation, the work shifts away from explaining variance after the fact and toward building systems that scale with intention.

Where FinOps Fits

AI is forcing cost conversations upstream, and that is not a bad thing. FinOps sits at the intersection of these three perspectives, translating technical behavior into financial context and pulling procurement into the loop early enough to change outcomes.

The organizations getting ahead of AI cost are not necessarily the ones spending the most on tooling. They are the ones where finance understands what tokens and context windows mean for a budget, where engineering knows that a retry policy has a dollar value, and where procurement is in the room when capacity decisions get made.

That coordination is what turns AI from a budgeting surprise into a managed system. The cost was always there. It just needed a shared language.

Free, private email that puts your privacy first

A private inbox doesn’t have to come with a price tag—or a catch. Proton Mail’s free plan gives you the privacy and security you expect, without selling your data or showing you ads.

Built by scientists and privacy advocates, Proton Mail uses end-to-end encryption to keep your conversations secure. No scanning. No targeting. No creepy promotions.

With Proton, you’re not the product — you’re in control.

Start for free. Upgrade anytime. Stay private always.

RESOURCES
The Burn-Down Bulletin: More Things to Know

That’s all for this week. See you next Tuesday!

Keep Reading