Making Inference Cost Measurable and Routable
GPU cost unpredictable is a symptom. The disease is unlabeled spend — you see a bill, not a decision tree. Fix measurement first; routing second.
CTOs describe the same week: finance asks for a forecast, engineering says "it depends on traffic," and the chart of GPU cost unpredictable spikes looks like weather. That is not a capacity-planning failure alone. It is missing attribution between what the product did and what infra it burned.
Three layers of inference cost
Production AI spend usually mixes three buckets. Treating them as one line item guarantees surprise.
- Provider inference. Per-token API charges or self-hosted GPU hours.
- Retrieval and orchestration. Embeddings, re-ranking, agent tool loops — often more calls than the "main" generation.
- Retry and quality tax. Failed RAG pulls, guardrail rewrites, user-triggered regenerations. Invisible in demos, massive in production.
Dashboards that only show "OpenAI bill" miss two-thirds of the story. AI inference cost reduction starts when retrieval and retries carry the same labels as chat completions.
Make cost measurable: the minimum schema
You do not need a data warehouse on day one. You need consistent fields on every inference event:
workflow_id— user-facing job, not internal microservice namestep— retrieve, classify, generate, validatemodel— provider + model ID actually invokedtokens_in/tokens_out— or GPU-seconds for self-hostedaccount_id— for B2B, tie to customer before useroutcome— success, retry, fallback model
With that schema, questions become answerable: Which workflow blew the budget Tuesday? Did the new prompt increase output tokens? Are retries doubling cost on one customer integration?
This is the foundation of token cost management — credits on your pricing page should map back to these rows, or you are guessing margin.
Make cost routable: cheapest capable model
LLM routing is not "send easy stuff to the small model" as a slogan. It is a policy per step:
- Define the quality bar for the step (factual grounding, tone, format compliance).
- Run evals on a fixed set of production-shaped inputs — not cherry-picked demos.
- Route to the cheapest model that clears the bar; escalate only on low confidence.
- Log escalations as first-class events — they are your roadmap for fine-tuning or better retrieval.
Without per-step evals, routing is superstition. You will either over-spend on premium models or under-deliver on the steps users notice.
What "routable" looks like in ops
Measurable + routable infra means an on-call engineer can answer in minutes:
- We shifted 38% of classification calls to a smaller model last release — quality held, cost down.
- Account X's integration triggers double retrieval — not a provider outage.
- Latency spike correlates with a routing fallback, not GPU saturation.
That is when inference stops feeling like weather and starts feeling like a system you operate.
Start this week
Instrument one workflow end to end. Add routing to exactly one step where evals already exist. Resist boiling the ocean.
Predictable GPU and API bills are not about locking spend — they are about tying every dollar to a product decision you can change.
Want cost per workflow, not just a provider invoice? Early access to ozDNA's routing and governance layer is open for vertical AI teams.
Get Early Access