When Every Active User Destroys Your Margin
Your AI product does not have a margin problem because users love it. It has a margin problem because every workflow calls the most expensive model too often — and you only notice after activation curves look great.
This is the pattern we keep seeing in vertical AI teams: AI API bills scale faster than revenue. Not because the product is broken. Because inference cost is tied to usage intensity, and usage intensity rises with the features that make the product sticky.
A compliance assistant that re-reads twelve documents per query. An academic tool that runs detection, humanization, and summarization in one session. A support agent that chains three LLM calls before a human ever sees the ticket. Each active user is doing more work per visit than your spreadsheet assumed.
The trap: confusing growth metrics with unit economics
Founders optimize what dashboards show first: signups, weekly actives, time-in-product. Investors ask about those too — early. But the number that kills seed-stage AI companies quietly is gross margin per active user, and most teams do not have it instrumented until Series A conversations get awkward.
Symptoms look like familiar SaaS problems:
- Revenue grows 15% month over month; infra grows 40%.
- Support tickets about "slow" responses — often expensive-model latency, not bad UX.
- Power users you love in case studies are underwater on variable cost.
- You add credits or caps, and churn spikes among the users who were most engaged.
The difference from classic SaaS: your COGS scales with intelligence per action, not just storage and bandwidth. Two users on the same plan can have 10× different token burn.
What to measure before you change pricing
Before you raise prices or throttle features, get three views of cost — not one blended AWS line item.
- Cost per workflow. Name the jobs users actually hire your product to do (draft report, scan regulation, rewrite paragraph). Attribute tokens and dollars to each.
- Cost per account. Which customer segments drive margin? Enterprise pilots often look small in seat count but enormous in inference.
- Cost per model call type. Retrieval, classification, generation, and re-ranking have different economics. Lumping them hides routing opportunities.
Until those exist, "we need cheaper models" is a guess. You might need cheaper models for step two of five — and a premium model only where quality actually moves retention.
Why gateways alone do not fix this
LLM gateways and routers solve integration: one API surface, failover, spend dashboards. That is necessary. It is not the same as LLM cost optimization at the workflow layer.
Optimization asks: for this specific vertical task, which model clears the quality bar at the lowest cost and latency? That question depends on domain evals — not generic benchmarks. A router that does not know your RAG pipeline's failure modes will still route expensive calls to expensive models by default.
A practical first move
Pick your highest-volume workflow. Log input tokens, output tokens, model ID, and latency for every call for one week. Plot cost per completed user outcome — not per API request.
You will almost always find a minority of steps consuming a majority of spend. That is where routing, caching, or smaller models earn their keep — without hurting the outcome users pay for.
Control token burn before it becomes structural. That is the job of infrastructure, not a pricing-page patch after the fact.
Building vertical AI and watching infra outrun revenue? Join the ozDNA early access list — GPU Bill Bodyguard series for founders and CTOs.
Get Early Access