Skip to main content

Inference

The act of running a trained AI model to generate output for a given input.

Inference is the production-time use of an AI model — the part where you actually pay for tokens and serve users. Distinct from training (the much more expensive process of teaching the model in the first place).

For operators, inference cost is the line that matters month-to-month. Frontier-model inference costs have fallen by 10x or more between 2023 and 2026 as competition has increased, but the math still matters at scale: serving an agentic workflow that uses 100K tokens per session vs. a single-prompt one that uses 2K tokens is a 50x cost difference.

Key levers operators control: model selection (cheap models for easy tasks, expensive models for hard ones), prompt caching (reuse common prefixes), batch processing (group requests when latency-tolerant), and avoiding accidental loops in agent workflows.

Coming soon

Get the weekly digest

New tools, reviews, and prompts every Friday.