What does AI inference cost?

Question

Accepted Answer

AI inference costs vary dramatically depending on the model, provider, and optimization techniques used. Understanding the cost structure is essential for budgeting AI projects and choosing the right deployment approach. Here's a comprehensive breakdown.

**API pricing (pay-per-use):**

Major language model APIs charge per token (roughly per word). As of early 2026:

**Frontier models** (highest capability):
- GPT-4o: ~$2.50/million input tokens, ~$10/million output tokens
- Claude Sonnet: ~$3/million input, ~$15/million output
- Gemini 1.5 Pro: ~$1.25/million input, ~$5/million output

**Mid-tier models** (good capability, lower cost):
- GPT-4o mini: ~$0.15/million input, ~$0.60/million output
- Claude Haiku: ~$0.25/million input, ~$1.25/million output
- Gemini 1.5 Flash: ~$0.075/million input, ~$0.30/million output

**What this means in practice:**

- Processing a 10-page document (~3,000 words, ~4,000 tokens) with GPT-4o costs about $0.01 input + a few cents for the response
- A customer service chatbot handling 1,000 conversations/day (averaging 500 tokens each) costs $1.25-$15/day depending on model choice
- Analyzing 10,000 documents costs $40-$150 with frontier models, or $1.50-$6 with smaller models

**Self-hosted model costs:**

Running your own models eliminates per-token fees but introduces infrastructure costs:

**GPU rental** (cloud):
- NVIDIA A100 (80GB): $2-4/hour (AWS, GCP, Azure)
- NVIDIA H100: $3-8/hour
- NVIDIA A10G: $1-2/hour (sufficient for smaller models)

**Cost per query** (self-hosted):
- LLaMA 70B on 4x A100: ~$0.001-0.003 per query (at 100+ queries/hour)
- LLaMA 8B on 1x A10G: ~$0.0001-0.0003 per query
- Mistral 7B quantized on consumer GPU: effectively free after hardware cost

**The crossover point**: Self-hosting typically becomes cheaper than API pricing at 50,000-100,000+ queries per day for mid-tier models. Below that volume, APIs are more cost-effective when you factor in engineering time for deployment and maintenance.

**Image generation costs:**
- DALL-E 3: $0.04-0.08 per image
- Stable Diffusion (self-hosted): $0.001-0.005 per image
- Midjourney: $0.01-0.04 per image (subscription-based)

**Optimization techniques that cut costs:**

1. **Model selection**: Use the smallest model that meets your quality requirements. 80% of tasks don't need GPT-4.
2. **Prompt optimization**: Shorter, more efficient prompts reduce token costs. Eliminating unnecessary context can cut costs 30-50%.
3. **Caching**: Store results for repeated or similar queries. Can reduce costs 40-80% for applications with repetitive patterns.
4. **Batching**: Process multiple requests together for efficiency gains.
5. **Quantization**: Run 4-bit or 8-bit quantized models for 70-80% cost reduction with minimal quality loss.
6. **Routing**: Use cheap models for simple queries and expensive models only for complex ones. A router model can cut average costs 60%.

**Budget planning rule of thumb**: For most business applications, plan $500-5,000/month for AI API costs at moderate scale (10,000-100,000 queries/month). High-volume applications can spend $10,000-50,000/month but should consider self-hosting or hybrid approaches.