AI inference costs vary dramatically depending on the model, provider, and optimization techniques used. Understanding the cost structure is essential for budgeting AI projects and choosing the right deployment approach. Here's a comprehensive breakdown.
API pricing (pay-per-use):
Major language model APIs charge per token (roughly per word). As of early 2026:
Frontier models (highest capability):
- GPT-4o: ~$2.50/million input tokens, ~$10/million output tokens
- Claude Sonnet: ~$3/million input, ~$15/million output
- Gemini 1.5 Pro: ~$1.25/million input, ~$5/million output
Mid-tier models (good capability, lower cost):
- GPT-4o mini: ~$0.15/million input, ~$0.60/million output
- Claude Haiku: ~$0.25/million input, ~$1.25/million output
- Gemini 1.5 Flash: ~$0.075/million input, ~$0.30/million output
What this means in practice:
- Processing a 10-page document (~3,000 words, ~4,000 tokens) with GPT-4o costs about $0.01 input + a few cents for the response
- A customer service chatbot handling 1,000 conversations/day (averaging 500 tokens each) costs $1.25-$15/day depending on model choice
- Analyzing 10,000 documents costs $40-$150 with frontier models, or $1.50-$6 with smaller models
Self-hosted model costs:
Running your own models eliminates per-token fees but introduces infrastructure costs:
GPU rental (cloud):
- NVIDIA A100 (80GB): $2-4/hour (AWS, GCP, Azure)
- NVIDIA H100: $3-8/hour
- NVIDIA A10G: $1-2/hour (sufficient for smaller models)
Cost per query (self-hosted):
- LLaMA 70B on 4x A100: ~$0.001-0.003 per query (at 100+ queries/hour)
- LLaMA 8B on 1x A10G: ~$0.0001-0.0003 per query
- Mistral 7B quantized on consumer GPU: effectively free after hardware cost
The crossover point: Self-hosting typically becomes cheaper than API pricing at 50,000-100,000+ queries per day for mid-tier models. Below that volume, APIs are more cost-effective when you factor in engineering time for deployment and maintenance.
Image generation costs:
- DALL-E 3: $0.04-0.08 per image
- Stable Diffusion (self-hosted): $0.001-0.005 per image
- Midjourney: $0.01-0.04 per image (subscription-based)
Optimization techniques that cut costs:
- Model selection: Use the smallest model that meets your quality requirements. 80% of tasks don't need GPT-4.
- Prompt optimization: Shorter, more efficient prompts reduce token costs. Eliminating unnecessary context can cut costs 30-50%.
- Caching: Store results for repeated or similar queries. Can reduce costs 40-80% for applications with repetitive patterns.
- Batching: Process multiple requests together for efficiency gains.
- Quantization: Run 4-bit or 8-bit quantized models for 70-80% cost reduction with minimal quality loss.
- Routing: Use cheap models for simple queries and expensive models only for complex ones. A router model can cut average costs 60%.
Budget planning rule of thumb: For most business applications, plan $500-5,000/month for AI API costs at moderate scale (10,000-100,000 queries/month). High-volume applications can spend $10,000-50,000/month but should consider self-hosting or hybrid approaches.