AI inference costs vary dramatically depending on the model, provider, and optimization techniques used. Understanding the cost structure is essential for budgeting AI projects and choosing the right deployment approach. Here's a comprehensive breakdown.

API pricing (pay-per-use):

Major language model APIs charge per token (roughly per word). As of early 2026:

Frontier models (highest capability):

  • GPT-4o: ~$2.50/million input tokens, ~$10/million output tokens
  • Claude Sonnet: ~$3/million input, ~$15/million output
  • Gemini 1.5 Pro: ~$1.25/million input, ~$5/million output

Mid-tier models (good capability, lower cost):

  • GPT-4o mini: ~$0.15/million input, ~$0.60/million output
  • Claude Haiku: ~$0.25/million input, ~$1.25/million output
  • Gemini 1.5 Flash: ~$0.075/million input, ~$0.30/million output

What this means in practice:

  • Processing a 10-page document (~3,000 words, ~4,000 tokens) with GPT-4o costs about $0.01 input + a few cents for the response
  • A customer service chatbot handling 1,000 conversations/day (averaging 500 tokens each) costs $1.25-$15/day depending on model choice
  • Analyzing 10,000 documents costs $40-$150 with frontier models, or $1.50-$6 with smaller models

Self-hosted model costs:

Running your own models eliminates per-token fees but introduces infrastructure costs:

GPU rental (cloud):

  • NVIDIA A100 (80GB): $2-4/hour (AWS, GCP, Azure)
  • NVIDIA H100: $3-8/hour
  • NVIDIA A10G: $1-2/hour (sufficient for smaller models)

Cost per query (self-hosted):

  • LLaMA 70B on 4x A100: ~$0.001-0.003 per query (at 100+ queries/hour)
  • LLaMA 8B on 1x A10G: ~$0.0001-0.0003 per query
  • Mistral 7B quantized on consumer GPU: effectively free after hardware cost

The crossover point: Self-hosting typically becomes cheaper than API pricing at 50,000-100,000+ queries per day for mid-tier models. Below that volume, APIs are more cost-effective when you factor in engineering time for deployment and maintenance.

Image generation costs:

  • DALL-E 3: $0.04-0.08 per image
  • Stable Diffusion (self-hosted): $0.001-0.005 per image
  • Midjourney: $0.01-0.04 per image (subscription-based)

Optimization techniques that cut costs:

  1. Model selection: Use the smallest model that meets your quality requirements. 80% of tasks don't need GPT-4.
  2. Prompt optimization: Shorter, more efficient prompts reduce token costs. Eliminating unnecessary context can cut costs 30-50%.
  3. Caching: Store results for repeated or similar queries. Can reduce costs 40-80% for applications with repetitive patterns.
  4. Batching: Process multiple requests together for efficiency gains.
  5. Quantization: Run 4-bit or 8-bit quantized models for 70-80% cost reduction with minimal quality loss.
  6. Routing: Use cheap models for simple queries and expensive models only for complex ones. A router model can cut average costs 60%.

Budget planning rule of thumb: For most business applications, plan $500-5,000/month for AI API costs at moderate scale (10,000-100,000 queries/month). High-volume applications can spend $10,000-50,000/month but should consider self-hosting or hybrid approaches.