Quantization is a technique that reduces the precision of an AI model's numerical weights to make it smaller, faster, and cheaper to run — often with surprisingly little loss in quality. It's one of the most practical techniques for making AI accessible and affordable.
The core concept: Neural network weights are normally stored as 32-bit floating point numbers (FP32). Quantization converts them to lower precision — 16-bit (FP16), 8-bit (INT8), or even 4-bit (INT4). This is like the difference between storing a measurement as 3.14159265 versus 3.14 — you lose some precision, but the practical result is often nearly identical.
Why quantization matters:
Memory reduction: A 70-billion parameter model in FP32 requires ~280 GB of GPU memory. Quantized to INT4, it needs ~35 GB — fitting on a single high-end consumer GPU instead of a server cluster.
Speed improvement: Lower precision means faster arithmetic. INT8 quantization typically speeds up inference by 2-4x on supported hardware. This translates directly to lower latency and higher throughput.
Cost savings: Smaller models need fewer GPUs. If you can run a model on 1 GPU instead of 4, you cut inference costs by 75%.
Types of quantization:
Post-training quantization (PTQ): Applied after training is complete. No additional training required. Quick to implement but may lose more accuracy than other methods. Most commonly used for deploying existing models.
Quantization-aware training (QAT): Simulates quantization effects during training, so the model learns to be robust to lower precision. Produces better results than PTQ but requires retraining.
GPTQ: A popular method for quantizing large language models that calibrates on a small dataset to minimize accuracy loss. Widely used in the open-source community.
AWQ (Activation-Aware Weight Quantization): Preserves the most important weights at higher precision based on activation patterns. Often produces the best quality-to-size tradeoff.
Real-world impact:
- LLaMA 70B quantized to 4-bit runs on a single 48GB GPU (A6000) instead of requiring 4x A100s
- Inference cost drops from ~$4/hour to ~$1/hour
- Response speed improves by 2-3x
- Quality loss is typically 1-3% on benchmarks — imperceptible for most applications
When quantization hurts: Tasks requiring precise numerical reasoning, complex multi-step logic, or maximum factual accuracy can show degradation with aggressive quantization (4-bit or lower). For these tasks, FP16 or INT8 is safer. Always benchmark your specific use case.
Practical guidance: If you're deploying open source models, start with INT8 quantization — it's the sweet spot of speed improvement with minimal quality loss. Only go to INT4 if you need further memory or cost reduction and can tolerate slightly lower quality. Most cloud providers now offer quantized model serving as a standard option.