In Depth
Perplexity is computed as the exponent of the average negative log-likelihood per token. It is used to compare models trained on the same data distribution and to track training progress. While perplexity correlates loosely with downstream task performance, it does not capture instruction-following ability, factual accuracy, or safety — which is why benchmark suites and human evaluation remain essential complements.