In Depth

Perplexity is calculated on a test dataset the model hasnt seen. A model with perplexity of 10 is essentially choosing between 10 equally likely next words on average. Lower is better. It is one of the most fundamental metrics for evaluating language models during training.