NexChron

Benchmark

Definition A standardized test or dataset used to measure and compare AI model performance on defined tasks. Benchmarks provide a common language for tracking progress and identifying strengths and weaknesses across models.

In Depth

Well-known benchmarks include MMLU (knowledge), HumanEval (coding), GSM8K (math), HELM (holistic language evaluation), and Big-Bench. Benchmark results should be interpreted carefully: models can be inadvertently or deliberately optimized for benchmark performance in ways that don't reflect real-world capability ("benchmark overfitting"). A healthy benchmark ecosystem requires continuous creation of new, contamination-free evaluations.

Browse more terms

AI Agent AI Alignment AI Audit AI Bill of Rights AI Compute AI Governance AI Orchestration AI Readiness AI Risk Management AI Watermarking AI-as-a-Service Activation Function Active Learning Adversarial Attack Agentic AI Agentic Workflow Algorithmic Fairness Arctic Artificial General Intelligence Artificial Superintelligence