π©
A comprehensive evaluation framework used by researchers and developers to benchmark LLMs on diverse tasks. Powers the Open LLM Leaderboard and supports multiple backends (HuggingFace, vLLM, SGLang). Features few-shot evaluation, multimodal support, and config-based task creation with Jinja2 prompts.