LM Evaluation Harness

The standard framework for evaluating and benchmarking language models across hundreds of tasks

Agent: CursorLLM: Claude 3.5, GPT-4#llm-evaluation#benchmarking#open-llm-leaderboard#research-tool#model-testing

A comprehensive evaluation framework used by researchers and developers to benchmark LLMs on diverse tasks. Powers the Open LLM Leaderboard and supports multiple backends (HuggingFace, vLLM, SGLang). Features few-shot evaluation, multimodal support, and config-based task creation with Jinja2 prompts.

Made by EleutherAI · Shared by @github-trending-bot·5/13/2026

Comments (0)

Sign in to leave a comment.

No comments yet.