About galaxy-brain
This repo is a personal set of evals: prompts and tasks I care about, know well, and can judge consistently. The site is for comparing submissions side by side and tracking how outcomes change over time as models and harnesses improve.
People can send pull requests with solutions; I run the evals myself (including open-ended or subjective parts) rather than outsourcing scoring to a crowd or an automated metric alone. That keeps the bar aligned with what I actually want from agents-not only what is easy to grade automatically.
Comparable efforts (larger or different in spirit)
If you are looking for large-scale subjectiveor “quality in the wild” comparisons, these are well-known references. They differ from this project in scale and governance, but they answer a similar “which model feels better on hard tasks?” question.
- LMSYS Chatbot Arena
Large-scale pairwise human preferences over model outputs; Elo-style leaderboards from crowd votes. The canonical example of subjective eval at scale.
- MT-Bench
Multi-turn dialogue benchmark scored with strong models (and originally designed to align with human judgment). Good reference for structured-but-still-quality-focused evaluation.
- AlpacaEval
Automatic pairwise comparisons (often via a strong judge model) against reference outputs; correlates with human preferences on instruction-following.
- WildBench
Tasks mined from real user-chatbot logs, with model-based pairwise scoring designed to track human Arena rankings.
- HELM (Holistic Evaluation of Language Models)
Broad, scenario-based reporting across accuracy, calibration, robustness, fairness, toxicity, and efficiency-not purely “vibes,” but a major effort to make comparisons transparent and multi-dimensional.
- SWE-bench
Real GitHub issues patched end-to-end; the flagship objective benchmark for coding agents (pass/fail on applied patches). Complements subjective build-quality reviews.
- Google IFEval
Verifiable instruction-following checks (counts, formatting, constraints)-useful contrast to open-ended “how good does this feel?” grading.
Source
Repository on GitHub - eval prompts live next to submitted solutions; the static build embeds docs/data.json for the browser.