Authoritative LLM Benchmarks and Super-Benchmarkers
There are thousands of LLM benchmarks out there, covering conversation, reasoning, coding, safety, and more. But which ones actually matter? Instead of digging through endless papers and GitHub repos, let’s take a birds-eye view and try to rank them (and visualize some insights!).
This is a first-pass approach to identifying the most influential benchmarks. It’s simple, rough, and evolving. I’ll keep updating it with more benchmarks and better metrics over time.
What we’re trying to answer:
- How do we see the big picture without reading every single paper or analyzing every GitHub repo individually?
- Which benchmarks are leading the pack? Are they widely cited, starred on GitHub, or built by large research teams?
- Who are the Super-Benchmarkers? Behind every benchmark, there’s a group of researchers, institutions, and labs driving the evaluation landscape. Some of these names keep showing up across multiple benchmarks.
To answer this, I built a straightforward ranking pipeline that calculates an “authority score” for each benchmark (for lack of a better term!). The formula is highly opinionated but adjustable (Google Colab if you want to tweak it).
The score blends four factors:
- Citations per month (normalized for time since publication)
- GitHub stars (a proxy for community traction)
- Number of samples (bigger = more comprehensive?)
- Number of authors (more collaboration = more credibility?)
#----------------------------------------------------------------
# Compute Authority Score
# ----------------------------------------------------------------
alpha = 0.5
beta = 0.3
gamma = 0.1
delta = 0.1
def comp_cite(row):
c = row["Normalized Citation"]
return alpha * np.log1p(c)
def comp_stars(row):
s = row["Stars"]
return beta * np.log1p(s)
def comp_samples(row):
sm = row["Number of Samples"]
return gamma * np.log1p(sm)
def comp_authors(row):
na = row["Number of Authors"]
return delta * np.log1p(na)
df["CitePart"] = df.apply(comp_cite, axis=1)
df["StarsPart"] = df.apply(comp_stars, axis=1)
df["SamplePart"] = df.apply(comp_samples, axis=1)
df["AuthorPart"] = df.apply(comp_authors, axis=1)
df["Authority Score"] = df["CitePart"] + df["StarsPart"] + df["SamplePart"] + df["AuthorPart"]
# Fill in Type if missing
df["Type"] = df["Type"].fillna("Uncategorized").astype(str)
To build the dataset, I started with EvidentlyAI LLM Evaluation Benchmarks and expanded iteratively using a snowballing approach until the rankings started to stabilize.
Some benchmarks score consistently high across multiple dimensions, making them clear leaders.
Breaking benchmarks down by type, we can see how authority is distributed across different research areas.
Some researchers appear over and over again. These are the key people shaping AI evaluation and AI safety, showing up on multiple high-impact benchmarks.
---------------------------------------------------------------------------
# 3. Compute Authority Score
# ---------------------------------------------------------------------------
alpha, beta, gamma, delta = 0.5, 0.3, 0.1, 0.1
df["CitePart"] = alpha * np.log1p(df["Normalized Citation"])
df["StarsPart"] = beta * np.log1p(df["Stars"])
df["SamplePart"] = gamma * np.log1p(df["Number of Samples"])
df["AuthorPart"] = delta * np.log1p(df["Number of Authors"])
df["Authority Score"] = df["CitePart"] + df["StarsPart"] + df["SamplePart"] + df["AuthorPart"]
df["Authority Score (0-100)"] = 100 * df["Authority Score"] / df["Authority Score"].max()
df["CleanTypes"] = df["Type"].fillna("").apply(unify_types)
df["CleanTypeStr"] = df["CleanTypes"].apply(lambda lst: ",".join(lst))
# ---------------------------------------------------------------------------
# 4. Aggregate Authority Score by Author & Type
# ---------------------------------------------------------------------------
author_type_contrib = defaultdict(float)
for idx, row in df.iterrows():
score = row["Authority Score"]
authors_str = row.get("Authors", "")
authors_list = [a.strip() for a in authors_str.split(",") if a.strip() and a.lower() not in ["invalid url", "unknown"]]
types_list = row["CleanTypes"]
if not authors_list:
continue
# Give each author the FULL "score" (no division), for each type
for author in authors_list:
for benchmark_type in types_list:
author_type_contrib[(author, benchmark_type)] += score
# Convert to DataFrame
df_author_type = pd.DataFrame(
[(author, btype, contrib) for (author, btype), contrib in author_type_contrib.items()],
columns=["Author", "Type", "Authority Contribution"]
)
# For each author, turn authority by type into a relative percentage (0-100)
df_author_type["Relative Contribution (0-100)"] = (
df_author_type.groupby("Author")["Authority Contribution"]
.transform(lambda x: 100 * x / x.sum())
)
# Compute total authority per author
df_author_total = (
df_author_type.groupby("Author")["Authority Contribution"]
.sum()
.reset_index()
.rename(columns={"Authority Contribution": "Total Authority"})
)
# Get top 15 authors by total authority
df_author_total = df_author_total.sort_values("Total Authority", ascending=False)
top_authors = df_author_total["Author"].head(15).tolist()
# Filter only top authors
df_top_authors = df_author_type[df_author_type["Author"].isin(top_authors)].copy()
# Attach each author's "Total Authority" for sorting
df_top_authors = pd.merge(df_top_authors, df_author_total, on="Author", how="left")
# Sort by descending total authority
df_top_authors = df_top_authors.sort_values("Total Authority", ascending=False)
# Make Author an ordered categorical that matches the sort order
ordered_authors = df_top_authors["Author"].unique().tolist() # largest first
df_top_authors["Author"] = pd.Categorical(
df_top_authors["Author"],
categories=ordered_authors,
ordered=True
)
This makes sense: these are some of the most well-known figures in this space. It’s interested to see in what type of benchmarks they invest.
Of course, this is only measuring authority, not actual quality, and of course is very simplified and biased. If you’re interested in a deeper take on benchmark quality, I highly recommend checking out The AI Evaluation Substack.
What’s Next? Next versions will include:
- More benchmarks (better crawling, especially for newer datasets and more exotic tests).
- Better weighting strategies (normalizing for recency, adjusting influence metrics).
- Fine-grained influence tracking (which benchmarks drive downstream research or get cited in policy papers?).
Feedback welcome!
P.S. This analysis is part of an ongoing brainstorming with Raul Castro Fernandez.