Folklore cabinet of charts, model score tokens, and public AI rumors
Folklore Ledger

Benchmarks become stories before they become understanding

LLM capability claims rarely travel as raw tables. They become shorthand: a score mentioned in a launch post, a demo retold without its constraints, a leaderboard screenshot detached from its date, or a benchmark name used as proof of general intelligence. Benchmark Folklore keeps those stories visible while asking what they actually support. The page’s stance is simple: public lore matters because it changes adoption, but lore should never be confused with durable evidence.

Score

What was measured, on which version, under which evaluation setup, and with what known limitations?

Story

How did the score get repeated in articles, posts, slide decks, product pages, and procurement conversations?

Afterlife

What later model, paper, audit, or failure case changed the meaning of the original claim?