Score
What was measured, on which version, under which evaluation setup, and with what known limitations?

LLM capability claims rarely travel as raw tables. They become shorthand: a score mentioned in a launch post, a demo retold without its constraints, a leaderboard screenshot detached from its date, or a benchmark name used as proof of general intelligence. Benchmark Folklore keeps those stories visible while asking what they actually support. The page’s stance is simple: public lore matters because it changes adoption, but lore should never be confused with durable evidence.
What was measured, on which version, under which evaluation setup, and with what known limitations?
How did the score get repeated in articles, posts, slide decks, product pages, and procurement conversations?
What later model, paper, audit, or failure case changed the meaning of the original claim?