Will ‘Humanity’s Last Exam’ be able to stump expert-level AI?

A team of technology experts issued a global call on Monday seeking the toughest questions to pose to artificial intelligence systems, which increasingly have handled popular benchmark tests like child’s play.

Dubbed “Humanity’s Last Exam,” the project seeks to determine when expert-level AI has arrived. It aims to stay relevant even as capabilities advance in future years, according to the organizers, a non-profit called the Center for AI Safety (CAIS) and the startup Scale AI.

The call comes days after the maker of ChatGPT previewed a new model, known as OpenAI o1, which “destroyed the most popular reasoning benchmarks,” said Dan Hendrycks, executive director of CAIS and an advisor to Elon Musk’s xAI startup.

Hendrycks co-authored two 2021 papers that proposed tests of AI systems that are now widely used, one quizzing them on undergraduate-level knowledge of topics like U.S. history, the other probing models’ ability to reason through competition-level math. The undergraduate-style test has more downloads from the online AI hub Hugging Face than any such dataset.

At the time of those papers, AI was giving almost random answers to questions on the exams. “They’re now crushed,” Hendrycks told Reuters.

As one example, the Claude models from the AI lab Anthropic have gone from scoring about 77% on the undergraduate-level test in 2023, to nearly 89% a year later, according to a prominent capabilities leaderboard.

These common benchmarks have less meaning as a result.

AI has appeared to score poorly on lesser-used tests involving plan formulation and visual pattern-recognition puzzles, according to Stanford University’s AI Index Report from April. OpenAI o1 scored around 21% on one version of the pattern-recognition ARC-AGI test, for instance, the ARC organizers said on Friday.

Some AI researchers argue that results like this show planning and abstract reasoning to be better measures of intelligence, though Hendrycks said the visual aspect of ARC makes it less suited to assessing language models. “Humanity’s Last Exam” will require abstract reasoning, he said.

Answers from common benchmarks may also have ended up in data used to train AI systems, industry observers have said. Hendrycks said some questions on “Humanity’s Last Exam” will remain private to make sure AI systems’ answers are not from memorization.

The exam will include at least 1,000 crowd-sourced questions due November 1 that are hard for non-experts to answer. These will undergo peer review, with winning submissions offered co-authorship and up to $5,000 prizes sponsored by Scale AI.

“We desperately need harder tests for expert-level models to measure the rapid progress of AI,” said Alexandr Wang, Scale’s CEO.

One restriction: the organizers want no questions about weapons, which some say would be too dangerous for AI to study.

—Jeffrey Dastin and Katie Paul, Reuters

https://www.fastcompany.com/91192231/will-humanitys-last-exam-stump-expert-level-ai?partner=rss&utm_source=rss&utm_medium=feed&utm_campaign=rss+fastcompany&utm_content=rss

Creată 4mo | 17 sept. 2024, 14:40:02

Autentifică-te pentru a adăuga comentarii

Alte posturi din acest grup

We need to put human creativity at the center of adtech

I’ve been searching for the words to describe my feelings towards the current state of adtech. Terms like “stale,” “stagnant,” and “boring” are among the

24 ian. 2025, 13:20:02 | Fast company - tech

How dangerous are 3D printers? Maybe enough for a background check

As 3D-printed gun violence abounds, some lawmakers are looking to cut the problem at the root.

The New York state senate is currently evaluating a bill that would dramatically chan

24 ian. 2025, 10:50:04 | Fast company - tech

A new Instagram feature might expose your embarrassing habits

Instagram Reels has added a new feature that shows you a feed of videos that your friends have liked. The bad news: It works both ways, meaning your friends can now see every video you’ve liked.&n

23 ian. 2025, 21:10:04 | Fast company - tech

Subaru security vulnerability exposed millions of cars to tracking risks

Two security researchers discovered a security vulnerability in Subaru’s Starlink-connected vehicles last year that gave them “unrestricted targeted access to all vehicles and customer

23 ian. 2025, 21:10:03 | Fast company - tech

OpenAI’s new Operator is a step into AI’s agentic future

OpenAI announced on Thursday a research preview of Operator, an AI agent that can browse the web and perform tasks for the user. Operat

23 ian. 2025, 21:10:02 | Fast company - tech

TikTok France is being sued by 7 families. Here’s why

In the moment when her world shattered three years ago, Stephanie Mistre found her 15-year-ol

23 ian. 2025, 18:40:07 | Fast company - tech

The Oval Office ‘Stargate Project’ reveal was just more tech industry genuflecting

Welcome to AI Decoded, Fast Company’s weekly newsletter that breaks down the most important news in the world of AI. You can sign up to receive this newsletter every week

23 ian. 2025, 18:40:05 | Fast company - tech

Tomas_r2