The generative AI revolution has turned into a global race, with mixtures of models from private companies and open-source initiatives all competing to become the most popular and powerful. Many choose to promote their prowess by demonstrating their performance on common tests and levels within regular rankings.
But the legitimacy of those rankings has been thrown into question as new research published in Cornell University’s preprint server arXiv shows it’s possible to rig a model’s results with just a few hundred votes.
“When we talk about large language models, their performance on benchmarks is very important,” says study author Tianyu Pang, a researcher at Sea AI Lab, a Singapore-based research group. It helps promote startups looking to tout the abilities of their models, “which makes some startups motivated to get or manipulate the benchmark,” he says.
To test whether manipulation of the rankings was possible, Pang and his colleagues looked at Chatbot Arena, a crowdsourced AI benchmarking platform developed by researchers at the University of California Berkeley and LMArena. On Chatbot Arena, users can state their preference for one chatbot’s output over the other when put through a battery of tests. The results of those votes feed into the wider rankings that the platform shares publicly, and which are often regarded as definitive.
But Pang and his colleagues identified that it’s possible to sway the ranking position of models with just a few hundred votes. “We just need to take hundreds of new votes to improve a single ranking position,” he says. “The technique is very simple.”
While Chatbot Arena keeps the identities of its models secret when they’re pitted against one another, Pang and his colleagues trained a classifier to identify which model is being used based on its outputs, with a high accuracy level. “Then we can utilize the rating system to more efficiently improve the model ranking with the least number of new votes,” he explains.
The vote-rigging experiment was not tested on the live version of Chatbot Arena so as not to poison the results of the real website, but instead on historical data from the ranking platform. Despite this, Pang says that it’d be possible to do so in real life with the proper version of Chatbot Arena.
The team behind the ranking platform did not respond to Fast Company’s request for comment. Pang says his last contact with Chatbot Arena came in September 2024 (before he conducted the experiment), when he flagged the potential technique to manipulate the results. According to Pang, the Chatbot Arena team responded by recommending the researchers sandbox test the principle in the historical data. Pang says that Chatbot Arena does have multiple anti-cheating mechanisms in place to avoid flooding voting, but that they don’t mitigate against his team’s technique.
“From the user side, for now, we cannot make sure the rankings are reliable,” says Pang. “It’s the responsibility of the Chatbot Arena team to implement some anti-cheating mechanism to make sure the benchmark is the real level.”
Autentifică-te pentru a adăuga comentarii
Alte posturi din acest grup
![Try these tips to help your parents stay safe online](https://www.cdn5.niftycent.com/a/1/G/w/5/8/6/try-these-tips-to-help-your-parents-stay-safe-online.webp)
![Airlines are finally embracing Apple’s air tags—which means lost luggage could be a thing of the past](https://www.cdn5.niftycent.com/a/1/0/B/z/d/o/airlines-are-finally-embracing-apple-s-air-tags-which-means-lost-luggage-could-be-a-thing-of-the-past.webp)
There’s nothing more annoying than arriving at your destination and finding that your checked baggage didn’t make the trip. But thanks to Apple’s new partnership with 15 different airlines,
![Oracle’s HR software now has AI to help with taxes and career planning](https://www.cdn5.niftycent.com/a/e/7/v/N/j/A/oracle-s-hr-software-now-has-ai-to-help-with-taxes-and-career-planning.webp)
Oracle’s new AI will answer employee questions about everything job-related, from hiring to retiring.
Oracle has embedded artificial intelligence capabilities into its Human Capital Man
![‘Attractive people doing attractive things’: Members of this Instagram group dress up to make ‘old money’ content](https://www.cdn5.niftycent.com/a/1/Y/r/A/g/n/attractive-people-doing-attractive-things-members-of-this-instagram-group-dress-up-to-make-old-money-content.webp)
An X post recently made the rounds for its “old money” visuals. The video depicting weekends spent sailing Lake Como in tuxedos
![This copy trading app wants to produce the ‘next five Warren Buffetts’](https://www.cdn5.niftycent.com/a/e/b/9/r/v/V/this-copy-trading-app-wants-to-produce-the-next-five-warren-buffetts.webp)
![In a time crunch? 3 ways GenAI can come through in a clutch](https://www.cdn5.niftycent.com/a/1/g/o/B/b/a/in-a-time-crunch-3-ways-genai-can-come-through-in-a-clutch.webp)
![Transportation Sec. Duffy suggests changing air traffic rules and antiquated tech](https://www.cdn5.niftycent.com/a/1/n/6/y/v/3/transportation-sec-duffy-suggests-changing-air-traffic-rules-and-antiquated-tech.webp)
U.S. Transportation Secretary Sean Duffy said on Wednesday he is reconsidering ru