Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Creado 3d | 17 abr 2025, 15:30:20


Inicia sesión para agregar comentarios

Otros mensajes en este grupo.

Nintendo Switch 2 pre-orders open Thursday in the US starting at $450, plus everything else you need to know

Nintendo finally revealed when gamers in the US and Canada will be able to place their orders for Switch 2 consoles. Nintendo

20 abr 2025, 15:30:03 | Engadget
A bunch of robots ran a half-marathon alongside humans and it was incredibly goofy

Beijing held what’s being called the world’s first half-marathon for robots, allowing bipedal bots to compete alongside human runners, and as one might expect, ridiculousness ensued. The robots, wh

19 abr 2025, 23:10:18 | Engadget
Doctor Who ‘Lux’ review: Hope can change the world

Spoilers for “Lux.”

It’s an interesting time to be a long-running science fantasy media property in the streaming TV age. Star Trek is in the grip of an

19 abr 2025, 20:50:13 | Engadget
NASA’s Lucy spacecraft is about to have its second close encounter with an asteroid

A NASA spacecraft will make a close approach to an asteroid in the main belt on Sunday afternoon, in the second of several asteroid flybys planned for its 12-year mission to study remnants of the e

19 abr 2025, 18:30:12 | Engadget
Star Wars Zero Company looks like XCOM with Jedi and droids

EA and Lucasfilm shared first look at

19 abr 2025, 16:20:10 | Engadget
Real-time strategy game 'Tempest Rising' has been released early to all users

Tempest Rising, a real-time strategy game that's being called a "spiritual successor" a

19 abr 2025, 13:50:15 | Engadget