Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Établi 9d | 17 avr. 2025, 15:30:20


Connectez-vous pour ajouter un commentaire

Autres messages de ce groupe

How to watch LlamaCon 2025, Meta's first generative AI developer conference

After a couple years of having its open-source Llama AI model be just a part of its Connect conferences, Meta is breaking things out and hosting an entirely generative AI-focused developer conferen

25 avr. 2025, 22:50:14 | Engadget
Boox's new Go 7 E Ink tablets support handwriting with a $46 stylus

Boox, a company that makes E Ink gear ranging from

25 avr. 2025, 20:30:23 | Engadget
“It feels alive”: The Legend of Ochi director on the power of puppets

The Legend of Ochi feels like a film that shouldn't exist today. It's an original story, not an adaptation of an already popular book or comic. It's filled with complex puppetry and practi

25 avr. 2025, 20:30:22 | Engadget
Infinity Nikki is coming to Steam and getting a co-op mode

The fashion-forward adventure Infinity Nikki is finally coming to Steam on April 29, compl

25 avr. 2025, 20:30:21 | Engadget