Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
созданный 6d | 17 апр. 2025 г., 15:30:20


Войдите, чтобы добавить комментарий

Другие сообщения в этой группе

OpenAI says it would buy Chrome if Google is forced to sell

Google is under the microscope following

22 апр. 2025 г., 23:10:11 | Engadget
Wheel of Time is getting a new AAA open-world RPG adaptation

Wheel of Time is getting a new video game adaptation. The popular fantasy book series has already seen an imagining for the small screen with

22 апр. 2025 г., 20:40:23 | Engadget
Instagram's former CEO testifies Zuckerberg thought the app was a ‘threat’ to Facebook

Facebook acquired Instagram in 2012 for $1 billion, but tensions between Mark Zuckerberg and the app’s founders persisted for years afterward. On Tuesday, Instagram’s former CEO and cofounder Kevin

22 апр. 2025 г., 20:40:21 | Engadget
Our favorite Google Nest security camera is on sale for 30 percent off

Engadget's pick for the best security camera for newbies is on sale for 30 percen

22 апр. 2025 г., 18:20:37 | Engadget
Instagram is rolling out Edits, its CapCut competitor

Earlier this year, right as TikTok and other ByteDance apps were temporarily pulled from Apple and Google’s app stores, Meta announced that it was working on a new video editing app tailored to Ins

22 апр. 2025 г., 18:20:35 | Engadget
Overwatch 2's frenetic Stadium mode is a new lease on life for my go-to game

I try to play as broad a swathe of games as I can, including as many of the major releases as I am able to get to. Baldur's Gate 3 garnered near-universal praise when it arrived in 2023, a

22 апр. 2025 г., 18:20:33 | Engadget