Wikipedia offers AI developers a training dataset to maybe get scraper bots off its back

Wikipedia has been struggling with the impact that AI crawlers — bots that are scraping text and multimedia from the encyclopedia to train generative artificial intelligence models — have been having on its servers, leading to increased costs and slower load times for human users in some cases. Perhaps in an effort to stop the bots from pummeling the public Wikipedia website and soaking up too much bandwidth, the Wikimedia Foundation (which manages Wikipedia's data) is offering AI developers a dataset they can freely use.

The organization has teamed up with Kaggle, a data science platform, to offer up a beta release of a structured dataset in both English and French. According to Google — which owns Kaggle — the dataset is formatted for machine learning to make it more useful for training, development and data science.

Wikimedia Enterprise notes that the dataset includes "abstracts, short descriptions, infobox-style key-value data, image links and clearly segmented article sections." There are no references or other "non-prose elements," such as video clips. The lack of references could make the issue of attribution for information in the dataset somewhat foggy. However, Wikimedia Enterprise (a part of the Wikimedia Foundation that seeks to make Wikipedia data available through APIs) says that the content in the dataset is freely licensed under Creative Commons, the public domain and so on since it's all from Wikipedia.

This article originally appeared on Engadget at https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss https://www.engadget.com/ai/wikipedia-offers-ai-developers-a-training-dataset-to-maybe-get-scraper-bots-off-its-back-143255593.html?src=rss
Erstellt 2d | 17.04.2025, 15:30:20


Melden Sie sich an, um einen Kommentar hinzuzufügen

Andere Beiträge in dieser Gruppe

Here are the coolest cars at New York International Auto Show 2025

This year marks the 125th anniversary of the New York International Auto Show (NYIAS), and despite concerns over tariffs, there are still a lot of manufacturers here showing off new models includin

18.04.2025, 21:40:18 | Engadget
Google is trying to get college students hooked on AI with a free year of Gemini Advanced

Under no circumstances should you let AI do your schoolwork for you, but Google has decided to make that option a little bit easier for the next year. The company is

18.04.2025, 21:40:17 | Engadget
The Apple Sports app now lets users create and share game cards

The Apple Sports app just introduced a new feature called Game Card Sharing. This lets users generate digital game cards that carry information about a specific match. The cards can be generated fo

18.04.2025, 19:20:15 | Engadget
Celebrate the 35th anniversary of the Hubble Space Telescope with a gigantic tower of gas and dust

As part of their ongoing celebration of the Hubble Space Telescope's

18.04.2025, 19:20:14 | Engadget
The rhythm-infused adventure Unbeatable has a new demo for PC and PS5

In the latest evidence that indie games are often where you find the boldest creative choices, look no further than Unbeatable. The hand‑drawn rhythm adventure title — announced in 2020 an

18.04.2025, 19:20:12 | Engadget
The Kia EV4 makes its US debut at the 2025 New York Auto Show

Kia's first all-electric sedan, the 2026 EV4, is making its official debut in the US at the

18.04.2025, 17:10:15 | Engadget