Since 2017, Mozilla’s Common Voice project has collected more than 30,000 hours of recordings of people from around the world speaking their languages.
The project’s goal is to provide a free, publicly available dataset that anyone can use for training voice recognition AI software and other projects, while ensuring that all the material is provided with the informed consent of the people being recorded. Common Voice now includes recorded material and corresponding transcripts in roughly 180 languages, all available under the public domain-like Creative Commons CC0 license, with volunteers from communities worldwide working to add their own languages to the mix.
“We don’t add languages to the platform without communities,” says EM Lewis-Jong, product director at Mozilla. “It sounds like a small thing, but I think in the current AI age, it actually is weirdly radical to be consent-centered.”
![](https://images.fastcompany.com/image/upload/f_webp,q_auto,c_fit,w_1024,h_1024/wp-cms-2/2024/11/Spontaneous-Speech-Platfrom-Beta-Oct-2024.jpg)
And while Mozilla doesn’t disclose, or in some cases even necessarily know, exactly who’s using the data, Lewis-Jong says it’s been used by Big Tech companies, small independent operations, and plenty of projects in between. The dataset has been downloaded from Mozilla millions of times, and it’s also available through the AI development platform Hugging Face, which hosts speech recognition models trained on the Common Voice data.
In some cases, the dataset has been used by smaller projects focused on specific tasks, like delivering multilingual legal advice, providing information about governance, or building voice-powered chatbots with local agricultural information.
“I think it’s fair to say that from the largest and most famous technology organizations to really small civil society projects and individual developers, we really do see the full range,” Lewis-Jong says.
![](https://images.fastcompany.com/image/upload/f_webp,q_auto,c_fit,w_1024,h_1024/wp-cms-2/2024/11/spont_fr.png)
Common Voice continues to grow as new material gets recorded in existing languages and new volunteers approach Mozilla to localize the contribution for their own languages, letting contributors record, validate, and transcribe material that gets added to future releases.
At the start, Common Voice mostly prompted people to read aloud from public domain texts. But as it has expanded from languages with a large body of public domain texts for volunteers to read from to others with less such material, the project has added more general prompts, inviting people from different communities to answer general questions. Their answers, what Mozilla refers to as “spontaneous speech” about the topic, are then transcribed by volunteers.
Volunteers have often included people interested in providing open datasets, people wanting to see more AI tools available in their languages, and those interested in language preservation. For example, government advocates for the Welsh language have urged speakers to contribute to the project, and speakers of other languages around the world have contributed out of linguistic pride and a desire to have their language included in the voice assistants and tech tools of the future.
Irvin Chen, a language community organizer for the project in Taiwan, says that prior to Common Voice, there was no available dataset of spoken languages in Taiwan besides Mandarin. A Taiwanese language project launched on Common Voice in 2022, and Chen and others are now working with speakers of indigenous languages from across Taiwan to get them added as well.
“Once we start recording and once we have the result, it will be a resource that benefits forever all future research and all future technologies,” Chen says, noting that having accurate Taiwanese speech-to-text systems could benefit him and other people who can speak the language but can’t write it fluently.
![](https://images.fastcompany.com/image/upload/f_webp,q_auto,c_fit,w_1024,h_1024/wp-cms-2/2024/11/Taiwanese_scripted.png)
“For that, it’s very important that we have a database for researchers to build the voice-to-text information,” he says.
As Common Voice expands, Lewis-Jong explains, the project has to evolve technically to support languages that don’t match all of its initial assumptions about available resources.
“Increasingly what we see now is lower and lower resource languages joining, which has been a really interesting technology challenge, because so much of the way that platforms are built and technology is built is really optimized for high resource languages,” she says. “So now we’re having to adapt on the fly to try and make sure that we can accommodate people who maybe speak a language that doesn’t have any standardized writing system, or support people whose languages don’t exist in isolation but have lots of other languages mixed in.”
![](https://images.fastcompany.com/image/upload/f_webp,q_auto,c_fit,w_1024,h_1024/wp-cms-2/2024/11/spont_malay.png)
Mozilla is also working on a pilot program with some African language communities to allow for language datasets released under more restrictive licenses. It’s a potential response to contributors who are concerned about making data available completely free, even to big, wealthy tech companies or to projects to which they might be opposed. And while the details are still being hammered out, Lewis-Jong says licenses might be designed to allow certain types of organizations to use data free of charge while others would be expected to make a contribution to the community, or enforce certain rules for attributing data.
“This is all new territory for us, because we have historically only allowed people to create datasets and release them under the most open possible licensing,” she says. “So it’s going to be a really interesting experiment for the communities, for the data consumers, and for us.”
Accedi per aggiungere un commento
Altri post in questo gruppo
![The American woman who went viral in Pakistan has a crypto coin](https://www.cdn5.niftycent.com/a/D/m/8/9/E/W/the-american-woman-who-went-viral-in-pakistan-has-a-crypto-coin.webp)
The “American woman in Pakistan” now has a crypto coin.
If you don’t know who that is, American Onijah Andrew Robinson recently went viral after claiming she flew to Pa
![DOGE has disregarded data protection and privacy norms. The consequences will be felt years down the line](https://www.cdn5.niftycent.com/a/D/v/v/A/Y/E/doge-has-disregarded-data-protection-and-privacy-norms-the-consequences-will-be-felt-years-down-the-line.webp)
It has been a tumultuous few weeks since Donald Trump took office for the second time as president of the United States, While Trump has garnered headlines for his outlandish executive orders aime
David Ko, CEO of Calm, speaks with Brendan Vaughan about the state of mental health solutions in the workplace.
https://www.fastcompany.com/91276663/workplace-wellness-calm-ceos-guide-to-prio
![3 ways Tesla stands to win from Elon Musk’s war on the U.S. government](https://www.cdn5.niftycent.com/a/1/R/r/j/b/b/3-ways-tesla-stands-to-win-from-elon-musk-s-war-on-the-u-s-government.webp)
Elon Musk has long railed against the U.S. government, saying a crushing number of
![Will my social media posts really help my career?](https://www.cdn5.niftycent.com/a/e/b/9/g/q/0/will-my-social-media-posts-really-help-my-career.webp)
There are certain social media rules we can all agree on: Ghosting a conversation is impolite, and replying “k” to a text is the equivalent of a backhand slap (violent, wrong, and rude). But what
![This Google Maps ‘safety’ feature is actually making roads more dangerous](https://www.cdn5.niftycent.com/a/e/7/v/L/a/E/this-google-maps-safety-feature-is-actually-making-roads-more-dangerous.webp)
Picture this: You’re driving on a crowded highway, preparing to change lanes and pass a tractor-trailer. As you check your mirrors, a loud chime on your car’s infotainment screen rings out.
![How SoftBank’s Masayoshi Son plans to win the AI wars](https://www.cdn5.niftycent.com/a/1/Y/r/p/a/m/how-softbank-s-masayoshi-son-plans-to-win-the-ai-wars.webp)