MLCommons presents free datasets for speech recognition

Dec 17, 2021

Dmitry Spodarets

MLCommons presents free datasets for speech recognition

MLCommons, an organization that aims to build free, open-source AI development tools and resources, has just released the People’s Speech Dataset and the Multilingual Spoken Words Corpus. This is huge, given the amount of work the team had to do to put it together.

According to the MLCommons team, the People’s Speech Dataset is one of the world’s most complex English speech datasets licensed for academic and commercial usage, with huge amount of hours of recording materials. In the meantime, the Multilingual Spoken Words Corpus ranks among the biggest and best audio speech datasets featuring keywords in 50+ languages.

With the release of the People’s Speech Dataset and the MSWC, the developers now have new tools to add to their toolbox, enabling them to design and build their own speech recognition systems on a smaller budget and having fewer technical challenges than ever before.

Is the future of AI/ML open-source, then? Let us know in the comments!

Comments

Qwen3: A new generation of language models featuring hybrid thinking

The Qwen team has released Qwen3, a new family of open-weight language models comprising two MoE and six dense models. The Qwen3 model family features hybrid thinking capabilities, support for 119 languages, and competitive performance.

May 08, 2025

by Ellie Ramirez-Camara

News

Microsoft launches Phi-4 'reasoning' models to celebrate Phi-3's first anniversary

Microsoft has introduced three new small language models—Phi-4-reasoning, Phi-4-reasoning-plus, and Phi-4-mini-reasoning—that reportedly deliver complex reasoning capabilities comparable to much larger models while maintaining efficiency for deployment across various computing environments.

May 08, 2025

by Ellie Ramirez-Camara

News

AI2's OLMo 2 1B model rivals offerings from leading tech firms

AI2's new Olmo 2 1B model outperforms similar-sized offerings from Google, Meta, and Alibaba across multiple key benchmarks, including GSM8K, TruthfulQA, and DROP, while maintaining a small enough size to run on resource-limited hardware.

May 06, 2025

by Ellie Ramirez-Camara

SF Bay Area media and education platform focused on AI and Data. As a voice of the AI industry, Data Phoenix delivers news, insights, practical knowledge and helps companies be heard in the community.

Subscribe

MLCommons presents free datasets for speech recognition

Comments

Read Next

Qwen3: A new generation of language models featuring hybrid thinking

Microsoft launches Phi-4 'reasoning' models to celebrate Phi-3's first anniversary

AI2's OLMo 2 1B model rivals offerings from leading tech firms