The open-source multilingual LLM Aya supports over 100 languages

Cohere for AI's Aya Project is working towards closing the gap of language representation in AI by providing coverage to unserved and underrepresented languages. The Aya model and its dataset are available under a fully permissive Apache 2.0 license.

The Cohere for AI (C4AI) team recently shared the release of the Aya model and datasets under a fully permissive Apache 2.0 license to broaden access to multilingual resources for AI development. Aya currently covers 101 languages, more than twice as many covered by the existing open-source multi-language models. With the release of Aya, C4AI hopes to bring underrepresented and underserved communities closer to being fairly represented in the current technological landscape, as well as contributing to the mitigation of the inherent cultural bias reflected by the intentional choice to focus on the development of AI and LLMs that serve the English-speaking community along with a couple of popular languages. To this end, the Aya Project brought together over 3,000 independent researchers from 119 countries.

Achieving fair representation means ensuring that models perform comparably across languages. Aya improves performance for underserved languages in tasks including natural language understanding, summarization, and translation. The C4AI benchmarked Aya against other leading open multilingual models, including mT0 and Bloomz, to find that Aya surpasses them by a wide margin. Aya averaged a 75% grade in human evaluations against other open-source models and 80-90% in simulated win rates. More generally, Aya expands access to 50 previously unserved languages, such as Somali and Uzbek, to provide an open-source model for many underrepresented languages.

The Aya Collection is the world's largest multilingual prompt and completion collection, jointly released with the Aya model. The collection contains 513 million prompts and completions covering 114 languages, created by fluent speakers of the languages that built templates for a selection of datasets and augmented a curated dataset list. The released data also includes the Aya Dataset, which C4AI describes as "the most extensive human-annotated, multilingual, instruction fine-tuning dataset to date," containing 204,000 human-curated annotations by fluent speakers of 67 languages. The dataset includes languages that had never appeared in multilingual instruction-style datasets. The Aya Collection and Dataset contain multiple language examples that reflect natural and organic language use via dialects and original contributions.

Since C4AI expects Aya will provide a foundation for other open science projects, those interested in contributing and ensuring that their languages are represented in the Aya Project can sign up to contribute and connect with other individuals working for fair language representation in AI.

Subscribe

The open-source multilingual LLM Aya supports over 100 languages

Comments

Read Next

Microsoft has launched three new models heavily marketed towards business use cases

Mistral secures $830M in debt financing to build its first data center in France

Mercor reports it fell victim to a cyberattack linked to the recently compromised LiteLLM

Legal tech darling Harvey confirms new funding round at a $11B valuation

Wikipedia bans LLM usage for article rewriting and generation