7/27/2023 0 Comments Google translateq![]() ![]() We applied the Transformer-based model to a dataset that had been filtered with a CLD3 model and trained to recognize clusters of similar languages. MASS simply garbles the input by randomly removing sequences of tokens from it, and trains the model to predict these sequences. This model supplements the LangID task with the MAsked Sequence-to-Sequence (MASS) task to better generalize over noisy web data. In our early attempts to identify under-resourced languages on the web by training a standard Compact Language Detector v3 (CLD3) LangID model, we too found that the dataset was too noisy to be usable.Īs an alternative, we trained a Transformer-based, semi-supervised LangID model on over 1000 languages. Tasks like LangID, which work well for high-resource languages, are unsuccessful for under-resourced languages, and many publicly available datasets crawled from the web often contain more noise than usable data for the languages they attempt to support. ![]() Finally, we highlight how native speakers have helped us realize this accomplishment.Īutomatically gathering usable textual data for under-resourced languages is much more difficult than it may seem. The techniques we introduce supplement massively multilingual models with a self supervised task to enable zero-resource translation. For these languages, we created monolingual datasets by developing and using specialized neural language identification models combined with novel filtering approaches. As part of this effort, we are expanding Google Translate to include 24 under-resourced languages. In “ Building Machine Translation Systems for the Next Thousand Languages”, we describe how to build high-quality monolingual datasets for over a thousand languages that do not have translation datasets available and demonstrate how one can use monolingual data alone to train MT models. Both of these challenges need to be addressed for translation models to reach sufficient quality. MT models usually train on large amounts of parallel (translated) text, but without such data, models must learn to translate from limited amounts of monolingual text, which is a novel area of research. The second challenge arises from modeling limitations. The first arises from data scarcity digitized data for many languages is limited and can be difficult to find on the web due to quality issues with Language Identification (LangID) models. There are two key bottlenecks towards building functioning translation models for the long tail of languages. Moreover, the languages that are currently represented are overwhelmingly European, largely overlooking regions of high linguistic diversity, like Africa and the Americas. Nevertheless, while existing translation services cover languages spoken by the majority of people world wide, they only include around 100 languages in total, just over 1% of those actively spoken globally. Performance on research benchmarks like WMT have soared, and translation services have improved in quality and expanded to include new languages. Machine translation (MT) technology has made significant advances in recent years, as deep learning has been integrated with natural language processing (NLP). Posted by Isaac Caswell and Ankur Bapna, Research Scientists, Google Translate ![]()
0 Comments
Leave a Reply. |