Researchers at Amazon Web Services (AWS) AI lab have discovered that a large amount of online content comes from machine translated (MT) sources.
This content, which is translated into many different languages, is often of low quality, which the team says highlights the critical need for data quality and resource considerations when training large language models (LLMs).
The researchers also found that machine-generated content is common in translations for languages that have fewer resources, and makes up a significant portion of all content on the Internet.
Selection bias
“We actually became interested in this topic because several colleagues working in MT who are native speakers of low-resource languages noticed that much of the Internet in their native language seemed to be MT-generated,” says Mehak Dhaliwal, a former intern applied sciences at AWS and current doctoral candidate at the University of California, Santa Barbara Motherboard.
“So the insight really came from the low-resourced language speakers, and we did the research to better understand the problem and see how widespread it was.”
The team developed a comprehensive resource known as the Multi-Way ccMatrix (MWccMatrix) to better understand the characteristics of machine-translated content. This resource contains 6.4 billion unique sentences in 90 different languages and contains translation tuples, which are sets of sentences in different languages that are translations of each other.
The study, which was submitted to Cornell University preprint server arXiv, found that large amounts of web content are often translated into numerous languages, usually through machine translation. This content not only appears in translations into languages with fewer resources, but also makes up a significant portion of all web content in these languages.
The researchers additionally noted a selection bias in the type of content that is translated into multiple languages, likely for the purpose of generating advertising revenue.
The article concludes that “MT technology has improved dramatically over the past decade, but still falls short of human quality. MT content has been added to the Web over many years using whatever MT systems were available at the time, so much of the MT on the Internet is likely to be of very low quality by modern standards. This could produce less smooth LLM models with more hallucinations, and the selection bias indicates that the data may be of lower quality even before MT errors are taken into account. Data quality is crucial in LLM training, where high-quality corpora, such as books and Wikipedia articles, are typically upsampled multiple times.”