The Internet Archive will provide portions of its web archive to the University of Edinburgh to support the School of Informatics’ work building open data and tools for advancing machine translation, especially for low-resource languages. Machine translation is the process of automatically converting text in one language to another.
The ParaCrawl project is mining translated text from the web in 29 languages. With over 1 million translated sentences available for several languages, ParaCrawl is often the largest open collection of translations for each language. The project is a collaboration between the University of Edinburgh, University of Alicante, Prompsit, TAUS, and Omniscien with funding from the EU’s Connecting Europe Facility. Internet Archive data is vastly expanding the data mined by ParaCrawl and therefore the amount of translated sentences collected. Lead by Kenneth Heafield of the University of Edinburgh, the overall project will yield open corpora and open-source tools for machine translation as well as the processing pipeline.
Archived web data from IA’s general web collections will be used in the project. Because translations are particularly scarce for Icelandic, Croatian, Norwegian, and Irish, the IA will also use customized internal language classification tools to prioritize and extract data in these languages from archived websites in its collections.
The partnership expands on IA’s ongoing effort to provide computational research services to large-scale data mining projects focusing on open-source technical developments for furthering the public good and open access to information and data. Other recent collaborations include providing web data for assessing the state of local online news nationwide, analyzing historical corporate industry classifications, and mapping online social communities. As well, IA is expanding its work in making available custom extractions and datasets from its 20+ years of historical web data. For further information on IA’s web and data services, contact webservices at archive dot org.