Internet Archive Partners with University of Edinburgh to Provide Historical Web Data Supporting Machine Translation

The Internet Archive will provide portions of its web archive to the University of Edinburgh to support the School of Informatics’ work building open data and tools for advancing machine translation, especially for low-resource languages. Machine translation is the process of automatically converting text in one language to another.

The ParaCrawl project is mining translated text from the web in 29 languages.  With over 1 million translated sentences available for several languages, ParaCrawl is often the largest open collection of translations for each language.   The project is a collaboration between the University of Edinburgh, University of Alicante, Prompsit, TAUS, and Omniscien with funding from the EU’s Connecting Europe Facility.  Internet Archive data is vastly expanding the data mined by ParaCrawl and therefore the amount of translated sentences collected. Lead by Kenneth Heafield of the University of Edinburgh, the overall project will yield open corpora and open-source tools for machine translation as well as the processing pipeline.  

Archived web data from IA’s general web collections will be used in the project.  Because translations are particularly scarce for Icelandic, Croatian, Norwegian, and Irish, the IA will also use customized internal language classification tools to prioritize and extract data in these languages from archived websites in its collections.

The partnership expands on IA’s ongoing effort to provide computational research services to large-scale data mining projects focusing on open-source technical developments for furthering the public good and open access to information and data. Other recent collaborations include providing web data for assessing the state of local online news nationwide, analyzing historical corporate industry classifications, and mapping online social communities. As well, IA is expanding its work in making available custom extractions and datasets from its 20+ years of historical web data. For further information on IA’s web and data services, contact webservices at archive dot org.

3 thoughts on “Internet Archive Partners with University of Edinburgh to Provide Historical Web Data Supporting Machine Translation

  1. Christopher Cardoen

    Hi there,

    Beautiful Article!

    So if I understand it right archived websites will be translated in the future?

    I’m not a native English speaker, so pardon me if I din’t get it.

    Thank you for your reply.

    Kind regards,

  2. Beth Stevens Breen

    This is Beth Stevens Breen. I just read above reply about the Internet Archive . It is an amazing place filled with information. The “WayBack Time Machine” is historical to say the least. It is a very familiar place to me and I’m glad you enjoy the information as well. I like to read various articles and compare information from the past and present. I like to question and challenge ideas and information. I am glad you like it as well:) The internet system is a vary interesting thing.

Comments are closed.