Tag Archives: research

Internet Archive and Center for Open Science Collaborate to Preserve Open Science Data

Open Science and research reproducibility rely on ongoing access to research data. With funding from the Institute of Museum and Library Services’ National Leadership Grants for Libraries program, the Internet Archive (IA) and Center for Open Science (COS) will work together to ensure that open data related to the scientific research process is archived for perpetual access, redistribution, and reuse. The project aims to leverage the intersection between open research data, the long-term stewardship activities of libraries, and distributed data sharing and preservation networks. By focusing on these three areas of work, the project will test and implement infrastructure for improved data sharing in further support of open science and data curation. Building out interoperability between open data platforms like the Open Science Framework (OSF) of COS, large scale digital archives like IA, and collaborative preservation networks has the potential to enable more seamless distribution of open research data and enable new forms of custody and use. See also the press release from COS announcing this project.

OSF supports the research lifecycle by enabling researchers to produce and manage registrations and data artifacts for further curation to foster adoption and discovery. The Internet Archive works with 700+ institutions to collect, archive, and provide access to born-digital and web-published resources and data. Preservation at IA of open data on OSF will enable further availability of this data to other preservation networks and curatorial partners for distributed long term stewardship and local custody by research institutions using both COS and IA services. The project will also partner with a number of preservation networks and repositories to mirror portions of this data and test additional interoperability among additional stewardship organizations and digital preservation systems.

Beyond the prototyping and technical work of data archiving, the teams will also be conducting training, including the development of open education resources, webinars, and similar materials to ensure data librarians can incorporate the project deliverables into their local research data management workflows. The two-year project will first focus on OSF Registrations data and expand to include other open access materials hosted on OSF. Later stage work will test interoperable approaches to sharing subsets of this data with other preservation networks such as LOCKSS, AP Trust, and individual university libraries. Together, IA and COS aim to lay the groundwork for seamless technical integration supporting the full lifecycle of data publishing, distribution, preservation, and perpetual access.

Project contacts:
IA – Jefferson Bailey, Director of Web Archiving & Data Services, jefferson [at] archive.org
COS – Nici Pfeiffer, Director of Product, nici [at] cos.io

Internet Archive Partners with University of Edinburgh to Provide Historical Web Data Supporting Machine Translation

The Internet Archive will provide portions of its web archive to the University of Edinburgh to support the School of Informatics’ work building open data and tools for advancing machine translation, especially for low-resource languages. Machine translation is the process of automatically converting text in one language to another.

The ParaCrawl project is mining translated text from the web in 29 languages.  With over 1 million translated sentences available for several languages, ParaCrawl is often the largest open collection of translations for each language.   The project is a collaboration between the University of Edinburgh, University of Alicante, Prompsit, TAUS, and Omniscien with funding from the EU’s Connecting Europe Facility.  Internet Archive data is vastly expanding the data mined by ParaCrawl and therefore the amount of translated sentences collected. Lead by Kenneth Heafield of the University of Edinburgh, the overall project will yield open corpora and open-source tools for machine translation as well as the processing pipeline.  

Archived web data from IA’s general web collections will be used in the project.  Because translations are particularly scarce for Icelandic, Croatian, Norwegian, and Irish, the IA will also use customized internal language classification tools to prioritize and extract data in these languages from archived websites in its collections.

The partnership expands on IA’s ongoing effort to provide computational research services to large-scale data mining projects focusing on open-source technical developments for furthering the public good and open access to information and data. Other recent collaborations include providing web data for assessing the state of local online news nationwide, analyzing historical corporate industry classifications, and mapping online social communities. As well, IA is expanding its work in making available custom extractions and datasets from its 20+ years of historical web data. For further information on IA’s web and data services, contact webservices at archive dot org.

Hacking Web Archives

The awkward teenage years of the web archive are over. It is now 27 years since Tim Berners-Lee created the web and 20 years since we at Internet Archive set out to systematically archive web content. As the web gains evermore “historicity” (i.e., it’s old and getting older — just like you!), it is increasingly recognized as a valuable historical record of interest to researchers and others working to study it at scale.

Thus, it has been exciting to see — and for us to support and participate in — a number of recent efforts in the scholarly and library/archives communities to hold hackathons and datathons focused on getting web archives into the hands of research and users. The events have served to help build a collaborative framework to encourage more use, more exploration, more tools and services, and more hacking (and similar levels of the sometime-maligned-but-ever-valuable yacking) to support research use of web archives. Get the data to the people!

pngl3s_hackathon_postFirst, in May, in partnership with the Alexandria Project of L3S at University of Hannover in Germany, we helped sponsor “Exploring the Past of the Web: Alexandria & Archive-It Hackathonalongside the Web Science 2016 conference. Over 15 researchers came together to analyze almost two dozen subject-based web archives created by institutions using our Archive-It service. Universities, archives, museums, and others contributed web archive collections on topics ranging from the Occupy Movement to Human Rights to Contemporary Women Artists on the Web. Hackathon teams geo-located IP addresses, analyzed sentiments and entities in webpage text, and studied mime type distributions.

unleashed attendeesunleashed_vizSimilarly, in June, our friends at Library of Congress hosted the second Archives Unleashed  datathon, a follow-on to a previous event held at University of Toronto in March 2016. The fantastic team organizing these two Archives Unleashed hackathons have created an excellent model for bringing together transdisciplinary researchers and librarians/archivists to foster work with web data. In both Archives Unleashed events, attendees formed into self-selecting teams to work together on specific analytical approaches and with specific web archive collections and datasets provided by Library of Congress, Internet Archive, University of Toronto, GWU’s Social Feed Manager, and others. The #hackarchives tweet stream gives some insight into the hacktivities, and the top projects were presented at the Save The Web symposium held at LC’s Kluge Center the day after the event.

Both events show a bright future for expanding new access models, scholarship, and collaborations around building and using web archives. Plus, nobody crashed the wi-fi at any of these events! Yay!

Special thanks go to Altiscale (and Start Smart Labs) and ComputeCanada for providing cluster computing services to support these events. Thanks also go to the multiple funding agencies, including NSF and SSHRC, that provided funding, and to the many co-sponsoring and hosting institutions. Super special thanks go to key organizers, Helge Holzman and Avishek Anand at L3S and Matt Weber, Ian Milligan, and Jimmy Lin at Archives Unleashed, who made these events a rollicking success.

For those interested in participating in a web archives hackathon/datathon, more are in the works, so stay tuned to the usual social media channels. If you are interested in helping host an event, please let us know. Lastly, for those that can’t make an event, but are interested in working with web archives data, check out our Archives Research Services Workshop.

Lastly, some links to blog posts, projects, and tools from these events:

Some related blog posts:

Some hackathon projects:

Some web archive analysis tools:

Here’s to more happy web archives hacking in the future!