Tag Archives: research

Archive-It and Archives Unleashed Join Forces to Scale Research Use of Web Archives

Archived web data and collections are increasingly important to scholarly practice, especially to those scholars interested in data mining and computational approaches to analyzing large sets of data, text, and records from the web. For over a decade Internet Archive has worked to support computational use of its web collections through a variety of services, from making raw crawl data available to researchers, performing customized extraction and analytic services supporting network or language analysis, to hosting web data hackathons and having dataset download features in our popular suite of web archiving services in Archive-It. Since 2016, we have also collaborated with the Archives Unleashed project to support their efforts to build tools, platforms, and learning materials for social science and humanities scholars to study web collections, including those curated by the 700+ institutions using Archive-It

We are excited to announce a significant expansion of our partnership. With a generous award of $800,000 (USD) to the University of Waterloo from The Andrew W. Mellon Foundation, Archives Unleashed and Archive-It will broaden our collaboration and further integrate our services to provide easy-to-use, scalable tools to scholars, researchers, librarians, and archivists studying and stewarding web archives.  Further integration of Archives Unleashed and Archive-It’s Research Services (and IA’s Web & Data Services more broadly) will simplify the ability of scholars to analyze archived web data and give digital archivists and librarians expanded tools for making their collections available as data, as pre-packaged datasets, and as archives that can be analyzed computationally. It will also offer researchers a best-of-class, end-to-end service for collecting, preserving, and analyzing web-published materials.

The Archives Unleashed team brings together a team of co-investigators.  Professor Ian Milligan, from the University of Waterloo’s Department of History, Jimmy Lin, Professor and Cheriton Chair at Waterloo’s Cheriton School of Computer Science, and Nick Ruest, Digital Assets Librarian in the Digital Scholarship Infrastructure department of York University Libraries, along with Jefferson Bailey, Director of Web Archiving & Data Services at the Internet Archive, will all serve as co-Principal Investigators on the “Integrating Archives Unleashed Cloud with Archive-It” project. This project represents a follow-on to the Archives Unleashed project that began in 2017, also funded by The Andrew W. Mellon Foundation.

“Our first stage of the Archives Unleashed Project,” explains Professor Milligan, “built a stand-alone service that turns web archive data into a format that scholars could easily use. We developed several tools, methods and cloud-based platforms that allow researchers to download a large web archive from which they can analyze all sorts of information, from text and network data to statistical information. The next logical step is to integrate our service with the Internet Archive, which will allow a scholar to run the full cycle of collecting and analyzing web archival content through one portal.”

“Researchers, from both the sciences and the humanities, are finally starting to realize the massive trove of archived web materials that can support a wide variety of computational research,” said Bailey. “We are excited to scale up our collaboration with Archives Unleashed to make the petabytes of web and data archives collected by Archive-It partners and other web archiving institutions around the world more useful for scholarly analysis.” 

The project begins in July 2020 and will begin releasing public datasets as part of the integration later in the year. Upcoming and future work includes technical integration of Archives Unleashed and Archive-It, creation and release of new open-source tools, datasets, and code notebooks, and a series of in-person “datathons” supporting a cohort of scholars using archived web data and collections in their data-driven research and analysis. We are grateful to The Andrew W. Mellon Foundation for their support of this integration and collaboration in support of critical infrastructure supporting computational scholarship and its use of the archived web.

Primary contacts:
IA – Jefferson Bailey, Director of Web Archiving & Data Services, jefferson [at] archive.org
AU – Ian Milligan, Professor of History, University of Waterloo, i2milligan [at] uwaterloo.ca

Internet Archive and Center for Open Science Collaborate to Preserve Open Science Data

Open Science and research reproducibility rely on ongoing access to research data. With funding from the Institute of Museum and Library Services’ National Leadership Grants for Libraries program, the Internet Archive (IA) and Center for Open Science (COS) will work together to ensure that open data related to the scientific research process is archived for perpetual access, redistribution, and reuse. The project aims to leverage the intersection between open research data, the long-term stewardship activities of libraries, and distributed data sharing and preservation networks. By focusing on these three areas of work, the project will test and implement infrastructure for improved data sharing in further support of open science and data curation. Building out interoperability between open data platforms like the Open Science Framework (OSF) of COS, large scale digital archives like IA, and collaborative preservation networks has the potential to enable more seamless distribution of open research data and enable new forms of custody and use. See also the press release from COS announcing this project.

OSF supports the research lifecycle by enabling researchers to produce and manage registrations and data artifacts for further curation to foster adoption and discovery. The Internet Archive works with 700+ institutions to collect, archive, and provide access to born-digital and web-published resources and data. Preservation at IA of open data on OSF will enable further availability of this data to other preservation networks and curatorial partners for distributed long term stewardship and local custody by research institutions using both COS and IA services. The project will also partner with a number of preservation networks and repositories to mirror portions of this data and test additional interoperability among additional stewardship organizations and digital preservation systems.

Beyond the prototyping and technical work of data archiving, the teams will also be conducting training, including the development of open education resources, webinars, and similar materials to ensure data librarians can incorporate the project deliverables into their local research data management workflows. The two-year project will first focus on OSF Registrations data and expand to include other open access materials hosted on OSF. Later stage work will test interoperable approaches to sharing subsets of this data with other preservation networks such as LOCKSS, AP Trust, and individual university libraries. Together, IA and COS aim to lay the groundwork for seamless technical integration supporting the full lifecycle of data publishing, distribution, preservation, and perpetual access.

Project contacts:
IA – Jefferson Bailey, Director of Web Archiving & Data Services, jefferson [at] archive.org
COS – Nici Pfeiffer, Director of Product, nici [at] cos.io

Internet Archive Partners with University of Edinburgh to Provide Historical Web Data Supporting Machine Translation

The Internet Archive will provide portions of its web archive to the University of Edinburgh to support the School of Informatics’ work building open data and tools for advancing machine translation, especially for low-resource languages. Machine translation is the process of automatically converting text in one language to another.

The ParaCrawl project is mining translated text from the web in 29 languages.  With over 1 million translated sentences available for several languages, ParaCrawl is often the largest open collection of translations for each language.   The project is a collaboration between the University of Edinburgh, University of Alicante, Prompsit, TAUS, and Omniscien with funding from the EU’s Connecting Europe Facility.  Internet Archive data is vastly expanding the data mined by ParaCrawl and therefore the amount of translated sentences collected. Lead by Kenneth Heafield of the University of Edinburgh, the overall project will yield open corpora and open-source tools for machine translation as well as the processing pipeline.  

Archived web data from IA’s general web collections will be used in the project.  Because translations are particularly scarce for Icelandic, Croatian, Norwegian, and Irish, the IA will also use customized internal language classification tools to prioritize and extract data in these languages from archived websites in its collections.

The partnership expands on IA’s ongoing effort to provide computational research services to large-scale data mining projects focusing on open-source technical developments for furthering the public good and open access to information and data. Other recent collaborations include providing web data for assessing the state of local online news nationwide, analyzing historical corporate industry classifications, and mapping online social communities. As well, IA is expanding its work in making available custom extractions and datasets from its 20+ years of historical web data. For further information on IA’s web and data services, contact webservices at archive dot org.

Hacking Web Archives

The awkward teenage years of the web archive are over. It is now 27 years since Tim Berners-Lee created the web and 20 years since we at Internet Archive set out to systematically archive web content. As the web gains evermore “historicity” (i.e., it’s old and getting older — just like you!), it is increasingly recognized as a valuable historical record of interest to researchers and others working to study it at scale.

Thus, it has been exciting to see — and for us to support and participate in — a number of recent efforts in the scholarly and library/archives communities to hold hackathons and datathons focused on getting web archives into the hands of research and users. The events have served to help build a collaborative framework to encourage more use, more exploration, more tools and services, and more hacking (and similar levels of the sometime-maligned-but-ever-valuable yacking) to support research use of web archives. Get the data to the people!

pngl3s_hackathon_postFirst, in May, in partnership with the Alexandria Project of L3S at University of Hannover in Germany, we helped sponsor “Exploring the Past of the Web: Alexandria & Archive-It Hackathonalongside the Web Science 2016 conference. Over 15 researchers came together to analyze almost two dozen subject-based web archives created by institutions using our Archive-It service. Universities, archives, museums, and others contributed web archive collections on topics ranging from the Occupy Movement to Human Rights to Contemporary Women Artists on the Web. Hackathon teams geo-located IP addresses, analyzed sentiments and entities in webpage text, and studied mime type distributions.

unleashed attendeesunleashed_vizSimilarly, in June, our friends at Library of Congress hosted the second Archives Unleashed  datathon, a follow-on to a previous event held at University of Toronto in March 2016. The fantastic team organizing these two Archives Unleashed hackathons have created an excellent model for bringing together transdisciplinary researchers and librarians/archivists to foster work with web data. In both Archives Unleashed events, attendees formed into self-selecting teams to work together on specific analytical approaches and with specific web archive collections and datasets provided by Library of Congress, Internet Archive, University of Toronto, GWU’s Social Feed Manager, and others. The #hackarchives tweet stream gives some insight into the hacktivities, and the top projects were presented at the Save The Web symposium held at LC’s Kluge Center the day after the event.

Both events show a bright future for expanding new access models, scholarship, and collaborations around building and using web archives. Plus, nobody crashed the wi-fi at any of these events! Yay!

Special thanks go to Altiscale (and Start Smart Labs) and ComputeCanada for providing cluster computing services to support these events. Thanks also go to the multiple funding agencies, including NSF and SSHRC, that provided funding, and to the many co-sponsoring and hosting institutions. Super special thanks go to key organizers, Helge Holzman and Avishek Anand at L3S and Matt Weber, Ian Milligan, and Jimmy Lin at Archives Unleashed, who made these events a rollicking success.

For those interested in participating in a web archives hackathon/datathon, more are in the works, so stay tuned to the usual social media channels. If you are interested in helping host an event, please let us know. Lastly, for those that can’t make an event, but are interested in working with web archives data, check out our Archives Research Services Workshop.

Lastly, some links to blog posts, projects, and tools from these events:

Some related blog posts:

Some hackathon projects:

Some web archive analysis tools:

Here’s to more happy web archives hacking in the future!