Archived web data and collections are increasingly important to scholarly practice, especially to those scholars interested in data mining and computational approaches to analyzing large sets of data, text, and records from the web. For over a decade Internet Archive has worked to support computational use of its web collections through a variety of services, from making raw crawl data available to researchers, performing customized extraction and analytic services supporting network or language analysis, to hosting web data hackathons and having dataset download features in our popular suite of web archiving services in Archive-It. Since 2016, we have also collaborated with the Archives Unleashed project to support their efforts to build tools, platforms, and learning materials for social science and humanities scholars to study web collections, including those curated by the 700+ institutions using Archive-It.
We are excited to announce a significant expansion of our partnership. With a generous award of $800,000 (USD) to the University of Waterloo from The Andrew W. Mellon Foundation, Archives Unleashed and Archive-It will broaden our collaboration and further integrate our services to provide easy-to-use, scalable tools to scholars, researchers, librarians, and archivists studying and stewarding web archives. Further integration of Archives Unleashed and Archive-It’s Research Services (and IA’s Web & Data Services more broadly) will simplify the ability of scholars to analyze archived web data and give digital archivists and librarians expanded tools for making their collections available as data, as pre-packaged datasets, and as archives that can be analyzed computationally. It will also offer researchers a best-of-class, end-to-end service for collecting, preserving, and analyzing web-published materials.
The Archives Unleashed team brings together a team of co-investigators. Professor Ian Milligan, from the University of Waterloo’s Department of History, Jimmy Lin, Professor and Cheriton Chair at Waterloo’s Cheriton School of Computer Science, and Nick Ruest, Digital Assets Librarian in the Digital Scholarship Infrastructure department of York University Libraries, along with Jefferson Bailey, Director of Web Archiving & Data Services at the Internet Archive, will all serve as co-Principal Investigators on the “Integrating Archives Unleashed Cloud with Archive-It” project. This project represents a follow-on to the Archives Unleashed project that began in 2017, also funded by The Andrew W. Mellon Foundation.
“Our first stage of the Archives Unleashed Project,” explains Professor Milligan, “built a stand-alone service that turns web archive data into a format that scholars could easily use. We developed several tools, methods and cloud-based platforms that allow researchers to download a large web archive from which they can analyze all sorts of information, from text and network data to statistical information. The next logical step is to integrate our service with the Internet Archive, which will allow a scholar to run the full cycle of collecting and analyzing web archival content through one portal.”
“Researchers, from both the sciences and the humanities, are finally starting to realize the massive trove of archived web materials that can support a wide variety of computational research,” said Bailey. “We are excited to scale up our collaboration with Archives Unleashed to make the petabytes of web and data archives collected by Archive-It partners and other web archiving institutions around the world more useful for scholarly analysis.”
The project begins in July 2020 and will begin releasing public datasets as part of the integration later in the year. Upcoming and future work includes technical integration of Archives Unleashed and Archive-It, creation and release of new open-source tools, datasets, and code notebooks, and a series of in-person “datathons” supporting a cohort of scholars using archived web data and collections in their data-driven research and analysis. We are grateful to The Andrew W. Mellon Foundation for their support of this integration and collaboration in support of critical infrastructure supporting computational scholarship and its use of the archived web.
IA – Jefferson Bailey, Director of Web Archiving & Data Services, jefferson [at] archive.org
AU – Ian Milligan, Professor of History, University of Waterloo, i2milligan [at] uwaterloo.ca