For scholars, especially those in the humanities, the library is their laboratory. Published works and manuscripts are their materials of science. Today, to do meaningful research, that also means having access to modern datasets that facilitate data mining and machine learning.
On March 2, the Internet Archive launched a new series of webinars highlighting its efforts to support data-intensive scholarship and digital humanities projects. The first session focused on the methods and techniques available for analyzing web archives at scale.
Watch the session recording now:
“If we can have collections of cultural materials that are useful in ways that are easy to use — still respectful of rights holders — then we can start to get a bigger idea of what’s going on in the media ecosystem,” said Internet Archive Founder Brewster Kahle.
Just what can be done with billions of archived web pages? The possibilities are endless.
Jefferson Bailey, Internet Archive’s Director of Web Archiving & Data Services, and Helge Holzmann, Web Data Engineer, shared some of the technical issues libraries should consider and tools available to make large amounts of digital content available to the public.
The Internet Archive gathers information from the web through different methods including global and domain crawling, data partnerships and curation services. It preserves different types of content (text, code, audio-visual) in a variety of formats.
Social scientists, data analysts, historians and literary scholars make requests for data from the web archive for computational use in their research. Institutions use its service to build small and large collections for a range of purposes. Sometimes the projects can be complex and it can be a challenge to wrangle the volume of data, said Bailey.
The Internet Archive has worked on a project reviewing changes to the content of 800,000 corporate home pages since 1996. It has also done data mining for a language analysis that did custom extractions for Icelandic, Norwegian and Irish translation.
Transforming data into useful information requires data engineering. As librarians consider how to respond to inquiries for data, they should look at their tech resources, workflow and capacity. While more complicated to produce, the potential has expanded given the size, scale and longitudinal analysis that can be done.
“We are getting more and more computational use data requests each year,” Bailey said. “If librarians, archivists, cultural heritage custodians haven’t gotten these requests yet, they will be getting them soon.”
Up next in the Library as Laboratory series:
The next webinar in the series will be held March 16, and will highlight five innovative web archiving research projects from the Archives Unleashed Cohort Program. Register now.