Author Archives: jefferson

10 Years of Archiving the Web Together

Posted on October 25, 2016 by jefferson

As the Internet Archive turns 20, the Archive-It community is proud to celebrate an anniversary of its own: 10 years of working with thousands of librarians, archivists, and others to preserve the web and build rich, expansive collections of websites for discovery and use by future generations. Eighteen partners inaugurated the Archive-It service in 2006. Since then, that list has grown to include more than 450 organizations and individuals, each with its unique goals and collecting scope. In this time they added more than 17 billion (yes, with a “b”) URLs to their collections.

Archive-It partners over the years. Clockwise from top-left: Margaret Maes (LIPA) and Nicholas Taylor (Stanford University); James Jacobs (Stanford University) and Kent Norsworthy (University of Texas at Austin); K12 web archivists at PS 174 in Queens; Renate Giacomuzzi, Elisabeth Sporer (University of Innsbruck), and Kristine Hanna (Internet Archive)

Archive-It partners over the years. Clockwise from top-left: Margaret Maes (Legal Information Preservation Alliance) and Nicholas Taylor (Stanford University); James Jacobs (Stanford University) and Kent Norsworthy (University of Texas at Austin); K12 web archivists at PS 174 in Queens; Renate Giacomuzzi, Elisabeth Sporer (University of Innsbruck), and Kristine Hanna (Internet Archive)

And to give you just a hint of how the overall collection has grown: that’s about 5 billion new URLs in just the last year! They’ve captured some momentous historical events, local community history, and social and cultural activity across more than 7,000 collections to date, everything from 700+ human rights sites to the tea party movement; tobacco industry records to Mormon missionaries’ blogs. And of course who can forget all of the LOLcats? They’ve collaborated on capturing breaking news, opened doors to the next generation of curators in our K12 web archiving program, and explored their own collections in new forms with datasets leveraging our researcher services.

The Archive-It pilot website in 2005

Archive-It is Internet Archive’s web archiving service that helps institutions build, preserve, and provide access to collections of archived web content. It was developed in response to the needs of libraries, archives, historical societies, museums, and other organizations who sought to use the same powerful technology behind the Wayback Machine to curate their own web archives. The service was then the first of its kind, but has grown and expanded to meet the needs of an ever-widening scope of partners dedicated to archiving the web.

Adding a website to a collection in Archive-It 2.0, as released in July 2006.

Our pilot partners, who began testing a beta version of the service in late 2005, helped to develop and improve the essential tools that such a service would provide and used those tools to create collections, documenting local and global histories in a new way. Based on feedback from the pilot partners, the Archive-It web application launched publicly in 2006 with the most basic of curation tools: create a collection, capture content, and make it publicly available. The service and the community grew exponentially from there.

Archive-It 5.0 realtime crawl tracking.

The myriad partner-driven technical (to say nothing of aesthetic!) improvements of the last ten years are reflected in this year’s release of Archive-It 5.0, the first full redesign of the Archive-It web application since its launch. In the meantime, Archive-It continues to work with the community to preserve and provide access to amazing collections and to develop new tools for archiving the web, including new capture technologies, data transfer APIs, and more.

With year 11 (and Archive-It 5.1) just around the corner, we look forward to helping our partner institutions use new tools, build new collections, and expand the broader community working to archive the web.

Hacking Web Archives

Posted on August 31, 2016 by jefferson

The awkward teenage years of the web archive are over. It is now 27 years since Tim Berners-Lee created the web and 20 years since we at Internet Archive set out to systematically archive web content. As the web gains evermore “historicity” (i.e., it’s old and getting older — just like you!), it is increasingly recognized as a valuable historical record of interest to researchers and others working to study it at scale.

Thus, it has been exciting to see — and for us to support and participate in — a number of recent efforts in the scholarly and library/archives communities to hold hackathons and datathons focused on getting web archives into the hands of research and users. The events have served to help build a collaborative framework to encourage more use, more exploration, more tools and services, and more hacking (and similar levels of the sometime-maligned-but-ever-valuable yacking) to support research use of web archives. Get the data to the people!

First, in May, in partnership with the Alexandria Project of L3S at University of Hannover in Germany, we helped sponsor “Exploring the Past of the Web: Alexandria & Archive-It Hackathon” alongside the Web Science 2016 conference. Over 15 researchers came together to analyze almost two dozen subject-based web archives created by institutions using our Archive-It service. Universities, archives, museums, and others contributed web archive collections on topics ranging from the Occupy Movement to Human Rights to Contemporary Women Artists on the Web. Hackathon teams geo-located IP addresses, analyzed sentiments and entities in webpage text, and studied mime type distributions.

Similarly, in June, our friends at Library of Congress hosted the second Archives Unleashed datathon, a follow-on to a previous event held at University of Toronto in March 2016. The fantastic team organizing these two Archives Unleashed hackathons have created an excellent model for bringing together transdisciplinary researchers and librarians/archivists to foster work with web data. In both Archives Unleashed events, attendees formed into self-selecting teams to work together on specific analytical approaches and with specific web archive collections and datasets provided by Library of Congress, Internet Archive, University of Toronto, GWU’s Social Feed Manager, and others. The #hackarchives tweet stream gives some insight into the hacktivities, and the top projects were presented at the Save The Web symposium held at LC’s Kluge Center the day after the event.

Both events show a bright future for expanding new access models, scholarship, and collaborations around building and using web archives. Plus, nobody crashed the wi-fi at any of these events! Yay!

Special thanks go to Altiscale (and Start Smart Labs) and ComputeCanada for providing cluster computing services to support these events. Thanks also go to the multiple funding agencies, including NSF and SSHRC, that provided funding, and to the many co-sponsoring and hosting institutions. Super special thanks go to key organizers, Helge Holzman and Avishek Anand at L3S and Matt Weber, Ian Milligan, and Jimmy Lin at Archives Unleashed, who made these events a rollicking success.

For those interested in participating in a web archives hackathon/datathon, more are in the works, so stay tuned to the usual social media channels. If you are interested in helping host an event, please let us know. Lastly, for those that can’t make an event, but are interested in working with web archives data, check out our Archives Research Services Workshop.

Lastly, some links to blog posts, projects, and tools from these events:

Some related blog posts:

Some hackathon projects:

Some web archive analysis tools:

Here’s to more happy web archives hacking in the future!

IMLS National Digital Platform Grant Awarded to Advance Web Archiving

Posted on October 8, 2015 by jefferson

We are excited to announce that the Institute of Museum and Library Services (IMLS) has recently awarded a National Leadership Grant, in the National Digital Platform category, to a proposal by Internet Archive’s Archive-It, Stanford University Libraries (DLSS and LOCKSS), University of North Texas, and Rutgers University. The $353,221 grant will support the project “Systems Interoperability and Collaborative Development for Web Archiving,” a two-year research project to test economic and community models for collaborative technology development, prototype system integration through development of Export APIs, and build community participation in web archiving development and new research and access tools. In addition to the technical development included in the scope of work, the project will also host a National Symposium on Web Archiving Interoperability in early 2017.

The project supports the National Digital Platform funding priority of IMLS by increasing access to shared services and infrastructure while building capacity for broader community input in technology development. Project outcomes will promote system integration, facilitate increased distributed preservation of archived data, and help support new global and local access models possible through export APIs, with an eye towards modeling post-grant interoperable systems architectures. Archive-It’s status as widely-used, shared web archiving infrastructure ensures broad community impact and makes possible the involvement of institutions of all sizes in project work. The involvement of Stanford University Libraries builds on their work in the Hydra community and with digital preservation services. UNT contributes experience in digital library and web archiving technology development and Rutgers’ work on research uses of web archives ensures the involvement of downstream user communities. Overall, the project will lay the groundwork for future collaboration around interoperability that will enhance the integration of disparate systems, increase local preservation, and improve the discoverability and use of web archives.

The outcomes of the project will build on the past and current collaborations of project partners, as well as Archive-It’s work on API development internally and in related collaborative development work. Project partners’ roles in affiliated groups like the IIPC, LDCX, NDSA, and the Web Science community ensures the involvement of the larger digital library and internet researcher communities. The two-year project will run from January 2016 through December 2017 and Jefferson Bailey, Director, Web Archiving Programs, Internet Archive, will serve as Project Director.

We thank IMLS for their generous support of this project and their ongoing support for libraries and archives working collaboratively towards building a sustained National Digital Platform. The complete list of IMLS-funded projects this award cycle is available online and the full narratives of all projects funded as part of the National Digital Platform were published on the IMLS blog . Go IMLS!

Two Grants Announced Supporting Web Archiving

Posted on April 26, 2015 by jefferson

We are excited to announce Internet Archive’s participation in two new grant-funded collaborative projects to advance the field of web archiving! Our Archive-It service, which works with libraries, archives, museums and others to provide the tools for institutions to create their own web archives, will partner with New York University and Old Dominion University on two separate areas of work. We thank both The Andrew W. Mellon Foundation and the Institute of Museum and Library Services (IMLS) for their recognition of the value of web archiving and their support for the continued development of tools and initiatives to expand the quality, accessibility, and extensibility of these collections. We also thank our awesome collaborative partners on these projects, New York University Libraries, NYU’s Moving Image Archiving and Preservation (MIAP) program, and Old Dominion University’s Web Science and Digital Libraries Research Group and look forward to working with them as part of our broader initiative for “Building Libraries Together.”

For the project “Archiving the Websites of Contemporary Composers,” led by NYU Libraries and funded with a grant of $480,000 from The Andrew W. Mellon Foundation, we will work with the Libraries and MIAP. This project will archive web-based and born-digital audiovisual materials, and research and develop tools for their improved capture and discoverability. Contemporary musical works, as well as the rich secondary materials that accompany them, are increasingly migrating to the web. We outlined a number of current challenges to capturing and replaying online multimedia, such as dynamic and transient URL generation and adaptive bitrate streaming, as well as a need for continued research and development around the integration of web archives and non-web collections.

We have two specific pieces of work in the grant. First, we will build tools to improve the crawling and capture of web-based audiovisual materials, addressing the increasing complexity of streaming audiovisual materials, especially on third-party hosting and sharing platforms. This development work will build on our experience creating “Heritrix helper” tools like Umbra. Our second area of work will explore methods to integrate discovery of high-quality, non-web multimedia content held in external repositories into the Archive-It platform. Linking Archive-It collections with non-web institutional content has great potential to integrate web and non-web archives. This work will build on NYU’s creation of an API for their preservation repository, our increased use of API-based systems integration in Archive-It 5.0, and our continued work on improved content discovery for web collections. See NYU’s press release for more details.

The second recently-announced grant project is being lead by Old Dominion University’s Web Science and Digital Libraries Research Group, which received a $468,618 National Leadership Grant for Libraries from IMLS for the project, “Combining Social Media Storytelling With Web Archives” (grant number LG-71-15-0077). Readers not familiar with ODU’s great history of research and development around web archives are encourage to check out projects such as WARCreate/WAIL, their work on visualizations and Archive-It, and our recent favorite, the #whatdiditlooklike tool. In this project ODU will be building tools and processes to assimilate user-focused, online storytelling methods, such as Storify, to 1) summarize existing collections and 2) bootstrap new or expand existing web archive collections. The project will provide new ways to create unique topical and thematic collections through URLs shared via social media and storytelling platforms.

We will be working with them to integrate these tools in Archive-It, conduct user testing and training, and explore other ways that storytelling and user-generated materials can help build narrative pathways into large, often diffuse, collections of web content. We are excited to work with ODU and continue our increased focus on new models of access for web archives, as many institutional web collections are now of a breadth, volume, and operational maturity to begin focusing on novel ways their web archives can be studied and better understood by users and researchers.

Thanks again to Mellon Foundation and IMLS for supporting these cooperative efforts to advance web archiving and we are excited to work with our great partners and the broader community to keeping preserving and expanding access to the rich historical and cultural record documented on the web.

Internet Archive Blogs

A blog from the team at archive.org

Author Archives: jefferson

10 Years of Archiving the Web Together

Hacking Web Archives

IMLS National Digital Platform Grant Awarded to Advance Web Archiving

Two Grants Announced Supporting Web Archiving