Long before the 2016 Presidential election cycle librarians have understood this often-overlooked fact: vast amounts of government data and digital information are at risk of vanishing when a presidential term ends and administrations change. For example, 83% of .gov pdf’s disappeared between 2008 and 2012.
That is why the Internet Archive, along with partners from the Library of Congress, University of North Texas, George Washington University, Stanford University, California Digital Library, and other public and private libraries, are hard at work on the End of Term Web Archive, a wide-ranging effort to preserve the entirety of the federal government web presence, especially the .gov and .mil domains, along with federal websites on other domains and official government social media accounts.
While not the only project the Internet Archive is doing to preserve government websites, ftp sites, and databases at this time, the End of Term Web Archive is a far reaching one.
The Internet Archive is collecting webpages from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts. The effort is likely to preserve hundreds of millions of individual government webpages and data and could end up totaling well over 100 terabytes of data of archived materials. Over its full history of web archiving, the Internet Archive has preserved over 3.5 billion URLs from the .gov domain including over 45 million PDFs.
This end-of-term collection builds on similar initiatives in 2008 and 2012 by original partners Internet Archive, Library of Congress, University of North Texas, and California Digital Library to document the “gov web,” which has no mandated, domain-wide single custodian. For instance, here is the National Institute of Literacy (NIFL) website in 2008. The domain went offline in 2011. Similarly, the Sustainable Development Indicators (SDI) site was later taken down. Other websites, such as invasivespecies.gov were later folded into larger agency domains. Every web page archived is accessible through the Wayback Machine and past and current End of Term specific collections are full-text searchable through the main End of Term portal. We have also worked with additional partners to provide access to the full data for use in data-mining research and projects.
The project has received considerable press attention this year, with related stories in The New York Times, Politico, The Washington Post, Library Journal, Motherboard, and others.
“No single government entity is responsible for archiving the entire federal government’s web presence,” explained Jefferson Bailey, the Internet Archive’s Director of Web Archiving. “Web data is already highly ephemeral and websites without a mandated custodian are even more imperiled. These sites include significant amounts of publicly-funded federal research, data, projects, and reporting that may only exist or be published on the web. This is tremendously important historical information. It also creates an amazing opportunity for libraries and archives to join forces and resources and collaborate to archive and provide permanent access to this material.”
This year has also seen a significant increase in citizen and librarian driven “hackathons” and “nomination-a-thons” where subject experts and concerned information professionals crowdsource lists of high-value or endangered websites for the End of Term archiving partners to crawl. Librarian groups in New York City are holding nomination events to make sure important sites are preserved. And universities such as The University of Toronto are holding events for “guerrilla archiving” focused specifically on preserving climate related data.
We need your help too! You can use the End of Term Nomination Tool to nominate any .gov or government website or social media site and it will be archived by the project team. If you have other ideas, please comment here or send ideas to email@example.com. And you can also help by donating to the Internet Archive to help our continued mission to provide “Universal Access to All Knowledge.”