Over 200 terabytes of the government web archived!

In our December post, “Preserving U.S. Government Websites and Data as the Obama Term Ends,” we described our participation in the End of Term Web Archive project to preserve federal government websites and data at times of administration changes. We wanted to give a quick update on the project — we have archived a heck of a lot of data!

Between Fall 2016 and Spring 2017, the Internet Archive archived over 200 terabytes of government websites and data. This includes over 100TB of public websites and over 100TB of public data from federal FTP file servers totaling, together, over 350 million URLs/files. This includes over 70 million html pages, over 40 million PDFs and, towards the other end of the spectrum and for semantic web aficionados, 8 files of the text/turtle mime type. Other End of Term partners have also been vigorously preserving websites and data from the .gov/.mil web domains.

Every web page we have archived is accessible through the Wayback Machine and we are working to add the 2016 harvest to the main End of Term portal soon. While we continue to analyze this collection, we posted some preliminary statistics using the new Wayback Machine’s summary interface for this specific collection, which can be found on the End of Term (EOT 2016) summary stats page; those and additional stats are served via a public EOT 2016 stats API and the full collection is also available.

Through the EOT project’s public nomination form and through our collaboration with the DataRefugeEnvironmental Data and Governance Initiative (EDGI), and other efforts, over 100,000 webpages or government datasets were nominated by citizens and preservationists for archiving. The EOT and community efforts have also garnered notable press (see our End of Term 2016 Press collection). We are working with partners to provide access to the full dataset for use in data mining and computational analysis and hosted a hackathon earlier this year to support use of the Obama White House Social Media datasets.

While the specific End of Term collection has closed, we continue our large-scale, dedicated efforts to preserve the government web. Working with the University of North Texas, we launched the Government Web & Data Archive nomination form so the public can continue to nominate public government websites and data to be archived.

Lastly, archiving government data remains a critical activity of the preservation community. You can help our role in these efforts by continuing to nominate websites, promoting the EOT project via press and outreach (contact the EOT project team for any inquiries), and by donating to the Internet Archive to support our ongoing mission to provide “Universal Access to All Knowledge.”