Author Archives: jefferson

Over 200 terabytes of the government web archived!

In our December post, “Preserving U.S. Government Websites and Data as the Obama Term Ends,” we described our participation in the End of Term Web Archive project to preserve federal government websites and data at times of administration changes. We wanted to give a quick update on the project — we have archived a heck of a lot of data!

Between Fall 2016 and Spring 2017, the Internet Archive archived over 200 terabytes of government websites and data. This includes over 100TB of public websites and over 100TB of public data from federal FTP file servers totaling, together, over 350 million URLs/files. This includes over 70 million html pages, over 40 million PDFs and, towards the other end of the spectrum and for semantic web aficionados, 8 files of the text/turtle mime type. Other End of Term partners have also been vigorously preserving websites and data from the .gov/.mil web domains.

Every web page we have archived is accessible through the Wayback Machine and we are working to add the 2016 harvest to the main End of Term portal soon. While we continue to analyze this collection, we posted some preliminary statistics using the new Wayback Machine’s summary interface for this specific collection, which can be found on the End of Term (EOT 2016) summary stats page; those and additional stats are served via a public EOT 2016 stats API and the full collection is also available.

Through the EOT project’s public nomination form and through our collaboration with the DataRefugeEnvironmental Data and Governance Initiative (EDGI), and other efforts, over 100,000 webpages or government datasets were nominated by citizens and preservationists for archiving. The EOT and community efforts have also garnered notable press (see our End of Term 2016 Press collection). We are working with partners to provide access to the full dataset for use in data mining and computational analysis and hosted a hackathon earlier this year to support use of the Obama White House Social Media datasets.

While the specific End of Term collection has closed, we continue our large-scale, dedicated efforts to preserve the government web. Working with the University of North Texas, we launched the Government Web & Data Archive nomination form so the public can continue to nominate public government websites and data to be archived.

Lastly, archiving government data remains a critical activity of the preservation community. You can help our role in these efforts by continuing to nominate websites, promoting the EOT project via press and outreach (contact the EOT project team for any inquiries), and by donating to the Internet Archive to support our ongoing mission to provide “Universal Access to All Knowledge.”

Join us for a White House Social Media and Gov Data Hackathon!

gov_hackathonJoin us at the Internet Archive this Saturday January 7 for a government data hackathon! We are hosting an informal hackathon working with White House social media data, government web data, and data from election-related collections. We will provide more gov data than you can shake a script at! If you are interested in attending, please register using this form. The event will take place at our 300 Funston Avenue headquarters from 10am-5pm.

We have been working with the White House on their admirable project to provide public access to eight years of White House social media data for research and creative reuse. Read more on their efforts at this blog post. Copies of this data will be publicly accessible at archive.org. We have also been furiously archiving the federal government web as part of our collaborative End of Term Web Archive and have also collected a voluminous amount of media and web data as part of the 2016 election cycle. Data from these projects — and others — will be made publicly accessible for folks to analyze, study, and do fun, interesting things with.

At Saturday’s hackathon, we will give an overview of the datasets available, have short talks from affiliated projects and services, and point to tools and methods for analyzing the hackathon’s data. We plan for a loose, informal event. Some datasets that will be available for the event and publicly accessible online:

  • Obama Administration White House social media from 2009-current, including Twitter, Tumblr, Vine, Facebook, and (possibly) YouTube
  • Comprehensive web archive data of current White House websites: whitehouse.gov, petitions.whitehouse.gov, letsmove.gov and other .gov websites
  • The End of Term Web Archives, a large-scale collaborative effort to preserve the federal government web ( .gov/.mil) at presidential transitions, including web data from 2008, 2012, and our current 2016 project
  • Special sub-collections of government data, such as every powerpoint in the Internet Archive’s web archive from the .mil web domain
  • Extensive archives of of social media data related to the 2016 election including data from candidates, pundits, and media
  • Full text transcripts of Trump candidate speeches
  • Python notebooks, cluster computing tools, and pointers to methods for playing with data at scale.

Much of this data was collected in partnership with other libraries and with the support of external funders. We thank, foremost, the current White House Office of Digital Strategy staff for their advocacy for open access and working with us and others to make their social media open to the public. We also thank our End of Term Web Archive partners and related community efforts helping preserve the .gov web, as well as the funders that have supported many of the collecting and engineering efforts that makes all this data publicly accessible, including the Institute of Museum and Library Services, Altiscalethe Knight Foundation, the Democracy Fund, the Kahle-Austin Foundation, and others.

Preserving U.S. Government Websites and Data as the Obama Term Ends

Long before the 2016 Presidential election cycle librarians have understood this often-overlooked fact: vast amounts of government data and digital information are at risk of vanishing when a presidential term ends and administrations change.  For example, 83% of .gov pdf’s disappeared between 2008 and 2012.

That is why the Internet Archive, along with partners from the Library of Congress, University of North Texas, George Washington University, Stanford University, California Digital Library, and other public and private libraries, are hard at work on the End of Term Web Archive, a wide-ranging effort to preserve the entirety of the federal government web presence, especially the .gov and .mil domains, along with federal websites on other domains and official government social media accounts.

While not the only project the Internet Archive is doing to preserve government websites, ftp sites, and databases at this time, the End of Term Web Archive is a far reaching one.

The Internet Archive is collecting webpages from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts. The effort is likely to preserve hundreds of millions of individual government webpages and data and could end up totaling well over 100 terabytes of data of archived materials. Over its full history of web archiving, the Internet Archive has preserved over 3.5 billion URLs from the .gov domain including over 45 million PDFs.

This end-of-term collection builds on similar initiatives in 2008 and 2012 by original partners Internet Archive, Library of Congress, University of North Texas, and California Digital Library to document the “gov web,” which has no mandated, domain-wide single custodian. For instance, here is the National Institute of Literacy (NIFL) website in 2008. The domain went offline in 2011. Similarly, the Sustainable Development Indicators (SDI) site was later taken down. Other websites, such as invasivespecies.gov were later folded into larger agency domains. Every web page archived is accessible through the Wayback Machine and past and current End of Term specific collections are full-text searchable through the main End of Term portal. We have also worked with additional partners to provide access to the full data for use in data-mining research and projects.

The project has received considerable press attention this year, with related stories in The New York Times, Politico, The Washington Post, Library Journal, Motherboard, and others.

“No single government entity is responsible for archiving the entire federal government’s web presence,” explained Jefferson Bailey, the Internet Archive’s Director of Web Archiving.  “Web data is already highly ephemeral and websites without a mandated custodian are even more imperiled. These sites include significant amounts of publicly-funded federal research, data, projects, and reporting that may only exist or be published on the web. This is tremendously important historical information. It also creates an amazing opportunity for libraries and archives to join forces and resources and collaborate to archive and provide permanent access to this material.”

This year has also seen a significant increase in citizen and librarian driven “hackathons” and “nomination-a-thons” where subject experts and concerned information professionals crowdsource lists of high-value or endangered websites for the End of Term archiving partners to crawl. Librarian groups in New York City are holding nomination events to make sure important sites are preserved. And universities such as  The University of Toronto are holding events for “guerrilla archiving” focused specifically on preserving climate related data.

We need your help too! You can use the End of Term Nomination Tool to nominate any .gov or government website or social media site and it will be archived by the project team.   If you have other ideas, please comment here or send ideas to info@archive.org.   And you can also help by donating to the Internet Archive to help our continued mission to provide “Universal Access to All Knowledge.”

Please: Help Build the 2016 U.S. Presidential Election Web Archive

seal_of_the_president_of_the_united_states-svgHelp us build a web archive documenting reactions to the 2016 Presidential Election. You can submit websites and other online materials, and provide relevant descriptive information, via this simple submission form. We will archive and provide ongoing access to these materials as part of the Internet Archive Global Events collection.

Since its beginning, the Internet Archive has worked with a global partner community of cultural heritage institutions, researchers and scholars, and citizens to build crowdsourced topical web archives that preserve primary sources documenting significant global events. Past collections include the Occupy Movement, the 2013 US Government Shutdown, the Jasmine Revolution in Tunisia, and the Charlie Hebdo attacks. These collections leverage the power of individual curators and motivated citizens to help expand our collective efforts to diversity and augment the historical record. Any webpages, sites, or other online resources about the 2016 Presidential Election are in scope. This web archive will build upon our affiliated efforts, such as the Political TV Ad Archive, and other collecting strategies, to provide permanent access to current political events.

As we noted in a recent blog post, the Internet Archive is “well positioned, with our mission of Universal Access to All Knowledge, to help inform the public in turbulent times, to demonstrate the power in sharing and openness.” You can help us in this mission by submitting websites that preserve the online record of this unique historical moment.

GifCities: The GeoCities Animated GIF Search Engine

 

underconstruction

dancing_babyhomer_1

 

 

line

skeletonworm          surfcpu       webfun          gif_guitarman

Try the Internet Archive’s animated GIF search engine at GifCities.org!  You can now get your early-web GIF fix and have a fun way to browse the web archive. Search for snowglobes or butterflies or balloons or (naturally) cats. If you click on a GIF, then it brings to you to the original page from the Wayback Machine. (Then please consider donating to the Archive)

One of the goals for our 20th anniversary event last week was to highlight the amusing and wacky corners of the web, as represented in our web archive, in order to provide a light-hearted, novel perspective on the history of this amazing publication platform that we have worked to preserve over the years.

The animated GIF is perhaps the iconic, indomitable filetype of the early web.  Meme-vessel, page-spacer, action-graphic-maker — GIFS are a quintessential feature of the 1990’s web aesthetic, but remain just as popular today as they were twenty years ago. GeoCities, the first major web hosting platform for individual users to create their own pages, and once the third most visited site on the web before being shut down in 2009, occupies a similarly notable place in the history of the web.

So we combined these two aspects of web history by extracting every animated GIF from GeoCities in our web archive and built a search engine on top of them. Behold, for your viewing pleasure, over 4,500,000 animated GIFs (1,600,000 unique), searchable based on filename and URL path, with most GIFs linking to the archived GeoCities web page where it was originally displayed.

Some random staff faves:

dinosaur

 skullmail  dogsruledoorsmor

landing-a

Soft-launched at our anniversary event on Wednesday, where we also projected GifCities on the side of our headquarters in San Francisco, the project has been featured in The Guardian, BoingBoing, the A.V. Club, CNET,  and others. The GeoCities GIF collection was also made available for creative reuse by artists and researchers, and featured in work such as the GifCollider project currently showing at BAMPFA (see the videos online) and the Hall of GIFs data visualization at NCSU. Shout-outs also go to others working with the GeoCities web archive, including the Geocities Research Institute and historians. More details on the project can be found at the GifCities about page.

And yes, like every other upstanding web citizen, we GifCities’ed ourselves:internet1archive1

10 Years of Archiving the Web Together

As the Internet Archive turns 20, the Archive-It community is proud to celebrate an anniversary of its own: 10 years of working with thousands of librarians, archivists, and others to preserve the web and build rich, expansive collections of websites for discovery and use by future generations. Eighteen partners inaugurated the Archive-It service in 2006. Since then, that list has grown to include more than 450 organizations and individuals, each with its unique goals and collecting scope. In this time they added more than 17 billion (yes, with a “b”) URLs to their collections.

Archive-It partners over the years. Clockwise from top-left: Margaret Maes (LIPA) and Nicholas Taylor (Stanford University); James Jacobs (Stanford University) and Kent Norsworthy (University of Texas at Austin); K12 web archivists at PS 174 in Queens; Renate Giacomuzzi, Elisabeth Sporer (University of Innsbruck), and Kristine Hanna (Internet Archive)

Archive-It partners over the years. Clockwise from top-left: Margaret Maes (Legal Information Preservation Alliance) and Nicholas Taylor (Stanford University); James Jacobs (Stanford University) and Kent Norsworthy (University of Texas at Austin); K12 web archivists at PS 174 in Queens; Renate Giacomuzzi, Elisabeth Sporer (University of Innsbruck), and Kristine Hanna (Internet Archive)

And to give you just a hint of how the overall collection has grown: that’s about 5 billion new URLs in just the last year! They’ve captured some momentous historical events, local community history, and social and cultural activity across more than 7,000 collections to date, everything from 700+ human rights sites to the tea party movement; tobacco industry records to Mormon missionaries’ blogs. And of course who can forget all of the LOLcats? They’ve collaborated on capturing breaking news, opened doors to the next generation of curators in our K12 web archiving program, and explored their own collections in new forms with datasets leveraging our researcher services.

archive-it-2006_oldweb

The Archive-It pilot website in 2005

Archive-It is Internet Archive’s web archiving service that helps institutions build, preserve, and provide access to collections of archived web content. It was developed in response to the needs of libraries, archives, historical societies, museums, and other organizations who sought to use the same powerful technology behind the Wayback Machine to curate their own web archives. The service was then the first of its kind, but has grown and expanded to meet the needs of an ever-widening scope of partners dedicated to archiving the web.

archive-it_2-0

Adding a website to a collection in Archive-It 2.0, as released in July 2006.

Our pilot partners, who began testing a beta version of the service in late 2005, helped to develop and improve the essential tools that such a service would provide and used those tools to create collections, documenting local and global histories in a new way. Based on feedback from the pilot partners, the Archive-It web application launched publicly in 2006 with the most basic of curation tools: create a collection, capture content, and make it publicly available. The service and the community grew exponentially from there.

Archive-It 5.0 realtime crawl tracking.

Archive-It 5.0 realtime crawl tracking.

The myriad partner-driven technical (to say nothing of aesthetic!) improvements of the last ten years are reflected in this year’s release of Archive-It 5.0, the first full redesign of the Archive-It web application since its launch. In the meantime, Archive-It continues to work with the community to preserve and provide access to amazing collections and to develop new tools for archiving the web, including new capture technologies, data transfer APIs, and more.

With year 11 (and Archive-It 5.1) just around the corner, we look forward to helping our partner institutions use new tools, build new collections, and expand the broader community working to archive the web.

Hacking Web Archives

The awkward teenage years of the web archive are over. It is now 27 years since Tim Berners-Lee created the web and 20 years since we at Internet Archive set out to systematically archive web content. As the web gains evermore “historicity” (i.e., it’s old and getting older — just like you!), it is increasingly recognized as a valuable historical record of interest to researchers and others working to study it at scale.

Thus, it has been exciting to see — and for us to support and participate in — a number of recent efforts in the scholarly and library/archives communities to hold hackathons and datathons focused on getting web archives into the hands of research and users. The events have served to help build a collaborative framework to encourage more use, more exploration, more tools and services, and more hacking (and similar levels of the sometime-maligned-but-ever-valuable yacking) to support research use of web archives. Get the data to the people!

pngl3s_hackathon_postFirst, in May, in partnership with the Alexandria Project of L3S at University of Hannover in Germany, we helped sponsor “Exploring the Past of the Web: Alexandria & Archive-It Hackathonalongside the Web Science 2016 conference. Over 15 researchers came together to analyze almost two dozen subject-based web archives created by institutions using our Archive-It service. Universities, archives, museums, and others contributed web archive collections on topics ranging from the Occupy Movement to Human Rights to Contemporary Women Artists on the Web. Hackathon teams geo-located IP addresses, analyzed sentiments and entities in webpage text, and studied mime type distributions.

unleashed attendeesunleashed_vizSimilarly, in June, our friends at Library of Congress hosted the second Archives Unleashed  datathon, a follow-on to a previous event held at University of Toronto in March 2016. The fantastic team organizing these two Archives Unleashed hackathons have created an excellent model for bringing together transdisciplinary researchers and librarians/archivists to foster work with web data. In both Archives Unleashed events, attendees formed into self-selecting teams to work together on specific analytical approaches and with specific web archive collections and datasets provided by Library of Congress, Internet Archive, University of Toronto, GWU’s Social Feed Manager, and others. The #hackarchives tweet stream gives some insight into the hacktivities, and the top projects were presented at the Save The Web symposium held at LC’s Kluge Center the day after the event.

Both events show a bright future for expanding new access models, scholarship, and collaborations around building and using web archives. Plus, nobody crashed the wi-fi at any of these events! Yay!

Special thanks go to Altiscale (and Start Smart Labs) and ComputeCanada for providing cluster computing services to support these events. Thanks also go to the multiple funding agencies, including NSF and SSHRC, that provided funding, and to the many co-sponsoring and hosting institutions. Super special thanks go to key organizers, Helge Holzman and Avishek Anand at L3S and Matt Weber, Ian Milligan, and Jimmy Lin at Archives Unleashed, who made these events a rollicking success.

For those interested in participating in a web archives hackathon/datathon, more are in the works, so stay tuned to the usual social media channels. If you are interested in helping host an event, please let us know. Lastly, for those that can’t make an event, but are interested in working with web archives data, check out our Archives Research Services Workshop.

Lastly, some links to blog posts, projects, and tools from these events:

Some related blog posts:

Some hackathon projects:

Some web archive analysis tools:

Here’s to more happy web archives hacking in the future!

IMLS National Digital Platform Grant Awarded to Advance Web Archiving

imls_logo_2cWe are excited to announce that the Institute of Museum and Library Services (IMLS) has recently awarded a National Leadership Grant, in the National Digital Platform category, to a proposal by Internet Archive’s Archive-It, Stanford University Libraries (DLSS and LOCKSS), University of North Texas, and Rutgers University. The $353,221 grant will support the project “Systems Interoperability and Collaborative Development for Web Archiving,” a two-year research project to test economic and community models for collaborative technology development, prototype system integration through development of Export APIs, and build community participation in web archiving development and new research and access tools. In addition to the technical development included in the scope of work, the project will also host a National Symposium on Web Archiving Interoperability in early 2017.

The project supports the National Digital Platform funding priority of IMLS by increasing access to shared services and infrastructure while building capacity for broader community input in technology development. Project outcomes will promote system integration, facilitate increased distributed preservation of archived data, and help support new global and local access models possible through export APIs, with an eye towards modeling post-grant interoperable systems architectures. Archive-It’s status as widely-used, shared web archiving infrastructure ensures broad community impact and makes possible the involvement of institutions of all sizes in project work. The involvement of Stanford University Libraries builds on their work in the Hydra community and with digital preservation services. UNT contributes experience in digital library and web archiving technology development and Rutgers’ work on research uses of web archives ensures the involvement of downstream user communities. Overall, the project will lay the groundwork for future collaboration around interoperability that will enhance the integration of disparate systems, increase local preservation, and improve the discoverability and use of web archives.

The outcomes of the project will build on the past and current collaborations of project partners, as well as Archive-It’s work on API development internally and in related collaborative development work. Project partners’ roles in affiliated groups like the IIPC, LDCX, NDSA, and the Web Science community ensures the involvement of the larger digital library and internet researcher communities. The two-year project will run from January 2016 through December 2017 and Jefferson Bailey, Director, Web Archiving Programs, Internet Archive, will serve as Project Director.

We thank IMLS for their generous support of this project and their ongoing support for libraries and archives working collaboratively towards building a sustained National Digital Platform. The complete list of IMLS-funded projects this award cycle is available online and the full narratives of all projects funded as part of the National Digital Platform were published on the IMLS blog . Go IMLS!

Two Grants Announced Supporting Web Archiving

We are excited to announce Internet Archive’s participation in two new grant-funded collaborative projects to advance the field of web archiving! Our Archive-It service, which works with libraries, archives, museums and others to provide the tools for institutions to create their own web archives, will partner with New York University and Old Dominion University on two separate areas of work. We thank both The Andrew W. Mellon Foundation and the Institute of Museum and Library Services (IMLS) for their recognition of the value of web archiving and their support for the continued development of tools and initiatives to expand the quality, accessibility, and extensibility of these collections. We also thank our awesome collaborative partners on these projects, New York University Libraries, NYU’s Moving Image Archiving and Preservation (MIAP) program, and Old Dominion University’s Web Science and Digital Libraries Research Group and look forward to working with them as part of our broader initiative for “Building Libraries Together.”

For the project “Archiving the Websites of Contemporary Composers,” led by NYU Libraries and funded with a grant of $480,000 from The Andrew W. Mellon Foundation, we will work with the Libraries and MIAP.  This project will archive web-based and born-digital audiovisual materials, and research and develop tools for their improved capture and discoverability. Contemporary musical works, as well as the rich secondary materials that accompany them, are increasingly migrating to the web. We outlined a number of current challenges to capturing and replaying online multimedia, such as dynamic and transient URL generation and adaptive bitrate streaming, as well as a need for continued research and development around the integration of web archives and non-web collections.

We have two specific pieces of work in the grant. First, we will build tools to improve the crawling and capture of web-based audiovisual materials, addressing the increasing complexity of streaming audiovisual materials, especially on third-party hosting and sharing platforms. This development work will build on our experience creating “Heritrix helper” tools like Umbra. Our second area of work will explore methods to integrate discovery of high-quality, non-web multimedia content held in external repositories into the Archive-It platform. Linking Archive-It collections with non-web institutional content has great potential to integrate web and non-web archives. This work will build on NYU’s creation of an API for their preservation repository, our increased use of API-based systems integration in Archive-It 5.0, and our continued work on improved content discovery for web collections. See NYU’s press release for more details.

The second recently-announced grant project is being lead by Old Dominion University’s Web Science and Digital Libraries Research Group, which received a $468,618 National Leadership Grant for Libraries from IMLS for the project, “Combining Social Media Storytelling With Web Archives” (grant number LG-71-15-0077). Readers not familiar with ODU’s great history of research and development around web archives are encourage to check out projects such as WARCreate/WAIL, their work on visualizations and Archive-It, and our recent favorite, the #whatdiditlooklike tool. In this project ODU will be building tools and processes to assimilate user-focused, online storytelling methods, such as Storify, to 1) summarize existing collections and 2) bootstrap new or expand existing web archive collections. The project will provide new ways to create unique topical and thematic collections through URLs shared via social media and storytelling platforms.

We will be working with them to integrate these tools in Archive-It, conduct user testing and training, and explore other ways that storytelling and user-generated materials can help build narrative pathways into large, often diffuse, collections of web content. We are excited to work with ODU and continue our increased focus on new models of access for web archives, as many institutional web collections are now of a breadth, volume, and operational maturity to begin focusing on novel ways their web archives can be studied and better understood by users and researchers.

Thanks again to Mellon Foundation and IMLS for supporting these cooperative efforts to advance web archiving and we are excited to work with our great partners and the broader community to keeping preserving and expanding access to the rich historical and cultural record documented on the web.