Category Archives: Archive-It

Please: Help Build the 2016 U.S. Presidential Election Web Archive

seal_of_the_president_of_the_united_states-svgHelp us build a web archive documenting reactions to the 2016 Presidential Election. You can submit websites and other online materials, and provide relevant descriptive information, via this simple submission form. We will archive and provide ongoing access to these materials as part of the Internet Archive Global Events collection.

Since its beginning, the Internet Archive has worked with a global partner community of cultural heritage institutions, researchers and scholars, and citizens to build crowdsourced topical web archives that preserve primary sources documenting significant global events. Past collections include the Occupy Movement, the 2013 US Government Shutdown, the Jasmine Revolution in Tunisia, and the Charlie Hebdo attacks. These collections leverage the power of individual curators and motivated citizens to help expand our collective efforts to diversity and augment the historical record. Any webpages, sites, or other online resources about the 2016 Presidential Election are in scope. This web archive will build upon our affiliated efforts, such as the Political TV Ad Archive, and other collecting strategies, to provide permanent access to current political events.

As we noted in a recent blog post, the Internet Archive is “well positioned, with our mission of Universal Access to All Knowledge, to help inform the public in turbulent times, to demonstrate the power in sharing and openness.” You can help us in this mission by submitting websites that preserve the online record of this unique historical moment.

Hacking Web Archives

The awkward teenage years of the web archive are over. It is now 27 years since Tim Berners-Lee created the web and 20 years since we at Internet Archive set out to systematically archive web content. As the web gains evermore “historicity” (i.e., it’s old and getting older — just like you!), it is increasingly recognized as a valuable historical record of interest to researchers and others working to study it at scale.

Thus, it has been exciting to see — and for us to support and participate in — a number of recent efforts in the scholarly and library/archives communities to hold hackathons and datathons focused on getting web archives into the hands of research and users. The events have served to help build a collaborative framework to encourage more use, more exploration, more tools and services, and more hacking (and similar levels of the sometime-maligned-but-ever-valuable yacking) to support research use of web archives. Get the data to the people!

pngl3s_hackathon_postFirst, in May, in partnership with the Alexandria Project of L3S at University of Hannover in Germany, we helped sponsor “Exploring the Past of the Web: Alexandria & Archive-It Hackathonalongside the Web Science 2016 conference. Over 15 researchers came together to analyze almost two dozen subject-based web archives created by institutions using our Archive-It service. Universities, archives, museums, and others contributed web archive collections on topics ranging from the Occupy Movement to Human Rights to Contemporary Women Artists on the Web. Hackathon teams geo-located IP addresses, analyzed sentiments and entities in webpage text, and studied mime type distributions.

unleashed attendeesunleashed_vizSimilarly, in June, our friends at Library of Congress hosted the second Archives Unleashed  datathon, a follow-on to a previous event held at University of Toronto in March 2016. The fantastic team organizing these two Archives Unleashed hackathons have created an excellent model for bringing together transdisciplinary researchers and librarians/archivists to foster work with web data. In both Archives Unleashed events, attendees formed into self-selecting teams to work together on specific analytical approaches and with specific web archive collections and datasets provided by Library of Congress, Internet Archive, University of Toronto, GWU’s Social Feed Manager, and others. The #hackarchives tweet stream gives some insight into the hacktivities, and the top projects were presented at the Save The Web symposium held at LC’s Kluge Center the day after the event.

Both events show a bright future for expanding new access models, scholarship, and collaborations around building and using web archives. Plus, nobody crashed the wi-fi at any of these events! Yay!

Special thanks go to Altiscale (and Start Smart Labs) and ComputeCanada for providing cluster computing services to support these events. Thanks also go to the multiple funding agencies, including NSF and SSHRC, that provided funding, and to the many co-sponsoring and hosting institutions. Super special thanks go to key organizers, Helge Holzman and Avishek Anand at L3S and Matt Weber, Ian Milligan, and Jimmy Lin at Archives Unleashed, who made these events a rollicking success.

For those interested in participating in a web archives hackathon/datathon, more are in the works, so stay tuned to the usual social media channels. If you are interested in helping host an event, please let us know. Lastly, for those that can’t make an event, but are interested in working with web archives data, check out our Archives Research Services Workshop.

Lastly, some links to blog posts, projects, and tools from these events:

Some related blog posts:

Some hackathon projects:

Some web archive analysis tools:

Here’s to more happy web archives hacking in the future!

Two Grants Announced Supporting Web Archiving

We are excited to announce Internet Archive’s participation in two new grant-funded collaborative projects to advance the field of web archiving! Our Archive-It service, which works with libraries, archives, museums and others to provide the tools for institutions to create their own web archives, will partner with New York University and Old Dominion University on two separate areas of work. We thank both The Andrew W. Mellon Foundation and the Institute of Museum and Library Services (IMLS) for their recognition of the value of web archiving and their support for the continued development of tools and initiatives to expand the quality, accessibility, and extensibility of these collections. We also thank our awesome collaborative partners on these projects, New York University Libraries, NYU’s Moving Image Archiving and Preservation (MIAP) program, and Old Dominion University’s Web Science and Digital Libraries Research Group and look forward to working with them as part of our broader initiative for “Building Libraries Together.”

For the project “Archiving the Websites of Contemporary Composers,” led by NYU Libraries and funded with a grant of $480,000 from The Andrew W. Mellon Foundation, we will work with the Libraries and MIAP.  This project will archive web-based and born-digital audiovisual materials, and research and develop tools for their improved capture and discoverability. Contemporary musical works, as well as the rich secondary materials that accompany them, are increasingly migrating to the web. We outlined a number of current challenges to capturing and replaying online multimedia, such as dynamic and transient URL generation and adaptive bitrate streaming, as well as a need for continued research and development around the integration of web archives and non-web collections.

We have two specific pieces of work in the grant. First, we will build tools to improve the crawling and capture of web-based audiovisual materials, addressing the increasing complexity of streaming audiovisual materials, especially on third-party hosting and sharing platforms. This development work will build on our experience creating “Heritrix helper” tools like Umbra. Our second area of work will explore methods to integrate discovery of high-quality, non-web multimedia content held in external repositories into the Archive-It platform. Linking Archive-It collections with non-web institutional content has great potential to integrate web and non-web archives. This work will build on NYU’s creation of an API for their preservation repository, our increased use of API-based systems integration in Archive-It 5.0, and our continued work on improved content discovery for web collections. See NYU’s press release for more details.

The second recently-announced grant project is being lead by Old Dominion University’s Web Science and Digital Libraries Research Group, which received a $468,618 National Leadership Grant for Libraries from IMLS for the project, “Combining Social Media Storytelling With Web Archives” (grant number LG-71-15-0077). Readers not familiar with ODU’s great history of research and development around web archives are encourage to check out projects such as WARCreate/WAIL, their work on visualizations and Archive-It, and our recent favorite, the #whatdiditlooklike tool. In this project ODU will be building tools and processes to assimilate user-focused, online storytelling methods, such as Storify, to 1) summarize existing collections and 2) bootstrap new or expand existing web archive collections. The project will provide new ways to create unique topical and thematic collections through URLs shared via social media and storytelling platforms.

We will be working with them to integrate these tools in Archive-It, conduct user testing and training, and explore other ways that storytelling and user-generated materials can help build narrative pathways into large, often diffuse, collections of web content. We are excited to work with ODU and continue our increased focus on new models of access for web archives, as many institutional web collections are now of a breadth, volume, and operational maturity to begin focusing on novel ways their web archives can be studied and better understood by users and researchers.

Thanks again to Mellon Foundation and IMLS for supporting these cooperative efforts to advance web archiving and we are excited to work with our great partners and the broader community to keeping preserving and expanding access to the rich historical and cultural record documented on the web.

University of California Libraries to partner with Archive-It

cdl_logoThis week, the University of California California Digital Libraries and the UC Libraries announced a partnership with Internet Archive’s Archive-It Service.

In the coming year, CDL’s Web Archiving Service (WAS) collections and all core infrastructure activities, i.e., crawling, indexing, search, display, and storage, will be transferred to Archive-It. WAS partners have captured close to 80 terabytes of archived content most of which will be added to the 450 terabytes Archive-It partners have collected.

We are excited to work with CDL as we transition over the UC (and other) libraries to the Archive-It service. These UC libraries have unique and compelling collections (some dating back to 2006) including their Grateful Dead Web Archive: http://webarchives.cdlib.orggdarchive/a/gratefuldead which of course fits in quite nicely with the Internet Archive’s large collection of downloadable and streamed Grateful Dead shows in our Live Music Archive.

By collaborating with CDL, Archive-it can continue to expand the core functionalities of web archiving and work with CDL and other colleagues to develop new tools to advance the use of web archives. Such collaboration is sorely needed at this juncture and we welcome the opportunity to expand the capabilities of web archiving. By working together as a community we can create useful and sustainable web archives and ensure growth in the field of web archiving.

Be sure and check out some of the CDL collections:

Archiving the LGBT Web: Eastern Europe and Eurasia- UCB: http://webarchives.cdlib.org/a/lgbtwebeasterneurope
Federal Regional Agencies in California Web Archive- UC Davis: http://webarchives.cdlib.org/a/uscalagencies
Salvadoran Presidential Election March 2009 – Web Archive- UC Irving: http://webarchives.cdlib.org/a/salvador
2009 H1N1 Influenza A (Swine Flu) Outbreak- UC San Diego: http://webarchives.cdlib.org/a/h1n1
California Tobacco Control Web Archive- UCSF http://webarchives.cdlib.org/a/caltobaccocontrol

Archive-It: Crawling the Web Together

A post by the Archive-It team

Today Phase 1 of the 5.0 release of the Archive-It web application was released for use by the 326 partners using the Archive-It service.

In 1996 when the Internet Archive was founded, we used automated crawlers to capture the web, snapping up millions of web pages and preserving them for history. Ironically, our digital record of humankind was being driven by computer algorithms.

As the years went by, it became clear that we needed people and communities to capture and save what is really and truly important. So in February 2006 we launched the Archive-It service, 1.0, which allowed traditional librarians and archivists to become web archivists by initiating focused, curated crawls of the live web using a simple web application with partner/tech support. Launching Archive-It meant we could help our colleagues create their own web collections for their own libraries and also foster a community around web archiving to work together to build a global digital public library at www.archive.org.

Now, as we expand to the next generation of Archive-It with our 5.0 release, we hope to provide even greater tools for collection development. Released this week, 5.0 phase 1 highlights a shiny new user interface and significantly enhanced post-crawl reports that include infographics with visual representations of the data.

representative of the data

Figure 1: Screenshot from the Reports section of the new Archive-It 5.0 user interface

Back in 2006 there was little understanding of web archiving and many organizations were questioning whether this was a valid activity that could or should be a part of their larger institutional collecting strategies. After all, the challenges were staggering: the quality of web content was all over the map; conflicting policies and organizational structures posed challenges; no one had yet established best practices for selecting the content, how to handle metadata, or how to integrate this new type of content into other holdings and existing catalogs at the institution.   Also, back then we could not have predicted the extent to which material that once existed in physical form would now only appear on the web in digital form.

We launched the Archive-It service with a small band of believers and supporters, among them librarians and archivists from Indiana University, University of Texas at Austin, Library of Virginia, Montana State Library, and North Carolina State Archives and State Library. Partners were very patient with us and with Archive-It 1.0, which was bare bones. Collaborating and working with the library and archive community has always been a top priority for the Internet Archive, and a defining characteristic of the Archive-It service. There have been many times during the past 8+ years when we have not known the answer to a question and we say: “Let’s ask the community and see what they think!” And the community has always gotten back to us with supportive answers   – both illustrative and specific.

Figure 2: Screenshot from the North Carolina State Government Web Site Archive of the North Carolina State Archives and State Library of North Carolina.

As time went on, the community of web archivists grew and we were able to produce some compelling answers to the question: why web archive? Here are just a few:

  • To create a thematic or topical web archive
  • To fulfill a mandate to preserve institutional memory and history
  • To archive state or local agency publications no longer being deposited in print form
  • To archive records to meet university or government retention policies
  • To preserve an historical record of an institution’s web and /or social media presence
  • To capture a website before re-design or it is taken offline
  • To archive online art, exhibitions, and artists’ materials

img3

Figure 3: Screenshot from the Latin American Government Documents Archive, LAGDA of the University of Texas at Austin.

Figure 4: Screenshot from the Catalogues Raisonnés collection of the New York Art Resources Consortium (NYARC).

To date in 2014, 326 Archive-It partners have created 2700 public collections on a diversity and range of topics, subjects, events and domains. These collections have become integral to these organizations’ collecting strategies and have helped to raise awareness and understanding about why web archiving is so important.

We like to say that the Archive-It service is both a partner and a vendor. We are a service provider and we strive to consistently deliver a high level of customer support — which we believe partners notice and appreciate. We also strive to be a partner to our community and work collaboratively on initiatives that we share together; a few of which are: a) collaborative efforts around archiving spontaneous events (like the 2011 Japanese Earthquake collection), b) teaching web archiving in graduate level MLIS programs and professional development workshops and c) the K12 Web Archiving program (now in its 7th year) where we work with 3rd to 12 graders around the county and ask them what they would like to archive for future generations. As one of the student archivists put it, “500 years from now, kids will think we were really cool.”

Many of the features and functionality that we see in the Archive-It service today are a direct result of a partner making a suggestion or request. Through face to face brainstorming sessions, online surveys, webinars, and support tickets, partners have expressed their ideas as well as offered constructive criticism. And we have listened.   We hope that as the service continues to grow and we launch Archive-It 5.0 that many of our partners will see themselves in Archive-It. Their collections will continue to be valuable to researchers, historians, scholars and the general public for many years to come.

Here are some links to just a few of those collections on the Archive-It website:

Columbia University’s collection on Human Rights: https://archive-it.org/collections/1068

National Museum of Women in the Arts’s collection on Contemporary Women Artists on the Web: https://archive-it.org/collections/2973

University of Alberta’s Circumpolar Collection: https://archive-it.org/collections/2475

Brigham Young University’s Mormon Missionary Collection: https://archive-it.org/collections/3609

Stanford University’s collection on Freedom of Information (FOIA): https://archive-it.org/collections/924

As we continue down this road – excited for the future and what comes next – we know that it takes a community to archive the web and we look forward to working with our partners to build libraries together.

Job Posting: Web Application/Software Developer for Archive-It

The Internet Archive is looking for a smart, collaborative and resourceful engineer to lead and do the development of the next generation of the Archive-It service, a web based application used by libraries and archives around the world. The Internet Archive is a digital public library founded in 1996. Archive-It is a self-sustaining revenue generating subscription service first launched in 2006.

Primary responsibilities would be to extend the success of Archive-It, which librarians and archivists use to create collections of digital content, and then make them accessible to researchers, scholars and the general public.  Widely considered to be the market leader since its’ inception, Archive-It’s partner base has archived over five billion web pages and over 260 terabytes of data.  http://archive-it.org

Working for Archive-It program’s director, this position has technical responsibility to evolve this service while still being straightforward enough to be operated by 300+ partner organizations and their users with minimal technical skills. Our current system is primarily Java based and we are looking to help build the next-generation of Archive-It using the latest web technologies. The ideal candidate will possess a desire to work collaboratively with a small internal team and a large, vocal and active user community; demonstrating independence, creativity, initiative and technological savvy, in addition to being a great programmer/architect.

The ideal candidate will have:


  • 5+ years work experience in Java and Python web application development
  • Experience with Hadoop, specifically HBase and Pig
  • Experience developing web application database back-end (SQL or NoSQL).
  • Good understanding of latest web framework technologies, both JVM and non-JVM based, and trade-offs between them.
  • Strong familiarity with all aspects of web technology and protocols, including: HTTP, HTML, and Javascript
  • Experience with a variety of web applications, machine clusters, distributed systems, and high-volume data services.
  • Flexibility and a sense of humor
  • BS Computer Science, or equivalent work experience

Bonus points for:

  • Experience with web crawlers and/or applications designed to display [archived] web content (especially server-side apps)
  • Open source practices experience
  • Experience and/or interest in user interface design and information architecture
  • Familiarity with Apache SOLR or similar facet-based search technologies
  • Experience with the building/architecture of social media sites
  • Experience building out a mobile platform

To apply:

Please send your resume and cover letter to kristine at archive dot org with the subject line “Web App Developer Archive-It”.

The Archive thanks all applicants for their interest, but advises that only those selected for an interview will be contacted. No phone calls please!

We are an equal opportunity employer.

Archive-It Team Encourages Your Contributions To The “Occupy Movement” Collection

Since September 17th, 2011 when protesters descended on Wall Street, set up tents, and refused to move until their voices were heard, an impassioned plea for economic and social equality has manifested itself in similar protests and demonstrations around the world. Inspired by “Occupy Wall Street (OWS)”, these global protests and demonstrations are collectively now being referred to as the “Occupy Movement”.

In an effort to document these historic, and politically and socially charged, events as they unfold, IA’s Archive-It team has recently created an “Occupy Movement” collection to begin capturing information about the movement found online. With blogs communicating movement ideals and demands, social media used to coordinate demonstrations, and news related websites portraying the movement from a dizzying variety of angles, the presence and representation of the Occupy Movement online is both hugely valuable to our understanding of the movement as a whole, while constantly in-flux and at-risk.

The value of the collection hinges on the diversity, depth, and breadth of our seeds and websites we crawl. We are asking and encouraging anyone with websites they feel are important to archive, sites that tell a story about the movement, to pass them along and we will add them to the Occupy Movement collection. These might include movement-wide or city-specific websites, sites with images, blogs, YouTube videos, even Twitter accounts of individuals or organizations involved with the movement. No ideas or additions are too small or too large; perhaps your ideas or suggestions will be a unique part of the movement not yet represented in our collection. IA Archive-It friends and partners are already sending in seeds, which we greatly appreciate.

The web content captured in this collection will be included in the General Archive collection at http://www.archive.org/details/occupywallstreet
which has been actively collecting materials on the Occupy Movement for a few months.

Please send any seeds suggestions, questions, or comments to Graham at graham@archive.org.

K12 Web Archiving Program

If you were a student which websites would you want to save for future generations? What would you want people to look at 50 or even 500 years from now?

These questions are central to the K12 Web Archiving Program, a partnership between the Internet Archive and the Library of Congress. Now in its third year, working with 5th to 12th graders in schools around the country, this innovative program has the students make the decisions about what website content will be saved, as each of them actively participates in a collaborative team environment, developing problem solving and critical thinking skills. An important piece of the program is for the students to attach descriptive metadata to the websites, providing information as to why a site should be saved. By enabling students to preserve websites, the program gives them an opportunity to not only document their culture and learn about the fragility of digital content, but their work also becomes a primary source of information for future researchers.

The Library of Congress recently shot a 8 minute video with 8th graders from Moran Middle School in Wallingford, CT. The video provides an inside glimpse into what the students think about the program.

The student’s digital collections can be found here at: http://archive-it.org/k12/

K12 home page

Call for Applications for the K-12 Web Archiving Program

K-12 Archiving ProgramFrom the Archive-It team:

If you were a K12 student which websites would you want to save for future generations? What would you want people to look at 50 or even 500 years from now?

These questions are central to the K12 Web Archiving Program, a partnership between the Internet Archive and the Library of Congress. Now wrapping up its second year, with 12 schools in 11 states around the country, this innovative program provides a new perspective on saving history and culture, allowing students to actively participate and make decisions about what “at risk” website content will be saved. The decisions they make help them to develop an awareness of how the Web content they choose will become primary sources for future historians studying our lives.

Archive-It logoThe program uses Archive-It, a web archiving service from the Internet Archive, to capture born digital content from the Web to create collection “time capsules.” Students decide the type of collections and the specific websites to be captured, attaching a brief description to every one so that people in the future will know why they chose this content. By allowing students to identify websites that will be preserved for the long-term, the program gives teens and younger students a chance to identify and document their cultural history and the world that’s important to them. Unlike time capsules of tangible objects, which usually remain hidden for decades or centuries, the resulting Web collections are immediately visible and publicly accessible, with full text search for study and analysis.

Any teachers that are interested in this program, please visit the application website for more information and to fill out an application for the 2010/2011 school year.  Applications are due by July 2.

To see collections that students have created in the first two years of the program, please visit the program website.

-Jeff Kaplan

Web-archive-on-demand service for libraries launched

The web team at Internet Archive launched the public website Archive-it that allows users to create, manage and search their own web archives through a web interface. The service as been developed, in particular, for memory institutions and state archives. IA has been testing and developing the application through
a pilot program which includes 13 other institutions, mainly Libraries and Archives, who are potential users of this service.

The collections developed through the pilot are all available for search and browse access through the public facing site, and information on the program, the pilot partners, how the application works is detailed on the site as well.