Tag Archives: web data research

IMLS National Leadership Grant Supports Expansion of the ARCH Computational Research Platform

Posted on September 19, 2023 by jefferson

In June, we announced the official launch of Archives Research Compute Hub (ARCH) our platform for supporting computational research with digital collections. The Archiving & Data Services group at IA has long provided computational research services via collaborations, dataset services, product features, and other partnerships and software development. In 2020, in partnership with our close collaborators at the Archives Unleashed project, and with funding from the Mellon Foundation, we pursued cooperative technical and community work to make text and data mining services available to any institution building, or researcher using, archival web collections. This led to the release of ARCH, with more than 35 libraries and 60 researchers and curators participating in beta testing and early product pilots. Additional work supported expanding the community of scholars doing computational research using contemporary web collections by providing technical and research support to multi-institutional research teams.

We are pleased to announce that ARCH recently received funding from the Institute of Museum and Library Services (IMLS), via their National Leadership Grants program, supporting ARCH expansion. The project, “Expanding ARCH: Equitable Access to Text and Data Mining Services,” entails two broad areas of work. First, the project will create user-informed workflows and conduct software development that enables a diverse set of partner libraries, archives, and museums to add digital collections of any format (e.g., image collections, text collections) to ARCH for users to study via computational analysis. Working with these partners will help ensure that ARCH can support the needs of organizations of any size that aim to make their digital collections available in new ways. Second, the project will work with librarians and scholars to expand the number and types of data analysis jobs and resulting datasets and data visualizations that can be created using ARCH, including allowing users to build custom research collections that are aggregated from the digital collections of multiple institutions. Expanding the ability for scholars to create aggregated collections and run new data analysis jobs, potentially including artificial intelligence tools, will enable ARCH to significantly increase the type, diversity, scope, and scale of research it supports.

Collaborators on the Expanding ARCH project include a set of institutional partners that will be closely involved in guiding functional requirements, testing designs, and using the newly-built features intended to augment researcher support. Primary institutional partners include University of Denver, University of North Carolina at Chapel Hill, Williams College Museum of Art, and Indianapolis Museum of Art, with additional institutional partners joining in the project’s second year.

Thousands of libraries, archives, museums, and memory organizations work with Internet Archive to build and make openly accessible digitized and born-digital collections. Making these collections available to as many users in as many ways as possible is critical to providing access to knowledge. We are thankful to IMLS for providing the financial support that allows us to expand the ARCH platform to empower new and emerging types of access and research.

Internet Archive Welcomes Digital Humanists and Cultural Heritage Professionals to “Humanities and the Web: Introduction to Web Archive Data Analysis”

Posted on December 13, 2022 by Lori Donovan

By The Community Programs Team

On November 14, 2022, the Internet Archive hosted Humanities and the Web: Introduction to Web Archive Data Analysis, a one-day introductory workshop for humanities scholars and cultural heritage professionals. The group included disciplinary scholars and information professionals with research interests ranging from Chinese feminist movements, to Indigenous language revitalization, to the effects of digital platforms on discourses of sexuality and more. The workshop was held at the Central Branch of the Los Angeles Public Library and coincided with the National Humanities Conference.

*Attendees and Facilitators at Humanities and the Web: Introduction to Web Archive Data Analysis, November 14, 2022, Los Angeles Public Library*

The goals of the workshop were to introduce web archives as primary sources and to provide a sampling of tools and methodologies that could support computational analysis of web archive collections. Internet Archive staff shared web archive research use cases and provided participants with hands-on experience building web archives and analyzing web archive collections as data.

*Senior Program Manager, Lori Donovan, guiding attendees in using Voyant to analyze text datasets extracted from an Archive-It collection using ARCH.*

The workshop’s central feature was an introduction to ARCH (Archives Research Compute Hub). ARCH transforms web archives into datasets tuned for computational research, allowing researchers to, for example, extract all text, spreadsheets, PDFs, images, audio, named entities and more from collections. During the workshop, participants worked directly with text, network, and image file datasets generated from web archive collections. With access to datasets derived from these collections, the group explored a range of analyses using Palladio, RAWGraphs, and Voyant.

*Visualization of the image files contained in the Chicago Architecture Biennial collection, created using Palladio based on an Image File dataset extracted from the collection using ARCH.*

The high level of interest and participation in this event is indicative of the appetite within the Humanities for workshops on computational research. Participants described how the workshop gave them concrete language to express the challenges of working with large-scale data, while also expressing how the event offered strategies they could apply to their own research or could use to support their research communities. For those who were not able to make it to Humanities and the Web, we will be hosting a series of virtual and in-person workshops in 2023. Keep your eye on this space for upcoming announcements.

Introducing the COVID-19 Web Archive

Posted on October 11, 2022 by Lori Donovan

We are pleased to announce that the COVID-19 Web Archive is now available! As the COVID-19 pandemic emerged in early 2020, librarians, archivists, and others with interest in preserving cultural heritage began documenting the personal, cultural, and societal impact of the global pandemic on their communities. These efforts included creating archival collections preserving physical, digital, and web-based records and information for use by students, scholars, and citizens. In response to this immediate need for archiving resources by both libraries and memory institutions, the Internet Archive’s Archive-It service launched a COVID-19 Web Archiving Special Campaign in April 2020 providing free and subsidized tools, training, and community support to institutions and local efforts to preserve web-published materials documenting the COVID-19 pandemic.

The COVID-19 Web Archive builds on this curatorial work to gather together more than 160 web archive collections created by more than 125 libraries, archives, and cultural heritage organizations into a shared access portal built and maintained by the Internet Archive. The COVID-19 Web Archive currently totals nearly 90 terabytes of archived data composed of over 1.5 billion webpages and allows for full text, metadata, and media search within individual collections and across the entire archive. The archive will be continuously updated over time. If you have a collection you’d like to include in the portal, please contact us at covidwebarchive@archive.org.

Collections document the pandemic from a number of different perspectives, including:

Athens Regional Library System’s Athens, Georgia Area COVID-19 Response collection, which highlights “the local response to the coronavirus (COVID-19) pandemic in Athens, Georgia. Included are communications from Athens-Clarke County government, communications from Clarke County School District, fundraisers for local businesses, ‘Band Together’ showcases, and various other items that are related to the local response.”
University of British Columbia’s COVID-19, Racism, and Asian Communities collection, which documents incidents of racism against the Asian communities in Canada, related to the COVID-19 pandemic.
New York University’s Tamiment Wagner: NYC COVID-19 Web Activism collection, which “documents activists’ use of social media and the internet to create content, online campaigns, online actions, virtual mutual aid networks and funds to highlight, resist, and call attention to ways in which COVID-19 has impacted New York City physically, emotionally, politically, and economically.”
Pennsylvania Horticultural Society’s COVID-19 Collection, “focus[ed] on the Pennsylvania Horticultural Society’s programmatic COVID-19 response via #GrowTogetherPHS, a campaign to engage our audiences in gardening at home.”

The browsing and searching capabilities available on the COVID-19 Web Archive website are augmented by the availability of public datasets, as well as a series of in-person and virtual data analysis workshops which will facilitate a myriad of potential avenues for research use of web archives. A number of research projects and use cases for COVID-19-related web archives have already emerged from the work of ARCH (Archives Research Compute Hub) cohort program members in 2021-2022.

If you are interested in learning more about the COVID-19 Web Archive and associated research opportunities, we are holding an informational webinar on Thursday, October 27 at 11am PT. A walkthrough of the COVID-19 Web Archive is available here.

The COVID-19 Web Archive was made possible with generous support from the Institute of Museum and Library Services (IMLS) as part of their American Rescue Plan grant program.

Early Web Datasets & Researcher Opportunities

Posted on March 12, 2021 by jefferson

In July, we announced our partnership with the Archives Unleashed project as part of our ongoing effort to make new services available for scholars and students to study the archived web. Joining the curatorial power of our Archive-It service, our work supporting text and data mining, and Archives Unleashed’s in-browser analysis tools will open up new opportunities for understanding the petabyte-scale volume of historical records in web archives.

As part of our partnership, we are releasing a series of publicly available datasets created from archived web collections. Alongside these efforts, the project is also launching a Cohort Program providing funding and technical support for research teams interested in studying web archive collections. These twin efforts aim to help build the infrastructure and services to allow more researchers to leverage web archives in their scholarly work. More details on the new public datasets and the cohorts program are below.

Early Web Datasets

Our first in a series of public datasets from the web collections are oriented around the theme of the early web. These are, of course, datasets intended for data mining and researchers using computational tools to study large amounts of data, so are absent the informational or nostalgia value of looking at archived webpages in the Wayback Machine. If the latter is more your interest, here is an archived Geocities page with unicorn GIFs.

GeoCities Collection (1994–2009)

As one of the first platforms for creating web pages without expertise, Geocities lowered the barrier of entry for a new generation of website creators. There were at least 38 million pages displayed by GeoCities before it was terminated by Yahoo! in 2009. This dataset collection contains a number of individual datasets that include data such as domain counts, image graph and web graph data, and binary file information for a variety of file formats like audio, video, and text and image files. A graphml file is also available for the domain graph.

GeoCities Dataset Collection: https://archive.org/details/geocitiesdatasets

Friendster (2003–2015)

Friendster was an early and widely used social media networking site where users were able to establish and maintain layers of shared connections with other users. This dataset collection contains graph files that allow data-driven research to explore how certain pages within Friendster linked to each other. It also contains a dataset that provides some basic metadata about the individual files within the archival collection.

Friendster Dataset Collection: https://archive.org/details/friendsterdatasets

Early Web Language Datasets (1996–1999)

These two related datasets were generated from the Internet Archive’s global web archive collection. The first dataset, “Parallel Language Records of the Early Web (1996–1999)” provides a dataset of multilingual records, or URLs of websites that have the same text represented in multiple languages. Such multi-language text from websites are a rich source for parallel language corpora and can be valuable in machine translation. The second dataset, “Language Annotations of the Early Web (1996–1999)” is another metadata set that annotates the language of over four million websites using Compact Language Detector (CLD3).

Early Web Language collection: https://archive.org/details/earlywebdatasets

Archives Unleashed Cohort Program

Applications are now being accepted from research teams interested in performing computational analysis of web archive data. Five cohorts teams of up to five members each will be selected to participate in the program from July 2021 to June 2022. Teams will:

Participate in cohort events, training, and support, with a closing event held at Internet Archive, in San Francisco, California, USA tentatively in May 2022. Prior events will be virtual or in-person, depending on COVID-19 restrictions
Receive bi-monthly mentorship via support meetings with the Archives Unleashed team
Work in the Archive-It Research Cloud to generate custom datasets
Receive funding of $11,500 CAD to support project work. Additional support will be provided for travel to the Internet Archive event

Applications are due March 31, 2021. Please visit the Archives Unleashed Research Cohorts webpage for more details on the program and instructions on how to apply.

Archive-It and Archives Unleashed Join Forces to Scale Research Use of Web Archives

Posted on July 28, 2020 by jefferson

Archived web data and collections are increasingly important to scholarly practice, especially to those scholars interested in data mining and computational approaches to analyzing large sets of data, text, and records from the web. For over a decade Internet Archive has worked to support computational use of its web collections through a variety of services, from making raw crawl data available to researchers, performing customized extraction and analytic services supporting network or language analysis, to hosting web data hackathons and having dataset download features in our popular suite of web archiving services in Archive-It. Since 2016, we have also collaborated with the Archives Unleashed project to support their efforts to build tools, platforms, and learning materials for social science and humanities scholars to study web collections, including those curated by the 700+ institutions using Archive-It.

We are excited to announce a significant expansion of our partnership. With a generous award of $800,000 (USD) to the University of Waterloo from The Andrew W. Mellon Foundation, Archives Unleashed and Archive-It will broaden our collaboration and further integrate our services to provide easy-to-use, scalable tools to scholars, researchers, librarians, and archivists studying and stewarding web archives. Further integration of Archives Unleashed and Archive-It’s Research Services (and IA’s Web & Data Services more broadly) will simplify the ability of scholars to analyze archived web data and give digital archivists and librarians expanded tools for making their collections available as data, as pre-packaged datasets, and as archives that can be analyzed computationally. It will also offer researchers a best-of-class, end-to-end service for collecting, preserving, and analyzing web-published materials.

The Archives Unleashed team brings together a team of co-investigators. Professor Ian Milligan, from the University of Waterloo’s Department of History, Jimmy Lin, Professor and Cheriton Chair at Waterloo’s Cheriton School of Computer Science, and Nick Ruest, Digital Assets Librarian in the Digital Scholarship Infrastructure department of York University Libraries, along with Jefferson Bailey, Director of Web Archiving & Data Services at the Internet Archive, will all serve as co-Principal Investigators on the “Integrating Archives Unleashed Cloud with Archive-It” project. This project represents a follow-on to the Archives Unleashed project that began in 2017, also funded by The Andrew W. Mellon Foundation.

“Our first stage of the Archives Unleashed Project,” explains Professor Milligan, “built a stand-alone service that turns web archive data into a format that scholars could easily use. We developed several tools, methods and cloud-based platforms that allow researchers to download a large web archive from which they can analyze all sorts of information, from text and network data to statistical information. The next logical step is to integrate our service with the Internet Archive, which will allow a scholar to run the full cycle of collecting and analyzing web archival content through one portal.”

“Researchers, from both the sciences and the humanities, are finally starting to realize the massive trove of archived web materials that can support a wide variety of computational research,” said Bailey. “We are excited to scale up our collaboration with Archives Unleashed to make the petabytes of web and data archives collected by Archive-It partners and other web archiving institutions around the world more useful for scholarly analysis.”

The project begins in July 2020 and will begin releasing public datasets as part of the integration later in the year. Upcoming and future work includes technical integration of Archives Unleashed and Archive-It, creation and release of new open-source tools, datasets, and code notebooks, and a series of in-person “datathons” supporting a cohort of scholars using archived web data and collections in their data-driven research and analysis. We are grateful to The Andrew W. Mellon Foundation for their support of this integration and collaboration in support of critical infrastructure supporting computational scholarship and its use of the archived web.

Primary contacts:
IA – Jefferson Bailey, Director of Web Archiving & Data Services, jefferson [at] archive.org
AU – Ian Milligan, Professor of History, University of Waterloo, i2milligan [at] uwaterloo.ca

Archiving Online Local News with the News Measures Research Project

Posted on November 13, 2019 by jefferson

Over the past two years Archive-It, Internet Archive’s web archiving service, has partnered with researchers at the Hubbard School of Journalism and Mass Communication at University of Minnesota and the Dewitt Wallace Center for Media and Democracy at Duke University in a project designed to evaluate the health of local media ecosystems as part of the News Measures Research Project, funded by the Democracy Fund. The project is led by Phil Napoli at Duke University and Matthew Weber at University of Minnesota. Project staff worked with Archive-It to crawl and archive the homepages of 663 local news websites representing 100 communities across the United States. Seven crawls were run on single days from July through September and captured over 2.2TB of unique data and 16 million URLs. Initial findings from the research detail how local communities cover core topics such as emergencies, politics and transportation. Additional findings look at the volume of local news produced by different media outlets, and show the importance of local newspapers in providing communities with relevant content.

The goal of the News Measures Research Project is to examine the health of local community news by analyzing the amount and type of local news coverage in a sample of community. In order to generate a random and unbiased sample of communities, the team used US Census data. Prior research suggested that average income in a community is correlated with the amount of local news coverage; thus the team decided to focus on three different income brackets (high, medium and low) using the Census data to break up the communities into categories. Rural areas and major cities were eliminated from the sample in order to reduce the number of outliers; this left a list of 1,559 communities ranging in population from 20,000 to 300,000 and in average household income from $21,000 to $215,000. Next, a random sample of 100 communities was selected, and a rigorous search process was applied to build a list of 663 news outlets that cover local news in those communities (based on Web searches and established directories such as Cision).

The News Measures Research Project web captures provide a unique snapshot of local news in the United States. The work is focused on analyzing the nature of local news coverage at a local level, while also examining the broader nature of local community news. At the local level, the 100 community sample provides a way to look at the nature of local news coverage. Next, a team of coders analyzed content on the archived web pages to assess what is being covered by a given news outlet. Often, the websites that serve a local community are simply aggregating content from other outlets, rather than providing unique content. The research team was most interested in understanding the degree to which local news outlets are actually reporting on topics that are pertinent to a given community (e.g. local politics). At the global level, the team looked at interaction between community news websites (e.g. sharing of content) as well as automated measures of the amount of coverage.

The primary data for the researchers was the archived local community news data, but in addition, the team worked with census data to aggregate other measures such as circulation data for newspapers. These data allowed the team to examine the amount and type of local news changes depending on the characteristics of the community. Because the team was using multiple datasets, the Web data is just one part of the puzzle. The WAT data format proved particularly useful for the team in this regard. Using the WAT file format allowed the team to avoid digging deeply into the data – rather, the WAT data allowed the team to examine high level structure without needing to examine the content of each and every WARC record. Down the road, the WARC data allows for a deeper dive, but the lighter metadata format of the WAT files has enabled early analysis.

Stay tuned for more updates as research utilizing this data continues! The websites selected will continue to be archived and much of the data are publicly available.

Internet Archive Partners with University of Edinburgh to Provide Historical Web Data Supporting Machine Translation

Posted on June 19, 2019 by jefferson

The Internet Archive will provide portions of its web archive to the University of Edinburgh to support the School of Informatics’ work building open data and tools for advancing machine translation, especially for low-resource languages. Machine translation is the process of automatically converting text in one language to another.

The ParaCrawl project is mining translated text from the web in 29 languages. With over 1 million translated sentences available for several languages, ParaCrawl is often the largest open collection of translations for each language. The project is a collaboration between the University of Edinburgh, University of Alicante, Prompsit, TAUS, and Omniscien with funding from the EU’s Connecting Europe Facility. Internet Archive data is vastly expanding the data mined by ParaCrawl and therefore the amount of translated sentences collected. Lead by Kenneth Heafield of the University of Edinburgh, the overall project will yield open corpora and open-source tools for machine translation as well as the processing pipeline.

Archived web data from IA’s general web collections will be used in the project. Because translations are particularly scarce for Icelandic, Croatian, Norwegian, and Irish, the IA will also use customized internal language classification tools to prioritize and extract data in these languages from archived websites in its collections.

The partnership expands on IA’s ongoing effort to provide computational research services to large-scale data mining projects focusing on open-source technical developments for furthering the public good and open access to information and data. Other recent collaborations include providing web data for assessing the state of local online news nationwide, analyzing historical corporate industry classifications, and mapping online social communities. As well, IA is expanding its work in making available custom extractions and datasets from its 20+ years of historical web data. For further information on IA’s web and data services, contact webservices at archive dot org.

Join us for a White House Social Media and Gov Data Hackathon!

Posted on January 2, 2017 by jefferson

Join us at the Internet Archive this Saturday January 7 for a government data hackathon! We are hosting an informal hackathon working with White House social media data, government web data, and data from election-related collections. We will provide more gov data than you can shake a script at! If you are interested in attending, please register using this form. The event will take place at our 300 Funston Avenue headquarters from 10am-5pm.

We have been working with the White House on their admirable project to provide public access to eight years of White House social media data for research and creative reuse. Read more on their efforts at this blog post. Copies of this data will be publicly accessible at archive.org. We have also been furiously archiving the federal government web as part of our collaborative End of Term Web Archive and have also collected a voluminous amount of media and web data as part of the 2016 election cycle. Data from these projects — and others — will be made publicly accessible for folks to analyze, study, and do fun, interesting things with.

At Saturday’s hackathon, we will give an overview of the datasets available, have short talks from affiliated projects and services, and point to tools and methods for analyzing the hackathon’s data. We plan for a loose, informal event. Some datasets that will be available for the event and publicly accessible online:

Obama Administration White House social media from 2009-current, including Twitter, Tumblr, Vine, Facebook, and (possibly) YouTube
Comprehensive web archive data of current White House websites: whitehouse.gov, petitions.whitehouse.gov, letsmove.gov and other .gov websites
The End of Term Web Archives, a large-scale collaborative effort to preserve the federal government web ( .gov/.mil) at presidential transitions, including web data from 2008, 2012, and our current 2016 project
Special sub-collections of government data, such as every powerpoint in the Internet Archive’s web archive from the .mil web domain
Extensive archives of of social media data related to the 2016 election including data from candidates, pundits, and media
Full text transcripts of Trump candidate speeches
Python notebooks, cluster computing tools, and pointers to methods for playing with data at scale.

Much of this data was collected in partnership with other libraries and with the support of external funders. We thank, foremost, the current White House Office of Digital Strategy staff for their advocacy for open access and working with us and others to make their social media open to the public. We also thank our End of Term Web Archive partners and related community efforts helping preserve the .gov web, as well as the funders that have supported many of the collecting and engineering efforts that makes all this data publicly accessible, including the Institute of Museum and Library Services, Altiscale, the Knight Foundation, the Democracy Fund, the Kahle-Austin Foundation, and others.

GifCities: The GeoCities Animated GIF Search Engine

Posted on November 1, 2016 by jefferson

Try the Internet Archive’s animated GIF search engine at GifCities.org! You can now get your early-web GIF fix and have a fun way to browse the web archive. Search for snowglobes or butterflies or balloons or (naturally) cats. If you click on a GIF, then it brings to you to the original page from the Wayback Machine. (Then please consider donating to the Archive)

One of the goals for our 20th anniversary event last week was to highlight the amusing and wacky corners of the web, as represented in our web archive, in order to provide a light-hearted, novel perspective on the history of this amazing publication platform that we have worked to preserve over the years.

The animated GIF is perhaps the iconic, indomitable filetype of the early web. Meme-vessel, page-spacer, action-graphic-maker — GIFS are a quintessential feature of the 1990’s web aesthetic, but remain just as popular today as they were twenty years ago. GeoCities, the first major web hosting platform for individual users to create their own pages, and once the third most visited site on the web before being shut down in 2009, occupies a similarly notable place in the history of the web.

So we combined these two aspects of web history by extracting every animated GIF from GeoCities in our web archive and built a search engine on top of them. Behold, for your viewing pleasure, over 4,500,000 animated GIFs (1,600,000 unique), searchable based on filename and URL path, with most GIFs linking to the archived GeoCities web page where it was originally displayed.

Some random staff faves:

Soft-launched at our anniversary event on Wednesday, where we also projected GifCities on the side of our headquarters in San Francisco, the project has been featured in The Guardian, BoingBoing, the A.V. Club, CNET, and others. The GeoCities GIF collection was also made available for creative reuse by artists and researchers, and featured in work such as the GifCollider project currently showing at BAMPFA (see the videos online) and the Hall of GIFs data visualization at NCSU. Shout-outs also go to others working with the GeoCities web archive, including the Geocities Research Institute and historians. More details on the project can be found at the GifCities about page.

And yes, like every other upstanding web citizen, we GifCities’ed ourselves:

Hacking Web Archives

Posted on August 31, 2016 by jefferson

The awkward teenage years of the web archive are over. It is now 27 years since Tim Berners-Lee created the web and 20 years since we at Internet Archive set out to systematically archive web content. As the web gains evermore “historicity” (i.e., it’s old and getting older — just like you!), it is increasingly recognized as a valuable historical record of interest to researchers and others working to study it at scale.

Thus, it has been exciting to see — and for us to support and participate in — a number of recent efforts in the scholarly and library/archives communities to hold hackathons and datathons focused on getting web archives into the hands of research and users. The events have served to help build a collaborative framework to encourage more use, more exploration, more tools and services, and more hacking (and similar levels of the sometime-maligned-but-ever-valuable yacking) to support research use of web archives. Get the data to the people!

First, in May, in partnership with the Alexandria Project of L3S at University of Hannover in Germany, we helped sponsor “Exploring the Past of the Web: Alexandria & Archive-It Hackathon” alongside the Web Science 2016 conference. Over 15 researchers came together to analyze almost two dozen subject-based web archives created by institutions using our Archive-It service. Universities, archives, museums, and others contributed web archive collections on topics ranging from the Occupy Movement to Human Rights to Contemporary Women Artists on the Web. Hackathon teams geo-located IP addresses, analyzed sentiments and entities in webpage text, and studied mime type distributions.

Similarly, in June, our friends at Library of Congress hosted the second Archives Unleashed datathon, a follow-on to a previous event held at University of Toronto in March 2016. The fantastic team organizing these two Archives Unleashed hackathons have created an excellent model for bringing together transdisciplinary researchers and librarians/archivists to foster work with web data. In both Archives Unleashed events, attendees formed into self-selecting teams to work together on specific analytical approaches and with specific web archive collections and datasets provided by Library of Congress, Internet Archive, University of Toronto, GWU’s Social Feed Manager, and others. The #hackarchives tweet stream gives some insight into the hacktivities, and the top projects were presented at the Save The Web symposium held at LC’s Kluge Center the day after the event.

Both events show a bright future for expanding new access models, scholarship, and collaborations around building and using web archives. Plus, nobody crashed the wi-fi at any of these events! Yay!

Special thanks go to Altiscale (and Start Smart Labs) and ComputeCanada for providing cluster computing services to support these events. Thanks also go to the multiple funding agencies, including NSF and SSHRC, that provided funding, and to the many co-sponsoring and hosting institutions. Super special thanks go to key organizers, Helge Holzman and Avishek Anand at L3S and Matt Weber, Ian Milligan, and Jimmy Lin at Archives Unleashed, who made these events a rollicking success.

For those interested in participating in a web archives hackathon/datathon, more are in the works, so stay tuned to the usual social media channels. If you are interested in helping host an event, please let us know. Lastly, for those that can’t make an event, but are interested in working with web archives data, check out our Archives Research Services Workshop.

Lastly, some links to blog posts, projects, and tools from these events:

Some related blog posts:

Some hackathon projects:

Some web archive analysis tools:

Here’s to more happy web archives hacking in the future!

Internet Archive Blogs

A blog from the team at archive.org