When MTVNews.com went offline in late June, Internet users were quick to discover that some (but sadly, not all) of the site had been archived in the Internet Archive’s Wayback Machine. While you can no longer browse MTV News directly on the web, the archived pages are available via the Wayback Machine, starting with the first crawl of the site on July 5, 1997.
Why provide search indexes to music news? Because, as Michael Alex, founding editor of MTV News Digital, wrote in an op-ed for Variety, “the archives of MTV News and countless other news and entertainment organizations have a similar value: They’re a living record of entertainment history as it happened.”
It’s important to remember that these collections were captured as a routine part of the daily work conducted by more than one thousand libraries and archives collaborating with the Internet Archive to archive the web. For centuries, libraries have been the trusted repositories of culture and knowledge. As our news and information sources move increasingly digital, the role of libraries like the Internet Archive and our partners has changed to meet these new demands. This is why libraries like ours exist, and why web archiving is critical for preserving our shared digital culture.
In my newsletter Zero Retries, I write about interesting developments in Ham Radio to folks like me whose primary interest in Ham Radio is experimenting with the more advanced technological possibilities of Ham Radio. Such developments include communicating with data modes locally and worldwide (Packet Radio), using Ham Radio satellites and communicating with Ham Radio astronauts on the International Space Station, and developing M17, a new two way radio technology based entirely on open source (to mention just a few).
One of my favorite ways to use the DLARC (nearly 120,000 items now, and still growing) is to re-explore ideas that were proposed or attempted in Ham Radio, but for various reasons, didn’t quite become mainstream. Typically, the technology of earlier eras simply wasn’t up to some proposed ideas. But, with the technology of the 2020s such as cheap, powerful computers and software defined radio technology, many old ideas can be reexamined with perhaps succeed in becoming mainstream now. The problem has been that much of the source material for such “reimagining” has been languishing in file cabinets or bookcases of Ham Radio Operators like me, with nowhere to go. With the grant, IA could hire a dedicated archivist and began receiving, scanning, hosting, and aggregating electronic versions of old Ham Radio material.
One of my favorite examples of maybe we should try this again? is a one page flyer for a radio unit designed for data – the NW Digital Radio UDRX-440. That radio was a leading-edge idea in 2013, but didn’t become a product. One reason for that fate was that it required a small but powerful computer that NW Digital Radio was forced to develop itself, which was expensive. More than a decade later, the computer that NW Digital Radio required, with a quad-core, 1.8 GHZ processor and 1 GB of RAM is available off-the-shelf – for $35. Perhaps it’s time for an innovative Ham Radio manufacturer to try creating something like the UDRX-440 again. Being able to provide a link to illustrate such a concept, and prove that one manufacturer got as far as the design stage, can be inspirational.
Another example maybe we should try this again? is the PACSAT system, a data-communications protocol and hardware specification for Ham Radio satellites that combined multiple receivers with a single high speed transmitter for more efficient throughput of data. In the 1990s, PACSAT was proposed and several satellites were actually built and put into orbit. But then, PACSAT required dedicated, expensive, specialized hardware suitable only for a satellite. In the 2020s, a PACSAT system could replace a Ham Radio repeater with a software defined receiver (can now listen to multiple frequencies) and a few other off-the-shelf parts. The difference that DLARC makes is that all the original reference material for PACSAT can easily be found in DLARC. If some graduate student were to email me looking for a project, I can suggest that they create a “PACSAT 2025” – and point them to all of the PACSAT material in DLARC.
Many new Ham Radio Operators live in “restricted” living arrangements such as apartments, condominiums, or communities that don’t allow external antennas. Thus, to operate on the Ham Radio “High Frequency – HF” bands (shortwave) bands, some “creativity” is required – a stealthy antenna. One of my favorite collections within DLARC is 73 Magazine which was published monthly for 43 years, with many, many antenna construction articles such as the “compressed W3EDP” HF antenna that would fit into an attic. Unlike current Ham Radio magazines, all 516 issues of 73 Magazine can be browsed, and downloaded, and because Internet Archive does optical character recognition (OCR), every word of every issue is keyword searchable.That, is powerful and ample “food for the imagination” of Ham Radio Operators looking to the past for some interesting projects to tackle.
Those are just a few examples of the utility of DLARC from my perspective. Ham Radio has existed for more than a century, but prior to DLARC, there was no comprehensive online archive of Ham Radio material. There were some personal archives, some Ham Radio clubs and organizations had their newsletters online, but there was no comprehensive online archive of Ham Radio material. DLARC is now the archive that Ham Radio has been missing. Most significantly, unlike some Ham Radio organizations, material in DLARC is free for public access (though some material is subject to Controlled Digital Lending). DLARC includes club newsletters (from all over the world), Ham Radio books and magazines (some from very early in the 20th century), audio recordings, video recordings, conference proceedings… literally a treasure trove of knowledge and ideas and inspiration.
Thank you Internet Archive and Archivist Kay Savetz K6KJN for all the hard work in creating and growing Digital Library of Amateur Radio & Communications – we really appreciate it (and I use it nearly every day).
Steve Stroh Amateur Radio Operator N8GNJ Bellingham, Washington, USA
It’s that time again. The 2024 End of Term crawl has officially begun! The End of Term Web Archive #EOTArchive hosts an initiative named the End of Term crawl to archive U.S. government websites in the .gov and .mil web domains — as well as those harder-to-find government websites hosted on .org, .edu, and other top level domains (TLDs) — as one administrative term ends and a new term begins.
End of Term crawls have been completed for term transitions in 2004, 2008, 2012, 2016, and 2020. The results of these efforts is preserved in the End of Term Web Archive. In total, over 500 terabytes of government websites and data have been archived through the End of Term Web Archive efforts. These archives can be searched full-text via the Internet Archive’s collections search and also downloaded as bulk data for machine-assisted analysis.
The purpose of the End of Term Web Archive is to preserve a record of government websites for historical and research purposes. It is important to capture these websites because they can provide a snapshot of government messaging before and after the transition of terms. The End of Term Web Archive preserves information that may no longer be available on the live web for open access.
We are committed to preserving a record of U.S. government websites. But we need your help to complete the 2024 End of Term crawl.
How can you help?!
We have a list of top level domains from the General Services Administration (GSA) and from previous End of term crawls. But we need volunteers to help us out. We are currently accepting nominations for websites to be included in the 2024 End of Term Web Archive.
Submit a url nomination by going to digital2.library.unt.edu/nomination/eth2024/. We encourage you to nominate any and all U.S. federal government websites that you want to make sure get captured. Nominating urls deep within .gov/.mil websites helps to make our web crawls as thorough and complete as possible.
Started in 2017, our Community Webs program has over 175 public libraries and local cultural organizations working to build digital archives documenting the experiences of their communities, especially those patrons often underrepresented in traditional archives. Participating public libraries have created over 1,400 collections documenting local civic life totaling nearly 100 terabytes and tens of millions of individual documents, images, audio/video files, blogs, websites, social media, and more. You can browse many of these collections at the Community Webs website. Participants have also collaborated on digitization efforts to bring minority newspapers online, held public programming and outreach events, and formed local partnerships to help preservation efforts at other mission-aligned organizations. The program has conducted numerous workshops and national symposia to help public librarians gain expertise in digital preservation and cohort members have done dozens of presentations at professional conferences showcasing their work. In the past, Community Webs has received support from the Institute of Museum and Library Services, the Mellon Foundation, the Kahle Austin Foundation, and the National Historical Publications and Records Commission.
We are excited to announce that Community Webs has received $750,000 in funding from The Mellon Foundation to continue expanding the program. The award will allow additional public libraries to join the program and will enable new and existing members to continue their web archiving collection building using our Archive-It service. In addition, the funding will also provide members access to Internet Archive’s new Vault digital preservation service, enabling them to build and preserve collections of any type of digital materials. Lastly, leveraging members’ prior success in local partnerships, Community Webs will now include an “Affiliates” program so member public libraries can nominate local nonprofit partners that can also receive access to archiving services and resources. Funding will also support the continuation of the program’s professional development training in digital preservation and community archiving and its overall cohort and community building activities of workshops, events, and symposia.
We thank The Andrew W. Mellon Foundation for their generous support of Community Webs. We are excited to continue to expand the program and empower hundreds of public librarians to build archives that document the voices, lives, and events of their communities and to ensure this material is permanently available to patrons, students, scholars, and citizens.
This is a guest post from Teresa Soleau (Digital Preservation Manager), Anders Pollack (Software Engineer), and Neal Johnson (Senior IT Project Manager) from the J. Paul Getty Trust.
Project Background
Getty pursues its mission in Los Angeles and around the world through the work of its constituent programs—Getty Conservation Institute, Getty Foundation, J. Paul Getty Museum, and Getty Research Institute—serving the general interested public and a wide range of professional communities to promote a vital civil society through an understanding of the visual arts.
In 2019, Getty began a website redesign project, changing the technology stack and updating the way we interact with our communities online. The legacy website contained more than 19,000 web pages and we knew many were no longer useful or relevant and should be retired, possibly after being archived. This led us to leverage the content we’d captured using the Internet Archive’s Archive-It service.
We’d been crawling our site since 2017, but had treated the results more as a record of institutional change over time than as an archival resource to be consulted after deletion of a page. We needed to direct traffic to our Wayback Machine captures thus ensuring deleted pages remain accessible when a user requests a deprecated URL. We decided to dynamically display a link to the archived page from our site’s 404 error “Page not found” page.
The project to audit all existing pages required us to educate content owners across the institution about web archiving practices and purpose. We developed processes for completing human reviews of large amounts of captured content. This work is described in more detail in a 2021 Digital Preservation Coalition blog post that mentions the Web Archives Collecting Policy we developed.
In this blog post we’ll discuss the work required to use the Internet Archive’s data API to add the necessary link on our 404 pages pointing to the most recent Wayback Machine capture of a deleted page.
Technical Underpinnings
Implementation of our Wayback Machine integration was very straightforward from a technical point of view. The first example provided in the Wayback Machine APIs documentation page provided the technical guidance needed for our use case to display a link to the most recent capture of any page deleted from our website. With no requirements for authentication or management of keys or platform-specific software development kit (SDK) dependencies, our development process was simplified. We chose to incorporate the Wayback API using Nuxt.js, the web framework used to build the new Getty.edu site.
Since the Wayback Machine API is highly performant for simple queries, with a typical response delay in milliseconds, we are able to query the API before rendering the page using a Nuxt route middleware module. API error handling and a request timeout were added to ensure that edge cases such as API failures or network timeouts do not block rendering of the 404 response page.
The only Internet Archive API feature missing for our initial list of requirements was access to snapshot page thumbnails in the JSON data payload received from the API. Access to these images would allow us to enhance our 404 page with a visual cue of archived page content.
Results and Next Steps
Our ability to include a link to an archived version of a deleted web page on our 404 response page helped ease the tough decisions content stakeholders were obliged to make about what content to archive and then delete from the website. We could guarantee availability of content in perpetuity without incurring the long term cost of maintaining the information ourselves.
The API brings back the most recent Wayback Machine capture by default which is sometimes not created by us and hasn’t necessarily passed through our archive quality assurance process. We intend to develop our application further so that we privilege the display of Getty’s own page captures. This will ensure we’re delivering the highest quality capture to users.
Google Analytics has been configured to report on traffic to our 404 pages and will track clicks on links pointing to Internet Archive pages, providing useful feedback on what portion of archived page traffic is referred from our 404 error page.
To work around the challenge of providing navigational affordances to legacy content and ensure web page titles of old content remains accessible to search engines, we intend to provide an up-to-date index of all archived getty.edu pages.
As we continue to retire obsolete website pages and complete this monumental content archiving and retirement effort, we’re grateful for the Internet Archive API which supports our goal of making archived content accessible in perpetuity.
Kevin Hegg is Head of Digital Projects at James Madison University Libraries (JMU). Kevin has held many technology positions within JMU Libraries. His experience spans a wide variety of technology work, from managing computer labs and server hardware to developing a large open-source software initiative. We are thankful to Kevin for taking time to talk with us about his experience with ARCH (Archives Research Compute Hub), AI, and supporting research at JMU.
Thomas Padilla is Deputy Director, Archiving and Data Services.
Thomas: Thank you for agreeing to talk more about your experience with ARCH, AI, and supporting research. I find that folks are often curious about what set of interests and experiences prepares someone to work in these areas. Can you tell us a bit about yourself and how you began doing this kind of work?
Kevin: Over the span of 27 years, I have held several technology roles within James Madison University (JMU) Libraries. My experience ranges from managing computer labs and server hardware to developing a large open-source software initiative adopted by numerous academic universities across the world. Today I manage a small team that supports faculty and students as they design, implement, and evaluate digital projects that enhance, transform, and promote scholarship, teaching, and learning. I also co-manage Histories Along the Blue Ridgewhich hosts over 50,000 digitized legal documents from courthouses along Virginia’s Blue Ridge mountains.
Thomas: I gather that your initial interest in using ARCH was to see what potential it afforded for working with James Madison University’s Mapping Black Digital and Public Humanities project. Can you introduce the project to our readers?
Kevin: The Mapping the Black Digital and Public Humanities project began at JMU in Fall 2022. The project draws inspiration from established resources such as the Colored Convention Project and the Reviews in Digital Humanities journal. It employs Airtable for data collection and Tableau for data visualization. The website features a map that not only geographically locates over 440 Black digital and public humanities projects across the United States but also offers detailed information about each initiative. The project is a collaborative endeavor involving JMU graduate students and faculty, in close alliance with JMU Libraries. Over the past year, this interdisciplinary team has dedicated hundreds of hours to data collection, data visualization, and website development.
The project has achieved significant milestones. In Fall 2022, Mollie Godfrey and Seán McCarthy, the project leaders, authored, “Race, Space, and Celebrating Simms: Mapping Strategies for Black Feminist Biographical Recovery“, highlighting the value of such mapping projects. At the same time, graduate student Iliana Cosme-Brooks undertook a monumental data collection effort. During the winter months, Mollie and Seán spearheaded an effort to refine the categories and terms used in the project through comprehensive research and user testing. By Spring 2023, the project was integrated into the academic curriculum, where a class of graduate students actively contributed to its inaugural phase. Funding was obtained to maintain and update the database and map during the summer.
Looking ahead, the project team plans to present their work at academic conferences and aims to diversify the team’s expertise further. The overarching objective is to enhance the visibility and interconnectedness of Black digital and public humanities projects, while also welcoming external contributions for the initiative’s continual refinement and expansion.
Thomas: It sounds like the project adopts a holistic approach to experimenting with and integrating the functionality of a wide range of tools and methods (e.g., mapping, data visualization). How do you see tools like ARCH fitting into the project and research services more broadly? What tools and methods have you used in combination with ARCH?
Kevin: ARCH offers faculty and students an invaluable resource for digital scholarship by providing expansive, high-quality datasets. These datasets enable more sophisticated data analytics than typically encountered in undergraduate pedagogy, revealing patterns and trends that would otherwise remain obscured. Despite the increasing importance of digital humanities, a significant portion of faculty and students lack advanced coding skills. The advent of AI-assisted coding platforms like ChatGPT and GitHub CoPilot has democratized access to programming languages such as Python and JavaScript, facilitating their integration into academic research.
For my work, I employed ChatGPT and CoPilot to further process ARCH datasets derived from a curated sample of 20 websites focused on Black digital and public humanities. Utilizing PyCharm—an IDE freely available for educational purposes—and the CoPilot extension, my coding efficiency improved tenfold.
Next, I leveraged ChatGPT’s Advanced Data Analysis plugin to deconstruct visualizations from Stanford’s Palladio platform, a tool commonly used for exploratory data visualizations but lacking a means for sharing the visualizations. With the aid of ChatGPT, I developed JavaScript-based web applications that faithfully replicate Palladio’s graph and gallery visualizations. Specifically, I instructed ChatGPT to employ the D3 JavaScript library for ingesting my modified ARCH datasets into client-side web applications. The final products, including HTML, JavaScript, and CSV files, were made publicly accessible via GitHub Pages (see my graph and gallery on GitHub Pages)
In summary, the integration of Python and AI-assisted coding tools has not only enhanced my use of ARCH datasets but also enabled the creation of client-side web applications for data visualization.
Thomas: Beyond pairing ChatGPT with ARCH, what additional uses are you anticipating for AI-driven tools in your work?
Kevin: AI-driven tools have already radically transformed my daily work. I am using AI to reduce or even eliminate repetitive, mindless tasks that take tens or hundreds of hours. For example, as part of the Mapping project, ChatGPT+ helped me transform an AirTable with almost 500 rows and two dozen columns into a series of 500 blog posts on a WordPress site. ChatGPT+ understands the structure of a WordPress export file. After a couple of hours of iterating through my design requirements with ChatGPT, I was able to import 500 blog posts into a WordPress website. Without this intervention, this task would have required over a hundred hours of tedious copying and pasting. Additionally, we have been using AI-enabled platforms like Otter and Descript to transcribe oral interviews.
I foresee AI-driven tools playing an increasingly pivotal role in many facets of my work. For instance, natural language processing could automate the categorization and summarization of large text-based datasets, making archival research more efficient and our analyses richer. AI can also be used to identify entities in large archival datasets. Archives hold a treasure trove of artifacts waiting to be described and discovered. AI offers tools that will supercharge our construction of finding aids and item-level metadata.
Lastly, AI could facilitate more dynamic and interactive data visualizations, like the ones I published on GitHub Pages. These will offer users a more engaging experience when interacting with our research findings. Overall, the potential of AI is vast, and I’m excited to integrate more AI-driven tools into JMU’s classrooms and research ecosystem.
Thomas: Thanks for taking the time Kevin. To close out, whose work would you like people to know more about?
Kevin: Engaging in Digital Humanities (DH) within the academic library setting is a distinct privilege, one that requires a collaborative ethos. I am fortunate to be a member of an exceptional team at JMU Libraries, a collective too expansive to fully acknowledge here. AI has introduced transformative tools that border on magic. However, loosely paraphrasing Immanuel Kant, it’s crucial to remember that technology devoid of content is empty. I will use this opportunity to spotlight the contributions of three JMU faculty whose work celebrates our local community and furthers social justice.
Mollie Godfrey (Department of English) and Seán McCarthy (Writing, Rhetoric, and Technical Communication) are the visionaries behind two inspiring initiatives: the Mapping Project and the Celebrating Simms Project. The latter serves as a digital, post-custodial archive honoring Lucy F. Simms, an educator born into enslavement in 1856 who impacted three generations of young students in our local community. Both Godfrey and McCarthy have cultivated deep, lasting connections within Harrisonburg’s Black community. Their work strikes a balance between celebration and reparation. Collaborating with them has been as rewarding as it is challenging.
Gianluca De Fazio (Justice Studies) spearheads the Racial Terror: Lynching in Virginia project, illuminating a grim chapter of Virginia’s past. His relentless dedication led to the installation of a historical marker commemorating the tragic lynching of Charlotte Harris. De Fazio, along with colleagues, has also developed nine lesson plans based on this research, which are now integrated into high school curricula. My collaboration with him was a catalyst for pursuing a master’s degree in American History.
Both the Celebrating Simms and Racial Terror projects are highlighted in the Mapping the Black Digital and Public Humanities initiative. The privilege of contributing to such impactful projects alongside such dedicated individuals has rendered my extensive tenure at JMU both meaningful and, I hope, enduring.
As a doctoral student in anthropology at Yale University, Spencer Kaplan often relies on the Internet Archive for his research. He is an anthropologist of technology who studies virtual communities. Kaplan said he uses the Wayback Machine to create a living archive of data that he can analyze.
Last summer, Kaplan studied the blockchain community, which is active on Twitter and constantly changing. As people were sharing their views of the market and helping one another, he needed a way to save the data before their accounts disappeared. A failed project might have prompted the users to take down the information, but Kaplan used the Wayback Machine to preserve the social media exchanges.
In his research, Kaplan said he discovered an environment of mistrust online in the blockchain community and an abundance of scams. He followed how people were navigating the scams, warning one another online to be careful, and actually building trust in some cases. While blockchain is trying to build technologies that avoid trust in social interaction, Kaplan said it was interesting to observe blockchain enthusiasts engaging in trusting connections. He takes the texts of tweets to build a corpus that he can then code and analyze the data to track or show trends.
The Wayback Machine can be helpful, Kaplan said, in finding preserved discussions on Twitter, early versions of company websites or pages that have been taken down altogether—a start-up company that went out of business, for example. “It’s important to be able to hold on to that [information] because our research takes place at a very specific moment in time and we want to be able to capture that specific moment,” Kaplan said.
The Internet Archive’s Open Library has also been essential in Kaplan’s work. When he was recently researching the invention of the “corporate culture” concept, he had trouble finding the first editions of many business books written in the late 80s and early 90s. His campus library often bought updated volumes, but Kaplan needed the originals. “I needed the first edition because I needed to know exactly what they said first and I was able to find that on the Internet Archive,” Kaplan said.
A part of a series: The Internet Archive as Research Library
Written by Caralee Adams
When gathering evidence for a court case or researching human rights violations, Lili Siri Spira often found that the material she needed was preserved by the Internet Archive.
In Spira’s work, the Wayback Machine has played an integral role in providing stamped artifacts and metadata.
For example, when researching the Bolivian coup in 2019, she wanted to learn more about the sentiment of indigenous people toward political leadership. Spira used the Wayback Machine to examine how indigenous Bolivian websites had changed since 2009. She discovered after initial criticism, some websites seemed to have disappeared.
“The great thing about the Internet Archive is that it really protects the chain of custody,” Spira said. “It’s not only that you look back, but you can even find a website now and capture it in time with the metadata.”
In 2020, The Berkeley Protocol on Digital Open Source Violations provided global guidelines for using public digital information as evidence in international criminal and human rights investigations. Spira said this allows preserved website data to be used in court proceedings to hold parties accountable.
On other occasions, Spira has investigated companies suspected of unethical practices. Sometimes executives openly admitted to certain behaviors, only to later deny their action. Companies may attempt to erase past communication, but Spira said she can uncover the previous versions of websites through the Wayback Machine.
“Our knowledge is not being held sacred by many people in this country and around the world,” Spira said. “It’s incredibly important for research work in any field to have access to preserved [digital] information—especially when that research is making certain allegations against powerful entities and corporations.”
We thank Lili and her colleagues for sharing their story for how they use the Internet Archive’s collections in their work.
This Spring, the Internet Archive hosted two in-person workshops aimed at helping to advance library support for web archive research: Digital Scholarship & the Web and Art Resources on the Web. These one-day events were held at the Association of College & Research Libraries (ACRL) conference in Pittsburgh and the Art Libraries Society of North America (ARLIS) conference in Mexico City. The workshops brought together librarians, archivists, program officers, graduate students, and disciplinary researchers for full days of learning, discussion, and hands-on experience with web archive creation and computational analysis. The workshops were developed in collaborationwith the New York Art Resources Consortium (NYARC) – and are part of an ongoing series of workshops hosted by the Internet Archive through Summer 2023.
Internet Archive Deputy Director of Archiving & Data Services Thomas Padilla discussing the potential of web archives as primary sources for computational research at Art Resources on the Web in Mexico City.
Designed in direct response to library community interest in supporting additional uses of web archive collections, the workshops had the following objectives: introduce participants to web archives as primary sources in context of computational research questions, develop familiarity with research use cases that make use of web archives; and provide an opportunity to acquire hands-on experience creating web archive collections and computationally analyzing them usingARCH (Archives Research Compute Hub) – a new service set to publicly launch June 2023.
Internet Archive Community Programs Manager Lori Donovan walking workshop participants through a demonstration of Palladio using a dataset generated with ARCH at Digital Scholarship & the Web In Pittsburgh, PA.
In support of those objectives, Internet Archive staff walked participants through web archiving workflows, introduced a diverse set of web archiving tools and technologies, and offered hands-on experience building web archives. Participants were then introduced to Archives Research Compute Hub (ARCH). ARCH supports computational research with web archive collections at scale – e.g., text and data mining, data science, digital scholarship, machine learning, and more. ARCH does this by streamlining generation and access to more than a dozen research ready web archive datasets, in-browser visualization, dataset analysis, and open dataset publication. Participants further explored data generated with ARCH in Palladio, Voyant, and RAWGraphs.
Network visualization of the Occupy Web Archive collection, created using Palladio based on a Domain Graph Dataset generated by ARCH.
Gallery visualization of the CARTA Art Galleries collection, created using Palladio based on an Image Graph Dataset generated by ARCH.
At the close of the workshops, participants were eager to discuss web archive research ethics, research use cases, and a diverse set of approaches to scaling library support for researchers interested in working with web archive collections – truly vibrant discussions – and perhaps the beginnings of a community of interest! We plan to host future workshops focused on computational research with web archives – please keep an eye on our Event Calendar.
Tell us about your research & how you use the Internet Archive to further it! We are gathering testimonials about how our library & collections are used in different research projects & settings.
From using our books to check citations to doing large-scale data analysis using our web archives, we want to hear from you!