Tag Archives: open data

Internet Archive and Center for Open Science Collaborate to Preserve Open Science Data

Open Science and research reproducibility rely on ongoing access to research data. With funding from the Institute of Museum and Library Services’ National Leadership Grants for Libraries program, the Internet Archive (IA) and Center for Open Science (COS) will work together to ensure that open data related to the scientific research process is archived for perpetual access, redistribution, and reuse. The project aims to leverage the intersection between open research data, the long-term stewardship activities of libraries, and distributed data sharing and preservation networks. By focusing on these three areas of work, the project will test and implement infrastructure for improved data sharing in further support of open science and data curation. Building out interoperability between open data platforms like the Open Science Framework (OSF) of COS, large scale digital archives like IA, and collaborative preservation networks has the potential to enable more seamless distribution of open research data and enable new forms of custody and use. See also the press release from COS announcing this project.

OSF supports the research lifecycle by enabling researchers to produce and manage registrations and data artifacts for further curation to foster adoption and discovery. The Internet Archive works with 700+ institutions to collect, archive, and provide access to born-digital and web-published resources and data. Preservation at IA of open data on OSF will enable further availability of this data to other preservation networks and curatorial partners for distributed long term stewardship and local custody by research institutions using both COS and IA services. The project will also partner with a number of preservation networks and repositories to mirror portions of this data and test additional interoperability among additional stewardship organizations and digital preservation systems.

Beyond the prototyping and technical work of data archiving, the teams will also be conducting training, including the development of open education resources, webinars, and similar materials to ensure data librarians can incorporate the project deliverables into their local research data management workflows. The two-year project will first focus on OSF Registrations data and expand to include other open access materials hosted on OSF. Later stage work will test interoperable approaches to sharing subsets of this data with other preservation networks such as LOCKSS, AP Trust, and individual university libraries. Together, IA and COS aim to lay the groundwork for seamless technical integration supporting the full lifecycle of data publishing, distribution, preservation, and perpetual access.

Project contacts:
IA – Jefferson Bailey, Director of Web Archiving & Data Services, jefferson [at] archive.org
COS – Nici Pfeiffer, Director of Product, nici [at] cos.io

Internet Archive, Code for Science and Society, and California Digital Library to Partner on a Data Sharing and Preservation Pilot Project

Research and cultural heritage institutions are facing increasing costs to provide long-term public access to historically valuable collections of scientific data, born-digital records, and other digital artifacts. With many institutions moving data to cloud services, data sharing and access costs have become more complex. As leading institutions in decentralization and data preservation, the Internet Archive (IA), Code for Science & Society (CSS) and California Digital Library (CDL) will work together on a proof-of-concept pilot project to demonstrate how decentralized technology could bolster existing institutional infrastructure and provide new tools for efficient data management and preservation. Using the Dat Protocol (developed by CSS), this project aims to test the feasibility of a decentralized network as a new option for organizations to archive and monitor their digital assets.

Dat is already being used by diverse communities, including researchers, developers, and data managers. California Digital Library is building innovative tools for data publication and digital preservation. The Internet Archive is leading efforts to advance the decentralized web community. This joint project will explore the issues that emerge from collecting institutions adopting decentralized technology for storage and preservation activities. The pilot will feature a defined corpus of open data from CDL’s data sharing service. The project aims to demonstrate how members of a cooperative, decentralized network can leverage shared services to ensure data preservation while reducing storage costs and increasing replication counts. By working with the Dat Protocol, the pilot will maximize openness, interoperability, and community input. Linking institutions via cooperative, distributed data sharing networks has the potential to achieve efficiencies of scale not possible through centralized or commercial services. The partners intend to openly share the outcomes of this proof-of-concept work to inform further community efforts to build on this potential.

Want to learn more? Representatives of this project will be at FORCE 2018, Joint Conference on Digital Libraries, Open Repositories, DLF Forum, and the Decentralized Web Summit.

More about CSS: Code for Science & Society is a nonprofit organization committed to building public interest technology and low-cost decentralized tools with the Dat Project to help people share and preserve versioned digital information. Read more about CSS’ Dat in the Lab project, our recent Community Call, and other activities. (Contact: Danielle Robinson)

More about CDL UC3: The University of California Curation Center (UC3) at the California Digital Library (CDL) provides innovative data curation and digital preservation services to the 10-campus University of California system and the wider scholarly and cultural heritage communities. https://uc3.cdlib.org/. (Contact: John Chodacki)

More about IA: The Internet Archive is a non-profit digital library with the mission to provide “universal access to all knowledge.” It works with hundreds of national and international partners providing web, data, and preservation services and maintains an online library comprising millions of freely-accessible books, films, audio, television broadcasts, software, and hundreds of billions of archived websites. https://archive.org/. (Contact: Jefferson Bailey)

Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation

The Andrew W. Mellon Foundation has awarded a research and development grant to the Internet Archive to address the critical need to preserve the “long tail” of open access scholarly communications. The project, Ensuring the Persistent Access of Long Tail Open Access Journal Literature, builds on prototype work identifying at-risk content held in web archives by using data provided by identifier services and registries. Furthermore, the project expands on work acquiring missing open access articles via customized web harvesting, improving discovery and access to this materials from within extant web archives, and developing machine learning approaches, training sets, and cost models for advancing and scaling this project’s work.

The project will explore how adding automation to the already highly automated systems for archiving the web at scale can help address the need to preserve at-risk open access scholarly outputs. Instead of specialized curation and ingest systems, the project will work to identify the scholarly content already collected in general web collections, both those of the Internet Archive and collaborating partners, and implement automated systems to ensure at-risk scholarly outputs on the web are well-collected and are associated with the appropriate metadata. The proposal envisages two opposite but complementary approaches:

  • A top-down approach involves taking journal metadata and open data sets from identifier and registry sources such as ISSN, DOAJ, Unpaywall, CrossRef, and others and examining the content of large-scale web archives to ask “is this journal being collected and preserved and, if not, how can collection be improved?”
  • A bottom-up approach involves examining the content of general domain-scale and global-scale web archives to ask “is this content a journal and, if so, can it be associated with external identifier and metadata sources for enhanced discovery and access?”

The grant will fund work to use the output of these approaches to generate training sets and test them against smaller web collections in order to estimate how effective this approach would be at identifying the long-tail content, how expensive a full-scale effort would be, and what level of computing infrastructure is needed to perform such work. The project will also build a model for better understanding the costs for other web archiving institutions to do similar analysis upon their collection using the project’s algorithms and tools. Lastly, the project team, in the Web Archiving and Data Services group with Director Jefferson Bailey as Principal Investigator,  will undertake a planning process to determine resource requirements and work necessary to build a sustainable workflow to keep the results up-to-date incrementally as publication continues.

In combination, these approaches will both improve the current state of preservation for long-tail journal materials as well as develop models for how this work can be automated and applied to existing corpora at scale. Thanks to the Mellon Foundation for their support of this work and we look forward to sharing the project’s open-source tools and outcomes with a broad community of partners.

The tech powering the Political TV Ad Archive

Ever wonder how we built the Political TV Ad Archive? This post explains what happens back stage — how we are using advanced technology to generate the counts for how many times a particular ad has aired on television, where, and when, in markets that we track.

There are three pieces to the Political TV Ad Archive:

  • The Internet Archive collects, prepares, and serves the TV content in markets where we have feeds. Collection of TV is part of a much larger effort to meet the organization’s mission of providing “Universal Access to All Knowledge.”The Internet Archive is the online home to millions of free books, movies, software, music, images, web pages and more.
  • The Duplitron 5000 is our whimsical name for an open source system responsible for taking video and creating unique, compressed versions of the audio tracks. These are known as audio fingerprints. We create an audio fingerprint for each political ad that we discover, which we then match against our incoming stream of broadcast television to find each new copy, or airing, of that ad. These results are reported back to the Internet Archive.
  • The Political TV Ad Archive is a WordPress site that presents our data and our videos and presents it to the rest of the world. On this website, for the sake of posterity, we also archive copies of political ads that may be airing in markets we don’t track, or exclusively on social media. But for the ads that show up in areas where we’re collecting TV, we are able to present the added information about airings.

 

Step 1: recording television

We have a whole bunch of hardware spread around the country to record television. That content is then pieced together to form the programs that get stored on the Internet Archive’s servers. We have a few ways to collect TV content. In some cases, such as the San Francisco market, we own and manage the hardware that records local cable. In other cases, such as markets in Ohio and Iowa, the content is provided to us by third party services.

Regardless of how we get the data, the pipeline takes it to the same place. We record in minute-long chunks of video and stitch them together into programs based on what we know about the station’s schedule. This results in video segments of anywhere from 30 minutes to 12 hours. Those programs are then turned into a variety of file formats for archival purposes.

The ad counts we publish are based on actual airings, as opposed to reported airings. This means that we are not estimating counts by analyzing Federal Election Commission (FEC) reports on spending by campaigns. Nor are we digitizing reports filed by broadcasting stations with the Federal Communications Commission (FCC) about political ads, though that is a worthy goal. Instead we generate counts by looking at what actually has been broadcast to the public.

Because we are working from the source, we know we aren’t being misled. On the flip side, this means that we can only report counts for the channels we actively track and record. In the first phase of our project, we tracked more than 20 markets in 11 key primary states (details here.) We’re now in the process of planning which markets we’ll track for the general elections. Our main constraint is simple: money. Capturing TV comes at a cost.

A lot can go wrong here. Storms can affect reception, packets can be lost or corrupted before they reach our servers. The result can be time shifts or missing content. But most of the time the data winds up sitting comfortably on our hard drives unscathed.

Step 2: searching television

Video is terrible when you’re trying to look for a specific piece of it. It’s slow, it’s heavy, it is far better suited for watching than for working with, but sometimes you need to find a way.

There are a few things to try. One is transcription; if you have a time-coded transcript you can do anything. Like create a text editor for video, or search for key phrases, like “I approve this message.”

The problem is that most television is not precisely transcribed. Closed captions are required for most U.S. TV programs, but not for advertisements. Shockingly, most political ads are not captioned. There are a few open source tools out there for automated transcript generation, but the results leave much to be desired.

Introducing audio fingerprinting

We use a free and open tool called audfprint to convert our audio files into audio fingerprints.

An audio fingerprint is a summarized version of an audio file, one that has removed everything except the most interesting pieces of every few milliseconds. The trick is that the summaries are formed in a way that makes it easy to compare them, and because they are summaries, the resulting fingerprint is a lot smaller and faster to work with than the original.

The audio fingerprints we use are based on a thing called frequency. Sounds are made up of waves, and each wave repeats–oscillates–at different rates. Faster repetitions are linked to higher sounds, lower repetitions are lower sounds.

An audio file contains instructions that tell a computer how to generate these waves. Audfprint breaks the audio files into tiny chunks (around 20 chunks per second) and runs a mathematical function on each fragment to identify the most prominent waves and their corresponding frequencies.

The rest is thrown out, the summaries are stored, and the result is an audio fingerprint.

If the same sound exists across two files, a common set of dominant frequencies will be seen in both fingerprints. Audfprint makes it possible to compare the chunks between two sound files, count how many they have in common, and how many appear in roughly the same distance from one another.

This is what we use to find copies of political ads.

Step 3: cataloguing political ads

When we discover a new political ad the first thing we do is register it on the Internet Archive, kicking off the ingestion process. The person who found it types in some basic information such as who the ad mentions, who paid for it, and what topics are discussed.

The ad is then sent to the system we built to manage our fingerprinting workflow, we whimsically call the Duplitron 5000—or the “DT5k.” This uses audfprint to generate fingerprints, organizes how the fingerprints are stored, process the comparison results, and allows us to scale to process across millions of minutes of television.

DT5k generates a fingerprint for the ad, stores it, and then compares that fingerprint with hundreds of thousands of existing fingerprints for the shows that had been previously ingested into the system. It takes a few hours for all of the results to come in. When they do, the Duplitron makes sense of the numbers and tells the archive which programs contain copies of the ad and what time the ad aired.

These result end up being fairly accurate, but not perfect. The matches are based on audio, not video, which means we face trouble when the same soundtrack is used in a political ad as has been used in, for instance, an infomercial.

We are working on improving the system to filter out these kinds of false positives, but even with no changes these fingerprints have provided solid data across the markets we track.

Duplitron

The Duplitron 5000, counting political ads. Credit: Lyla Duey.

Step 4: enjoying the results

And so you understand a little bit more about our system. You can download our data and watch the ads at the Political TV Ad Archive.  (For more on our metadata–what’s in it, and what can you can do with it, read here.)

Over the coming months we are working to make the system more accurate. We are also exploring ways to identify newly released political ads without any need for manual entry.

P.S. We’re also working to make it as easy as possible for any researchers to download all of our fingerprints to use in their own local copies of the Duplitron 5000. Would you like to experiment with this capability? If so, contact me on Twitter at @slifty.