The Music Modernization Act is Bad for the Preservation of Sound Recordings

There’s a bill working its way through Congress called the Music Modernization Act (the current bill is a mix of several bills, the portion we are concerned with was formerly called the CLASSICS Act) that has us very concerned about the fate of historical sound recordings. As currently drafted, this bill would vastly expand the rights of performers of pre-1972 sound recordings, without any provision for a public domain for these works or meaningful fair use and library exceptions. After a visit to Washington DC meeting with various Congressional staffers working on this issue, we do not believe that the CLASSICS portion of the bill will be fixed. We therefore oppose the CLASSICS portion of the Music Modernization Act.

We agree with EFF on this, and they have written on the subject as well.

By way of background, sound recordings made before 1972 are not currently protected by federal copyright law, and have state law protection until 2067. To fix some real unfairness for a small group of still-living performance artists mostly from the 1960’s, this bill would give federal “pseudo-copyright” protection for digital performances for works going back to 1923. The bill would leave the rest under state law creating an even more complex and confusing legal landscape for libraries wishing to preserve these historical recordings for future generations.

Copyright law is meant to be a careful balance between creators and the public. This bill is a give away to a small group of commercial interests that leaves libraries and the public they serve behind. We hope Congress will reject this portion of the MMA.

Posted in Announcements, News | Comments Off on The Music Modernization Act is Bad for the Preservation of Sound Recordings

John Perry Barlow Symposium — Saturday, April 7

Watch The Video Here

Please join us for a celebration of the life and leadership of the recently departed founder of EFF, John Perry Barlow. His friends and compatriots in the fight for civil liberties, a fair and open internet, and voices for open culture will discuss what his ideas mean to them, and how we can follow his example as we continue our fight.

Speakers Lineup:

Edward Snowden, noted whistleblower and President of Freedom of the Press Foundation
Cindy Cohn, Executive Director of the Electronic Frontier Foundation
Cory Doctorow, celebrated scifi author and Editor in Chief of Boing Boing
Joi Ito, Director of the MIT Media Lab
John Gilmore, EFF Co-founder, Board Member, entrepreneur and technologist
Trevor Timm, Executive Director of Freedom of the Press
Shari Steele, Executive Director of the Tor Foundation and former EFF Executive Director
Mitch Kapor, Co-founder of EFF and Co-chair of the Kapor Center for Social Impact
Pam Samuelson, Richard M. Sherman Distinguished Professor of Law and Information at the University of California, Berkeley
Steven Levy, Wired Senior Writer, and author of Hackers, In the Plex, and other books
Amelia Barlow, daughter of John Perry Barlow

 

We suggest a $20 donation for admission to the Symposium, but no one will be turned away for lack of funds. All ticket proceeds will benefit the Electronic Frontier Foundation and the Freedom of the Press Foundation.

John Perry Barlow Symposium
Saturday, April 7, 2018
2 PM to 6 PM

Internet Archive
300 Funston Avenue
San Francisco, CA 94118

RSVP Here

Posted in Event, Past Event | 6 Comments

TV News Record: How cable TV news reports news, fact-checks on banking, trade, and public lands

A round up on what’s happening at the TV News Archive by Katie Dahl and Nancy Watzman.

This week, we present a Washington Post analysis of coverage of an alleged affair by the president; a Vox piece examining coverage of Andrew McCabe, the former deputy FBI director; and The Toronto Star’s use of a salient clip to illustrate a point about a presidential appointment. We also show fact-checks from FactCheck.org, PolitiFact, and The Washington Post’s Fact-Checker on claims related to banking, public lands, and trade policy.

Chicken-egg question on cable news coverage of alleged affair

CNN and MSNBC hosts and guests are talking a lot more about the alleged past affair between President Donald Trump and Stormy Daniels than Fox News is, according to Philip Bump’s latest analysis for The Washington Post using TV News Archive data via Television Explorer. 

Bump used the analysis as context to dig into a poll released by Suffolk University earlier this month: “One-fifth of Americans said that Fox News was the news or commentary source they trusted the most, a group that was primarily made up of Republicans… There’s a chicken-egg question here. Does Fox give the Stormy Daniels story a light touch because its audience is largely supportive of Trump or is Fox’s audience largely supportive of Trump because of the coverage they see on Fox? Or is it both?”

Did Fox News reporting contribute to perception of fired FBI official?

Vox’s Alvin Chang argues a connection between the firing of Andrew McCabe, former FBI deputy director, to a narrative built up over the course of months by Fox News. Using TV News Archive data via Television Explorer, Chang reports that “long before he was fired, Fox News… constantly referred to McCabe as the quintessential example of the FBI’s corruption and anti-Trump bias. They hinted that he was plotting several schemes against Trump during the election, leaking information to the press, and was bought and paid for by Hillary Clinton and Democrats.” This, he writes, allowed FOX News viewers to think it made “perfect sense for Attorney General Jeff Sessions (perhaps directed by Trump) to fire McCabe.” Chang goes on to warn, “This alternate reality is being fed into the president’s mind.”


What new presidential economic pick had to say about Canadian PM

The Toronto Star embedded a TV news clip in a piece on Trump’s pick to replace his economic advisor. Larry Kudlow, who is taking over from Gary Cohn as economic advisor, had said of U.S. trade policy:  “NAFTA is the key. And unfortunately we’re going after a major NAFTA ally, and perhaps America’s greatest ally, namely Canada. Even with this left-wing crazy guy Trudeau, they’re still our pals. They’re still our pals. Why are we going after them?” The clip has been viewed more than 112,000 times and counting.


Fact-Check: Senate banking bill a big win for Wall Street (Yes and No)

In a floor speech, Sen. Elizabeth Warren, D., Mass., said of the latest proposal to make changes to Dodd-Frank, “This bill is about goosing the bottom line and executive bonuses at the banks that make up the top one half of 1 percent of banks in this country by size. The very tippy-top.”

Manuela Tobias reported for PolitiFact: “The bill raises the bar of what is considered a big bank five-fold, which effectively relaxes the standards for large regional banks. Experts warn this also could open a door for bigger Wall Street bank giveaways.

The bill also has a few provisions affecting banks above $250 billion in assets. However, the effects would largely depend on the Federal Reserve’s interpretation of the law. The biggest banks might be able to get relaxed regulations, but then again, they might not.”


Fact-Check: Public lands proposal largest in history (False)

In a Senate hearing on the budget for the Dept. of the Interior, Interior Secretary Ryan Zinke said the president’s proposal “is the largest investment in our public lands infrastructure in our nation’s history. Let me repeat that, this is the largest investment in our public lands infrastructure in the history of this country.”

PolitiFact rates the claim false. Louis Jacobson reported: “It’s far from assured that the maximum figure of $18 billion in the proposal will ever be reached if enacted. Beyond that, though, Roosevelt’s $3 billion investment in the Civilian Conservation Corps would amount to $53 billion today, and it accounted for vastly more than the Trump proposal as a percentage of federal spending at the time.”

Fact-Check: U.S. has trade deficit with Canada (Four Pinocchios)

After a private meeting with Canadian Prime Minister Justin Trudeau, Trump defended his view about U.S.-Canada trade, tweeting, “We do have a Trade Deficit with Canada, as we do with almost all countries (some of them massive). P.M. Justin Trudeau of Canada, a very good guy, doesn’t like saying that Canada has a Surplus vs. the U.S.(negotiating), but they do … they almost all do … and that’s how I know!”

Glenn Kessler reports for The Washington Post’s Fact Checker that the president is not including services in his analysis of the trade relationship with Canada. He adds: “The president frequently suggests the United States is losing money with these deficits, but countries do not ‘lose’ money on trade deficits. A trade deficit simply means that people in one country are buying more goods from another country than people in the second country are buying from the first.” Kessler gives the claim four Pinocchios.

Eugene Kiely reports for FactCheck.org that the president’s claim that figures giving the U.S. a trade surplus with Canada are not including timber and energy is “not accurate. The Census Bureau, which is within the U.S. Department of Commerce, said its trade figures do include timber and energy and referred us to two publications that show that the agency does include timber and energy for imports and exports.”

Follow us @tvnewsarchive, and subscribe to our biweekly newsletter here.

Posted in News, Television Archive | Tagged , , , , , , , , , , , , | 4 Comments

Digital opportunity for the academic scholarly record

[MIT Libraries is holding a workshop on Grand Challenges for the scholarly record.  They asked participants for a problem/solution statement.  This is mine. -brewster kahle]

The problem of academic scholarly record now:

University library budgets are spent on closed rather than open: We invest dollars in closed/subscription services (Elsevier, JSTOR, Hathi) rather than ones open to all users (PLOS, Arxiv, Internet Archive, eBird)– and for a reason.  There is only so much money and our community demands access to closed services, and the open ones are there whether we pay for them or not.

We want open access AND digital curation and preservation– but have no means to spend cooperatively.

University libraries funded the building of Elsevier / JSTOR / HathiTrust: closed, subscription services.

We need to invest most University Library acquisition dollars in open: PLOS, Arxiv, Wikipedia, Internet Archive, eBird.

We have solved it when:

Anyone anywhere can get ALL information available to an MIT student, for free.

Everyone everywhere has the opportunity to contribute to the scholarly record as if they were MIT faculty, for free.

What should we do now?

Analog -> Digital conversion of all published scholarly must be completed soon.   And completely open, available in bulk.

Curation and Digital Preservation of born-digital research products: papers/websites/research data.

“Multi-homing” digital research product (papers, websites, research data) via peer-to-peer backends.

Who can best implement?

Vision and tech ability: Internet Archive, PLOS, Wikipedia, arxiv.

Funding now is coming from researchers, individuals, rich people.

Funding should come from University Library acquisition budgets.

Why might MIT lead?

OpenCourseware was bold.  MIT might invest in opening the scholarly record.

How might MIT do this?

Be bold.

Spend differently.

Lead.

Posted in Announcements, News | 1 Comment

Some Very Entertaining Plastic, Emulated at the Archive

It’s been a little over 4 years since the Internet Archive started providing emulation in the browser from our software collection; millions of plays of games, utilities, and everything else that shows up on a screen have happened since then. While we continue to refine the technology (including adding Webassembly as an option for running the emulations), we also have tried to expand out to various platforms, computers, and anything else that we can, based on the work of the emulation community, especially the MAME Development Team.

For a number of years, the MAME team has been moving towards emulating a class of hardware and software that, for some, stretches the bounds of what emulation can do, and we have now put up a collection of some of their efforts here at archive.org.

Introducing the Handheld History Collection.

This collection of emulated handheld games, tabletop machines, and even board games stretch from the 1970s well into the 1990s. They are attempts to make portable, digital versions of the LCD, VFD and LED-based machines that sold, often cheaply, at toy stores and booths over the decades.

We have done our best to add instructions and in some cases link to scanned versions of the original manuals for these games. They range from notably simplistic efforts to truly complicated, many-buttoned affairs that are truly difficult to learn, much less master.

They are, of course, entertaining in themselves – these are attempts to put together inexpensive versions of video games of the time, or bringing new properties wholecloth into existence. Often sold cheaply enough that they were sealed in plastic and sold in the same stores as a screwdriver set or flashlight, these little systems tried to pack the most amount of “game” into a small, custom plastic case, running on batteries. (Some were, of course, better built than others.)

They also represent the difficulty ahead for many aspects of digital entertainment, and as such are worth experiencing and understanding for that reason alone.

Taking a $2600 machine and selling it for $20

The shocking difference between the original sold arcade stand-ups and their toy store equivalents can be seen, for example, in the Arcade Game Q*Bert, which you can play at the Archive.

The original Arcade machine looks like this:

And the videogame itself looks like this:

Meanwhile. some time after the release of the arcade machine, a plastic tabletop version of the game came out, and it looked like this:

Using VFD (Vacuum Fluorescent Display) technology, the pre-formed art is lit up based on circuits that try to act like the arcade game as much as possible, without using an actual video screen or a even the same programming. As a result, the “video’ is much more abstract, fascinatingly so:

The music and speech synthesis is gone, a small plastic joystick replaces the metal and hard composite of the original, and the colors are a fraction of what they were. But somehow, if you squint, the original Q*Bert game is in there.

This sort of Herculean effort to squeeze a major arcade machine into a handful of circuits and a beeping, booping shell of what it once was is an ongoing situation – where once it was trying to make arcade machines work both on home consoles like the 2600 and Colecovision, so it was also the case of these plastic toy games. Work of this sort continues, as mobile games take charge and developers often work to bring huge immersive experiences to where a phone hits all the same notes.

The work in this area often speaks for itself. Check out some of these “screenshots” in the VFD games and see if you recognize the originals:

Naturally, these simple screens came packed in the brightest, most colorful stickers and plastic available, to lure in customers. The original containers, while not “emulated” in this browser-based version, definitely represent an important part of the experience.

A Major Bow to the Emulation Developers

The efforts behind accurately reflecting video game and computer experiences in an emulator, which the Archive then uses to provide our in-browser Emularity, are impressive in their own right, and should be highlighted as the lion’s group of the effort. Groups like the MAME Team as well as efforts like Dolphin, Higan, and many others, are all poking and prodding code to bring accuracy, speed and depth to software preservation. They are an often overlooked legion of volunteer effort addressing technical hurdles that no one else is approaching.

While this entry could be filled with many paragraphs about these efforts, one particularly strong example sticks out: Bringing emulation of LCD-based games to MAME.

Destroying The Artifact to Save It

In the case of most emulation, the chips of a circuit board as well as storage media connected to a machine can be read from non-destructively, such that the information is pulled off the original, returned to place, and these copies are used to present emulated versions. An example of this might be an arcade machine, whose chips are pulled from a circuit board, read, and then plugged back into the board, allowing the arcade machine to keep functioning. (Occasionally, an arcade machine/computer will use approaches like glue or batteries to prevent this sort of duplication, but it is generally a rare thing, due to maintenance concerns for operators.)

In the case of an LCD game machine, however, sometimes it is necessary to pull the item completely apart to get all the information from it. On the MAME team, there is a contributor named Sean Riddle and his collaborator “hap” who have been tireless in digging the information out of both LCD games and general computer chips.

To get the information off an LCD game, it has to be pulled apart and all its components scanned, vectorized, and traced to then make them into a software version of themselves. Among the information grabbed is the LCD display itself, which has a pre-formed set of images that do not overlap and represent every possible permutation of any visual data in the game. This will make almost no sense without illustrations, so here are some.

When playing the LCD version of the game “Nightmare Before Christmas”, the game will look like this:

That is a drawn background (also scanned in this process) that has a clear liquid-crystal display over it, showing Jack Skellington, the tree, and an elf. The artistry and intense technical challenge as both the original programming/design and the recovery of this information becomes clear when you see the LCD layer with all the elements “on” at once:

This sort of intense work is everywhere in the background of these LCD games. Here are some more:

 

(There are many more examples of these at this page at Sean Riddle’s site.)

Not only must the LCD panel be disassembled, but the circuit board beneath as well, to determine the programming involved. These are scanned and then studied to work out the cross-connections that tell the game when to light up what. The work has been optimized and can often go relatively quickly, but only due to years of experience behind the effort, experience which, again, comes from a volunteer force. Unfortunately, the machine does not survive, but the argument is made, quite rightly, that otherwise these toys will fade into oblivion. Now, they can be played by thousands or millions and do so for a significant amount of time to come.

The Fundamental Question: What Needs to be Emulated?

Floating in the back of this new collection, and in the many new LCD and electronic games being emulated by the MAME Team, is the core concern of “what will bring the most of the old game to life to be able to experience and study it?” With “standard” arcade games, it is often just a case of providing the video output as well as the speaker output and accepting the control panel signals either through a keyboard or through connected hardware. While you do not get the full role-play of being inside a dark arcade in the 1980s, you do get both the chance to play the original program as well as study its inner workings and the discoveries made in the process. Additional efforts to photograph or reference control panels, outside artwork and so on are also being done to the best available amount.

This question falls into sharp focus, however, with these electronic toys. The plastic is such a major component of the experience that it may not be enough for some researchers and users to be handed a version of the visual output to really know what the game was like. Compare the output of Bandai Pair Match:

…to what the original toy looked like:

The “core” is there, but a lot is left to the side out of necessity. Documentation, research and capturing all aspects of these machines will be required if they are to be ever recreated or understood in the future.

It’s the best of times that we are able to ask these questions while originals are still around, and it’s a testament to the many great teams and researchers who are bringing these old games into the realms of archives.

So please, take a walk through the Handheld History collection (as well as our other emulation efforts) and relive those plastic days of joy again.

Shout Outs and Thanks

Many different efforts and projects were brought together to make the Handheld History collection what it is. (We intend to expand it over time.) As always, a huge thanks to the MAME Developers for their tireless efforts to emulate our digital history; a special shout-out to Ryan Holtz for his announcements and highlighting of advances in this effort that inspired this collection to be assembled. Thanks to Daniel Brooks for maintenance of The Emularity as well as expanding the capabilities of the system to handle these new emulations. Sources for the photographs of the original plastic systems include The Handheld Games Museum and Electronic Plastic. (It is amazing how few photos of the original toy systems exist; in some cases Ebay sales are the only documented photographs of any resolution.) As a reference work for knowing which systems are emulated and how, we relied heavily on the work of the Arcade Italia Database site. Thanks to Azma and Zeether for providing metadata on images and control schemes for these games; and a huge thanks to all the photographers, documenters, scanners and reviewers who have been chronicling the history of these games for decades.

Posted in Announcements, News | 9 Comments

TV News Record: Glorious ContextuBot making progress

A round up on what’s happening at the TV News Archive by Katie Dahl and Nancy Watzman.

This week, we present an update on the video context project Glorious Contextubot, two recent news reports that use TV News Archive data, and fact-checks of TV appearances by the DNC chair and the president.

Fueled by TV News Archive, the Glorious Contextubot is making progress

Let’s say a friend posts a YouTube video link to a politician’s statement on Facebook, but you have a feeling it’s taken out of context. The clip is tightly edited, and you’re curious to see the rest of the statement. Was the politician answering a question? Was the statement part of a larger discussion?

Enter the Glorious ContextuBot. For the past nine months, veteran media innovators Mark Boas and Laurian Gridnoc of Hyperaudio and Trint, led by the Internet Archive’s own Dan Schultz, senior creative technologist of the TV News Archive, have been building a prototype of the Contextubot, fueled by the TV News Archive. The Contextubot is one of 20 winners of the Knight Prototype Fund’s $1 million challenge, announced in June 2017.

With the ContextuBot, it’s possible to use video to search video. Just paste a link to a video snippet into an interface and then pull up a transcript that puts things in context of what came before and after. Built from the Duplitron 5000, an audio fingerprinting tool Schultz developed to track political ads for the Political TV Ad Archive, the ContextuBot demonstrates how open technology built by the TV team can be repurposed and improved by motivated technologists – one that’s already captured the attention of the University of Iowa Informatics department, which is considering adopting it for researchers.

To date, the team has:

  • Made it easier to scale audio search. It’s now possible to scale up and down audio fingerprint finding within a corpus of TV news by adding or removing individual computers or compute clusters.  Our Duplitron would take eight hours to search a year of television, but the ContextuBot makes it much easier to spread that computing across multiple machines.
  • Built a demo interface. You can see a clip in context with a transcript of what comes before and after. Click on a word in the transcript, and you’ll be able to jump to that point in the video stream.
  • Begun to explore a “comic view.”  The team’s biggest goal is to explore ways to communicate the essence of a longer clip in a short amount of time.  One approach: converting video into a comic. This would set the groundwork for automatically extracting (and rendering) a storyboard from a video clip.

The team will present the prototype shortly before the International Symposium of Online Journalism conference in Austin in April 2018.


The Washington Post finds stark differences in cable TV coverage of Jared Kushner

After a heavy news week of developments related to Jared Kushner, President Trump’s son-in-law and a senior adviser, The Washington Post’s Philip Bump dug into the TV News Archive and found that while MSNBC and CNN had numerous mentions of Kushner’s name, Fox News had just ten.


The Washington Post examines coverage of Parkland shooting

Rachel Siegal used the TV News Archive to compare coverage of the Parkland shooting with several other high-profile shootings, and found that this time cable TV attention spans are a bit longer.


Fact-Check: the DNC raised record-making amounts in January. (Two Pinocchios)

In a recent interview, Democratic National Committee Chairman Tom Perez said, “We raised more money in January… of 2018 than any January in our history. So if the question is, ‘Do we have enough money to implement our game plan?’ Absolutely.”

This claim earned “two Pinocchios” from Salvador Rizzo, reporting for The Washington Post’s Fact Checker:  the “DNC raised $6 million in January 2018… That was below what it raised in January 2014 ($6.6 million), January 2012 ($13.2 million), January 2011 ($7.1 million) and January 2010 ($9.1 million).”  A spokesman for Perez “backed off from those comments when we reached out with FEC figures that told a different story.”


Fact-Check: Congressman fears NRA downgrade for gun legislation (misleading)

In a meeting with lawmakers to talk gun legislation, President Donald Trump suggested that an age requirement increase for purchasing guns was not included in a 2013 reform effort by Rep. Pat Toomey, R., Pa., “because you’re afraid of the NRA, right?”

Reporting by FactCheck.org’s Eugene Kiley, Lori Robertson, and Robert Farley calls this statement misleading.  “As a result of the legislation, Toomey’s rating with the NRA dropped from an “A” to a “C,” and the endorsements and contributions Toomey got from the NRA in previous House and Senate races disappeared. In 2016, the NRA stayed out of Toomey’s Senate race altogether; his Democratic opponent, Katie McGinty, had an “F” grade from the NRA. In that race, Toomey got the endorsement of a gun-control group, Everytown for Gun Safety, which ran ads supporting him.”


Follow us @tvnewsarchive, and subscribe to our biweekly newsletter here.

Posted in News, Television Archive | Tagged , , , , , , , , , , , , | Comments Off on TV News Record: Glorious ContextuBot making progress

Archive video now supports WebVTT for captions

We now support .vtt files (Web Video Text Tracks) in addition to .srt (SubRip) (.srt we have supported for years) files for captioning your videos.

It’s as simple as uploading a “parallel filename” to your video file(s).

Examples:

  • myvid.mp4
  • myvid.srt
  • myvid.vtt

Multi-lang support:

  • myvid.webm
  • myvid.en.vtt
  • myvid.en.srt
  • myvid.es.vtt

Here’s a nice example item:
https://archive.org/details/cruz-test

VTT with caption picker (and upcoming A/V player too!)

(We will have an updated A/V player with a better “picker” for so many language tracks in days, have no fear 😎

Enjoy!

 

Posted in Technical, Television Archive, Video Archive | Tagged , , , | Comments Off on Archive video now supports WebVTT for captions

10 Ways To Explore The Internet Archive For Free

The Internet Archive is a treasure trove of fascinating media, texts, and ephemera. Items that if they didn’t exist here, would be lost forever. Yet so many of our community members have difficulty describing what exactly it is…that we do here. Most people know us for the Wayback Machine, but we are so much more. To that end, we’ve put together a fun and useful guide to exploring the Archive. So, grab your flashlight and pith hat and let your digital adventure begin…

1. Pick a place & time you want to explore. Search our eBooks and Texts collection and download or borrow one of the 3 million books for free, offered in many formats, including PDFs and EPub.

2. Enter a time machine of old time films. Explore films of historic significance in the Prelinger Archives.

3. Want to listen to a live concert? The Live Music Archive holds more than 12,000 Grateful Dead concerts.

4. Who Knows What Evil Lurks in the Hearts of Men? Only the Shadow knows. You can too. Listen to “The Shadow” as he employs his power to cloud minds to fight crime in Old Time Radio.

5. To read or not to read? Try listening to Shakespeare with the LibriVox Free Audiobook Collection.

6. Need a laugh? Search the Animation Shorts collection for an old time cartoon.

7. Before there was Playstation 4… there was Atari. Play a classic video game on an emulated old time console, right in the browser. Choose from hundreds of games in the Internet Arcade.

8. Are you a technophile? Take the Oregon Trail or get nostalgic with the Apple II programs. You have instant access to decades of computer history in the Software Library.

9. Find a television news story you missed. Search our Television News Archive for all the channels that presented the story. How do they differ? Quote a clip from the story and share it.

10. Has your favorite website disappeared? Go to the Wayback Machine and type in the URL to see if this website has been preserved across time. Want to save a website? Use “Save Page Now.”

What does it take to become an archivist? It’s as simple as creating your own Internet Archive account and diving in. Upload photos, audio, and video that you treasure. Store them for free. Forever.

 

Sign up for free at https://archive.org.

Posted in Announcements, News | Comments Off on 10 Ways To Explore The Internet Archive For Free

Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation

The Andrew W. Mellon Foundation has awarded a research and development grant to the Internet Archive to address the critical need to preserve the “long tail” of open access scholarly communications. The project, Ensuring the Persistent Access of Long Tail Open Access Journal Literature, builds on prototype work identifying at-risk content held in web archives by using data provided by identifier services and registries. Furthermore, the project expands on work acquiring missing open access articles via customized web harvesting, improving discovery and access to this materials from within extant web archives, and developing machine learning approaches, training sets, and cost models for advancing and scaling this project’s work.

The project will explore how adding automation to the already highly automated systems for archiving the web at scale can help address the need to preserve at-risk open access scholarly outputs. Instead of specialized curation and ingest systems, the project will work to identify the scholarly content already collected in general web collections, both those of the Internet Archive and collaborating partners, and implement automated systems to ensure at-risk scholarly outputs on the web are well-collected and are associated with the appropriate metadata. The proposal envisages two opposite but complementary approaches:

  • A top-down approach involves taking journal metadata and open data sets from identifier and registry sources such as ISSN, DOAJ, Unpaywall, CrossRef, and others and examining the content of large-scale web archives to ask “is this journal being collected and preserved and, if not, how can collection be improved?”
  • A bottom-up approach involves examining the content of general domain-scale and global-scale web archives to ask “is this content a journal and, if so, can it be associated with external identifier and metadata sources for enhanced discovery and access?”

The grant will fund work to use the output of these approaches to generate training sets and test them against smaller web collections in order to estimate how effective this approach would be at identifying the long-tail content, how expensive a full-scale effort would be, and what level of computing infrastructure is needed to perform such work. The project will also build a model for better understanding the costs for other web archiving institutions to do similar analysis upon their collection using the project’s algorithms and tools. Lastly, the project team, in the Web Archiving and Data Services group with Director Jefferson Bailey as Principal Investigator,  will undertake a planning process to determine resource requirements and work necessary to build a sustainable workflow to keep the results up-to-date incrementally as publication continues.

In combination, these approaches will both improve the current state of preservation for long-tail journal materials as well as develop models for how this work can be automated and applied to existing corpora at scale. Thanks to the Mellon Foundation for their support of this work and we look forward to sharing the project’s open-source tools and outcomes with a broad community of partners.

Posted in Announcements, News | Tagged , , | Comments Off on Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation