Archive video now supports WebVTT for captions

We now support .vtt files (Web Video Text Tracks) in addition to .srt (SubRip) (.srt we have supported for years) files for captioning your videos.

It’s as simple as uploading a “parallel filename” to your video file(s).

Examples:

  • myvid.mp4
  • myvid.srt
  • myvid.vtt

Multi-lang support:

  • myvid.webm
  • myvid.en.vtt
  • myvid.en.srt
  • myvid.es.vtt

Here’s a nice example item:
https://archive.org/details/cruz-test

VTT with caption picker (and upcoming A/V player too!)

(We will have an updated A/V player with a better “picker” for so many language tracks in days, have no fear 😎

Enjoy!

 

Posted in Technical, Television Archive, Video Archive | Tagged , , , | Comments Off on Archive video now supports WebVTT for captions

10 Ways To Explore The Internet Archive For Free

The Internet Archive is a treasure trove of fascinating media, texts, and ephemera. Items that if they didn’t exist here, would be lost forever. Yet so many of our community members have difficulty describing what exactly it is…that we do here. Most people know us for the Wayback Machine, but we are so much more. To that end, we’ve put together a fun and useful guide to exploring the Archive. So, grab your flashlight and pith hat and let your digital adventure begin…

1. Pick a place & time you want to explore. Search our eBooks and Texts collection and download or borrow one of the 3 million books for free, offered in many formats, including PDFs and EPub.

2. Enter a time machine of old time films. Explore films of historic significance in the Prelinger Archives.

3. Want to listen to a live concert? The Live Music Archive holds more than 12,000 Grateful Dead concerts.

4. Who Knows What Evil Lurks in the Hearts of Men? Only the Shadow knows. You can too. Listen to “The Shadow” as he employs his power to cloud minds to fight crime in Old Time Radio.

5. To read or not to read? Try listening to Shakespeare with the LibriVox Free Audiobook Collection.

6. Need a laugh? Search the Animation Shorts collection for an old time cartoon.

7. Before there was Playstation 4… there was Atari. Play a classic video game on an emulated old time console, right in the browser. Choose from hundreds of games in the Internet Arcade.

8. Are you a technophile? Take the Oregon Trail or get nostalgic with the Apple II programs. You have instant access to decades of computer history in the Software Library.

9. Find a television news story you missed. Search our Television News Archive for all the channels that presented the story. How do they differ? Quote a clip from the story and share it.

10. Has your favorite website disappeared? Go to the Wayback Machine and type in the URL to see if this website has been preserved across time. Want to save a website? Use “Save Page Now.”

What does it take to become an archivist? It’s as simple as creating your own Internet Archive account and diving in. Upload photos, audio, and video that you treasure. Store them for free. Forever.

 

Sign up for free at https://archive.org.

Posted in Announcements, News | Comments Off on 10 Ways To Explore The Internet Archive For Free

Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation

The Andrew W. Mellon Foundation has awarded a research and development grant to the Internet Archive to address the critical need to preserve the “long tail” of open access scholarly communications. The project, Ensuring the Persistent Access of Long Tail Open Access Journal Literature, builds on prototype work identifying at-risk content held in web archives by using data provided by identifier services and registries. Furthermore, the project expands on work acquiring missing open access articles via customized web harvesting, improving discovery and access to this materials from within extant web archives, and developing machine learning approaches, training sets, and cost models for advancing and scaling this project’s work.

The project will explore how adding automation to the already highly automated systems for archiving the web at scale can help address the need to preserve at-risk open access scholarly outputs. Instead of specialized curation and ingest systems, the project will work to identify the scholarly content already collected in general web collections, both those of the Internet Archive and collaborating partners, and implement automated systems to ensure at-risk scholarly outputs on the web are well-collected and are associated with the appropriate metadata. The proposal envisages two opposite but complementary approaches:

  • A top-down approach involves taking journal metadata and open data sets from identifier and registry sources such as ISSN, DOAJ, Unpaywall, CrossRef, and others and examining the content of large-scale web archives to ask “is this journal being collected and preserved and, if not, how can collection be improved?”
  • A bottom-up approach involves examining the content of general domain-scale and global-scale web archives to ask “is this content a journal and, if so, can it be associated with external identifier and metadata sources for enhanced discovery and access?”

The grant will fund work to use the output of these approaches to generate training sets and test them against smaller web collections in order to estimate how effective this approach would be at identifying the long-tail content, how expensive a full-scale effort would be, and what level of computing infrastructure is needed to perform such work. The project will also build a model for better understanding the costs for other web archiving institutions to do similar analysis upon their collection using the project’s algorithms and tools. Lastly, the project team, in the Web Archiving and Data Services group with Director Jefferson Bailey as Principal Investigator,  will undertake a planning process to determine resource requirements and work necessary to build a sustainable workflow to keep the results up-to-date incrementally as publication continues.

In combination, these approaches will both improve the current state of preservation for long-tail journal materials as well as develop models for how this work can be automated and applied to existing corpora at scale. Thanks to the Mellon Foundation for their support of this work and we look forward to sharing the project’s open-source tools and outcomes with a broad community of partners.

Posted in Announcements, News | Tagged , , | Comments Off on Andrew W. Mellon Foundation Awards Grant to the Internet Archive for Long Tail Journal Preservation

27 Public Libraries and the Internet Archive Launch “Community Webs” for Local History Web Archiving

The lives and activities of communities are increasingly documented online; local news, events, disasters, celebrations — the experiences of citizens are now largely shared via social media and web platforms. As these primary sources about community life move to the web, the need to archive these materials becomes an increasingly important activity of the stewards of community memory. And in many communities across the nation, public libraries, as one of their many responsibilities to their patrons, serve the vital role of stewards of local history. Yet public libraries have historically been a small fraction of the growing national and international web archiving community.

With generous support from the Institute of Museum and Library Services, as well as the Kahle/Austin Foundation and the Archive-It service, the Internet Archive and 27 public library partners representing 17 different states have launched a new program: Community Webs: Empowering Public Libraries to Create Community History Web Archives. The program will provide education, applied training, cohort network development, and web archiving services for a group of public librarians to develop expertise in web archiving for the purpose of local memory collecting. Additional partners in the program include OCLC’s WebJunction training and education service and the public libraries of Queens, Cleveland and San Francisco will serve as “lead libraries” in the cohort. The program will result in dozens of terabytes of public library administered local history web archives, a range of open educational resources in the form of online courses, videos, and guides, and a nationwide network of public librarians with expertise in local history web archiving and the advocacy tools to build and expand the network. A full listing of the participating public libraries is below and on the program website.

In November 2017, the cohort gathered together at the Internet Archive for a kickoff meeting of brainstorming, socializing, and, of course, talking all things web archiving.  Partners shared details on their existing local history programs and ideas for collection development around web materials. Attendees talked about building collections documenting their demographic diversity or focusing on local issues, such as housing availability or changes in community profile. As an example, Abbie Zeltzer from the Patagonia Public Library, spoke about the changes in her community of 913 residents as the town redevelops a long dormant mining industry. Zeltzer intends on developing a web archive documenting this transition and the related community reaction and changes.

Since the kickoff meeting, the Community Webs cohort has been actively building collections, from hyper-local media sites in Kansas City, to neighborhood blogs in Washington D.C., to Mardi Gras in East Baton Rouge. In addition, program staff, cohort members, and WebJunction have been building out an extensive online course space with educational materials for training on web archiving for local history. The full course space and all open educational resources will be released in early 2019 and a second full in-person meeting of the cohort will take place in Fall 2018.

For further information on the Community Webs program, contact Maria Praetzellis, Program Manager, Web Archiving [maria at archive.org] or Jefferson Bailey, Director, Web Archiving [jefferson at archive.org].

Public Library City State
Athens Regional Library System Athens GA
Birmingham Public Library Birmingham AL
Brooklyn Public Library – Brooklyn Collection New York City NY
Buffalo & Erie County Public Library Buffalo NY
Cleveland Public LIbrary Cleveland OH
Columbus Metropolitan Library Columbus OH
County of Los Angeles Public Library Los Angeles CA
DC Public Library Washington DC
Denver Public Library – Western History and Genealogy Department and Blair-Caldwell African American Research Library Denver CO
East Baton Rouge Parish Library East Baton Rouge LA
Forbes Library Northampton MA
Grand Rapids Public Library Grand Rapids MI
Henderson District Public Libraries Henderson NV
Kansas City Public Library Kansas City MO
Lawrence Public Library Lawrence KS
Marshall Lyon County Library Marshall MN
New Brunswick Free Public Library New Brunswick NJ
Schomburg Center for Research in Black Culture (NYPL) New York City NY
Patagonia Library Patagonia AZ
Pollard Memorial Library Lowell MA
Queens Library New York City NY
San Diego Public Library San Diego CA
San Francisco Public Library San Francisco CA
Sonoma County Public Library Santa Rosa CA
The Urbana Free Library Urbana IL
West Hartford Public Library West Hartford CT
Westborough Public Library Westborough MA
Posted in Announcements, Archive-It | Tagged | Comments Off on 27 Public Libraries and the Internet Archive Launch “Community Webs” for Local History Web Archiving

Mass downloading 78rpm record transfers

To preserve or discover interesting 78rpm records you can download them to your own machine (rather than using our collection pages).  You can download lots on to a mac/linux machine by using a command line utility.

Preparation:  Download the IA command line tool.     Like so:

$ curl -LO https://archive.org/download/ia-pex/ia
$ chmod +x ia
$ ./ia help

Option 1:   if you want just a set of mp3’s to play download to your /tmp directory:

./ia download --search "collection:georgeblood" --no-directories --destdir /tmp -g "[!_][!7][!8]*.mp3"

or just blues (or hillbilly or other searches):

./ia download --search "collection:georgeblood AND blues" --no-directories --destdir /tmp -g "[!_][!7][!8]*.mp3"

Option 2: if you want to preserve the FLAC and MP3 and metadata files for the best version of the 78rpm record we have.  (if you are using a Mac Install homebrew on a mac, then type “brew install parallel”.  On linux try “apt-get install parallel”)

./ia search 'collection:georgeblood' --sort=publicdate\ asc --itemlist > itemlist.txt
cat itemlist.txt | parallel --joblog download.log './ia download {} --destdir /tmp -g "[!_][!7][!8]*"'

parallel --retry-failed --joblog download.log './ia download {} --destdir /tmp -g "[!_][!7][!8]*"'
Posted in 78rpm, Audio Archive | Comments Off on Mass downloading 78rpm record transfers

TV News Record: Television Explorer 2.0, shooting coverage & more

A round up on what’s happening at the TV News Archive by Katie Dahl and Nancy Watzman. 

Explore Television Explorer 2.0

Television Explorer, a tool to search closed captions from the TV News Archive, keeps getting better. Last week GDELT’s Kalev Leetaru added new and improved features:

  • 163 channels are now available to search, from C-Span to Al Jazeera to Spanish language content from Univision and Telemundo.
  • Results now come as a percentage of 15 second clips, making comparisons between simpler.
  • The context word function for searches is similarly redesigned, counting a matching 15-second clip as well searching the 15 second clips immediately before and after, helping to alleviate some previous issues with overcounting.
  • You can now see normalization timelines on the site, with newly available data about the total number of 15-second clips monitored each day and hour included in your query.

Take the revamped Television Explorer for a spin.

Here’s what we found when we used the new tools to track the use of the term, “cryptocurrency.” The rapid ascent, and sometimes fall, of the value of cryptocurrencies such as bitcoin have led to rises and dips in TV news coverage as well. In May 2017, international TV news channels began to run stories featuring the term, rapidly increasing in November and peaking just last week with BBC News. Television Explorer shows that Deutsche Welle led the pack ahead of BBC News and Al Jazeera in covering cryptocurrency. Among US networks, Bloomberg uses the term more than twice as often as Deutsche Welle. A search of the term bitcoin shows a similar trajectory, with CNBC coverage spiking December 11, 2017, a few days before bitcoin hit its historic peak in value to date.

Florida high school shooting TV news coverage shows familiar pattern

Within a broader analysis of how responses to the most recent school shooting compare with others, The Washington Post’s Philip Bump used TV News Archive closed caption data using GDELT’s Television Explorer to examine the pattern of use of the term “gun control” on CNN, Fox, and MSNBC. “After the mass shooting in Las Vegas last October, a political discussion about banning ‘bump stocks’ — devices that allowed the shooter to increase his rate of fire — soon collapsed.” “So far, the conversation after Parkland looks similar to past patterns.”

Washington Post graphic

Fact-check: Trump never said Russia didn’t meddle in election (Pants on Fire!)

“I never said Russia did not meddle in the election”

Reacting to the indictments of Russian nationals by Special Counsel Robert Mueller, President Donald Trump wrote, “I never said Russia did not meddle in the election, I said, ‘It may be Russia, or China or another country or group, or it may be a 400 pound genius sitting in bed and playing with his computer.’ The Russian “hoax” was that the Trump campaign colluded with Russia – it never did!”

Fact-checkers moved quickly to investigate this claim.

The Washington Post’s Fact Checker, Glenn Kessler: “According to The Fact Checker’s database of Trump claims, Trump in his first year as president then 44 more times denounced the Russian probe as a hoax or witch hunt perpetuated by Democrats. For instance, here’s a tweet from the president after reports emerged about the use of Facebook by Russian operatives, a key part of the indictment: ‘The Russia hoax continues, now it’s ads on Facebook. What about the totally biased and dishonest Media coverage in favor of Crooked Hillary?’”

PolitiFact’s Jon Greenberg:  “Pants on Fire!” The president “called the matter a ‘made-up story,’ and a ‘hoax.’ He has said that he believes Russian President Putin’s denial of any Russian involvement. He told Time, ‘I don’t believe they (Russia) interfered.’”

Vox on Fox (& CNN & MSNBC): Mueller indictment, Florida shooting

In an analysis of Fox News, CNN, and MSNBC during the 72 hours following the announcement of the indictment of 13 Russians, Vox’s Alvin Chang used TV News Archive closed captioning data and the GDELT Project’s Television Explorer to show “how Fox News spun the Mueller indictment and Florida shooting into a defense of the president.” Chang uses the data to show that “[I]nstead of focusing on the details of the indictment itself, pundits on Fox News spent a good chunk of their airtime pointing out that this isn’t proof of the Trump administration colluding with Russia.”


TV news coverage and analysis in one place

Scholars, pundits, and reporters have used the data we’ve created here in the TV News Archive in ways that continue to inspire us, adding much-needed context to our chaotic public discourse as seen on TV.  All that content is now in one place, showcasing the work of these researchers and reporters who turned TV news data into something meaningful.

Follow us @tvnewsarchive, and subscribe to our weekly newsletter here.

Posted in Announcements, Television Archive | Tagged , , , , , , , , , | Comments Off on TV News Record: Television Explorer 2.0, shooting coverage & more

Expanding the Television Archive

When we started archiving television in 2000, people shrugged and asked, “Why?  Isn’t it all junk anyway?” As the saying goes, one person’s junk is another person’s gold. From 2010-18, scholars, pundits and above all, reporters, have spun journalistic gold from the data captured in our 1.5 million hours of television news recordings. Our work has been fueled by visionary funders(1) who saw the potential impact of turning television – from news reports to political ads – into data that can be analyzed at scale. Now the Internet Archive is taking its Television Archive in new directions. In 2018 our goals for television will be: better curation in what we collect; broader collection across the globe; and working with computer scientists interested in exploring our huge data sets. Simply put, our mission is to build and preserve comprehensive collections of the world’s most important television programming and make them as accessible as possible to researchers and the general public. We will need your help.  

“Preserving TV news is critical, and at the Internet Archive we’ve decided to rededicate ourselves to growing our collection,” explained Roger MacDonald, Director of Television at the Internet Archive. “We plan to go wide, expanding our archives of global TV news from every continent. We also plan to go deep, gathering content from local markets around the country. And we plan to do so in a sustainable way that ensures that this TV will be available to generations to come.”

Libraries, museums and memory institutions have long played a critical role in preserving the cultural output of our creators. Television falls within that mandate. Indeed some of the most comprehensive US television collections are held by the Library of Congress, Vanderbilt University and UCLA. Now we’d like to engage with a broad range of libraries and memory institutions in the television collecting and curation process. If your organization has a mandate to collect television or researcher demand for this media, we would like to understand your needs and interests. The Internet Archive will undertake collection trials with interested institutions, with the eventual goal of making this work self-sustaining.

Simultaneously, we are looking to engage researchers interested in the non-consumptive analysis of television at scale, in ways that continue to respect the interests of right holders. The tools we’ve created may be useful. For instance, we hope the tools the Internet Archive used to detect TV campaign ads can be applied by researchers in new and different ways.  If your organization has interest in computing with television as data at large, we are interested in working with you.

This groundbreaking interface for searching television news, based on the closed captions associated with US broadcasts, was developed between 2009-2012.

A brief history of the Internet Archive’s Television collection:

2000 Working with pioneering engineer, Rod Hewitt, IA begins archiving 20 channels originating from many nations.

Oct. 2001 September 11, 2001 Collection established, and enhanced in 2011.

2009-2012 With funding from the Knight Foundation and many others, we built a service to allow public searching, citation and borrowing of US television news programs on DVD.

2012-2014 Public TV news library launched with tools to search, quote and share streamed snippets from television news.

2014 Pilot launched to detect political advertisements broadcast in the Philadelphia region, led to developing open sourced audio fingerprinting techniques.

2016 Political ad detection, curation, and access expanded to 28 battleground regions for 2016 elections, enabling journalists to fact check the ads and analyze the data at scale. The same tools helped reporters analyze presidential debates.  This resulted in front-page data visualizations in The New York Times, as well as 150+ analyses by news outlets from Fox News to The Economist to FiveThirtyEight.

2017-date Experiments with artificial intelligence techniques to employ facial identification, and on-screen optical character recognition to aid searching and data mining of television. Special curated collections of top political leaders and fact-check integrations.

In the run-up to the 2016 presidential elections, journalists at the NYT and elsewhere began analyzing television as data, in this case looking at the different sound bites each network chose to replay.

Embarking on a new direction also means shifting away from some of our current services. Our dedicated television team has been focusing on metadata enhancement and assisting journalists and scholars to use our data. We will be wrapping up some of these free services in the next three to four months.  We hope others will take up where we left off and build the tools that will make our collection even more valuable to the public.

Now more than ever in this era of disinformation, our world needs an open, reliable, canonical reference source of television news. This cannot exist without the diligent efforts of technologists, journalists, researchers, and television companies all working together to create a television archive open for all. We hope you will join us!

To learn more about the work of the TV News Archive outreach and metadata innovation team over the last few years, please see our blog posts.

(1) Funding for the Television Archive has come from diverse donors, including the John S. and James L. Knight Foundation, Democracy Fund, Rita Allen Foundation, craigslist Charitable Fund and The Buck Foundation.

Posted in Announcements, News | Tagged , , , , , , , , , , , | Comments Off on Expanding the Television Archive

Military Industrial Powerpoint Complex Karaoke! — Tuesday, March 6

The Internet Archive presents the first ever Military Powerpoint Karaoke: a night of “Powerpoint Karaoke” using presentations in the Military Industrial Powerpoint Complex collection at archive.org that were extracted by the Internet Archive from its public web archive and converted into a special collection of PDFs/epubs. The event will take place on Tuesday, March 6th at 7:30pm at our headquarters in San Francisco. The show will be preceded by a reception at 6:30 pm, when doors will also open.

Get Free Tickets Here

Also known as “Battle Decks,” Powerpoint Karaoke is an improvisational and art event where audience members give a presentation using a set of Powerpoint slides that they’ve never seen before. There are three rules: 1) The presenter cannot see the slides before presenting; 2) The presenter delivers each slide in succession without skipping slides or going back; and 3) The presentation ends when all slides are presented, or after 5 minutes (whichever comes first). We’re thrilled to have Rick Prelinger, creator of Lost Landscapes and Prelinger Archive, and Avery Trufelman of 99% Invisible, joining us to deliver headlining Powerpoint decks. The rest of the presentations will be delivered by you — the audience members who sign up.

This event will use, as its source material, a curated collection of the Internet Archive’s Military Industrial Powerpoint Complex, a special project alongside GifCities that was originally created for the Internet Archive’s 20th Anniversary in October 2016. For the project, IA staff extracted all the Powerpoint files from its archive of the government’s public .mil web domain. The collection was expanded in early 2017 to include materials collected during the End of Term project, which archived a snapshot of the .gov and .mil web domains during the administration change. The Military Industrial Powerpoint Complex collection contains over 57,000 Powerpoint decks, each charged with material that ranges from the violent to the banal, featuring attack modes, leadership styles, harness types, and modes for requesting vacation days from the US Military. The project was originally inspired by writer Paul Ford’s article, “Amazing Military Infographics” which can be found in the Wayback Machine. As a whole, this collection forms a unique snapshot into our government’s Military Industrial Complex.

This event is organized by artists/archivists Liat Berdugo and Charlie Macquarie in partnership with the Internet Archive.

Tuesday, March 6
6:30 pm Reception
7:30 pm Program

Internet Archive
300 Funston Avenue
San Francisco, CA 94118

Get Free Tickets Here

Posted in Event, Past Event, Wayback Machine - Web Archive | Tagged | 1 Comment

Emulation in the Browser adds WebAssembly

Since we introduced our approach to Emulation in the Browser (now simply called The Emularity) back in 2013, there’s always been plans to continually improve the experience and advance the various web technologies that make it happen.

As of today, the Internet Archive now has a majority of emulated platforms running in WebAssembly.

What is WebAssembly?

Webassembly (or WASM) is meant to be a replacement for the “executed programs in the browser” aspects of Javascript. It is designed from the ground up to be open, widely supported, and taking into account all the lessons learned from 20 years of Javascript. The benefits include speed advantages, improvements in the code size and transfer, and being much easier to debug. It is a result of years of work, and can almost be considered a “do-over” from the lessons learned by Javascript.

What do I need to do?

You actually don’t need to do much at all! WebAssembly, if it’s enabled in your browser, will just start being how the program loads when you emulate something at the Internet Archive. (The loader will mention a “WASM Binary” when it’s loading up your emulation.) If you don’t have WASM or you’ve disabled it, the usual Javascript loading will happen, as it always has. There is support for WASM in the Firefox, Chrome, Safari, Brave and Edge browsers.

Our DOS, Windows, and Macintosh emulations are still running the older system with Javascript as the language. WASM support is now in places like the Console Living Room, Internet Arcade, and our support for platforms like the Apple II or the ZX Spectrum.

Also, if this is the first time you’ve become aware we’re emulating over 80,000 software titles in the browser.. well, you have a lot of software history to look forward to.

Thanks to Dan Brooks and everyone on the Emularity team for helping to advance us to the next level, as well as the many people working on the WebAssembly standard, to ensure software history is one click away.

We’re always interested in bug reports, or noticing strangeness, so definitely mail me at jscott@archive.org if you run into issues or want more information.

Posted in News | 2 Comments

Betting on Bitcoin? Better see this film — Monday, February 26

We all know this person: the friend who bought a new car with her Bitcoin earnings during the boom. The uncle who moved his retirement funds into cryptocurrencies and lost his shirt after the bust. So why is everyone suddenly buzzing about Bitcoin? What do they know that you don’t?

For an evening of Bitcoin 101, come to the Internet Archive Headquarters in San Francisco on Monday, February 26 for a screening of the documentary, “BANKING ON BITCOIN,” directed by Christopher Cannucciari.  

Doors open at 6 p.m. for drinks and snacks along with interactive workshops. Bitcoin community members will be on hand to answer your questions and help you set up your own digital currency wallets. At 7 p.m., join us for a panel discussion before the film with experts explaining the current state of the Bitcoin bubble. At 7:30 pm, we’ll screen the 90-minute independent film, “BANKING ON BITCOIN.”

Directed by Christopher Cannucciari , this independent film “features interviews with enthusiasts and experts, covering Bitcoin’s roots, its future and the technology that makes it tick.” Ticket prices are a suggested donation of $5 or more, but no one will be turned away for lack of funds. The Internet Archive warmly accepts donations in many currencies, including dollars, Bitcoin, Bitcoin Cash, Zcash and Ethereum.

Get Tickets Here

Or Pay with Bitcoin

Monday, February 26, 2018

6:00 pm—Reception and Bitcoin 101 Workshops

7:00 pm— Introduction/Q&A with Panel of Bitcoin Experts

7:30 pm— “BANKING ON BITCOIN” screening (90 minutes)

Trailer

At the Internet Archive

300 Funston Avenue

San Francisco, CA  94118

Reserve Now!  

Posted in Event, News | 1 Comment