Tag Archives: Wayback Machine

Digital archives: a time machine for the web

Posted on March 5, 2024 by Chris Freeland

This post was originally published in a newsletter by Project Liberty, February 20, 2024. Image by Project Liberty.

In the summer of 2023, the New York Times ran an article titled “Ways You Can Still Cancel Your Federal Student Loan Debt.”

The article outlined six ways to cancel student debt, with the final being:

“Death
This is not something that most people would choose as a solution to their debt burden.”

At least that was the sixth reason until the New York Times revised it with a stealth edit. When you read the article today, choosing death as a solution to a debt burden has been replaced, but there’s no mention that this article was revised. The timestamp is still the day it was originally published.

If not for Internet Archive’s Wayback Machine, this discrepancy wouldn’t have been caught. The Wayback Machine is a digital archive of the internet, and as such, it captured multiple previous versions.

The internet is constantly being revised in ways that allow history to be rewritten and a shared sense of truth to be questioned. With AI-generated disinformation, the potential to exert control over the future by rewriting the past has never been greater.

This week we’re exploring how digital archives are crucial in developing a record of truth in an ever-changing web.

The need for digital archives

Mark Graham, Director of the Wayback Machine, spoke with the Project Liberty Foundation and shared the key reasons why there’s an even greater need for digital archives:

The importance of the internet. So much of what humanity publishes and makes available lives only on the internet. Given how much time we spend online, the internet has become a central medium of human expression, history, and culture.

The fragile and ephemeral nature of the internet. Graham shared two stats that underscore how fragile today’s internet is:

A study found that of the two million hyperlinks in New York Times articles from 1996 to 2019, 25% of all links were broken (described as link rot).
The Wayback Machine has fixed 20 million broken links in Wikipedia articles with the correct ones.

“The web itself is a living thing. Webpages change. They go away on quite a frequent basis. There’s no backup system or version control system for the web,” Graham explained. That is, except for archives like the Wayback Machine.

The Wayback Machine

The Wayback Machine is a “time machine for the web,” in Graham’s words. It allows users to trace the evolution (or disappearance) of a webpage over time, enabling them to establish a record of what happened on the internet.

For example, the Apple.com URL has been archived 539,000 times since its first archived page in October 1996.
The Wayback Machine has archived over 866 billion webpages in its 28-year history. Today, it archives hundreds of millions of webpages every day and has become one of the most important archives of online content in the world.

How it works

The Wayback Machine “crawls” the web and downloads publicly accessible information. Webpages, documents, and data are stored with a time-stamped URL.
For information that’s not publicly accessible, Internet Archive offers web archiving services through Archive-It for 1,200 organizations in 24 countries around the world (from libraries to research institutions).
The Wayback Machine supports everyday people to help it archive the internet. Anyone can go to Save Page Now to archive a webpage or article.
The Wayback Machine partners with 1,200 fact-checking organizations globally to help it reference material on the web that was the source of disinformation. It has built a library of more than 200,000 examples where a claim has been made, and the Wayback Machine has provided additional context on if that claim is true (known as a review of the claim).

Archive of facts

Fixing links, archiving webpages, and fact-checking digital articles are part of a deeper, more important project to chronicle digital history and establish a record of facts.

Last month, the archive of press releases from a sitting member of Congress, New York’s Elise Stefanik, vanished after she came under scrutiny. The Wayback Machine documented this erasure and provided a time-stamped record of past versions of her website and press releases.
In 2018, a US Appeals court ruled that the Wayback Machine’s archive of webpages can be used as legitimate legal evidence.
The Internet Archive has countless examples of when the press have referenced the Wayback Machine to correct disinformation and dispel rumors. In one example from last year, the Associated Press relied on the Wayback Machine to set the record that the CDC did not say the polio vaccine gave millions of Americans a “cancer virus.”

With the rise of AI-generated disinformation, there’s reason to believe such attempts at rewriting history (even if that history is just yesterday) will become more prevalent and the social contract that has governed web crawlers is coming to an end.

A citizen-powered web

Building digital archives is a bulwark against those attempting to rewrite history and spread misinformation. An archived, time-stamped webpage is not just unimpeachable evidence, it’s a foundational building block of a shared sense of reality.

In 2014, when Malaysia Airlines Flight 17 went down over Ukraine, the Wayback Machine captured evidence that a pro-Russian group was behind the missile attack. But it wasn’t the Wayback Machine’s algorithms that captured the evidence by crawling the internet; it was an individual who found an obscure blog post from a Ukrainian separatist leader touting the shooting down of a plane. That individual identified the blogpost as important enough to be archived, and it became a critical piece of evidence, even after that post disappeared from the internet.

As Graham said, “You don’t know what you got until it’s gone. If you see something, save something.”

What pages can you help archive? Archive them with the Wayback Machine on Save Page Now.

Meet the Librarians: Sawood Alam, Wayback Machine

Posted on April 8, 2022 by Caralee Adams

To celebrate National Library Week 2022, we are taking readers behind the scenes to Meet the Librarians who work at the Internet Archive and in associated programs.

Sawood Alam was born and raised on a farm in a remote village of India with no smartphones, television or electricity.

“Books were one of the only means of learning and entertainment for us,” said Alam, who checked out as many books as he could from his school library every Thursday. “I had to take my buffalo out every afternoon. It was a boring task out in the field with no one to talk to, so books were my companions.”

When he was 10 years old, Alam helped at his school library, which was all run by children. He said he learned a lot about sorting, indexing and categorizing books—the beginning of a lifelong passion.

Nearly two decades later, Alam completed his PhD in computer science with a specialty in web archiving from Old Dominion University. He was part of the Web Science and Digital Libraries Research Group at the university.

Alam joined the staff of the Internet Archive as a web and data scientist in 2020. Working with the Wayback Machine team, Alam supports researchers from all around the world conducting analyses with Internet Archive collections. When someone has a research question that involves interaction with Wayback Machine APIs or downloading a large number of archived web pages, he helps prepare the data and provides technical assistance. Alam tries to improve the discoverability of items in massive web collections. His data insights and quality assurance efforts enhance web crawling and Wayback Machine operations.

Alam also collaborates with partners from academia, industry, and organizations on various research, development and standardization efforts. His own research has focused on archive profiling, interoperability and cooperation among archives, which are all topics the data scientist writes about and shares on Twitter.

“My first language is Urdu so when I see books and materials in Urdu in the Internet Archive it brings me joy.”
Sawood Alam, Wayback Machine

Formal academic training in the field of web archiving is uncommon, said Alam. With his background, he’s able to understand the data scientists’ research needs, he said, making his skills a perfect match for his position at the Internet Archive.

“‘Universal Access to All Knowledge’ is something that certainly resonates for me,” Alam said of the Internet Archive’s mission. “I would like to focus on making it more global.”

In recognition of his contribution to the library community with digital preservation, Alam received the NDSA 2020 Future Stewards Innovation Award.

Beyond his work at the Internet Archive, Alam serves the digital library and web archiving communities by peer-reviewing research papers and chairing sessions in journals and conferences in the fields of his interest and participating in conversations of International Internet Preservation Consortium (IIPC) with focus towards interoperability, collaborations, and other related topics.

Favorite items in the Internet Archive for Alam? “I established a volunteer-driven online Unicode Urdu books library, UrduWeb Digital Library, during my graduation years. My first language is Urdu so when I see books and materials in Urdu in the Internet Archive it brings me joy. Thanks to the Wayback Machine, I was able to narrate the lost story of the evolution of Urdu blogging on the 20th anniversary of the Internet Archive.”

Reflections as the Internet Archive turns 25

Posted on July 21, 2021 by Brewster Kahle

Photo by Rory Mitchell, The Mercantile, 2020 – CC by 4.0

(L-R) Brewster Kahle, Tamiko Thiel, Carl Feynman at Thinking Machines, May 1985. Photo courtesy of Tamiko Thiel.

A Library of Everything

As a young man, I wanted to help make a new medium that would be a step forward from Gutenberg’s invention hundreds of years before.

By building a Library of Everything in the digital age, I thought the opportunity was not just to make it available to everybody in the world, but to make it better–smarter than paper. By using computers, we could make the Library not just searchable, but organizable; make it so that you could navigate your way through millions, and maybe eventually billions of web pages.

The first step was to make computers that worked for large collections of rich media. The next was to create a network that could tap into computers all over the world: the Arpanet that became the Internet. Next came augmented intelligence, which came to be called search engines. I then helped build WAIS–Wide Area Information Server–that helped publishers get online to anchor this new and open system, which came to be enveloped by the World Wide Web.

By 1996, it was time to start building the library.

This library would have all the published works of humankind. This library would be available not only to those who could pay the $1 per minute that LexusNexus charged, or only at the most elite universities. This would be a library available to anybody, anywhere in the world. Could we take the role of a library a step further, so that everyone’s writings could be included–not only those with a New York book contract? Could we build a multimedia archive that contains not only writings, but also songs, recipes, games, and videos? Could we make it possible for anyone to learn about their grandmother in a hundred years’ time?

From the San Francisco Chronicle, Business Section, May 7, 1988. Photo by Jerry Telfer.

Not about an Exit or an IPO

From the beginning, the Internet Archive had to be a nonprofit because it contains everybody else’s things. Its motives had to be transparent. It had to last a long time.

In Silicon Valley, the goal is to find a profitable exit, either through acquisition or IPO, and go off to do your next thing. That was never my goal. The goal of the Internet Archive is to create a permanent memory for the Web that can be leveraged to make a new Global Mind. To find patterns in the data over time that would provide us with new insights, well beyond what you could do with a search engine. To be not only a historical reference but a living part of the pulse of the Internet.

*John Perry Barlow, lyricist for the Grateful Dead & founder of the Electronic Frontier Foundation, accepting the Internet Archive Hero Award, October 21, 2015. Photograph by Brad Shirakawa –* CC by 4.0

Looking Way Back

My favorite things from the early era of the Web were the dreamers.

In the early Web, we saw people trying to make a more democratic system work. People tried to make publishing more inclusive.

We also saw the other parts of humanity: the pornographers, the scammers, the spammers, and the trolls. They, too, saw the opportunity to realize their dreams in this new world. At the end of the day, the Internet and the World Wide Web–it’s just us. It’s just a history of humankind. And it has been an experiment in sharing and openness.

The World Wide Web at its best is a mechanism for people to share what they know, almost always for free, and to find one’s community no matter where you are in the world.

Brewster Kahle speaking at the 2019 Charleston Library Conference. Photo by Corey Seeman– CC by 4.0

Looking Way Forward

Over the next 25 years, we have a very different challenge. It’s solving some of the big problems with the Internet that we’re seeing now. Will this be our medium or will it be theirs? Will it be for a small controlling set of organizations or will it be a common good, a public resource?

So many of us trust the Web to find recipes, how to repair your lawnmower, where to buy new shoes, who to date. Trust is perhaps the most valuable asset we have, and squandering that trust will be a global disaster.

We may not have achieved Universal Access to All Knowledge yet, but we still can.

In another 25 years, we can have writings from not a hundred million people, but from a billion people, preserved forever. We can have compensation systems that aren’t driven by advertising models that enrich only a few.

We can have a world with many winners, with people participating, finding communities of like-minded people they can learn from all over the world. We can create an Internet where we feel in control.

I believe we can build this future together. You have already helped the Internet Archive build this future. Over the last 25 years, we’ve amassed billions of pages, 70 petabytes of data to offer to the next generation. Let’s offer it to them in new and exciting ways. Let’s be the builders and dreamers of the next twenty-five years.

See a timeline of Key Moments in Access to Knowledge, videos & an invitation to our 25th Anniversary Virtual Celebration at anniversary.archive.org.

The 20th Century Time Machine

Posted on October 13, 2017 by Nancy Watzman

by Nancy Watzman & Katie Dahl

Jason Scott

With the turn of a dial, some flashing lights, and the requisite puff of fog, emcees Tracey Jaquith, TV Architect, and Jason Scott, Free Range Archivist, cranked up the Internet Archive 20th Century Time Machine on stage before a packed house at the Internet Archive’s annual party on October 11.

Eureka! The cardboard contraption worked! The year was 1912, and out stepped Alexis Rossi, director of Media and Access, her hat adorned with a 78rpm record.

1912

D’Anna Alexander (center) with her mother (right) and grandmother (left).

“Close your eyes and listen,” Rossi asked the audience. And then, out of the speakers floated the scratchy sounds of Billy Murray singing “Low Bridge, Everybody Down” written by Thomas S. Allen. From 1898 to the 1950s, some three million recordings of about three minutes each were made on 78rpm discs. But these discs are now brittle, the music stored on them precious. The Internet Archive is working with partners on the Great 78 Project to store these recordings digitally, so that we and future generations can enjoy them and reflect on our music history. New collections include the Tina Argumedo and Lucrecia Hug 78rpm Collection of dance music collected in Argentina in the mid-1930s.

1927

Next to emerge from the Time Machine was David Leonard, president of the Boston Public Library, which was the first free, municipal library founded in the United States. The mission was and remains bold: make knowledge available to everyone. Knowledge shouldn’t be hidden behind paywalls, restricted to the wealthy but rather should operate under the principle of open access as public good, he explained. Leonard announced that the Boston Public Library would join the Internet Archive’s Great 78 Project, by authorizing the transfer of 200,000 individual 78s and LPs to preserve and make accessible to the public, “a collection that otherwise would remain in storage unavailable to anyone.”

David Leonard and Brewster Kahle

Brewster Kahle, founder and Digital Librarian of the Internet Archive, then came through the time machine to present the Internet Archive Hero Award to Leonard. “I am inspired every time I go through the doors,” said Kahle of the library, noting that the Boston Public Library was the first to digitize not just a presidential library, of John Quincy Adams, but also modern books. Leonard was presented with a tablet imprinted with the Boston Public Library homepage by Internet Archive 2017 Artist in Residence, Jeremiah Jenkins.

1942

Kahle then set the Time Machine to 1942 to explain another new Internet Archive initiative: liberating books published between 1923 to 1941. Working with Elizabeth Townsend Gard, a copyright scholar at Tulane University, the Internet Archive is liberating these books under a little known, and perhaps never used, provision of US copyright law, Section 108h, which allows libraries to scan and make available materials published 1923 to 1941 if they are not being actively sold. The name of the new collection: the Sony Bono Memorial Collection, named for the now deceased congressman and former representative who led the passage of the Copyright Term Extension Act of 1998, which included the 108h provision as a “gift” to libraries.

One of these books includes “Your Life,” a tome written by Kahle’s grandfather, Douglas E. Lurton, a “guide to a desirable living.” “I have one copy of this book and two sons. According to the law, I can’t make one copy and give it to the other son. But now it’s available,” Kahle explained.

1944

Sab Masada

The Time Machine cranked to 1944, out came Rick Prelinger, Internet Archive Board member, archivist, and filmmaker. Prelinger introduced a new addition to the Internet Archive’s film collection: long-forgotten footage of an Arkansas Japanese internment camp from 1944. As the film played on the screen, Prelinger welcomed Sab Masada, 87, who lived at this very camp as a 12-year-old.

Masada talked about his experience at the camp and why it is important for people today to remember it. “Since the election I’ve heard echoes of what I heard in 1942,” Masada said. “Using fear of terrorism to target the Muslims and people south of the border.”

1972

Next to speak was Wendy Hanamura, the director of partnerships. Hanamura explained how as a sixth grader she discovered a book at the library, Executive Order 9066, published in 1972, which chronicled photos of Japanese internment camps during World War II.

“Before I was an internet archivist, I was a daughter and granddaughter of American citizens who were locked up behind barbed wire in the same kind of camps that incarcerated Sab,” said Hanamura. That one book – now out of print – helped her understand what had happened to her family.

Inspired by making it to the semi-final round of the MacArthur 100&Change initiative with a proposal that provides libraries and learners with free digital access to four million books, the Internet Archive is forging ahead with plans, despite not winning the $100 million grant. Among the books the Internet Archive is making available: Executive Order 9066.

1985

The year display turned to 1985, Jason Scott reappeared on stage, explaining his role as a software curator. New this year to the Internet Archive are collections of early Apple software, he explained, with browser emulation allowing the user to experience just what it was like to fire up a Macintosh computer back in its hay day. This includes a collection of the then wildly popular “HyperCards,” a programmatic tool that enabled users to create programs that linked materials in creative ways, before the rise of the world wide web.

1997

After this tour through the 20th century, the Time Machine was set to 1997. Mark Graham, Director of the Wayback Machine and Vinay Goel, Senior Data Engineer, stepped on stage. Back in 1997, when the Wayback Machine began archiving websites on the still new World Wide Web, the entire thing amounted to 2.2 terabytes of data. Now the Wayback Machine contains 20 petabytes. Graham explained how the Wayback Machine is preserving tweets, government websites, and other materials that could otherwise vanish. One example: this report from The Rachel Maddow Show, which aired on December 16, 2016, about Michael Flynn, then slated to become National Security Advisor. Flynn deleted a tweet he had made linking to a falsified story about Hillary Clinton, but the Internet Archive saved it through the Wayback Machine.

Goel took the microphone to announce new improvements to Wayback Machine Search 2.0. Now it’s possible to search for keywords, such as “climate change,” and find not just web pages from a particular time period mentioning these words, but also different format types — such as images, pdfs, or yes, even an old Internet Archive favorite, animated gifs from the now-defunct GeoCities–including snow globes!

Thanks to all who came out to celebrate with the Internet Archive staff and volunteers, or watched online. Please join our efforts to provide Universal Access to All Knowledge, whatever century it is from.

Editor’s Note, 10/16/17: Watch the full event https://archive.org/details/youtube-j1eYfT1r0Tc

TV News Record: Wayback Machine saves deleted prez tweets

Posted on September 29, 2017 by Nancy Watzman

A weekly round up on what’s happening and what we’re seeing at the TV News Archive by Katie Dahl and Nancy Watzman. Additional research by Robin Chin.

In this week’s TV News Archive roundup, we explain how presidential tweets are forever, show how different TV cable news networks summarized NFL protests via Third Eye chyron data, and present FiveThirtyEight’s analysis of hurricane coverage (hint: Puerto Rico got less attention.)

Wayback Machine preserved deleted prez tweets; PolitiFact fact-checks legality of prez tweet deletions (murky)

The Internet Archive’s Wayback Machine has preserved President Donald Trump’s deleted tweets praising failed GOP Alabama U.S. Senate candidate Luther Strange following his defeat by Roy Moore on September 26. So does the Pulitzer Prize-winning investigative journalism site ProPublica, through its Politwoops project.

Kudos @propublica saving @realDonaldTrump deleted tweets. Also @internetarchive on Waybackhttps://t.co/FMkJNZ4xNS https://t.co/xAPRTzCCb0 pic.twitter.com/zXkHzDvkLP

— TV News Archive (@TVNewsArchive) September 27, 2017

The story of Trump’s deleted tweets about Strange was reported far and wide, including this segment on MSNBC’s “Deadline Whitehouse” that aired on September 27.

In a fact-check on the legality of a president deleting tweets, linked in the TV News Archive clip above, John Kruzel, reports for PolitiFact that the law is murky but still being fleshed out:

Experts were split over how much enforcement power courts have in the arena of presidential record-keeping, though most seemed to agree the president has the upper hand.

“One of the problems with the Presidential Records Act is that it does not have a lot of teeth,” said Douglas Cox, a professor at the City University of New York School of Law. “The courts have held that the president has wide and almost unreviewable discretion to interpret the Presidential Records Act.”

That said, many of the experts we spoke to are closely monitoring how the court responds to the litigation around Trump administration record-keeping.

He also provides background on that litigation, a lawsuit brought by Citizens for Responsibility and Ethics in Washington. The case is broadly about requirements for preserving presidential records, and a previous set of deleted presidential tweets is a part of it.

Fact Check: NFL attendance and ratings are way down because people love their country (Mostly false)

Speaking of Trump’s tweets, the president ignited an explosion of coverage with an early morning tweet on Sunday, Sept. 24, ahead of a long day of football games: “NFL attendance and ratings are WAY DOWN. Boring games yes, but many stay away because they love our country.”

Manuela Tobias of PolitiFact rated this claim as “mostly false,” reporting, “Ratings were down 8 percent in 2016, but experts said the drop was modest and in line with general ratings for the sports industry. The NFL remains the most watched televised sports event in the United States.” “As for political motivation, there’s little evidence to suggest people are boycotting the NFL. Most of the professional sports franchises are dealing with declines in popularity.”

How did different cable TV news networks cover the NFL protests?

We first used the Television Explorer tool to see where there was a spike in the use of the word “NFL” near the word “Trump.” It looked like Sunday showed the most use of these words. After a closer look, we saw MSNBC, Fox News, and CNN all showed highest mentions of these terms around 2 pm Pacific.

Spike at 2 pm (PST) for CNN, MSNBC, and CNN

Then we downloaded data from the new Third Eye project, which turns TV News chyrons into data, filtering for that date and hour. We were able to see how the three cable news networks were summarizing the news at that particular point in time.

At about 2:02, CNN broadcast this chyron: “NFL teams kneel, link arms in defiance of Trump.”

Screen grab of chyron caught by Third Eye from 2:02 pm 9/24/17 on CNN

Fox News chose the following, also seen below tweeted from one of the Third Eye twitter bots: “Some NFL owners criticize Trump’s statements on player protests, link arms with players”

FOXNEWS 2:02pm SOME NFL OWNERS CRITICIZE TRUMP’S STATEMENTS ‘. . ON PLAYER PROTESTS, LINK ARMS WITH PLAYERS i
FOX N EWS ALERT

— The Third Eye (@tvThirdEyeF) September 24, 2017

Meanwhile, MSNBC chose a different message: “Taking a knee: NFL teams send a message.”

Screen grab of chyron caught by Third Eye from 2:02 pm 9/24/17 on MSNBC

About eight minutes later, all three cable channels were still reporting on the NFL protests:

Puerto Rico’s hurricane Maria got less media attention than hurricanes Harvey & Irma

Writing for FiveThirtyEight.com, Dhrumil Mehta demonstrated that both online news sites and TV news broadcasters paid less attention to Puerto Rico’s hurricane Marie than to hurricanes Harvey and Irma, which hit mainland U.S. primarily in Texas and Florida. Mehta used TV News Archive data via Television Explorer, as well as data from Media Cloud on online news coverage, to help make his case:

While Puerto Rico suffers after Hurricane Maria, much of the U.S. media (FiveThirtyEight not excepted) has been occupied with other things: a health care bill that failed to pass, a primary election in Alabama, and a spat between the president and sports players, just to name a few. Last Sunday alone, after President Trump’s tweets about the NFL, the phrase “national anthem” was said in more sentences on TV news than “Puerto Rico” and “Hurricane Maria” combined.

To receive the TV News Archive’s email newsletter, subscribe here.

TV news chyron data provide ways to explore breaking news reports & bias

Posted on September 21, 2017 by Nancy Watzman

Today the Internet Archive’s TV News Archive announces a new way to plumb our TV news collections to see how news stories are reported: data feeds for the news that appears as chyrons on the lower thirds of TV screens. Our Third Eye project scans the lower thirds of TV screens, using OCR, or optical character recognition, to turn these fleeting missives into downloadable data ripe for analysis. At launch, Third Eye tracks BBC News, CNN, Fox News, and MSNBC, and contains more than four million chyrons captured in just over two weeks.

Download Third Eye data. API and TSV options available.

Follow Third Eye on Twitter.

Third Eye joins a growing suite of TV News Archive tools that help researchers, journalists, and the public analyze how news is filtered through TV and presented to the public. These include Face-o-Matic, created through a partnership with Matroid, which uses facial recognition to find top political leaders on TV news shows; and Television Explorer, an interface created by data scientist Kalev Leetaru that allows easy searching and visualization of TV News Archive closed captioning. The Political TV Ad Archive used audio fingerprinting to find airings of political ads in the 2016 elections, and the Trump and U.S. Congress archives provide a quick way to see news clips featuring top political figures, alongside associated fact checks by FactCheck.org, PolitiFact, and The Washington Post‘s Fact Checker.

Breaking news often appears as chyrons on TV before newscasters begin reporting or video is available, whether the subject is a hurricane or a breaking political story. Which chyrons a TV news network chooses to display often reveals editorial decisions that can demonstrate a particular slant on the news. With Third Eye data, investigations by journalists, fact-checkers, researchers, can explore how messages are delivered to the public in near real-time.

Third Eye on Twitter tweets the most clear, representative chyron from a one-minute period on a particular TV news channel. This can serve as an alert system, showing how TV networks are reporting news.

For example, on September 6, 2017, in the midst of a heavy news day featuring Hurricane Irma, the debate over a deal on immigration, and other stories, TV news cable networks began to show the breaking news that Facebook had turned over information about $100,000 in ads purchased by Russian sources during the 2016 elections to Robert S. Mueller III, the special counsel investigating ties between the Trump campaign and Russia. Our Third Eye CNN Twitter bot tweeted out this chyron recorded at 2:38 pm Pacific Standard Time.

CNN 2:38pm FACEBOOK: SOLD ADS TO FAKE RUSSIAN ACCOUNTS DURING ELECTION CAMPAIGN
CNN 2:38pm FACEBOOK: SOLD ADS TO FAKE…

— The Third Eye (@tvThirdEye) September 6, 2017

Here is the corresponding clip as it appears on the TV News Archive.

At 2:51 p.m., MSNBC ran this chyron: “FACEBOOK: WE SOLD POLITICAL ADS DURING ELECTION TO COMPANY LIKELY OPERATED IN RUSSIA.” The corresponding clip is below.

However, our data do not show Fox News running any chyrons on the Facebook ad news that day. To cross-check, we used Television Explorer, a tool for searching TV News Archive closed captions. (Captions differ from chyrons; captions capture what news anchors are actually saying, as opposed to chyrons, which feature text chosen by the TV channel to run at the bottom of the screen.) Television Explorer shows CNN and MSNBC covering the story on September 6, but not Fox News.

However, the Facebook ad story did make it on to the Fox News website during the 2 p.m. hour, as this search on the Wayback Machine shows.

This is just one example of the way that researchers might use Third Eye chyron data in conjunction with other tools to explore how a particular story is portrayed on TV news. We’d love for others to dig in, explore, and give us feedback on this new public data source.

More on Third Eye data

The work of the Internet Archive’s TV architect Tracey Jaquith, the Third Eye project applies OCR to the “lower thirds” of TV cable news screens to capture the text that appears there. The chyrons are not captions, which provide the text for what people are saying on screen, but rather are narrative display text that accompanies news broadcasts.

Created in real-time by TV news editors, chyrons sometimes include misspellings. The OCR process also frequently adds another element where text is not rendered correctly, leading to entries that may be garbled. To make sense out of the noise, Jaquith applies algorithms that choose the most representative chyrons from each channel collected over 60-second increments. This cleaned-up feed is what fuels the Twitter bots that post which chyrons are appearing on TV news screens.

We provide options to download this filtered feed and/or the raw feed nearly as soon as it appears on the TV screen. Both may be useful depending on the type of project. In addition, the Twitter feed itself is a good source to see what the filtered feed looks like.

Some notes:

Chryons are derived in near real-time from the TV News Archive‘s collection of TV news. The constantly updating public collection contains 1.4 million TV news shows, some dating back to 2009.
At launch, Third Eye captures four TV cable news channels: BBC News, CNN, Fox News, and MSNBC.
Data can be affected by temporary collection outages, which typically can last minutes or hours, but rarely more. If you are concerned about a specific time gap in a feed and would like to know if it’s the result of an outage, please inquire at tvnews@archive.org.
The “raw feed” option provides all of the OCR’ed text from chryons at the rate of approximately one entry per second. The “filtered tweets feed” provides the data that fuels our Twitter bots; this has been filtered to find the most representative, clearest chyrons from a 60-second period, with no more than one entry/tweet per minute (though the duration may be shorter than 60 seconds.) The filtered feed relies on algorithms that are a work in progress; we invite you to share your ideas on how to effectively filter the noise from the raw data.
Dates/times are in UTC (Coordinated Universal Time) in API feeds, PST (Pacific Standard Time) in tweets.
Because the size of the raw data is so large (about 20 megabytes per day), we limit results to seven days per request.
We began collecting raw data on August 25, 2017; the filtered feed begins on September 7, 2017.
“Duration” column is in seconds–the amount of time that particular chyron appeared on the screen.
To view clips in context on the TV News Archive, paste “https://archive.org/details/” before the field that begins with a channel name. For example, “FOXNEWSW_20170919_100000_FOX__Friends/start/792” becomes “https://archive.org/details/FOXNEWSW_20170919_100000_FOX__Friends/start/792”

We want to hear from you! Please contact us with questions, feedback, concerns – and also to tell us what project you’ve done with the TV News Archive’s Third Eye project: tvnews@archive.org. Follow us @tvnewsarchive, and subscribe to our weekly newsletter here.

Thanks to Robin Chin, Katie Dahl, Dan Schultz, and the TV News Archive director, Roger Macdonald, for contributing to this project.

Internet Archive to help First Draft News debunk fake news

Posted on April 6, 2017 by Nancy Watzman

We are delighted to announce a new partnership with First Draft News, a nonpartisan organization dedicated to ferreting out misinformation online.

In its short existence–it was founded in June 2015–First Draft News has already spearheaded innovative projects that bring together news organizations, social technology companies, and human rights organizations to verify the information that flows to online audiences. First Draft also helps define the problem: in February, Claire Wardle, the group’s research director, published a helpful taxonomy of the different types of fake news and misinformation that proliferate online.

Example: with French elections fast approaching on April 23, 2017, First Draft News launched CrossCheck, a project combining the efforts of more than 37 newsroom partners, as well as journalism students across France and beyond. They’ve been working together to debunk false rumors and news reports in a much-watched contest pitting the far-right National Front leader Marine Le Pen against centrist Emmanuel Macron, defender of the European Union, as well as other candidates.

This partnership has quashed reports that 30 percent of Macron’s campaign funding comes from Saudi Arabia, that France is spending 100 million euros to buy hotels to house immigrants, and that the country is planning to replace Christian public holidays with Muslim and Jewish holidays, plus many more. These false stories had been shared thousands of times on social media.

When the elections are over, First Draft News will research whether CrossCheck’s efforts were effective, or how they may be modified to become more so. “CrossCheck is a living laboratory,” says Aimee Rinehart, manager of First Draft’s Partner Network. Wardle will lead the efforts to determine whether the CrossCheck model, where several news organizations sign off on a fact-check or verification, builds public trust in the media, an increasing problem worldwide.

Already, First Draft News partners rely heavily on the Internet Archive’s Wayback Machine to verify information online. With our new collaboration, we hope to increase use of other Internet Archive resources, including our searchable collection of TV news and curated archives such as the Trump Archive, with its linked fact-checks by national fact checking organizations. We also hope the collaboration provides valuable input for our plans to apply more tools of machine learning to the TV News Archive that could help inform reliable news reporting in the future.

Preserving U.S. Government Websites and Data as the Obama Term Ends

Posted on December 15, 2016 by jefferson

Long before the 2016 Presidential election cycle librarians have understood this often-overlooked fact: vast amounts of government data and digital information are at risk of vanishing when a presidential term ends and administrations change. For example, 83% of .gov pdf’s disappeared between 2008 and 2012.

That is why the Internet Archive, along with partners from the Library of Congress, University of North Texas, George Washington University, Stanford University, California Digital Library, and other public and private libraries, are hard at work on the End of Term Web Archive, a wide-ranging effort to preserve the entirety of the federal government web presence, especially the .gov and .mil domains, along with federal websites on other domains and official government social media accounts.

While not the only project the Internet Archive is doing to preserve government websites, ftp sites, and databases at this time, the End of Term Web Archive is a far reaching one.

The Internet Archive is collecting webpages from over 6,000 government domains, over 200,000 hosts, and feeds from around 10,000 official federal social media accounts. The effort is likely to preserve hundreds of millions of individual government webpages and data and could end up totaling well over 100 terabytes of data of archived materials. Over its full history of web archiving, the Internet Archive has preserved over 3.5 billion URLs from the .gov domain including over 45 million PDFs.

This end-of-term collection builds on similar initiatives in 2008 and 2012 by original partners Internet Archive, Library of Congress, University of North Texas, and California Digital Library to document the “gov web,” which has no mandated, domain-wide single custodian. For instance, here is the National Institute of Literacy (NIFL) website in 2008. The domain went offline in 2011. Similarly, the Sustainable Development Indicators (SDI) site was later taken down. Other websites, such as invasivespecies.gov were later folded into larger agency domains. Every web page archived is accessible through the Wayback Machine and past and current End of Term specific collections are full-text searchable through the main End of Term portal. We have also worked with additional partners to provide access to the full data for use in data-mining research and projects.

The project has received considerable press attention this year, with related stories in The New York Times, Politico, The Washington Post, Library Journal, Motherboard, and others.

“No single government entity is responsible for archiving the entire federal government’s web presence,” explained Jefferson Bailey, the Internet Archive’s Director of Web Archiving. “Web data is already highly ephemeral and websites without a mandated custodian are even more imperiled. These sites include significant amounts of publicly-funded federal research, data, projects, and reporting that may only exist or be published on the web. This is tremendously important historical information. It also creates an amazing opportunity for libraries and archives to join forces and resources and collaborate to archive and provide permanent access to this material.”

This year has also seen a significant increase in citizen and librarian driven “hackathons” and “nomination-a-thons” where subject experts and concerned information professionals crowdsource lists of high-value or endangered websites for the End of Term archiving partners to crawl. Librarian groups in New York City are holding nomination events to make sure important sites are preserved. And universities such as The University of Toronto are holding events for “guerrilla archiving” focused specifically on preserving climate related data.

We need your help too! You can use the End of Term Nomination Tool to nominate any .gov or government website or social media site and it will be archived by the project team. If you have other ideas, please comment here or send ideas to info@archive.org. And you can also help by donating to the Internet Archive to help our continued mission to provide “Universal Access to All Knowledge.”

Blacked Out Government Websites Available Through Wayback Machine

Posted on October 2, 2013 by Brewster Kahle

(from the Internet Archive’s Archive-it group: Announcing the first ever Archive-It US Government Shutdown Notice Awards! )

Congress has caused the U.S. federal government to shut down and important websites have gone dark. Fortunately, we have the Wayback Machine to help.

Many government sites are displaying messages saying that they are not being updated or maintained during the government shut down, but the following sites are some who have completely shut their doors today. Clicking the logos will take you to a Wayback Machine archived capture of the site. Please donate to help us keep the government websites available. You can also suggest pages for us to archive so that we can document the shut down.

National Oceanic and Atmospheric Administration noaa.gov	National Park Service nps.gov	Library of Congress loc.gov
National Science Foundation nsf.gov	Federal Communication Commission fcc.gov	Bureau of the Census census.gov
U.S. Department of Agriculture usda.gov	United States Geological Survey usgs.gov	U.S. International Trade Commission usitc.gov
Federal Trade Commission ftc.gov	National Aeronautics and Space Administration nasa.gov	International Trade Administration trade.gov
Corporation for National and Community Service nationalservice.gov

A New Kind of Datacenter

Posted on March 26, 2009 by jeff kaplan

A note from Internet Archive’s founder, Brewster Kahle:

Today (March 25, 2009) the Internet Archive and Sun Microsystems are launching a new datacenter that stores the whole web archive and serves the Wayback Machine.

And, it is a modular datacenter that sits outside in a shipping container. This 3Petabyte (3 million gigabyte) datacenter will handle the 500 requests per second as it takes over the full Wayback load.

Thank you to Sun and Internet Archive staff that helped conceive and build this new perspective on long term active archiving.

In the press:
Sun Microsystems
Slashdot
Metafilter
San Francisco Chronicle
Computerworld
Good Morning Silicon Valley

Internet Archive Blogs

A blog from the team at archive.org