Tag Archives: AI & Research

Anti-Hallucination Add-on for AI Services Possibility

Chatbots, like OpenIA’s ChatGPT, Google’s Bard and others, have a hallucination problem (their term, not ours). It can make something up and state it authoritatively. It is a real problem. But there can be an old-fashioned answer, as a parent might say: “Look it up!”

Imagine for a moment the Internet Archive, working with responsible AI companies and research projects, could automate “Looking it Up” in a vast library to make those services more dependable, reliable, and trustworthy. How?

The Internet Archive and AI companies could offer an anti-hallucination service ‘add-on’ to the chatbots that could cite supporting evidence and counter claims to chatbot assertions by leveraging the library collections at the Internet Archive (most of which were published before generative AI).

By citing evidence for and against assertions based on papers, books, newspapers, magazines, books, TV, radio, government documents, we can build a stronger, more reliable knowledge infrastructure for a generation that turns to their screens for answers. Although many of these generative AI companies are already, or are intending, to link their models to the internet, what the Internet Archive can uniquely offer is our vast collection of “historical internet” content. We have been archiving the web for 27 years, which means we have decades of human-generated knowledge. This might become invaluable in an age when we might see a drastic increase in AI-generated content. So an Internet Archive add-on is not just a matter of leveraging knowledge available on the internet, but also knowledge available on the history of the internet.

Is this possible? We think yes because we are already doing something like this for Wikipedia by hand and with special-purpose robots like Internet Archive Bot Wikipedia communities, and these bots, have fixed over 17 million broken links, and have linked one million assertions to specific pages in over 250,000 books. With the help of the AI companies, we believe we can make this an automated process that could respond to the customized essays their services produce. Much of the same technologies used for the chatbots can be used to mine assertions in the literature and find when, and in what context, those assertions were made.

The result would be a more dependable World Wide Web, one where disinformation and propaganda are easier to challenge, and therefore weaken.

Yes, there are 4 major publishers suing to destroy a significant part of the Internet Archive’s book corpus, but we are appealing this ruling. We believe that one role of a research library like the Internet Archive, is to own collections that can be used in new ways by researchers and the general public to understand their world.

What is required? Common purpose, partners, and money. We see a role for a Public AI Research laboratory that can mine vast collections without rights issues arising. While the collections are significant already, we see collecting, digitizing, and making available the publications of the democracies around the world to expand the corpus greatly.

We see roles for scientists, researchers, humanists, ethicists, engineers, governments, and philanthropists, working together to build a better Internet.

If you would like to be involved, please contact Mark Graham at mark@archive.org.

Internet Archive weighs in on Artificial Intelligence at the Copyright Office

All too often, the formulation of copyright policy in the United States has been dominated by incumbent copyright industries. As Professor Jessica Litman explained in a recent Internet Archive book talk, copyright laws in the 20th century were largely “worked out by the industries that were the beneficiaries of copyright” to favor their economic interests. In these circumstances, Professor Litman has written, the Copyright Office “plays a crucial role in managing the multilateral negotiations and interpreting their results to Congress.” And at various times in history, the Office has had the opportunity to use this role to add balance to the policymaking process.

We at the Internet Archive are always pleased to see the Copyright Office invite a broad range of voices to discussions of copyright policy and to participate in such discussions ourselves. We did just that earlier this month, participating in a session at the United States Copyright Office on Copyright and Artificial Intelligence. This was the first in a series of sessions the Office will be hosting throughout the first half of 2023, as it works through its “initiative to examine the copyright law and policy issues raised by artificial intelligence (AI) technology.”

As we explained at the event, innovative machine learning and artificial intelligence technology is already helping us build our library. For example, our process for digitizing texts–including never-before-digitized government documents–has been significantly improved by the introduction of LSTM technology. And state-of-the-art AI tools have helped us improve our collection of 100 year-old 78 rpm records. Policymakers dazzled by the latest developments in consumer-facing AI should not forget that there are other uses of this general purpose technology–many of them outside the commercial context of traditional copyright industries–which nevertheless serve the purpose of copyright: “to increase and not to impede the harvest of knowledge.” 

Traditional copyright policymaking also frequently excludes or overlooks the world of open licensing. But in this new space, many of the tools come from the open source community, and much of the data comes from openly-licensed sources like Wikipedia or Flickr Commons. Industry groups that claim to represent the voice of authors typically do not represent such creators, and their proposed solutions–usually, demands that payment be made to corporate publishers or to collective rights management organizations–often don’t benefit, and are inconsistent with, the thinking of the open world

Moreover, even aside from openly licensed material, there are vast troves of technically copyrighted but not actively rights-managed content on the open web; these are also used to train AI models. Millions, if not billions, of individuals have contributed to these data sources, and because none of them are required to register their work for copyright to arise, it does not seem possible or sensible to try to identify all of the relevant copyright owners–let alone negotiate with each of them–before development can continue. Recognizing these and a variety of other concerns, the European Union has already codified copyright exceptions which permit the use of copyright-protected material as training data for generative AI models, subject to an opt-out in commercial situations and potential new transparency obligations

To be sure, there are legitimate concerns over how generative AI could impact creative workers and cause other kinds of harm. But it is important for copyright policymakers to recognize that artificial intelligence technology has the potential to promote the progress of science and the useful arts on a tremendous scale. It is both sensible and lawful as a matter of US copyright law to let the robots read. Let’s make sure that the process described by Professor Litman does not get in the way of building AI tools that work for everyone.

AI Audio Challenge: Audio Restoration of 78rpm Records based on Expert Examples

http://great78.archive.org/

Hopefully we have a dataset primed for AI researchers to do something really useful, and fun– how to take noise out of digitized 78rpm records.

The Internet Archive has 1,600 examples of quality human restorations of 78rpm records where the best tools were used to ‘lightly restore’ the audio files. This takes away scratchy surface noise while trying not to impair the music or speech. In the items are files in those items are the unrestored originals that were used.

But then the Internet Archive has over 400,000 unrestored files that are quite scratchy and difficult to listen to.

The goal is, or rather the hope is, that a program that can take all or many of the 400,000 unrestored records and make them much better. How hard this is is unknown, but hopefully it is a fun project to work on.

Many of the recordings are great and worth the effort. Please comment on this post if you are interested in diving in.

AI@IA — Extracting Words Sung on 100 year-old 78rpm records

A post in the series about how the Internet Archive is using AI to help build the library.

Freely available Artificial Intelligence tools are now able to extract words sung on 78rpm records.  The results may not be full lyrics, but we hope it can help browsing, searching, and researching.

Whisper is an open source tool from OpenAI “that approaches human level robustness and accuracy on English speech recognition.”  We were surprised how far it could get with recognizing spoken words on noisy disks and even words being sung.

For instance in As We Parted At The Gate (1915) by  Donald Chalmers, Harvey Hindermyer, and E. Austin Keith, the tool found the words:

[…] we parted at the gate,
I thought my heart would shrink.
Often now I seem to hear her last goodbye.
And the stars that tune at night will
never die as bright as they did before we
parted at the gate.
Many years have passed and gone since I
went away once more, leaving far behind
the girl I love so well.
But I wander back once more, and today
I pass the door of the cottade well, my
sweetheart, here to dwell.
All the roads they flew at fair,
but the faith is missing there.
I hear a voice repeating, you’re to live.
And I think of days gone by
with a tear so from her eyes.
On the evening as we parted at the gate,
as we parted at the gate, I thought my
heart would shrink.
Often now I seem to hear her last goodbye.
And the stars that tune at night will
never die as bright as they did before we
parted at the gate.

All of the extracted texts are now available– we hope it is useful for understanding these early recordings.  Bear in mind these are historical materials so may be offensive and also possibly incorrectly transcribed.

We are grateful that University of California Santa Barbara Library donated an almost complete set of transfers of 100 year-old Edison recordings to the Internet Archive’s Great 78 Project this year.  The recordings and the transfers were so good that the automatic tools were able to make out many of the words.

The next step is to integrate these texts into the browsing and searching interfaces at the Internet Archive.