Monthly Archives: March 2022

Google Summer of Code is a Win-Win for Contributing Students and Mentoring Organizations: Thank you Google

Lavanya Singh was eager to write lots of code after her freshman year of college, but she knew it was hard to find a place that would give her a chance. Then she landed a spot with the Google Summer of Code (GSoC) program working at the Internet Archive.

Paired with Mark Graham, director of The Wayback Machine, Singh was asked to create a systematic way to archive news sources from all around the world. 

Lavanya Singh, Google Summer of Code contributor.

“Mark basically gave me that problem and said: ‘Go figure it out,’” she recalls, grateful for the challenge, the tight knit community at the Internet Archive, and the mentorship provided throughout the project. “The Internet Archive really trusts their interns and gives you an opportunity to do huge scale technical projects that are going to be useful in the long run.”

The experience gave Singh skills and confidence that led to other internships and a job as a software engineer, following graduation this spring from Harvard University with a degree in computer science and philosophy.

For 17 years, GSoC has given more than 18,000 students from 112 countries the chance to learn about programming up close. Google selects students (called “contributors”) and matches them with organizations doing open-source projects. All told, the students have created 40 million lines of code since the program’s inception in 2005. It has helped launch careers, like Singh’s, and provided a pipeline of potential employees for the 746 organizations that have participated. Google recently posted its Google Summer of Code timeline for 2022 for applicants for the paid positions, which last 12 weeks.

“It is truly a benefit and service to students. For some, it can be transformational,” said Singh’s mentor, Graham, of the Internet Archive. “But it also helps us. It’s a way to learn about new talent. And it’s a way for the Internet Archive to increase our visibility and demonstrate that we are part of this community of organizations.”  

GSoC provides an infrastructure to match promising programmers with projects that can be difficult to find and is especially relevant now with people working remotely, said Brenton Cheng, a senior engineer with the Internet Archive.

“It’s been an incredible way by which people all over the world can get opportunities to work with companies, creating openings that might not be available to them otherwise,” said Cheng, who has mentored several student contributors over the years. 

Staff assign mini-projects designed to give students hands-on experience and a sense of accomplishment. Students are also included in team meetings, invited to give input and present their work, said Cheng. 

Recent GSoC projects and contributors:

  • Rakesh Chinta focused on building advanced features for the existing Chrome extension for the Wayback Machine (2017);
  • Zhengyue Cheng created a “map” of the web via the Wayback Machine (2018);
  • Salman Shah worked with the Open Library team to modernize and increase the coverage of its book catalog and improve website reliability (2018);
  • Kanchan Joshi improved site navigation for Archive.org (2019);
  • Giacomo Cignoni made a significant contribution with his BookReader Selection & Dark Mode project. He worked to give public domain works the ability to have text selection over the book page images (2020);
  • Tabish Shaikh helped improve the adoption of Open Library with his Adoption of BookLovers project – redesigning the Book Page and making it clearer what services were offered (2020);
  • Nolan Windham worked on the Open Book Genome Project. It centered on the ability for computers and machines to read a book on our behalf, and extract metadata that can then be made publicly useful to the world. Through the process, nearly 10,000 new books were added to the lending system (2021);
  • Xin Yue Chen focused on linking Wikipedia references to Internet Archive books (2021).

“We’re helping to train the next generation of developers,” Cheng said. “On the flip side, we really believe in our mission. Quite often, the people who work with the Google Summer of Code program continue to contribute with us as volunteers or sometimes even become employees.”

It’s a mutual win and an awesome program that has helped a lot of students find connections with companies, added Cheng. The program is a way for young people to show their initiative and is advertised as a way to “flip bits not burgers” in the summer. 

“It’s a chance to contribute to a larger organization and maybe set themselves on a different prospective path to their future,” Cheng said. 

Mek, who leads the OpenLibrary team at the Internet Archive, said the four GSoC students he’s worked with have made substantial improvements through their projects. 

“We were able to make progress in a variety of different areas that we may not otherwise have had the bandwidth to focus on,” said Mek. 

Being involved in GSoC has dramatically increased the number of volunteers who are interested in participating within the Open Library ecosystem. It prompted the Internet Archive to streamline the volunteer page and create intake forms. There has also been an effort to organize and label projects for new volunteers.

The GSoC experience led the Internet Archive to structure its own internship and fellowship opportunities. And it has provided the organization with a means to find qualified staff.

Anish Kumar Sarangi, Google Summer of Code contributor.

Anish Kumar Sarangi, a student GSoC contributor in 2018, joined the Internet Archive as an employee in May 2020. During his summer experience, Sarangi worked on development of the Chrome extension, “Wayback Machine.”  Today it is used by thousands of people to help them archive URLs, access archived content from broken links and perform other functions to help make the web more useful and reliable. 

“I gained a lot of knowledge and experience. Everyone was very encouraging and supportive,” said Sarangi, of the summer program. He now works from India in software development for the Internet Archive and has been a mentor with the program himself. His advice to others considering applying: “Please get involved in the community. You can get guidance and grow further in the organization.”

Library as Laboratory Recap: Supporting Computational Use of Web Collections

For scholars, especially those in the humanities, the library is their laboratory. Published works and manuscripts are their materials of science. Today, to do meaningful research, that also means having access to modern datasets that facilitate data mining and machine learning.

On March 2, the Internet Archive launched a new series of webinars highlighting its efforts to support data-intensive scholarship and digital humanities projects. The first session focused on the methods and techniques available for analyzing web archives at scale.

Watch the session recording now:

“If we can have collections of cultural materials that are useful in ways that are easy to use — still respectful of rights holders — then we can start to get a bigger idea of what’s going on in the media ecosystem,” said Internet Archive Founder Brewster Kahle.

Just what can be done with billions of archived web pages? The possibilities are endless. 

Jefferson Bailey, Internet Archive’s Director of Web Archiving & Data Services, and Helge Holzmann, Web Data Engineer, shared some of the technical issues libraries should consider and tools available to make large amounts of digital content available to the public.

The Internet Archive gathers information from the web through different methods including global and domain crawling, data partnerships and curation services. It preserves different types of content (text, code, audio-visual) in a variety of formats.

Learn more about the Library as Laboratory series & register for upcoming sessions.

Social scientists, data analysts, historians and literary scholars make requests for data from the web archive for computational use in their research. Institutions use its service to build small and large collections for a range of purposes. Sometimes the projects can be complex and it can be a challenge to wrangle the volume of data, said Bailey.

The Internet Archive has worked on a project reviewing changes to the content of 800,000 corporate home pages since 1996. It has also done data mining for a language analysis that did custom extractions for Icelandic, Norwegian and Irish translation.

Transforming data into useful information requires data engineering. As librarians consider how to respond to inquiries for data, they should look at their tech resources, workflow and capacity. While more complicated to produce, the potential has expanded given the size, scale and longitudinal analysis that can be done.  

“We are getting more and more computational use data requests each year,” Bailey said. “If librarians, archivists, cultural heritage custodians haven’t gotten these requests yet, they will be getting them soon.”

Up next in the Library as Laboratory series:

The next webinar in the series will be held March 16, and will highlight five innovative web archiving research projects from the Archives Unleashed Cohort Program. Register now.

What’s New in February 2022

Here are some of the notable new additions to the Internet Archive from February 2022. (Logging in might be required to borrow certain items.)

Notable new collections: 

We’ve been reorganizing some of the items uploaded by our users, and these collections of magazines struck us as particularly interesting:

Books 45,073

This month we’ve added books in more than 20 languages. Here are a few good ones to start with:

Audio Archive 73,305

The audio archive contains recordings ranging from alternative news programming, to Grateful Dead concerts, to Old Time Radio shows, to book and poetry readings, to original music uploaded by our users.

The LibriVox Free Audiobook Collection 118

Founded in 2005, Librivox is a community of volunteers from all over the world who record audio versions of public domain texts: poetry, short stories, whole books, even dramatic works, in many different languages.

78 RPMs and Cylinder Recordings 8,840

Listen to this collection of 78rpm records, cylinder recordings, and other recordings from the early 20th century.

Live Music Archive 892

The Live Music Archive is a community committed to providing the highest quality live concerts in a lossless, downloadable format, along with the convenience of on-demand streaming.

Netlabels 263

The Netlabels collection hosts complete, freely downloadable/streamable, often Creative Commons-licensed catalogs of virtual record labels.

Internet Arcade 5

The Internet Arcade is a web-based library of arcade (coin-operated) video games from the 1970s through to the 1990s, emulated in JSMAME, part of the JSMESS software package. Containing hundreds of games ranging through many different genres and styles, the Arcade provides research, comparison, and entertainment in the realm of the Video Game Arcade.

Independent Publisher Drives Innovation, Sells eBooks to Internet Archive

Publisher of 11:11 Press says it sells—rather than licenses—books to libraries for online lending to reach a broad audience.

The goal of 11:11 Press is to have its books in every library in the world, according to its founder and publisher, Andrew Wilt.

Andrew Wilt, 11:11 Press

“We are big supporters of libraries because they allow equal access to knowledge and preserve culture,” said Wilt, whose independent press based in Minneapolis sells its books at a discount to nonprofits. “From a publishing standpoint, our authors care about being read so we want to get our books to as many people as possible.”

The Internet Archive recently bought the entire catalog of books from 11:11 Press and made them available online for controlled digital lending to one person at a time.  

“Honestly, I don’t know why anyone would not want to have their books in a library, especially the Internet Archive, which is more relevant now than it has been any other time,” Wilt said. “It used to be the library of the future. But in our era of remote learning and people working from home, the Internet Archive is the library of the present. You don’t have to go into an actual physical building. It’s available for anyone with an internet connection. It’s probably the most relevant lending institution at the moment.”

“[Internet Archive] used to be the library of the future. But in our era of remote learning and people working from home, the Internet Archive is the library of the present.”

Andrew Wilt, editor, 11:11 Press

In business for four years, 11:11 Press publishes an eclectic mix of titles that Wilt describes as “disruptive literature.” Its authors push the boundaries. Some books have a very heavy, theoretical and academic focus while others are about everyday working people. There are books of poetry, short stories, novels, and hybrid work. The aim is to give exposure to underrepresented voices and offer an alternative from what is produced by mainstream publishers.

“We’re kind of this lighthouse trying to find those people who are actively looking for something that’s new and exciting,” said Wilt.

From the 11:11 Press Catalog

In one of the 11:11 Press “theory fiction” titles, Zer000 Excess, images are “glitching out” within the text, leading the reader to consider what meaning is being created. Jake Reber wrote the book using Microsoft PowerPoint 2007 – the only version of the software with identifiable software features known to produce these “glitches.” Authors like Reber intentionally use these embedded software tools incorrectly in order to get distortion. “Like the early punk bands who put fuzz in their music, we’re trying to add that distortion in the work,” said Wilt.

Human Tetris merges digital dating in an all-too-honest newspaper style of queer dating profiles. It was written as a collaboration between two different voices building a lattice of interlocking online identities by Vi Khi Nao and Ali Raz.

The publisher features “dangerous writing,” which uses fiction as the buffer to draw on personal experience. For authors in this genre, fiction is the lie that tells the truth. “We want to encourage writers to go to those uncharted territories of the self. What you find might be hard to look at, but if you pull back the layers, there’s something unique and beautiful there.” Wilt said.

Jinnwoo (Ben Webb) is a writer, musician, visual artist, and author of the book Little Hollywood published by 11:11 Press. It consists of B-grade movie scripts with paper doll cut outs. The idea is to engage the reader by having them cut out the dolls and use the scripts. “Going to those dark places with honesty encourages the reader to be more mindful, more present, which  leads to more empathy,” Wilt said.

Did you know? Thanks to the innovative partnership between the Internet Archive and Better World Books—our favorite online bookstore—patrons who browse to the 11:11 Press books at archive.org have a direct link to purchase new copies of the books in print via Better World Books.

“Small presses drive innovation.”

In its next catalog, 11:11 Press will be coming out with a 520-page Illustrated Old Testament and corresponding painting. This 9-by-12-inch book, which will sell for $150, is too religious for some and too secular for others, making it a perfect product for a small press, Wilt said. Another upcoming book will be a compilation of short stories by the late Peter Christopher who helped start the dangerous writing movement.

As a small press, Wilt said the focus isn’t to write with marketing in mind but rather for authors to write the stories only they can tell. The hope is for 11:11 Press to create something greater to help benefit society and get people to think in a different way. “Reading authors who courageously face their lives, their past, their future, encourages us, the readers, to do the same,” he said.

Wilt said he anticipates other independent publishers will follow suit in selling their works to the Internet Archive. “Small presses drive innovation. This is where experimentation occurs,” he said. “Our top priority is sharing knowledge.”