80 terabytes of archived web crawl data available for research

petaboxInternet Archive crawls and saves web pages and makes them available for viewing through the Wayback Machine because we believe in the importance of archiving digital artifacts for future generations to learn from.  In the process, of course, we accumulate a lot of data.

We are interested in exploring how others might be able to interact with or learn from this content if we make it available in bulk.  To that end, we would like to experiment with offering access to one of our crawls from 2011 with about 80 terabytes of WARC files containing captures of about 2.7 billion URIs.  The files contain text content and any media that we were able to capture, including images, flash, videos, etc.

What’s in the data set:

  • Crawl start date: 09 March, 2011
  • Crawl end date: 23 December, 2011
  • Number of captures: 2,713,676,341
  • Number of unique URLs: 2,273,840,159
  • Number of hosts: 29,032,069

The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date.  We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives.  The scope of the crawl was not limited except for a few manually excluded sites.  However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it.  For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them).  We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.  We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with.  We have also done some further analysis of the content.

Hosts Crawled pie chart

If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it.  We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.

 

43 thoughts on “80 terabytes of archived web crawl data available for research

  1. Pingback: 80 terabytes of archived web crawl data available for research | My Daily Feeds

  2. Mayank Sharma

    Hi,
    I’m a student of , Pilani campus. I’m pursuing my degree in Electrical and Electronics Engineering.
    However, I am currently an intern in MuSigma Business Solutions, Bangalore and have familiarized myself with data analysis and am specifically working on Text Mining (Opinion mining and Sentiment mining along with social media analysis to be precise) using
    .
    I am a blogger and run a personal blog which has my musings on the world. It has a small niche of followers and they were all completely unexpected. I’m also interning as a writer for an online magazine.
    I’d like to continue working in both these areas on my last semester on campus and do a study on what makes a good blog good. I’d like to study the traffic to personal blogs and see where it comes from – what are the general channels which generate traffic to personal blogs. I hope to someday develop a research paper about the social network structure of the blogosphere which is innante in its nature and rarely mentioned explicitly.
    NOTE: The work I hope to do is purely at a personal level 🙂
    Hope you like my idea.

  3. Paul Annear

    Dear Internet Archive

    The Concise Model of the Universe is a project which aims to capture everything that interests me and therefore many other people who are also able to add content. Unfortunately the people hosting The Model did not keep backups as promised and have crashed twice leading to a massive loss of information.

    I would love to be able to explore your resource to recover lost information and in the process to find information that should be included on The Model.

    Thank you for your consideration of my request.

    Paul Annear

    1. brewster

      if it has a url, you should be able to see it in the wayback machine.

      I bet you have tried this, but just in case.

      -brewster

  4. Pingback: 10,000,000,000,000,000 bytes archived! | Internet Archive Blogs

  5. Lynne Leflore

    I would only be interested in this material for genealogical purposes.

    Thanks for all of your hard work.

  6. Pingback: The Internet Archive has saved over 10,000,000,000,000,000 bytes of the Web | Digital Gadget dan Selular

  7. Pingback: The Internet Archive Has Now Saved a Whopping 10,000,000,000,000,000 Bytes of Data : Lenned

  8. Pingback: The Internet Archive Has Now Saved a Whopping 10,000,000,000,000,000 Bytes of Data |Trax Asia™

  9. Pingback: The Internet Archive Has Now Saved a Whopping 10,000,000,000,000,000 Bytes of Data | Webmasters' Home

  10. Pingback: The Internet Archive Has Now Saved a Whopping 10,000,000,000,000,000 Bytes of Data | 1v8 NET

  11. Pingback: The Internet Archive Has Now Saved a Whopping 10,000,000,000,000,000 Bytes of Data | genellacoleman.com

  12. Pingback: The Internet Archive Has Now Saved A Whopping 10 Petabytes Of Data | Gizmodo Australia

  13. Pingback: The Internet Archive is now home to 10 petabytes of data - TekDefenderTekDefender

  14. Pingback: Technable | Making you Technically Able

  15. Pingback: » The Internet Archive Has Now Saved a Whopping 10,000,000,000,000,000 Bytes of Data Gamez Menu

  16. Crawford Comeaux

    I’m developing a meta-research tool that extracts the references from research papers. I’m wondering if the crawl contains any document files & if so, access to them would provide me with:

    – a giant corpus to test the extractor and other functions against (especially if any are written in a language other than English)
    – a great seed for a bibliometric/biliographic database, as well as global research graph

    This would be for a commercial service, but I will be open-sourcing the project under the MIT license. I’m planning on charging $5/mo or less for the service & hope to sell it to one of the companies producing similar research/bibliographic tools (e.g. Zotero).

  17. Pingback: Internet Archive Celebrates 10 Petabytes « Random Walks

  18. Pingback: The Internet Archive Has Now Saved a Whopping 10,000,000,000,000,000 Bytes of Data : Gadget News

  19. Pingback: Internet Archive: 10 Petabyte im Internet-Museum | Die Hirn Offensive

  20. Pingback: Rob's Personal Aggregator » The Internet Archive Has Now Saved a Whopping 10,000,000,000,000,000 Bytes of Data

  21. Pingback: Arquivo com a história da internet ultrapassa 10 milhões de gigabytes | Micro Ploft

  22. Pingback: Internet Archiv archiviert 10 Petabyte an Daten | MediaCompany Blog - Frisches Webdesign aus Lübeck

  23. Pingback: Internet Archive Now Stores Over 10,000,000,000,000,000 Bytes of the Web | WebTool Plugin For WordPress

  24. Pingback: Morning Toolbox – October 29, 2012 – Post-CSIcon Monday Blues « Skeptical Software Tools

  25. Pingback: Arquivo com a história da internet ultrapassa 10 milhões de gigabytes |

  26. Pingback: Quora

  27. Pingback: Internet Archive, 10 Petabyte di cultura digitale | infropy - information entropy

  28. Pingback: The Internet Archive Reaches A Milestone, Capstone Creations in Louisville, KY

  29. Nikolaj Soboljev

    I would love to be able to explore your resource to find lost information about old plc (Programmable logic controller) .I have old machine but i don’t have program,so this would be a great opportunity to find it now.Also i want to know some details about old Personal Computer’s,that i can’t find on internet now.

    Thanks for all of your hard work.

    Nikolaj

  30. Pingback: A trillion…anything…in your Hadoop cluster is cool » Aaron at the Internet Archive

  31. Pingback: Get Latest News Around The World

  32. Pingback: Lecture des sources historiennes à l’ère numérique | Frédéric Clavert

  33. Pingback: Una copia completa de la Web en 80 terabytes « Noticias Venezuela

  34. Pingback: Finding .ca domains in the 80TB Wide Crawl | Ian Milligan

  35. Pingback: Generating List of Domain-Specific WARC Files to Download | Ian Milligan

  36. Pingback: Internet Archive ahora es el hogar de 10 petabytes de datos | Online

  37. Pingback: Just how big are these WARC files, anyways? | Ian Milligan

  38. Pingback: 80 terabytes of archived web crawl data available for research | Internet Archive Blogs | Shane Cloud

  39. Pingback: Lecture des sources historiennes à l’ère numérique | Frédéric Clavert

  40. Pingback: Exploring 50,000 Images from the Wide Web Scrape, Initial Thoughts | Ian Milligan

Comments are closed.