Heritrix

Heritrix is a web crawler designed for web archiving. It was written by the Internet Archive. It is available under a free software license and written in Java. The main interface is accessible using a web browser, and there is a command-line tool that can optionally be used to initiate crawls.

Stable release

3.4.0-20220727^[1]

/ 28 July 2022

github.com/internetarchive/heritrix3

Java

Linux/Unix-like/Windows (unsupported)

Web crawler

Apache License

github.com/internetarchive/heritrix3/wiki

Heritrix was developed jointly by the Internet Archive and the Nordic national libraries on specifications written in early 2003. The first official release was in January 2004, and it has been continually improved by employees of the Internet Archive and other interested parties.

For many years Heritrix was not the main crawler used to crawl content for the Internet Archive's web collection.^[2] The largest contributor to the collection, as of 2011, is Alexa Internet.^[2] Alexa crawls the web for its own purposes,^[2] using a crawler named ia_archiver. Alexa then donates the material to the Internet Archive.^[2] The Internet Archive itself did some of its own crawling using Heritrix, but only on a smaller scale.^[2]

Starting in 2008, the Internet Archive began performance improvements to do its own wide scale crawling, and now does collect most of its content.^[3]

Web Archiving

Austrian National Library

's Internet Archive

Bibliotheca Alexandrina

Bibliothèque nationale de France

British Library

California Digital Library's Web Archiving Service

CiteSeerX

Documenting Internet2

Internet Memory Foundation

Library and Archives Canada

^[4]

Library of Congress

National and University Library of Iceland

National Library of Finland

National Library of New Zealand

(Koninklijke Bibliotheek)^[5]

Royal Library of the Netherlands

Netarkivet.dk

Smithsonian Institution Archives

National Library of Israel

A number of organizations and national libraries are using Heritrix, among them:

Arc processing tools

WERA (Web ARchive Access)

htmlextractor – displays the links Heritrix would extract for a given URL

hoppath.pl – recreates the hop path (path of links) to the specified URL from a completed crawl

manifest_bundle.pl – bundles up all resources referenced by a crawl manifest file into an uncompressed or compressed tar ball

cmdline-jmxclient – enables command-line control of Heritrix

arcreader – extracts contents of ARC files (see above)

Heritrix comes with several command-line tools:

Further tools are available as part of the Internet Archive's warctools project.^[6]

Heritrix

Stable release

Stable release

Repository

Written in

Operating system

Type

License

Website

Austrian National Library

Bibliotheca Alexandrina

Bibliothèque nationale de France

British Library

CiteSeerX

Internet Memory Foundation

Library and Archives Canada

Library of Congress

National and University Library of Iceland

National Library of Finland

National Library of New Zealand

Royal Library of the Netherlands

Smithsonian Institution Archives

National Library of Israel

Arc processing tools

WERA (Web ARchive Access)

Internet Archive

National Digital Information Infrastructure and Preservation Program

Web crawler

Heritrix - official wiki

NutchWAX

Wayback (Open source Wayback Machine)