Apache Nutch is a highly extensible and scalable
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
web crawler
Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
software project.
Features

Nutch is coded entirely in the
Java programming language
Java is a high-level, general-purpose, memory-safe, object-oriented programming language. It is intended to let programmers ''write once, run anywhere'' ( WORA), meaning that compiled Java code can run on all platforms that support Jav ...
, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "
web crawler
Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
") has been written from scratch specifically for this project.
History
Nutch originated with
Doug Cutting
Douglass Read Cutting is a software designer, advocate for, and creator of open-source search technology. He founded two technology projects, Lucene and Nutch, with Mike Cafarella. The Apache Software Foundation now manages both projects. Cutti ...
, creator of both
Lucene
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...
and
Hadoop
Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
, and
Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented the
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filte ...
project and a
distributed file system
A clustered file system (CFS) is a file system which is shared by being simultaneously Mount (computing), mounted on multiple Server (computing), servers. There are several approaches to computer cluster, clustering, most of which do not emplo ...
. The two projects have been spun out into their own subproject, called
Hadoop
Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
.
In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the
Apache Software Foundation
The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
.
In February 2014 the
Common Crawl
Common Crawl is a nonprofit organization, nonprofit 501(c) organization#501.28c.29.283.29, 501(c)(3) organization that web crawler, crawls the web and freely provides its archives and datasets to the public. Common Crawl's Web archiving, web arch ...
project adopted Nutch for its open, large-scale web crawl.
While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.
Release history
Scalability
IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a
scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any
scale-up
SCALE-UP, Student-Centered Active Learning Environment with Upside-Down Pedagogies, is a classroom specifically created to facilitate active, collaborative learning in a classroom. The spaces are carefully designed to facilitate interactions betwe ...
computer such as the
POWER5
The POWER5 is a microprocessor developed and fabricated by IBM. It is an improved version of the POWER4. The principal improvements are support for simultaneous multithreading (SMT) and an on-die memory controller. The POWER5 is a dual-core ...
.
The ClueWeb09 dataset (used in e.g.
TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.
Related projects
*
Hadoop
Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...
– Java framework that supports distributed applications running on large clusters.
Search engines built with Nutch
*
Common Crawl
Common Crawl is a nonprofit organization, nonprofit 501(c) organization#501.28c.29.283.29, 501(c)(3) organization that web crawler, crawls the web and freely provides its archives and datasets to the public. Common Crawl's Web archiving, web arch ...
– publicly available internet-wide crawls, started using Nutch in 2014.
*
Creative Commons
Creative Commons (CC) is an American non-profit organization and international network devoted to educational access and expanding the range of creative works available for others to build upon legally and to share. The organization has release ...
Search – an implementation of Nutch, used in the period of 2004–2006.
*
DiscoverEd –
Open educational resources
Open educational resources (OER) are Instructional materials, teaching, learning, and research materials intentionally created and Free license, licensed to be free for the end user to own, share, and in most cases, modify. The term "OER" descr ...
search prototype developed by Creative Commons
*
Krugle uses Nutch to crawl web pages for code, archives and technically interesting content.
*
mozDex
mozDex was a project to build an Internet-scale search engine with free and open source software (FOSS) technologies like Nutch. Since its search algorithms and code were open, it was hoped that no search results could be manipulated by either mo ...
(inactive)
*
Wikia Search
Wikia Search was a short-lived free and open-source web search engine launched by Wikia, a for-profit wiki-hosting company founded by Jimmy Wales and Angela Beesley. Wikia Search followed other experiments by Wikia into search engine technolog ...
- launched 2008, closed down 2009
See also
References
Bibliography
*
External links
*
{{Web crawlers
Nutch
Apache Nutch is a highly extensible and scalable Open-source license, open source web crawler software project.
Features
Nutch is coded entirely in the Java (programming language), Java programming language, but data is written in language-ind ...
Internet search engines
Free search engine software
Java (programming language) libraries
Cross-platform free software
Free web crawlers