Apache Nutch is a highly extensible and scalable
open source
Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
web crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spi ...
software project.
Features

Nutch is coded entirely in the
Java programming language
Java is a high-level, class-based, object-oriented programming language that is designed to have as few implementation dependencies as possible. It is a general-purpose programming language intended to let programmers ''write once, run an ...
, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering.
The fetcher ("robot" or "
web crawler
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spi ...
") has been written from scratch specifically for this project.
History
Nutch originated with
Doug Cutting
Douglass Read Cutting is a software designer, advocate, and creator of open-source search technology. He founded two technology projects, Lucene, and Nutch, with Mike Cafarella. Both projects are now managed through the Apache Software Foundat ...
, creator of both
Lucene
Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as ...
and
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
, and
Mike Cafarella.
In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented a
MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
A MapReduce program is composed of a ''map'' procedure, which performs filteri ...
facility and a
distributed file system
A clustered file system is a file system which is shared by being simultaneously mounted on multiple servers. There are several approaches to clustering, most of which do not employ a clustered file system (only direct attached storage for ...
. The two facilities have been spun out into their own subproject, called
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
.
In January, 2005, Nutch joined the
Apache Incubator
Apache Incubator is the gateway for open-source projects intended to become fully fledged Apache Software Foundation projects.
The Incubator project was created in October 2002 to provide an entry path to the Apache Software Foundation for proj ...
, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the
Apache Software Foundation.
In February 2014 the
Common Crawl project adopted Nutch for its open, large-scale web crawl.
While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.
Release history
Scalability
IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a
scale-out
Scalability is the property of a system to handle a growing amount of work by adding resources to the system.
In an economic context, a scalable business model implies that a company can increase sales given increased resources. For example, a ...
system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any
scale-up computer such as the
POWER5.
The ClueWeb09 dataset (used in e.g.
TREC
TREC may refer to:
* Techniques de Randonnée Équestre de Compétition or Trec, an equestrian discipline
* Text Retrieval Conference, workshops co-sponsored by the National Institute of Standards and Technology (NIST) and the U.S. Department of ...
) was gathered using Nutch, with an average speed of 755.31 documents per second.
Related projects
*
Hadoop
Apache Hadoop () is a collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation. It provides a software framework for distributed storage ...
– Java framework that supports distributed applications running on large clusters.
Search engines built with Nutch
*
Common Crawl – publicly available internet-wide crawls, started using Nutch in 2014.
*
Creative Commons
Creative Commons (CC) is an American non-profit organization and international network devoted to educational access and expanding the range of creative works available for others to build upon legally and to share. The organization has releas ...
Search – an implementation of Nutch, used in the period of 2004–2006.
*
DiscoverEd –
Open educational resources
Open educational resources (OER) are teaching, learning, and research materials intentionally created and licensed to be free for the end user to own, share, and in most cases, modify. The term "OER" describes publicly accessible materials and ...
search prototype developed by Creative Commons
*
Krugle Krugle delivers continuously updated, federated access to all of the code and technical information in the enterprise. Krugle search helps an organization pinpoint critical code patterns and application issues - immediately and at massive scale.
K ...
uses Nutch to crawl web pages for code, archives and technically interesting content.
*
mozDex (inactive)
*
Wikia Search
Wikia Search was a short-lived free and open-source web search engine launched by Wikia, a for-profit wiki-hosting company founded in late 2004 by Jimmy Wales and Angela Beesley.
Wikia Search followed other experiments by Wikia into search ...
- launched 2008, closed down 2009
See also
*
Faceted search
Faceted search is a technique that involves augmenting traditional search techniques with a faceted navigation system, allowing users to narrow down search results by applying multiple filters based on faceted classification of the items. It is som ...
*
Information extraction
Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents and other electronically represented sources. In most of the cases this activity concer ...
*
Enterprise search
Enterprise search is the practice of making content from multiple enterprise-type sources, such as databases and intranets, searchable to a defined audience.
"Enterprise search" is used to describe the software of search information within an ente ...
References
Bibliography
*
External links
*
{{Web crawlers
Nutch
Apache Nutch is a highly extensible and scalable open source web crawler software project.
Features
Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular archit ...
Internet search engines
Free search engine software
Java (programming language) libraries
Cross-platform free software
Free web crawlers