Apache Nutch is a highly extensible and scalable

open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...

web crawler Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...

software project.

Features

Nutch is coded entirely in the

Java programming language Java is a high-level, general-purpose, memory-safe, object-oriented programming language. It is intended to let programmers ''write once, run anywhere'' ( WORA), meaning that compiled Java code can run on all platforms that support Jav ...

, but data is written in language-independent formats. It has a highly modular architecture, allowing developers to create plug-ins for media-type parsing, data retrieval, querying and clustering. The fetcher ("robot" or "

") has been written from scratch specifically for this project.

History

Nutch originated with

Doug Cutting Douglass Read Cutting is a software designer, advocate for, and creator of open-source search technology. He founded two technology projects, Lucene and Nutch, with Mike Cafarella. The Apache Software Foundation now manages both projects. Cutti ...

, creator of both

Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...

and

Hadoop Apache Hadoop () is a collection of Open-source software, open-source software utilities for reliable, scalable, distributed computing. It provides a software framework for Clustered file system, distributed storage and processing of big data usin ...

, and Mike Cafarella. In June, 2003, a successful 100-million-page demonstration system was developed. To meet the multi-machine processing needs of the crawl and index tasks, the Nutch project has also implemented the

MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel and distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filte ...

project and a

distributed file system A clustered file system (CFS) is a file system which is shared by being simultaneously Mount (computing), mounted on multiple Server (computing), servers. There are several approaches to computer cluster, clustering, most of which do not emplo ...

. The two projects have been spun out into their own subproject, called

. In January, 2005, Nutch joined the Apache Incubator, from which it graduated to become a subproject of Lucene in June of that same year. Since April, 2010, Nutch has been considered an independent, top level project of the

Apache Software Foundation The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...

. In February 2014 the

Common Crawl Common Crawl is a nonprofit organization, nonprofit 501(c) organization#501.28c.29.283.29, 501(c)(3) organization that web crawler, crawls the web and freely provides its archives and datasets to the public. Common Crawl's Web archiving, web arch ...

project adopted Nutch for its open, large-scale web crawl. While it was once a goal for the Nutch project to release a global large-scale web search engine, that is no longer the case.

Release history

Scalability

IBM Research studied the performance of Nutch/Lucene as part of its Commercial Scale Out (CSO) project. Their findings were that a scale-out system, such as Nutch/Lucene, could achieve a performance level on a cluster of blades that was not achievable on any

scale-up SCALE-UP, Student-Centered Active Learning Environment with Upside-Down Pedagogies, is a classroom specifically created to facilitate active, collaborative learning in a classroom. The spaces are carefully designed to facilitate interactions betwe ...

computer such as the

POWER5 The POWER5 is a microprocessor developed and fabricated by IBM. It is an improved version of the POWER4. The principal improvements are support for simultaneous multithreading (SMT) and an on-die memory controller. The POWER5 is a dual-core ...

. The ClueWeb09 dataset (used in e.g. TREC) was gathered using Nutch, with an average speed of 755.31 documents per second.

Related projects

– Java framework that supports distributed applications running on large clusters.

Search engines built with Nutch

– publicly available internet-wide crawls, started using Nutch in 2014. *

Creative Commons Creative Commons (CC) is an American non-profit organization and international network devoted to educational access and expanding the range of creative works available for others to build upon legally and to share. The organization has release ...

Search – an implementation of Nutch, used in the period of 2004–2006. * DiscoverEd –

Open educational resources Open educational resources (OER) are Instructional materials, teaching, learning, and research materials intentionally created and Free license, licensed to be free for the end user to own, share, and in most cases, modify. The term "OER" descr ...

search prototype developed by Creative Commons * Krugle uses Nutch to crawl web pages for code, archives and technically interesting content. *

mozDex mozDex was a project to build an Internet-scale search engine with free and open source software (FOSS) technologies like Nutch. Since its search algorithms and code were open, it was hoped that no search results could be manipulated by either mo ...

(inactive) *

Wikia Search Wikia Search was a short-lived free and open-source web search engine launched by Wikia, a for-profit wiki-hosting company founded by Jimmy Wales and Angela Beesley. Wikia Search followed other experiments by Wikia into search engine technolog ...

- launched 2008, closed down 2009

References

Bibliography

External links

* {{Web crawlers

Nutch Apache Nutch is a highly extensible and scalable Open-source license, open source web crawler software project. Features Nutch is coded entirely in the Java (programming language), Java programming language, but data is written in language-ind ...

Internet search engines Free search engine software Java (programming language) libraries Cross-platform free software Free web crawlers