KinoSearch
   HOME

TheInfoList



OR:

Apache Lucene is a
free and open-source Free and open-source software (FOSS) is software available under a Software license, license that grants users the right to use, modify, and distribute the software modified or not to everyone free of charge. FOSS is an inclusive umbrella term ...
search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
software library In computing, a library is a collection of resources that can be leveraged during software development to implement a computer program. Commonly, a library consists of executable code such as compiled functions and classes, or a library can ...
, originally written in
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
by
Doug Cutting Douglass Read Cutting is a software designer, advocate for, and creator of open-source search technology. He founded two technology projects, Lucene and Nutch, with Mike Cafarella. The Apache Software Foundation now manages both projects. Cutti ...
. It is supported by the
Apache Software Foundation The Apache Software Foundation ( ; ASF) is an American nonprofit corporation (classified as a 501(c)(3) organization in the United States) to support a number of open-source software projects. The ASF was formed from a group of developers of the ...
and is released under the
Apache Software License The Apache License is a permissive free software license written by the Apache Software Foundation (ASF). It allows users to use the software for any purpose, to distribute it, to modify it, and to distribute modified versions of the software u ...
. Lucene is widely used as a standard foundation for production search applications. Lucene has been ported to other programming languages including
Object Pascal Object Pascal is an extension to the programming language Pascal (programming language), Pascal that provides object-oriented programming (OOP) features such as Class (computer programming), classes and Method (computer programming), methods. T ...
,
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
, C#,
C++ C++ (, pronounced "C plus plus" and sometimes abbreviated as CPP or CXX) is a high-level, general-purpose programming language created by Danish computer scientist Bjarne Stroustrup. First released in 1985 as an extension of the C programmin ...
,
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (prog ...
,
Ruby Ruby is a pinkish-red-to-blood-red-colored gemstone, a variety of the mineral corundum ( aluminium oxide). Ruby is one of the most popular traditional jewelry gems and is very durable. Other varieties of gem-quality corundum are called sapph ...
and
PHP PHP is a general-purpose scripting language geared towards web development. It was originally created by Danish-Canadian programmer Rasmus Lerdorf in 1993 and released in 1995. The PHP reference implementation is now produced by the PHP Group. ...
.


History

Doug Cutting Douglass Read Cutting is a software designer, advocate for, and creator of open-source search technology. He founded two technology projects, Lucene and Nutch, with Mike Cafarella. The Apache Software Foundation now manages both projects. Cutti ...
originally wrote Lucene in 1999. Lucene was his fifth search engine. He had previously written two while at
Xerox PARC Future Concepts division (formerly Palo Alto Research Center, PARC and Xerox PARC) is a research and development company in Palo Alto, California. It was founded in 1969 by Jacob E. "Jack" Goldman, chief scientist of Xerox Corporation, as a div ...
, one at
Apple An apple is a round, edible fruit produced by an apple tree (''Malus'' spp.). Fruit trees of the orchard or domestic apple (''Malus domestica''), the most widely grown in the genus, are agriculture, cultivated worldwide. The tree originated ...
, and a fourth at
Excite Excitation, excite, exciting, or excitement may refer to: * Excitation (magnetic), provided with an electrical generator or alternator * ''Exite'', a series of racing video games published by Nintendo starting with '' Excitebike'' * Excite (web por ...
. It was initially available for download from its home at the
SourceForge SourceForge is a web service founded by Geoffrey B. Jeffery, Tim Perdue, and Drew Streib in November 1999. SourceForge provides a centralized software discovery platform, including an online platform for managing and hosting open-source soft ...
web site. It joined the Apache Software Foundation's
Jakarta Jakarta (; , Betawi language, Betawi: ''Jakartè''), officially the Special Capital Region of Jakarta (; ''DKI Jakarta'') and formerly known as Batavia, Dutch East Indies, Batavia until 1949, is the capital and largest city of Indonesia and ...
family of open-source Java products in September 2001 and became its own top-level Apache project in February 2005. The name Lucene is Doug Cutting's wife's middle name and her maternal grandmother's first name. Lucene formerly included a number of sub-projects, such as Lucene.NET,
Mahout A mahout is an elephant rider, trainer, or keeper. Mahouts were used since antiquity for both civilian and military use. Traditionally, mahouts came from ethnic groups with generations of elephant keeping experience, with a mahout retainin ...
, Tika and
Nutch Apache Nutch is a highly extensible and scalable Open-source license, open source web crawler software project. Features Nutch is coded entirely in the Java (programming language), Java programming language, but data is written in language-ind ...
. These three are now independent top-level projects. In March 2010, the
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features ...
search server joined as a Lucene sub-project, merging the developer communities. Version 4.0 was released on October 12, 2012. In March 2021, Lucene changed its logo, and
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features ...
became a top level Apache project again, independent from Lucene.


Features and common use

While suitable for any application that requires full text indexing and searching capability, Lucene is recognized for its utility in the implementation of
Internet search engine A search engine is a software system that provides hyperlinks to web pages, and other relevant information on the Web in response to a user's query. The user enters a query in a web browser or a mobile app, and the search results are typically ...
s and local, single-site searching. Lucene includes a feature to perform a fuzzy search based on
edit distance In computational linguistics and computer science, edit distance is a string metric, i.e. a way of quantifying how dissimilar two String (computing), strings (e.g., words) are to one another, that is measured by counting the minimum number of opera ...
. Lucene has also been used to implement recommendation systems. For example, Lucene's 'MoreLikeThis' Class can generate recommendations for similar documents. In a comparison of the term vector-based similarity approach of 'MoreLikeThis' with citation-based document similarity measures, such as
co-citation Co-citation is the frequency with which two documents are ''citation, cited'' together by other documents.. If at least one other document cites two documents in common, these documents are said to be ''co-cited''. The more co-citations two docu ...
and co-citation proximity analysis, Lucene's approach excelled at recommending documents with very similar structural characteristics and more narrow relatedness.M. Schwarzer, M. Schubotz, N. Meuschke, C. Breitinger, V. Markl, and B. Gipp, https://www.gipp.com/wp-content/papercite-data/pdf/schwarzer2016.pdf "Evaluating Link-based Recommendations for Wikipedia" in Proceedings of the 16th ACM/IEEE-CS Joint Conference on Digital Libraries (JCDL), New York, NY, USA, 2016, pp. 191-200. In contrast, citation-based document similarity measures tended to be more suitable for recommending more broadly related documents, meaning citation-based approaches may be more suitable for generating
serendipitous Serendipity is an unplanned fortunate discovery. The term was coined by Horace Walpole in 1754. The concept is often associated with scientific and technological breakthroughs, where accidental discoveries led to new insights or inventions. Man ...
recommendations, as long as documents to be recommended contain in-text citations.


Lucene-based projects

Lucene itself is just an indexing and search library and does not contain
crawling Crawl, The Crawl, or crawling may refer to: Biology * Crawling, any type of tetrapod quadrupedal locomotion with the torso persistently touching or very close to the ground. ** Crawling (human), any of several types of human quadrupedal gait * ...
and HTML
parsing Parsing, syntax analysis, or syntactic analysis is a process of analyzing a String (computer science), string of Symbol (formal), symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal gramm ...
functionality. However, several projects extend Lucene's capability: *
Apache Nutch Apache Nutch is a highly extensible and scalable open source web crawler software project. Features Nutch is coded entirely in the Java programming language, but data is written in language-independent formats. It has a highly modular architec ...
– provides
web crawling Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spider ...
and HTML parsing *
Apache Solr Solr (pronounced "solar") is an open-source enterprise-search platform, written in Java. Its major features include full-text search, hit highlighting, faceted search, real-time indexing, dynamic clustering, database integration, NoSQL features ...
– an enterprise search server *
CrateDB CrateDB is a distributed SQL database management system that integrates a fully searchable document-oriented data store. It is open-source, written in Java, based on a shared-nothing architecture, and designed for high scalability. CrateDB inclu ...
– open source, distributed SQL database built on Lucene * DocFetcher – a
multiplatform Within computing, cross-platform software (also called multi-platform software, platform-agnostic software, or platform-independent software) is computer software that is designed to work in several computing platforms. Some cross-platform softw ...
desktop search application *
Elasticsearch Elasticsearch is a Search engine (computing), search engine based on Apache Lucene, a free and open-source search engine. It provides a distributed, Multitenancy, multitenant-capable full-text search engine with an HTTP web interface and schema ...
– an enterprise search server released in 2010 * Kinosearch – a search engine written in
Perl Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language". Perl was developed ...
and C and a loose
port A port is a maritime facility comprising one or more wharves or loading areas, where ships load and discharge cargo and passengers. Although usually situated on a sea coast or estuary, ports can also be found far inland, such as Hamburg, Manch ...
of Lucene. The
Socialtext Socialtext Incorporated was a company based in Palo Alto, California, that produced enterprise social software for companies. It offered an integrated suite of wiki tools and social software applications, including microblogging, user profiles, ...
wiki software uses this search engine, and so does the MojoMojo wiki. It is also used by the
Human Metabolome Database The Human Metabolome Database (HMDB) is a comprehensive, high-quality, freely accessible, online database of small molecule metabolites found in the human body. It has been created by the Human Metabolome Project funded by Genome Canada and is ...
(HMDB) and the Toxin and Toxin-Target Database (T3DB). *
MongoDB MongoDB is a source-available, cross-platform, document-oriented database program. Classified as a NoSQL database product, MongoDB uses JSON-like documents with optional database schema, schemas. Released in February 2009 by 10gen (now MongoDB ...
Atlas Search – a cloud-native enterprise search application based on MongoDB and Apache Lucene * OpenSearch – an open source enterprise search server based on a fork of Elasticsearch 7 *
Swiftype Swiftype is a search and index company based in San Francisco, California, that provides search software for organizations, websites, and computer programs. Notable customers include AT&T, Dr. Pepper, Hubspot and TechCrunch. History Swiftype w ...
– an enterprise search startup based on Lucene


See also

*
Enterprise search Enterprise search is software technology for searching data sources internal to a company, typically intranet and database content. The search is generally offered only to users internal to the company. Enterprise search can be contrasted with web ...
* Information extraction *
Information retrieval Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
*
Text mining Text mining, text data mining (TDM) or text analytics is the process of deriving high-quality information from text. It involves "the discovery by computer of new, previously unknown information, by automatically extracting information from differe ...


References


Bibliography

* *


External links

* {{Authority control
Lucene Apache Lucene is a free and open-source search engine software library, originally written in Java by Doug Cutting. It is supported by the Apache Software Foundation and is released under the Apache Software License. Lucene is widely used as a ...
Free search engine software Java (programming language) libraries C Sharp libraries Cross-platform software Software using the Apache license Search engine software Pascal (programming language) software 1999 software