A distributed search engine is a
search engine
A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
where there is no central server. Unlike traditional centralized search engines, work such as
crawling
Crawl, The Crawl, or crawling may refer to:
Biology
* Crawling (human), any of several types of human quadrupedal gait
* Limbless locomotion, the movement of limbless animals over the ground
* Undulatory locomotion, a type of motion characteri ...
,
data mining, indexing, and
query processing is
distributed Distribution may refer to:
Mathematics
*Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations
*Probability distribution, the probability of a particular value or value range of a varia ...
among several peers in a decentralized manner where there is no single point of control.
History
Rorur
The short-term goal of the Rorur project is to create a distributed search engine that runs on a network of computers of common people in a decentralized fashion. A competitive latency and the delivery of the requested rank can be achieved if the number of participating nodes is large enough and the fraction of malicious nodes does not exceed a calculable threshold https://rorur.com/Whitepaper. The architecture builds on open-source algorithms that rely on public contribution for development and maintenance. To incentivize those who join and contribute, the revenue from advertising is distributed among node maintainers. The long-term goal is to have built-in personal search agents that construct and maintain personal knowledge graphs to assist the human-web interaction.
Presearch
Started in 2017, Presearch is an
ERC20 powered (PRE) search engine powered by a distributed network of community operated nodes which aggregate results from a variety of sources. This powers the searches a
presearch.com/This is planned to be a precursor where each node collaborates on a global decentralised index.
Presearch averages 5 million searches per day and has 2.2 million registered users. On Sept 1, 2021, Presearch was added as a default option to the search engine list on Android for the EU. On May 27, 2022, Presearch officially transitioned from its Testnet to a Mainnet. This means all search traffic through the service now runs over Presearch’s decentralized network of volunteer-run nodes.
YaCy
On December 15, 2003 Michael Christen announced development of a
P2P-based search engine, eventually named
YaCy
''YaCy'' (pronounced “ya see”) is a free distributed search engine, built on the principles of peer-to-peer (P2P) networks created by Michael Christen in 2003. The engine is written in Java and distributed on several hundred computers, , so-ca ...
, on the
heise online
Heise (officially ''Heise Gruppe'', formerly ''Verlag Heinz Heise'') is a German media conglomerate headquartered in Hanover, Lower Saxony. It was founded in 1949 by and is still family-owned. Its core business is directory media as well as gen ...
forums.
Dews
A theoretical design for a distributed search engine discussed in academic literature.
Seeks
Seeks is a free and open-source project licensed under the GNU Affero General Public License version 3 (AGPL-3.0-or-later). It exists to create an alternative to the current market-leading search engines, driven by user concerns rather than corp ...
Seeks was an open source websearch proxy and collaborative distributed tool for websearch. It ceased to have a usable release in 2016.
InfraSearch
In April 2000 several programmers (including
Gene Kan Gene Kan (September 6, 1976 — June 29, 2002) was a British-born Chinese American peer-to-peer file-sharing programmer who was among the first programmers to produce an open-source version of the file-sharing application that implemented the Gn ...
,
Steve Waterhouse) built a prototype
P2P web search engine based on
Gnutella
Gnutella is a peer-to-peer network protocol. Founded in 2000, it was the first decentralized peer-to-peer network of its kind, leading to other, later networks adopting the model.
In June 2005, Gnutella's population was 1.81 million computer ...
called
InfraSearch. The technology was later acquired by Sun Microsystems and incorporated into the
JXTA
JXTA (Juxtapose) was an open-source peer-to-peer protocol specification begun by Sun Microsystems in 2001. The JXTA protocols were defined as a set of XML messages which allow any device connected to a network to exchange messages and collabor ...
project. It was meant to run inside the participating websites' databases creating a
P2P network that could be accessed through the InfraSearch website.
Opencola
On May 31, 2000
Steelbridge Inc. announced development of OpenCOLA a collaborative distributive open source search engine. It runs on the user's computer and crawls the web pages and links the user puts in their opencola folder and shares resulting index over its
P2P network.
Mario
In February 2001 Wolf Garbe published an idea of a
peer-to-peer
Peer-to-peer (P2P) computing or networking is a distributed application architecture that partitions tasks or workloads between peers. Peers are equally privileged, equipotent participants in the network. They are said to form a peer-to-peer ...
search engine,
started the Faroo prototype in 2004, and released it in 2005.
[
]
Goals
The goals of building a distributed search engine include:
1. to create an independent search engine powered by the community;
2. to make the search operation open and transparent by relying on open-source software;
3. to distribute the advertising revenue to node maintainers, which may help create more robust web infrastructure;
4. to allow researchers to contribute to the development of open-source and publicly-maintainable ranking algorithms and to oversee the training of the algorithm parameters.
Challenges
1. The amount of data to be processed is enormous. The size of the visible web is estimated at 5PB spread around 10 billion pages.
2. The latency of the distributed operation must be competitive with the latency of the commercial search engines.
3. A mechanism that prevents malicious users from corrupting the distributed data structures or the rank needs to be developed.
See also
*
List of search engines#P2P search engines
*
Distributed processing
A distributed system is a system whose components are located on different networked computers, which communicate and coordinate their actions by passing messages to one another from any system. Distributed computing is a field of computer sci ...
References
{{Distributed search engines
Internet search engines
Internet service providers