Metasearch
   HOME

TheInfoList



OR:

A metasearch engine (or
search aggregator A search aggregator is a type of metasearch engine which gathers results from multiple search engines simultaneously, typically through RSS search results. It combines user specified search feeds (parameterized RSS feeds which return search resul ...
) is an online information retrieval tool that uses the data of a
web search engine A search engine is a software system designed to carry out web searches. They search the World Wide Web in a systematic way for particular information specified in a textual web search query. The search results are generally presented in a ...
to produce its own results. Metasearch engines take input from a user and immediately query search engines for results. Sufficient
data In the pursuit of knowledge, data (; ) is a collection of discrete Value_(semiotics), values that convey information, describing quantity, qualitative property, quality, fact, statistics, other basic units of meaning, or simply sequences of sy ...
is gathered, ranked, and presented to the users. Problems such as
spamming Spamming is the use of messaging systems to send multiple unsolicited messages (spam) to large numbers of recipients for the purpose of commercial advertising, for the purpose of non-commercial proselytizing, for any prohibited purpose (especia ...
reduces the accuracy and precision of results. The process of fusion aims to improve the engineering of a metasearch engine. Examples of metasearch engines include
Skyscanner Skyscanner is a metasearch engine and travel agency based in Edinburgh, Scotland. The site is available in over 30 languages and is used by 100 million people per month. The company lets people research and book travel options for their trips ...
and Kayak.com, which aggregate search results of online travel agencies and provider websites and
Searx Searx (; stylized as searX) is a free and open-source metasearch engine, available under the GNU Affero General Public License version 3, with the aim of protecting the privacy of its users. To this end, Searx does not share users' IP addresses ...
, a free and open-source search engine which aggregates results from internet search engines.


History

The first person to incorporate the idea of meta searching was Daniel Dreilinger of Colorado State University . He developed SearchSavvy, which let users search up to 20 different search engines and directories at once. Although fast, the search engine was restricted to simple searches and thus wasn't reliable.
University of Washington The University of Washington (UW, simply Washington, or informally U-Dub) is a public research university in Seattle, Washington. Founded in 1861, Washington is one of the oldest universities on the West Coast; it was established in Seattl ...
student Eric Selberg released a more "updated" version called
MetaCrawler MetaCrawler is a search engine. It is a registered trademark of InfoSpace and was created by Erik Selberg. It was originally a metasearch engine, as its name suggests. Throughout its lifetime it combined web search results from sources includin ...
. This search engine improved on SearchSavvy's accuracy by adding its own search syntax behind the scenes, and matching the syntax to that of the search engines it was probing. Metacrawler reduced the amount of search engines queried to 6, but although it produced more accurate results, it still wasn't considered as accurate as searching a query in an individual engine. On May 20, 1996,
HotBot HotBot was an American web search engine owned by Lycos. It was launched in May 1996 by ''Wired'' magazine. During the 1990s, it was one of the most popular search engines on the World Wide Web. History HotBot was launched in May 1996 by HotWir ...
, then owned by
Wired ''Wired'' (stylized as ''WIRED'') is a monthly American magazine, published in print and online editions, that focuses on how emerging technologies affect culture, the economy, and politics. Owned by Condé Nast, it is headquartered in San ...
, was a search engine with search results coming from the Inktomi and Direct Hit databases. It was known for its fast results and as a search engine with the ability to search within search results. Upon being bought by
Lycos Lycos, Inc., is a web search engine and web portal established in 1994, spun out of Carnegie Mellon University. Lycos also encompasses a network of email, web hosting, social networking, and entertainment websites. The company is based in Walth ...
in 1998, development for the search engine staggered and its market share fell drastically. After going through a few alterations, HotBot was redesigned into a simplified search interface, with its features being incorporated into Lycos' website redesign. A metasearch engine called Anvish was developed by Bo Shu and
Subhash Kak Subhash Kak is an Indian-American computer scientist and historical revisionist. He is the Regents Professor of Computer Science Department at Oklahoma State University–Stillwater, an honorary visiting professor of engineering at Jawaharlal N ...
in 1999; the search results were sorted using
instantaneously trained neural networks Instantaneously trained neural networks are feedforward artificial neural networks that create a new hidden neuron node for each novel training sample. The weights to this hidden neuron separate out not only this training sample but others that are ...
. This was later incorporated into another metasearch engine called Solosearch. In August 2000, India got its first meta search engine when HumHaiIndia.com was launched. It was developed by the then 16 year old Sumeet Lamba. The website was later rebranded as Tazaa.com.
Ixquick Startpage is a Dutch search engine company that highlights privacy as its distinguishing feature.ExpressVPN.com"Free Search Engines: What You're Looking For?" 19 January 2015, retrieved 5 April 2016. The website advertises that it allows users ...
is a search engine known for its privacy policy statement. Developed and launched in 1998 by David Bodnick, it is owned by Surfboard Holding BV. In June 2006, Ixquick began to delete private details of its users following the same process with
Scroogle Google's changes to its privacy policy on March 16, 2012 enabled the company to share data across a wide variety of services. These embedded services include millions of third-party websites that use AdSense and Analytics. The policy was widely c ...
. Ixquick's privacy policy includes no recording of users' IP addresses, no identifying cookies, no collection of personal data, and no sharing of personal data with third parties. It also uses a unique ranking system where a result is ranked by stars. The more stars in a result, the more search engines agreed on the result. In April 2005,
Dogpile Dogpile is a metasearch engine for information on the World Wide Web that fetches results from Google, Yahoo!, Yandex, Bing, and other popular search engines, including those from audio and video content providers such as Yahoo!. History Dog ...
, then owned and operated by
InfoSpace Infospace, Inc. was an American company that offered private label search engine, online directory, and provider of metadata feeds. The company's flagship metasearch site was Dogpile and its other notable consumer brands were WebCrawler and Met ...
, Inc., collaborated with researchers from the
University of Pittsburgh The University of Pittsburgh (Pitt) is a public state-related research university in Pittsburgh, Pennsylvania. The university is composed of 17 undergraduate and graduate schools and colleges at its urban Pittsburgh campus, home to the univers ...
and Pennsylvania State University to measure the overlap and ranking differences of leading Web search engines in order to gauge the benefits of using a metasearch engine to search the web. Results found that from 10,316 random user-defined queries from
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
,
Yahoo! Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo Inc., which is 90% owned by investment funds managed by Apollo Global Manage ...
, and
Ask Jeeves Ask.com (originally known as Ask Jeeves) is a question answering–focused e-business founded in 1996 by Garrett Gruener and David Warthen in Berkeley, California. The original software was implemented by Gary Chevsky, from his own design. Wart ...
, only 3.2% of first page search results were the same across those search engines for a given query. Another study later that year using 12,570 random user-defined queries from
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
,
Yahoo! Yahoo! (, styled yahoo''!'' in its logo) is an American web services provider. It is headquartered in Sunnyvale, California and operated by the namesake company Yahoo Inc., which is 90% owned by investment funds managed by Apollo Global Manage ...
,
MSN Search Microsoft Bing (commonly known as Bing) is a web search engine owned and operated by Microsoft. The service has its origins in Microsoft's previous search engines: MSN Search, Windows Live Search and later Live Search. Bing provides a variet ...
, and
Ask Jeeves Ask.com (originally known as Ask Jeeves) is a question answering–focused e-business founded in 1996 by Garrett Gruener and David Warthen in Berkeley, California. The original software was implemented by Gary Chevsky, from his own design. Wart ...
found that only 1.1% of first page search results were the same across those search engines for a given query.


Advantages

By sending multiple queries to several other search engines this extends the
coverage data A coverage is the digital representation of some spatio-temporal phenomenon. ISO 19123 provides the definition: * '' feature that acts as a function to return values from its range for any direct position within its spatial, temporal or spatiotempo ...
of the topic and allows more information to be found. They use the indexes built by other search engines, aggregating and often post-processing results in unique ways. A metasearch engine has an advantage over a single search engine because more results can be retrieved with the same amount of exertion. It also reduces the work of users from having to individually type in searches from different engines to look for resources. Metasearching is also a useful approach if the purpose of the user's search is to get an overview of the topic or to get quick answers. Instead of having to go through multiple search engines like Yahoo! or Google and comparing results, metasearch engines are able to quickly compile and combine results. They can do it either by listing results from each engine queried with no additional post-processing (Dogpile) or by analyzing the results and ranking them by their own rules (IxQuick, Metacrawler, and Vivismo). A metasearch engine can also hide the searcher's IP address from the search engines queried thus providing privacy to the search.


Disadvantages

Metasearch engines are not capable of
parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from L ...
query forms or able to fully translate query syntax. The number of hyperlinks generated by metasearch engines are limited, and therefore do not provide the user with the complete results of a query. The majority of metasearch engines do not provide over ten linked files from a single search engine, and generally do not interact with larger search engines for results.
Pay per click Pay-per-click (PPC) is an internet advertising model used to drive traffic to websites, in which an advertiser pays a publisher (typically a search engine, website owner, or a network of websites) when the ad is clicked. Pay-per-click is usually ...
links are prioritised and are normally displayed first. Metasearching also gives the illusion that there is more coverage of the topic queried, particularly if the user is searching for popular or commonplace information. It's common to end with multiple identical results from the queried engines. It is also harder for users to search with advanced search syntax to be sent with the query, so results may not be as precise as when a user is using an advanced search interface at a specific engine. This results in many metasearch engines using simple searching.


Operation

A metasearch engine accepts a single search request from the user. This search request is then passed on to another search engine's
database In computing, a database is an organized collection of data stored and accessed electronically. Small databases can be stored on a file system, while large databases are hosted on computer clusters or cloud storage. The design of databases s ...
. A metasearch engine does not create a database of web pages but generates a
Federated database system A federated database system (FDBS) is a type of meta- database management system (DBMS), which transparently maps multiple autonomous database systems into a single federated database. The constituent databases are interconnected via a computer netw ...
of
data integration Data integration involves combining data residing in different sources and providing users with a unified view of them. This process becomes significant in a variety of situations, which include both commercial (such as when two similar companies ...
from multiple sources. Since every search engine is unique and has different
algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
for generating ranked data, duplicates will therefore also be generated. To remove duplicates, a metasearch engine processes this data and applies its own algorithm. A revised list is produced as an output for the user. When a metasearch engine contacts other search engines, these search engines will respond in three ways: * They will both cooperate and provide complete access to the interface for the metasearch engine, including private access to the index database, and will inform the metasearch engine of any changes made upon the index database; * Search engines can behave in a non-cooperative manner whereby they will not deny or provide any access to interfaces; * The search engine can be completely hostile and refuse the metasearch engine total access to their database and in serious circumstances, by seeking
legal Law is a set of rules that are created and are law enforcement, enforceable by social or governmental institutions to regulate behavior,Robertson, ''Crimes against humanity'', 90. with its precise definition a matter of longstanding debate. ...
methods.


Architecture of ranking

Web pages that are highly ranked on many search engines are likely to be more relevant in providing useful information. However, all search engines have different ranking scores for each website and most of the time these scores are not the same. This is because search engines prioritise different criteria and methods for scoring, hence a website might appear highly ranked on one search engine and lowly ranked on another. This is a problem because Metasearch engines rely heavily on the consistency of this data to generate reliable accounts.


Fusion

A metasearch engine uses the process of Fusion to filter data for more efficient results. The two main fusion methods used are: Collection Fusion and Data Fusion. * Collection Fusion: also known as distributed retrieval, deals specifically with search engines that index unrelated data. To determine how valuable these sources are, Collection Fusion looks at the content and then ranks the data on how likely it is to provide relevant information in relation to the query. From what is generated, Collection Fusion is able to pick out the best resources from the rank. These chosen resources are then merged into a list. * Data Fusion: deals with information retrieved from search engines that indexes common data sets. The process is very similar. The initial rank scores of data are merged into a single list, after which the original ranks of each of these documents are analysed. Data with high scores indicate a high level of relevancy to a particular query and are therefore selected. To produce a list, the scores must be normalized using algorithms such as CombSum. This is because search engines adopt different policies of algorithms resulting in the score produced being incomparable.


Spamdexing

Spamdexing is the deliberate manipulation of search engine indexes. It uses a number of methods to manipulate the relevance or prominence of resources indexed in a manner unaligned with the intention of the indexing system. Spamdexing can be very distressing for users and problematic for search engines because the return contents of searches have poor precision. This will eventually result in the search engine becoming unreliable and not dependable for the user. To tackle Spamdexing, search robot algorithms are made more complex and are changed almost every day to eliminate the problem. It is a major problem for metasearch engines because it tampers with the
Web crawler A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (''web spid ...
's indexing criteria, which are heavily relied upon to format ranking lists. Spamdexing manipulates the natural
ranking A ranking is a relationship between a set of items such that, for any two items, the first is either "ranked higher than", "ranked lower than" or "ranked equal to" the second. In mathematics, this is known as a weak order or total preorder of ...
system of a search engine, and places websites higher on the ranking list than they would naturally be placed. There are three primary methods used to achieve this:


Content spam

Content spam are the techniques that alter the logical view that a search engine has over the page's contents. Techniques include: * Keyword Stuffing - Calculated placements of keywords within a page to raise the keyword count, variety, and density of the page * Hidden/Invisible Text - Unrelated text disguised by making it the same color as the background, using a tiny font size, or hiding it within the HTML code * Meta-tag Stuffing - Repeating keywords in meta tags and/or using keywords unrelated to the site's content * Doorway Pages - Low quality webpages with little content, but relatable keywords or phrases * Scraper Sites - Programs that allow websites to copy content from other websites and create content for a website * Article Spinning - Rewriting existing articles as opposed to copying content from other sites * Machine Translation - Uses machine translation to rewrite content in several different languages, resulting in illegible text


Link spam

Link spam are links between pages present for reasons other than merit. Techniques include: * Link-building Software - Automating the
search engine optimization Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as "natural" or " organic" results) rather than dire ...
(SEO) process * Link Farms - Pages that reference each other (also known as mutual admiration societies) * Hidden Links - Placing hyperlinks where visitors won't or can't see them * Sybil Attack - Forging of multiple identities for malicious intent * Spam Blogs - Blogs created solely for commercial promotion and the passage of link authority to target sites * Page Hijacking - Creating a copy of a popular website with similar content, but redirects web surfers to unrelated or even malicious websites * Buying Expired Domains - Buying expiring domains and replacing pages with links to unrelated websites * Cookie Stuffing - Placing an affiliate tracking cookie on a website visitor's computer without their knowledge * Forum Spam - Websites that can be edited by users to insert links to spam sites


Cloaking

This is a SEO technique in which different materials and information are sent to the web crawler and to the
web browser A web browser is application software for accessing websites. When a user requests a web page from a particular website, the browser retrieves its files from a web server and then displays the page on the user's screen. Browsers are used o ...
. It is commonly used as a spamdexing technique because it can trick search engines into either visiting a site that is substantially different from the search engine description or giving a certain site a higher ranking.


See also

* Federated search * List of metasearch engines *
Metabrowsing Metabrowsing refers to approaches to browsing Web-based information that emerged in the late 1990s as alternatives to the standard Web browser. According to LexisNexis the term "metabrowsing" began appearing in mainstream media in March 2000. Since ...
*
Multisearch Multisearch is a multitasking search engine which includes both search engine and metasearch engine characteristics with additional capability of retrieval of search result sets that were previously classified by users. It enables the user to ga ...
*
Search aggregator A search aggregator is a type of metasearch engine which gathers results from multiple search engines simultaneously, typically through RSS search results. It combines user specified search feeds (parameterized RSS feeds which return search resul ...
*
Search engine optimization Search engine optimization (SEO) is the process of improving the quality and quantity of website traffic to a website or a web page from search engines. SEO targets unpaid traffic (known as "natural" or " organic" results) rather than dire ...
* Hybrid search engine


References

{{Reflist, 1 Internet search engines