DeepPeep was a
search engine
A search engine is a software system that provides hyperlinks to web pages, and other relevant information on World Wide Web, the Web in response to a user's web query, query. The user enters a query in a web browser or a mobile app, and the sea ...
that aimed to
crawl
Crawl, The Crawl, or crawling may refer to:
Biology
* Crawling, any type of tetrapod quadrupedal locomotion with the torso persistently touching or very close to the ground.
** Crawling (human), any of several types of human quadrupedal gait
* L ...
and index every database on the public Web. Unlike traditional search engines, which crawl existing webpages and their hyperlinks, DeepPeep aimed to allow access to the so-called
Deep web, World Wide Web content only available via for instance typed queries into databases. The project started at the
University of Utah
The University of Utah (the U, U of U, or simply Utah) is a public university, public research university in Salt Lake City, Utah, United States. It was established in 1850 as the University of Deseret (Book of Mormon), Deseret by the General A ...
and was overseen by
Juliana Freire, an associate professor at the university's School of Computing WebDB group. The goal was to make 90% of all WWW content accessible, according to Freire. The project ran a beta search engine and was sponsored by the University of Utah and a $243,000 grant from the
National Science Foundation
The U.S. National Science Foundation (NSF) is an Independent agencies of the United States government#Examples of independent agencies, independent agency of the Federal government of the United States, United States federal government that su ...
. It generated worldwide interest.
How it works
Similar to
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
,
Yahoo
Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, an ...
, and other search engines, DeepPeep allows the users to type in a keyword and returns a list of links and databases with information regarding the keyword.
However, what separated DeepPeep and other search engines is that DeepPeep uses th
ACHE crawlerHierarchical Form Identification,
Context-Aware Form Clustering and 'LabelEx' to locate, analyze, and organize web forms to allow easy access to users.
ACHE Crawler
Th
ACHE Crawleris used to gather links and utilizes a learning strategy that increases the collection rate of links as these crawlers continue to search. What make
ACHE Crawlerunique from other crawlers is that other crawlers are focused crawlers that gather Web pages that have specific properties or keywords. Ache Crawlers instead includes a page classifier which allows it to sort out irrelevant pages of a domain as well as a link classifier which ranks a link by its highest relevance to a topic. As a result, th
ACHE Crawlerfirst downloads web links that has the higher relevance and saves resources by not downloading irrelevant data.
Hierarchical Form Identification
In order to further eliminate irrelevant links and search results, DeepPeep uses th
HIerarchical Form Identification (HIFI)framework that classifies links and search results based on the website's structure and content.
Unlike other forms of classification which solely relies on the
web form
A webform, web form or HTML form on a web page allows a user to enter data that is sent to a server for processing. Forms can resemble paper or database forms because web users fill out the forms using checkboxes, radio buttons, or text fields. ...
labels for organization
HIFIutilizes both the structure and content of the web form for classification. Utilizing these two classifiers, HIFI organizes the web forms in a hierarchical fashion which ranks the a web form's relevance to the target keyword.
Context-Aware Clustering
When there is no domain of interest or the domain specified has multiple types of definition, DeepPeep must separate the web form and cluster them into similar domains. The search engine uses context-aware clustering to group similar links in the same domain by modeling the web form into sets of hyperlinks and using its context for comparison. Unlike other techniques that require complicated label extraction and manual pre-processing of web forms, context-aware
clustering is done automatically and uses meta-data to handle web forms that are content rich and contain multiple attributes.
LabelEx
DeepPeep further extracts information called
Meta-Data from these pages which allows for better ranking of links and databases with the use of LabelEx, an approach for automatic decomposition and extraction of meta-data. Meta-data is data from web links that give information about other domains. LabelEx identifies the element-label mapping and uses the mapping to extract meta-data with accuracy unlike conventional approaches that used manually specific extraction rules.
Ranking
When the search results pop up after the user has input their keyword, DeepPeep ranks the links based on 3 features: term content, number of
backlink
From the point of view of a given web resource (referent), a backlink is a regular hyperlink on another web resource (the referrer) that points to the referent. A ''web resource'' may be (for example) a website, web page, or web directory.
A ba ...
s. and
pagerank
PageRank (PR) is an algorithm used by Google Search to rank web pages in their search engine results. It is named after both the term "web page" and co-founder Larry Page. PageRank is a way of measuring the importance of website pages. Accordin ...
. Firstly, the term content is simply determined by the content of the web link and its relevance. Backlinks are hyperlinks or links that direct the user to a different website. Pageranks is the ranking of websites in search engine results and works by counting the amount and quality of links to website to determine its importance. Pagerank and back link information are obtained from outside sources such as
Google
Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
,
Yahoo
Yahoo (, styled yahoo''!'' in its logo) is an American web portal that provides the search engine Yahoo Search and related services including My Yahoo, Yahoo Mail, Yahoo News, Yahoo Finance, Yahoo Sports, y!entertainment, yahoo!life, an ...
, and
Bing
Bing most often refers to:
* Bing Crosby (1903–1977), American singer
* Microsoft Bing, a web search engine
Bing may also refer to:
Food and drink
* Bing (bread), a Chinese flatbread
* Bing (soft drink), a UK brand
* Bing cherry, a varie ...
.
Beta Launch
DeepPeep Beta was launched and only covered seven domains: auto, airfare, biology, book, hotel, job, and rental. Under these seven domains, DeepPeep offered access to 13,000 Web forms.
One could access the website at
/Https://deepai.org/ DeepPeep.orgbut the website has been inactive after the beta version was taken down.
References
External links
* , found dead November 2016 with site appearing in relation to
Register.com. Last {{Cite web , url=http://www.deeppeep.org/ , title=DeepPeep: Discover the hidden web , access-date=2009-02-23 , archive-url=https://web.archive.org/web/20120509073423/http://www.deeppeep.org/ , archive-date=2012-05-09 , url-status=bot: unknown .
Science and technology in the United States
Defunct internet search engines