The normalized Google distance (NGD) is a
semantic similarity
Semantic similarity is a metric defined over a set of documents or terms, where the idea of distance between items is based on the likeness of their meaning or semantic content as opposed to lexicographical similarity. These are mathematical tools ...
measure derived from the number of hits returned by the
Google search engine
Google Search (also known simply as Google or Google.com) is a search engine operated by Google. It allows users to search for information on the Web by entering keywords or phrases. Google Search uses algorithms to analyze and rank websites ...
for a given
set of
keyword
Keyword may refer to:
Computing
* Keyword (Internet search), a word or phrase typically used by bloggers or online content creator to rank a web page on a particular topic
* Index term, a term used as a keyword to documents in an information syste ...
s.
Keywords with the same or similar meanings in a natural language sense tend to be "close" in units of normalized Google distance, while words with dissimilar meanings tend to be farther apart.
Specifically, the NGD between two search terms ''x'' and ''y'' is
:
where ''N'' is the total number of web pages searched by Google multiplied by the average number of singleton search terms occurring on pages; ''f''(''x'') and ''f''(''y'') are the number of hits for search terms ''x'' and ''y'', respectively; and ''f''(''x'', ''y'') is the number of web pages on which both ''x'' and ''y'' occur.
If the
then x and y are viewed as alike as possible, but if
then x and y are very different.
If the two search terms ''x'' and ''y'' never occur together on the same web page, but do occur separately, the NGD between them is infinite. If both terms always occur together, their NGD is zero.
Example: On 9 April 2013, googling for "Shakespeare" gave 130,000,000 hits;
googling for "Macbeth" gave 26,000,000 hits; and googling
for "Shakespeare Macbeth" gave 20,800,000 hits.
The number of pages indexed by Google was estimated by the number
of hits of the search term "the" which was 25,270,000,000 hits. Assuming
there are about 1,000 search terms on the average page this gives
.
Hence
:
.
"Shakespeare" and "Macbeth" are
very much alike according to the relative semantics supplied by
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
.
Introduction
The normalized Google distance is derived from the earlier
normalized compression distance.
Namely, objects can be given literally, like the literal four-letter genome of a mouse,
or the literal text of ''
Macbeth
''Macbeth'' (, full title ''The Tragedie of Macbeth'') is a tragedy by William Shakespeare. It is thought to have been first performed in 1606. It dramatises the damaging physical and psychological effects of political ambition on those w ...
'' by
Shakespeare
William Shakespeare ( 26 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's natio ...
. The similarity of these objects is given by the NCD. For
simplicity we take it that all meaning of the object
is represented by the literal object itself. Objects can also be
given by name, like 'the four-letter genome of a mouse,'
or 'the text of ''
Macbeth
''Macbeth'' (, full title ''The Tragedie of Macbeth'') is a tragedy by William Shakespeare. It is thought to have been first performed in 1606. It dramatises the damaging physical and psychological effects of political ambition on those w ...
'' by
Shakespeare
William Shakespeare ( 26 April 1564 – 23 April 1616) was an English playwright, poet and actor. He is widely regarded as the greatest writer in the English language and the world's pre-eminent dramatist. He is often called England's natio ...
.' There are
also objects that cannot be given literally, but only by name,
and that acquire their meaning from their contexts in background common
knowledge in humankind, like "home" or "red". The similarity between names for objects is
given by the NGD.
Google distribution and Google code
The probabilities of Google search terms, conceived as
the frequencies of page counts returned by Google divided by
the number of pages indexed by Google (multiplied by the average number of search terms in those pages), approximate the actual relative frequencies of those search terms as actually used in society. Based on this premise, the relations represented by the normalized Google distance approximately capture
the assumed true semantic relations governing the search terms. In the NGD, the World Wide Web and Google are used. Other text corpora include
Wikipedia
Wikipedia is a multilingual free online encyclopedia written and maintained by a community of volunteers, known as Wikipedians, through open collaboration and using a wiki-based editing system. Wikipedia is the largest and most-read ref ...
, the King James version of the
Bible
The Bible (from Koine Greek , , 'the books') is a collection of religious texts or scriptures that are held to be sacred in Christianity, Judaism, Samaritanism, and many other religions. The Bible is an anthologya compilation of texts o ...
or the ''
Oxford English Dictionary
The ''Oxford English Dictionary'' (''OED'') is the first and foundational historical dictionary of the English language, published by Oxford University Press (OUP). It traces the historical development of the English language, providing a com ...
'' together with appropriate search engines.
Properties
The following properties are proved in:
*The NGD is roughly in between 0 and
. It can be slightly negative. For example, "red red" gives about 20% more hits of Google on the
World Wide Web
The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet.
Documents and downloadable media are made available to the network through web se ...
than "red". (Mid 2013 there were 4.260.000.000 hits for "red" and 5.500.000.000 hits for "red red". Presently, "red red" now returns far fewer results than "red".) If the
then we view x and y as very dissimilar.
*The NGD is not a
metric
Metric or metrical may refer to:
* Metric system, an internationally adopted decimal system of measurement
* An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement
Mathematics
In mathem ...
. The NGD is zero for x and y that are not equal provided x and y do always occur together on the same web page. From the NGD formula we see that it is
symmetric
Symmetry (from grc, συμμετρία "agreement in dimensions, due proportion, arrangement") in everyday language refers to a sense of harmonious and beautiful proportion and balance. In mathematics, "symmetry" has a more precise definit ...
. The triangle property is not satisfied by the NGD. However, these results are theoretic. It is hard to come up with practical examples of the
World Wide Web
The World Wide Web (WWW), commonly known as the Web, is an information system enabling documents and other web resources to be accessed over the Internet.
Documents and downloadable media are made available to the network through web se ...
using Google that violate the triangle property.
Applications
Applications to colors versus numbers,
primes
A prime number (or a prime) is a natural number greater than 1 that is not a product of two smaller natural numbers. A natural number greater than 1 that is not prime is called a composite number. For example, 5 is prime because the only ways ...
versus non-primes and so are given in,
as well as a randomized massive experiment using
WordNet
WordNet is a lexical database of semantic relations between words in more than 200 languages. WordNet links words into semantic relations including synonyms, hyponyms, and meronyms. The synonyms are grouped into ''synsets'' with short definit ...
categories. In the primes versus non-primes case and the WordNet experiment the NGD method is augmented with a
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories ...
classifier. The experiments consist of 25 positive examples and 25 negative ones. The WordNet experiment consisted of 100 random WordNet categories. The NGD method had a success rate of 87.25%. The mean is 0.8725 while the standard deviation was 0.1169. These rates are about agreement with the WordNet categories which represent the knowledge of researchers with PhDs which entered them. It is rare to see agreement less than 75%.
References
Further reading
*
*
*
*
* (Includes comparison of NGD to other algorithms.)
* {{cite journal, author1=Wong, W., author2=Liu, W., author3= Bennamoun, M., date=2007, name-list-style=amp, title=Tree-Traversing Ant Algorithm for Term Clustering based on Featureless Similarities, journal=Data Mining and Knowledge Discovery, volume=15, issue=3, pages=349–381, doi=10.1007/s10618-007-0073-y, s2cid=14924678 (the use of NGD for term clustering)
Computational linguistics
Statistical distance