In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, SimHash is a technique for quickly estimating how
similar two sets are. The
algorithm
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
is used by the
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
Crawler to find near duplicate pages. It was created by
Moses Charikar
Moses Samson Charikar is an Indian computer scientist who works as a professor at Stanford University. He was previously a professor at Princeton University. The topics of his research include approximation algorithms, streaming algorithms, and ...
. In 2021 Google announced its intent to also use the algorithm in their newly created
FLoC (Federated Learning of Cohorts) system.
Evaluation and benchmarks
A large scale evaluation has been conducted by
Google
Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
in 2006 to compare the performance of
Minhash
In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by , and initially used in the AltaV ...
and Simhash algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling and using Minhash and
LSH for
Google News
Google News is a news aggregator service developed by Google. It presents a continuous flow of links to articles organized from thousands of publishers and magazines. Google News is available as an app on Android, iOS, and the Web.
Google r ...
personalization.
[
.]
See also
*
MinHash
In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by , and initially used in the AltaV ...
*
w-shingling
In natural language processing a ''w-shingling'' is a set of ''unique'' ''shingles'' (therefore ''n-grams'') each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity bet ...
*
Count–min sketch
In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear s ...
References
External links
Simhash Princeton PaperSimhash explainedComparison of MinHash vs. Simhash
{{Machine learning evaluation metrics
Hash functions
Clustering criteria
Hashing
Probabilistic data structures