HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, SimHash is a technique for quickly estimating how similar two sets are. The
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
is used by the
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
Crawler to find near duplicate pages. It was created by
Moses Charikar Moses Samson Charikar is an Indian computer scientist who works as a professor at Stanford University. He was previously a professor at Princeton University. The topics of his research include approximation algorithms, streaming algorithms, and ...
. In 2021 Google announced its intent to also use the algorithm in their newly created FLoC (Federated Learning of Cohorts) system.


Evaluation and benchmarks

A large scale evaluation has been conducted by
Google Google LLC () is an American Multinational corporation, multinational technology company focusing on Search Engine, search engine technology, online advertising, cloud computing, software, computer software, quantum computing, e-commerce, ar ...
in 2006 to compare the performance of
Minhash In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by , and initially used in the AltaV ...
and Simhash algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling and using Minhash and LSH for
Google News Google News is a news aggregator service developed by Google. It presents a continuous flow of links to articles organized from thousands of publishers and magazines. Google News is available as an app on Android, iOS, and the Web. Google r ...
personalization. .


See also

*
MinHash In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was invented by , and initially used in the AltaV ...
*
w-shingling In natural language processing a ''w-shingling'' is a set of ''unique'' ''shingles'' (therefore ''n-grams'') each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity bet ...
*
Count–min sketch In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear s ...


References


External links


Simhash Princeton PaperSimhash explainedComparison of MinHash vs. Simhash
{{Machine learning evaluation metrics Hash functions Clustering criteria Hashing Probabilistic data structures