SimHash
   HOME

TheInfoList



OR:

In
computer science Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, SimHash is a technique for quickly estimating how similar two sets are. The
algorithm In mathematics and computer science, an algorithm () is a finite sequence of Rigour#Mathematics, mathematically rigorous instructions, typically used to solve a class of specific Computational problem, problems or to perform a computation. Algo ...
is used by the
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
Crawler to find near duplicate pages. It was created by
Moses Charikar Moses Samson Charikar is an Indian-American computer scientist who works as a professor at Stanford University. He was previously a professor at Princeton University. The topics of his research include approximation algorithms, streaming algorithm ...
. In 2021 Google announced its intent to also use the algorithm in their newly created FLoC (Federated Learning of Cohorts) system.


Evaluation and benchmarks

A large scale evaluation has been conducted by
Google Google LLC (, ) is an American multinational corporation and technology company focusing on online advertising, search engine technology, cloud computing, computer software, quantum computing, e-commerce, consumer electronics, and artificial ...
in 2006 to compare the performance of
Minhash In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 confer ...
and Simhash algorithms. In 2007 Google reported using Simhash for duplicate detection for web crawling and using Minhash and LSH for
Google News Google News is a news aggregator service developed by Google. It presents a continuous flow of links to articles organized from thousands of publishers and magazines. Google News is available as an app on Android, iOS, and the Web. Google ...
personalization. .


See also

*
MinHash In computer science and data mining, MinHash (or the min-wise independent permutations locality sensitive hashing scheme) is a technique for quickly estimating how similar two sets are. The scheme was published by Andrei Broder in a 1997 confer ...
*
w-shingling In natural language processing a ''w''-shingling is a set of ''unique'' ''shingles'' (therefore ''n-grams'') each of which is composed of contiguous subsequences of tokens within a document, which can then be used to ascertain the similarity b ...
*
Count–min sketch In computing, the count–min sketch (CM sketch) is a probabilistic data structure that serves as a frequency table of events in a stream of data. It uses hash functions to map events to frequencies, but unlike a hash table uses only sub-linear ...
*
Locality-sensitive hashing In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Si ...


References


External links


Simhash Princeton PaperSimhash explainedComparison of MinHash vs. Simhash
{{Machine learning evaluation metrics Hash functions Clustering criteria Hashing Probabilistic data structures