In
computer science
Computer science is the study of computation, automation, and information. Computer science spans theoretical disciplines (such as algorithms, theory of computation, information theory, and automation) to practical disciplines (includin ...
, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability.
(The number of buckets is much smaller than the universe of possible input items.)
Since similar items end up in the same buckets, this technique can be used for
data clustering
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
and
nearest neighbor search
Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
. It differs from
conventional hashing techniques in that
hash collision
In computer science, a hash collision or hash clash is when two pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of bits.
Al ...
s are maximized, not minimized. Alternatively, the technique can be seen as a way to
reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
Hashing-based approximate
nearest neighbor search
Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function ...
algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as
locality-preserving hashing In computer science, locality-sensitive hashing (LSH) is an algorithmic technique that hashes similar input items into the same "buckets" with high probability. (The number of buckets is much smaller than the universe of possible input items.) Sinc ...
(LPH).
Definitions
An ''LSH family''
is defined for
* a
metric space
In mathematics, a metric space is a set together with a notion of '' distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general sett ...
,
* a threshold
,
* an approximation factor
,
* and probabilities
and
.
This family
is a set of functions
that map elements of the metric space to buckets
. An LSH family must satisfy the following conditions for any two points
and any hash function
chosen uniformly at random from
:
* if
, then
(i.e., and collide) with probability at least
,
* if
, then
with probability at most
.
A family is interesting when
. Such a family
is called ''
-sensitive''.
Alternatively
it is defined with respect to a universe of items that have a
similarity
Similarity may refer to:
In mathematics and computing
* Similarity (geometry), the property of sharing the same shape
* Matrix similarity, a relation between matrices
* Similarity measure, a function that quantifies the similarity of two objects
* ...
function