In
computer science
Computer science is the study of computation, information, and automation. Computer science spans Theoretical computer science, theoretical disciplines (such as algorithms, theory of computation, and information theory) to Applied science, ...
, locality-sensitive hashing (LSH) is a
fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability.
(The number of buckets is much smaller than the universe of possible input items.)
Since similar items end up in the same buckets, this technique can be used for
data clustering
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each o ...
and
nearest neighbor search
Nearest neighbor search (NNS), as a form of proximity search, is the optimization problem of finding the point in a given set that is closest (or most similar) to a given point. Closeness is typically expressed in terms of a dissimilarity function: ...
. It differs from
conventional hashing techniques in that
hash collision
In computer science, a hash collision or hash clash is when two distinct pieces of data in a hash table share the same hash value. The hash value in this case is derived from a hash function which takes a data input and returns a fixed length of ...
s are maximized, not minimized. Alternatively, the technique can be seen as a way to
reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.
Hashing-based approximate
nearest-neighbor search algorithms generally use one of two main categories of hashing methods: either data-independent methods, such as locality-sensitive hashing (LSH); or data-dependent methods, such as locality-preserving hashing (LPH).
Locality-preserving hashing was initially devised as a way to facilitate
data pipelining in implementations of
massively parallel
Massively parallel is the term for using a large number of computer processors (or separate computers) to simultaneously perform a set of coordinated computations in parallel. GPUs are massively parallel architecture with tens of thousands of ...
algorithms that use
randomized routing and
universal hashing
In mathematics and computing, universal hashing (in a randomized algorithm or data structure) refers to selecting a hash function at random from a family of hash functions with a certain mathematical property (see definition below). This guarantees ...
to reduce memory
contention and
network congestion
Network congestion in data networking and queueing theory is the reduced quality of service that occurs when a network node or link is carrying more data than it can handle. Typical effects include queueing delay, packet loss or the blocking of ...
.
Definitions
A finite family
of functions
is defined to be an ''LSH family''
for
* a
metric space
In mathematics, a metric space is a Set (mathematics), set together with a notion of ''distance'' between its Element (mathematics), elements, usually called point (geometry), points. The distance is measured by a function (mathematics), functi ...
,
* a threshold
,
* an approximation factor
,
* and probabilities
if it satisfies the following condition. For any two points
and a hash function
chosen uniformly at random from
:
* If
, then
(i.e., and collide) with probability at least
,
* If
, then
with probability at most
.
Such a family
is called
-sensitive.
LSH with respect to a similarity measure
Alternatively
it is possible to define an LSH family on a universe of items endowed with a similarity function