The Hopkins statistic (introduced by Brian Hopkins and
John Gordon Skellam) is a way of measuring the
cluster tendency of a data set. It belongs to the family of sparse sampling tests. It acts as a
statistical hypothesis test
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
where the
null hypothesis
In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
is that the data is generated by a
Poisson point process
In probability, statistics and related fields, a Poisson point process is a type of random mathematical object that consists of points randomly located on a mathematical space with the essential feature that the points occur independently of one ...
and are thus uniformly randomly distributed.
If individuals are aggregated, then its value approaches 0, and if they are randomly distributed, the value tends to 0.5.
Preliminaries
A typical formulation of the Hopkins statistic follows.
:Let
be the set of
data points.
:Generate a random sample
of
data points sampled without replacement from
.
:Generate a set
of
uniformly randomly distributed data points.
:Define two distance measures,
::
the minimum distance (given some suitable metric) of
to its nearest neighbour in
, and
::
the minimum distance of
to its nearest neighbour
Definition
With the above notation, if the data is
dimensional, then the Hopkins statistic is defined as:
Under the null hypotheses, this statistic has a Beta(m,m) distribution.
Notes and references
External links
* http://www.sthda.com/english/wiki/assessing-clustering-tendency-a-vital-issue-unsupervised-machine-learning
{{Machine learning evaluation metrics
Clustering criteria