Reservoir sampling is a family of
randomized algorithm
A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performa ...
s for choosing a
simple random sample
In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sam ...
, without replacement, of items from a population of unknown size in a single pass over the items. The size of the population is not known to the algorithm and is typically too large for all items to fit into
main memory
Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers.
The central processing unit (CPU) of a comput ...
. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size over the part of the population seen so far.
Motivation
Suppose we see a sequence of items, one at a time. We want to keep ten items in memory, and we want them to be selected at random from the sequence. If we know the total number of items and can access the items arbitrarily, then the solution is easy: select 10 distinct indices between 1 and with equal probability, and keep the -th elements. The problem is that we do not always know the exact in advance.
Simple: Algorithm R
A simple and popular but slow algorithm, ''Algorithm R'', was created by Alan Waterman.
Initialize an array indexed from to , containing the first items of the input . This is the ''reservoir''.
For each new input , generate a random number uniformly in . If , then set , otherwise, discard .
Return after all inputs are processed.
This algorithm works by induction on
.
While conceptually simple and easy to understand, this algorithm needs to generate a random number for each item of the input, including the items that are discarded. The algorithm's
asymptotic running time is thus
. Generating this amount of randomness and the linear run time causes the algorithm to be unnecessarily slow if the input population is large.
This is ''Algorithm R,'' implemented as follows:
(* S has items to sample, R will contain the result *)
ReservoirSample(S ..n R ..k
// fill the reservoir array
for i := 1 to k
R := S
// replace elements with gradually decreasing probability
for i := k+1 to n
(* randomInteger(a, b) generates a uniform integer from the inclusive range *)
j := randomInteger(1, i)
if j <= k
R := S
Optimal: Algorithm L
If we generate
random numbers