HOME

TheInfoList



OR:

Reservoir sampling is a family of
randomized algorithm A randomized algorithm is an algorithm that employs a degree of randomness as part of its logic or procedure. The algorithm typically uses uniformly random bits as an auxiliary input to guide its behavior, in the hope of achieving good performa ...
s for choosing a
simple random sample In statistics, a simple random sample (or SRS) is a subset of individuals (a sample) chosen from a larger set (a population) in which a subset of individuals are chosen randomly, all with the same probability. It is a process of selecting a sam ...
, without replacement, of items from a population of unknown size in a single pass over the items. The size of the population is not known to the algorithm and is typically too large for all items to fit into
main memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a comput ...
. The population is revealed to the algorithm over time, and the algorithm cannot look back at previous items. At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size over the part of the population seen so far.


Motivation

Suppose we see a sequence of items, one at a time. We want to keep ten items in memory, and we want them to be selected at random from the sequence. If we know the total number of items and can access the items arbitrarily, then the solution is easy: select 10 distinct indices between 1 and with equal probability, and keep the -th elements. The problem is that we do not always know the exact in advance.


Simple: Algorithm R

A simple and popular but slow algorithm, ''Algorithm R'', was created by Alan Waterman.
Initialize an array R indexed from 1 to k, containing the first items of the input x_1, ..., x_k. This is the ''reservoir''. For each new input x_i, generate a random number uniformly in \. If j \in \, then set R := x_i, otherwise, discard x_i. Return R after all inputs are processed.
This algorithm works by induction on i\geq k. While conceptually simple and easy to understand, this algorithm needs to generate a random number for each item of the input, including the items that are discarded. The algorithm's asymptotic running time is thus O(n). Generating this amount of randomness and the linear run time causes the algorithm to be unnecessarily slow if the input population is large. This is ''Algorithm R,'' implemented as follows: (* S has items to sample, R will contain the result *) ReservoirSample(S ..n R ..k // fill the reservoir array for i := 1 to k R := S // replace elements with gradually decreasing probability for i := k+1 to n (* randomInteger(a, b) generates a uniform integer from the inclusive range *) j := randomInteger(1, i) if j <= k R := S


Optimal: Algorithm L

If we generate n random numbers u_1, ..., u_n\sim U
, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
/math> independently, then the indices of the smallest k of them is a uniform sample of the k-subsets of \. The process can be done without knowing n:
Keep the smallest k of u_1, ..., u_i that has been seen so far, as well as w_i , the index of the largest among them. For each new u_, compare it with u_. If u_ < u_, then discard u_, store u_, and set w_ to be the index of the largest among them. Otherwise, discard u_, and set w_ = w_i.
Now couple this with the stream of inputs x_1, ..., x_n. Every time some u_i is accepted, store the corresponding x_i. Every time some u_i is discarded, discard the corresponding x_i. This algorithm still needs O(n) random numbers, thus taking O(n) time. But it can be simplified. First simplification: it is unnecessary to test new u_, u_, ... one by one, since the probability that the next acceptance happens at u_ is (1-u_)^ - (1-u_)^, that is, the interval l of acceptance follows a
geometric distribution In probability theory and statistics, the geometric distribution is either one of two discrete probability distributions: * The probability distribution of the number ''X'' of Bernoulli trials needed to get one success, supported on the set \; ...
. Second simplification: it's unnecessary to remember the entire array of the smallest k of u_1, ..., u_i that has been seen so far, but merely w, the largest among them. This is based on three observations: * Every time some new x_ is selected to be entered into storage, a uniformly random entry in storage is discarded. * u_\sim U
, w The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
/math> * w_ has the same distribution as \max\ , where all u_1, ..., u_\sim U
, w The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
/math> independently. This can be sampled by first sampling u\sim U
, 1 The comma is a punctuation mark that appears in several variants in different languages. It has the same shape as an apostrophe or single closing quotation mark () in many typefaces, but it differs from them in being placed on the baseline o ...
/math>, then taking w \cdot u^. This is ''Algorithm L'', which is implemented as follows: (* S has items to sample, R will contain the result *) ReservoirSample(S ..n R ..k // fill the reservoir array for i = 1 to k R := S (* random() generates a uniform (0,1) random number *) W := exp(log(random())/k) while i <= n i := i + floor(log(random())/log(1-W)) + 1 if i <= n (* replace a random item of the reservoir with item i *) R andomInteger(1,k):= S // random index between 1 and k, inclusive W := W * exp(log(random())/k) This algorithm computes three random numbers for each item that becomes part of the reservoir, and does not spend any time on items that do not. Its expected running time is thus O(k (1+\log(n/k))), which is optimal. At the same time, it is simple to implement efficiently and does not depend on random deviates from exotic or hard-to-compute distributions.


With random sort

If we associate with each item of the input a uniformly generated random number, the items with the largest (or, equivalently, smallest) associated values form a simple random sample. A simple reservoir-sampling thus maintains the items with the currently largest associated values in a
priority queue In computer science, a priority queue is an abstract data-type similar to a regular queue or stack data structure in which each element additionally has a ''priority'' associated with it. In a priority queue, an element with high priority is se ...
. (* S is a stream of items to sample S.Current returns current item in stream S.Next advances stream to next position min-priority-queue supports: Count -> number of items in priority queue Minimum -> returns minimum key value of all items Extract-Min() -> Remove the item with minimum key Insert(key, Item) -> Adds item with specified key *) ReservoirSample(S ..? H := new min-priority-queue while S has data r := random() // uniformly random between 0 and 1, exclusive if H.Count < k H.Insert(r, S.Current) else // keep k items with largest associated keys if r > H.Minimum H.Extract-Min() H.Insert(r, S.Current) S.Next return items in H The expected running time of this algorithm is O(n + k \log k \log(n/k)) and it is relevant mainly because it can easily be extended to items with weights.


Weighted random sampling

This method, also called sequential sampling, is incorrect in the sense that it does not allow to obtain a priori fixed inclusion probabilities. Some applications require items' sampling probabilities to be according to weights associated with each item. For example, it might be required to sample queries in a search engine with weight as number of times they were performed so that the sample can be analyzed for overall impact on user experience. Let the weight of item be w_i, and the sum of all weights be . There are two ways to interpret weights assigned to each item in the set: # In each round, the probability of every ''unselected'' item to be selected in that round is proportional to its weight relative to the weights of all unselected items. If is the current sample, then the probability of an item i \notin X to be selected in the current round is \textstyle w_i/(W - \sum_). # The probability of each item to be included in the random sample is not proportional to its relative weight, i.e., w_i / W. A simple counter-example is the case, e.g., k=n .


Algorithm A-Res

The following algorithm was given by Efraimidis and Spirakis that uses interpretation 1: (* S is a stream of items to sample S.Current returns current item in stream S.Weight returns weight of current item in stream S.Next advances stream to next position The power operator is represented by ^ min-priority-queue supports: Count -> number of items in priority queue Minimum() -> returns minimum key value of all items Extract-Min() -> Remove the item with minimum key Insert(key, Item) -> Adds item with specified key *) ReservoirSample(S ..? H := new min-priority-queue while S has data r := random() ^ (1/S.Weight) // random() produces a uniformly random number in (0,1) if H.Count < k H.Insert(r, S.Current) else // keep k items with largest associated keys if r > H.Minimum H.Extract-Min() H.Insert(r, S.Current) S.Next return items in H This algorithm is identical to the algorithm given in Reservoir Sampling with Random Sort except for the generation of the items' keys. The algorithm is equivalent to assigning each item a key r^ where is the random number and then selecting the items with the largest keys. Equivalently, a more numerically stable formulation of this algorithm computes the keys as -\ln(r)/w_i and select the items with the ''smallest'' keys.


Algorithm A-ExpJ

The following algorithm is a more efficient version of A-Res, also given by Efraimidis and Spirakis: (* S is a stream of items to sample S.Current returns current item in stream S.Weight returns weight of current item in stream S.Next advances stream to next position The power operator is represented by ^ min-priority-queue supports: Count -> number of items in the priority queue Minimum -> minimum key of any item in the priority queue Extract-Min() -> Remove the item with minimum key Insert(Key, Item) -> Adds item with specified key *) ReservoirSampleWithJumps(S ..? H := new min-priority-queue while S has data and H.Count < k r := random() ^ (1/S.Weight) // random() produces a uniformly random number in (0,1) H.Insert(r, S.Current) S.Next X := log(random()) / log(H.Minimum) // this is the amount of weight that needs to be jumped over while S has data X := X - S.Weight if X <= 0 t := H.Minimum ^ S.Weight r := random(t, 1) ^ (1/S.Weight) // random(x, y) produces a uniformly random number in (x, y) H.Extract-Min() H.Insert(r, S.Current) X := log(random()) / log(H.Minimum) S.Next return items in H This algorithm follows the same mathematical properties that are used in A-Res, but instead of calculating the key for each item and checking whether that item should be inserted or not, it calculates an exponential jump to the next item which will be inserted. This avoids having to create random variates for each item, which may be expensive. The number of random variates required is reduced from O(n) to O(k \log (n/k)) in expectation, where k is the reservoir size, and n is the number of items in the stream.


Algorithm A-Chao

Following algorithm was given by M. T. Chao uses interpretation 2: (* S has items to sample, R will contain the result S Weight contains weight for each item *) WeightedReservoir-Chao(S ..n R ..k WSum := 0 // fill the reservoir array for i := 1 to k R := S WSum := WSum + S Weight for i := k+1 to n WSum := WSum + S Weight p := S Weight / WSum // probability for this item j := random(); // uniformly random between 0 and 1 if j <= p // select item according to probability R andomInteger(1,k):= S //uniform selection in reservoir for replacement For each item, its relative weight is calculated and used to randomly decide if the item will be added into the reservoir. If the item is selected, then one of the existing items of the reservoir is uniformly selected and replaced with the new item. The trick here is that, if the probabilities of all items in the reservoir are already proportional to their weights, then by selecting uniformly which item to replace, the probabilities of all items remain proportional to their weight after the replacement. Note that Chao doesn't specify how to sample the first ''k'' elements. He simple assumes we have some other way of picking them in proportion to their weight. Chao: "Assume that we have a sampling plan of fixed size with respect to ''S_k'' at time ''A''; such that its first-order inclusion probability of ''X_t'' is ''π(k; i)''".


Algorithm A-Chao with Jumps

Similar to the other algorithms, it is possible to compute a random weight j and subtract items' probability mass values, skipping them while j > 0, reducing the number of random numbers that have to be generated. (* S has items to sample, R will contain the result S Weight contains weight for each item *) WeightedReservoir-Chao(S ..n R ..k WSum := 0 // fill the reservoir array for i := 1 to k R := S WSum := WSum + S Weight j := random() // uniformly random between 0 and 1 pNone := 1 // probabi