蓄水池取样(Reservoir sampling)

蓄水池取样是一系列的随机算法，用于简单随机抽样，在一个有未知项（n项）的集合中不放回抽取k项。集合的大小（n项）对于算法来说是未知的，通常所有项（n项）太大无法放入主内存里。随着时间的推移，算法不断读取子项，并且不能回溯以前的项。在任何时候，At any point, the current state of the algorithm must permit extraction of a simple random sample without replacement of size k over the part of the population seen so far.

起源

假如我们在一个集合里一次只看到一个子项。我们想要保存10个子项，并且选中它们的概率是一样的。如果我们知道子项的总数n，而且可以任意访问它们，那解决办法就简单了。在1跟n之间等概率地选择10个不同的子项，If we know the total number of items n and can access the items arbitrarily, then the solution is easy: select 10 distinct indices i between 1 and n with equal probability, and keep the i-th elements. The problem is that we do not always know the exact n in advance.

Simple algorithm

A simple and popular but slow algorithm, commonly known as Algorithm R, is due to Alan Waterman.[1]

The algorithm works by maintaining a reservoir of size k, which initially contains the first k items of the input. It then iterates over the remaining items until the input is exhausted. Using one-based array indexing, let i>k be the index of the item currently under consideration. The algorithm then generates a random number j between (and including) 1 and i. If j is at most k, then the item is selected and replaces whichever item currently occupies the j-th position in the reservoir. Otherwise, the item is discarded. In effect, for all i, the ith element of the input is chosen to be included in the reservoir with probability k/i. Similarly, at each iteration the jth element of the reservoir array is chosen to be replaced with probability 1/k\times k/i=1/i. It can be shown that when the algorithm has finished executing, each item in the input population has equal probability (i.e., k/n) of being chosen for the reservoir: k/i\times (1-1/(i+1))\times (1-1/(i+2))\times (1-1/(i+3))\times ...\times (1-1/n)=k/n}.

(* S has items to sample, R will contain the result *)
ReservoirSample(S[1..n], R[1..k])// fill the reservoir arrayfor i := 1 to kR[i] := S[i]// replace elements with gradually decreasing probabilityfor i := k+1 to n(* randomInteger(a, b) generates a uniform integer from the inclusive range {a, ..., b} *)j := randomInteger(1, i)if j <= kR[j] := S[i]

While conceptually simple and easy to understand, this algorithm needs to generate a random number for each item of the input, including the items that are discarded. Its asymptotic running time is thus O(n). This causes the algorithm to be unnecessarily slow if the input population is large.

An optimal algorithm

Algorithm L[2] improves upon this algorithm by computing how many items are discarded before the next item enters the reservoir. The key observation is that this number follows a geometric distribution and can therefore be computed in constant time.

(* S has items to sample, R will contain the result *)
ReservoirSample(S[1..n], R[1..k])// fill the reservoir arrayfor i = 1 to kR[i] := S[i](* random() generates a uniform (0,1) random number *)W := exp(log(random())/k)while i <= ni := i + floor(log(random())/log(1-W)) + 1if i <= n(* replace a random item of the reservoir with item i *)R[randomInteger(1,k)] := S[i]  // random index between 1 and k, inclusiveW := W * exp(log(random())/k)

This algorithm computes three random numbers for each item that becomes part of the reservoir, and does not spend any time on items that do not. Its expected running time is thus O(k(1+log(n/k))),[2] which is optimal.[1] At the same time, it is simple to implement efficiently and does not depend on random deviates from exotic or hard-to-compute distributions.

With random sort

If we associate with each item of the input a uniformly generated random number, the k items with the largest (or, equivalently, smallest) associated values form a simple random sample.[3] A simple reservoir-sampling thus maintains the k items with the currently largest associated values in a priority queue.

(*S is a stream of items to sampleS.Current returns current item in streamS.Next advances stream to next positionmin-priority-queue supports:Count -> number of items in priority queueMinimum -> returns minimum key value of all itemsExtract-Min() -> Remove the item with minimum keyInsert(key, Item) -> Adds item with specified key*)
ReservoirSample(S[1..?])H := new min-priority-queuewhile S has datar := random()   // uniformly random between 0 and 1, exclusiveif H.Count < kH.Insert(r, S.Current)else// keep k items with largest associated keysif r > H.MinimumH.Extract-Min()H.Insert(r, S.Current)S.Nextreturn items in H

The expected running time of this algorithm is O(n+k\log k\log(n/k)) and it is relevant mainly because it can easily be extended to items with weights.

Weighted random sampling

Some applications require items' sampling probabilities to be according to weights associated with each item. For example, it might be required to sample queries in a search engine with weight as number of times they were performed so that the sample can be analyzed for overall impact on user experience. Let the weight of item i be w_{i}, and the sum of all weights be W. There are two ways to interpret weights assigned to each item in the set:[4]

In each round, the probability of every unselected item to be selected in that round is proportional to its weight relative to the weights of all unselected items. If X is the current sample, then the probability of an item i\notin X to be selected in the current round is \textstyle w_{i}/(W-\sum _{j\in X}{w_{j}}).
The probability of each item to be included in the random sample is proportional to its relative weight, i.e., w_{i}/W. Note that this interpretation might not be achievable in some cases, e.g., k=n.

Algorithm A-Res[edit]

The following algorithm was given by Efraimidis and Spirakis that uses interpretation 1:[5]

(*S is a stream of items to sampleS.Current returns current item in streamS.Weight  returns weight of current item in streamS.Next advances stream to next positionThe power operator is represented by ^min-priority-queue supports:Count -> number of items in priority queueMinimum() -> returns minimum key value of all itemsExtract-Min() -> Remove the item with minimum keyInsert(key, Item) -> Adds item with specified key*)
ReservoirSample(S[1..?])H := new min-priority-queuewhile S has datar := random() ^ (1/S.Weight)   // random() produces a uniformly random number in (0,1)if H.Count < kH.Insert(r, S.Current)else// keep k items with largest associated keysif r > H.MinimumH.Extract-Min()H.Insert(r, S.Current)S.Nextreturn items in H

This algorithm is identical to the algorithm given in Reservoir Sampling with Random Sort except for the generation of the items' keys. The algorithm is equivalent to assigning each item a key r^{1/w_{i}} where r is the random number and then selecting the k items with the largest keys. Equivalently, a more numerically stable formulation of this algorithm computes the keys as -\ln(r)/w_{i} and select the k items with the smallest keys.[6][failed verification]

Algorithm A-ExpJ[edit]

The following algorithm is a more efficient version of A-Res, also given by Efraimidis and Spirakis:[5]

(*S is a stream of items to sampleS.Current returns current item in streamS.Weight  returns weight of current item in streamS.Next advances stream to next positionThe power operator is represented by ^min-priority-queue supports:Count -> number of items in the priority queueMinimum -> minimum key of any item in the priority queueExtract-Min() -> Remove the item with minimum keyInsert(Key, Item) -> Adds item with specified key*)
ReservoirSampleWithJumps(S[1..?])H := new min-priority-queuewhile S has data and H.Count < kr := random() ^ (1/S.Weight)   // random() produces a uniformly random number in (0,1)H.Insert(r, S.Current)S.NextX := log(random()) / log(H.Minimum) // this is the amount of weight that needs to be jumped overwhile S has dataX := X - S.Weightif X <= 0t := H.Minimum ^ S.Weightr := random(t, 1) ^ (1/S.Weight) // random(x, y) produces a uniformly random number in (x, y)H.Extract-Min()H.Insert(r, S.Current)X := log(random()) / log(H.Minimum)S.Nextreturn items in H

This algorithm follows the same mathematical properties that are used in A-Res, but instead of calculating the key for each item and checking whether that item should be inserted or not, it calculates an exponential jump to the next item which will be inserted. This avoids having to create random variates for each item, which may be expensive. The number of random variates required is reduced from O(n) to O(k\log(n/k)) in expectation, where k is the reservoir size, and n is the number of items in the stream.[5]

Algorithm A-Chao

Following algorithm was given by M. T. Chao uses interpretation 2:[7]

(*S has items to sample, R will contain the resultS[i].Weight contains weight for each item*)
WeightedReservoir-Chao(S[1..n], R[1..k])WSum := 0// fill the reservoir arrayfor i := 1 to kR[i] := S[i]WSum := WSum + S[i].Weightfor i := k+1 to nWSum := WSum + S[i].Weightp := k * S[i].Weight / WSum // probability for this itemj := random();          // uniformly random between 0 and 1if j <= p               // select item according to probabilityR[randomInteger(1,k)] := S[i]  //uniform selection in reservoir for replacement

For each item, its relative weight is calculated and used to randomly decide if the item will be added into the reservoir. If the item is selected, then one of the existing items of the reservoir is uniformly selected and replaced with the new item. The trick here is that, if the probabilities of all items in the reservoir are already proportional to their weights, then by selecting uniformly which item to replace, the probabilities of all items remain proportional to their weight after the replacement.

Relation to Fisher–Yates shuffle

Suppose one wanted to draw k random cards from a deck of cards. A natural approach would be to shuffle the deck and then take the top k cards. In the general case, the shuffle also needs to work even if the number of cards in the deck is not known in advance, a condition which is satisfied by the inside-out version of the Fisher–Yates shuffle:[8]

(* S has the input, R will contain the output permutation *)
Shuffle(S[1..n], R[1..n])R[1] := S[1]for i from 2 to n doj := randomInteger(1, i)  // inclusive rangeR[i] := R[j]R[j] := S[i]

Note that although the rest of the cards are shuffled, only the first k are important in the present context. Therefore, the array R need only track the cards in the first k positions while performing the shuffle, reducing the amount of memory needed. Truncating R to length k, the algorithm is modified accordingly:

(* S has items to sample, R will contain the result *)
ReservoirSample(S[1..n], R[1..k])R[1] := S[1]for i from 2 to k doj := randomInteger(1, i)  // inclusive rangeR[i] := R[j]R[j] := S[i]for i from k + 1 to n doj := randomInteger(1, i)  // inclusive rangeif (j <= k)R[j] := S[i]

Since the order of the first k cards is immaterial, the first loop can be removed and R can be initialized to be the first k items of the input. This yields Algorithm R.

Statistical properties

Probabilities of selection of the reservoir methods are discussed in Chao (1982)[7] and Tillé (2006).[9] While the first-order selection probabilities are equal to k/n (or, in case of Chao's procedure, to an arbitrary set of unequal probabilities), the second order selection probabilities depend on the order in which the records are sorted in the original reservoir. The problem is overcome by the cube sampling method of Deville and Tillé (2004).[10]

Limitations

Reservoir sampling makes the assumption that the desired sample fits into main memory, often implying that k is a constant independent of n. In applications where we would like to select a large subset of the input list (say a third, i.e. k=n/3), other methods need to be adopted. Distributed implementations for this problem have been proposed.[11]

References

^ Jump up to:a b Vitter, Jeffrey S. (1 March 1985). "Random sampling with a reservoir" (PDF). ACM Transactions on Mathematical Software. 11 (1): 37–57. CiteSeerX 10.1.1.138.784. doi:10.1145/3147.3165. S2CID 17881708.
^ Jump up to:a b Li, Kim-Hung (4 December 1994). "Reservoir-Sampling Algorithms of Time Complexity O(n(1+log(N/n)))". ACM Transactions on Mathematical Software. 20 (4): 481–493. doi:10.1145/198429.198435. S2CID 15721242.
^ Fan, C.; Muller, M.E.; Rezucha, I. (1962). "Development of sampling plans by using sequential (item by item) selection techniques and digital computers". Journal of the American Statistical Association. 57 (298): 387–402. doi:10.1080/01621459.1962.10480667. JSTOR 2281647.
^ Efraimidis, Pavlos S. (2015). "Weighted Random Sampling over Data Streams". Algorithms, Probability, Networks, and Games. Lecture Notes in Computer Science. 9295: 183–195. arXiv:1012.0256. doi:10.1007/978-3-319-24024-4_12. ISBN 978-3-319-24023-7. S2CID 2008731.
^ Jump up to:a b c Efraimidis, Pavlos S.; Spirakis, Paul G. (2006-03-16). "Weighted random sampling with a reservoir". Information Processing Letters. 97 (5): 181–185. doi:10.1016/j.ipl.2005.11.003.
^ Arratia, Richard (2002). Bela Bollobas (ed.). "On the amount of dependence in the prime factorization of a uniform random integer". Contemporary Combinatorics. 10: 29–91. CiteSeerX 10.1.1.745.3975. ISBN 978-3-642-07660-2.
^ Jump up to:a b Chao, M. T. (1982). "A general purpose unequal probability sampling plan". Biometrika. 69 (3): 653–656. doi:10.1093/biomet/69.3.653.
^ National Research Council (2013). Frontiers in Massive Data Analysis. The National Academies Press. p. 121. ISBN 978-0-309-28781-4.
^ Tillé, Yves (2006). Sampling Algorithms. Springer. ISBN 978-0-387-30814-2.
^ Deville, Jean-Claude; Tillé, Yves (2004). "Efficient balanced sampling: The cube method" (PDF). Biometrika. 91 (4): 893–912. doi:10.1093/biomet/91.4.893.
^ Reservoir Sampling in MapReduce

蓄水池取样(Reservoir sampling)相关推荐

蓄水池采样(Reservoir Sampling)
在一个给定长度的数组中随机等概率抽取一个数据很容易,但如果面对的是长度未知的海量数据流呢?蓄水池采样(Reservoir Sampling)算法就是来解决这个问题的, 它在分析一些大数据集的时候非常有 ...
蓄水池采样 Reservoir Sampling
# coding:utf8 import random# 从n个数中采样k个数 def reservoir_sampling(n, k):# 所有数据pool = [i for i in range( ...
Reservoir Sampling 蓄水池采样算法
https://blog.csdn.net/huagong_adu/article/details/7619665 https://www.jianshu.com/p/63f6cf19923d htt ...
图解连续学习中的蓄水池抽样算法(The Illustrated Reservoir sampling)
图解连续学习中的蓄水池抽样算法The Illustrated Reservoir sampling 前言什么是Reservoir Sampling? 蓄水池抽样算法(Reservoir sampli ...
蓄水池抽样算法 Reservoir Sampling
2018-03-05 14:06:40 问题描述:给出一个数据流,这个数据流的长度很大或者未知.并且对该数据流中数据只能访问一次.请写出一个随机选择算法,使得数据流中所有数据被选中的概率相等. 问题求 ...
蓄水池抽样算法(reservoir sampling)
蓄水池抽样算法(reservoir sampling) 场景:在长度未知的数据流中,等概率地采样一定数量的数据.即,数据量N未知,若要求采样k个数据,采样概率保证kN\frac{k}{N}Nk. 要 ...
蓄水池采样算法的python实现_蓄水池抽样算法（Reservoir Sampling）
蓄水池抽样算法(Reservoir Sampling) 许多年以后,当听说蓄水池抽样算法时,邱simple将会想起,那个小学数学老师带他做"小明对水池边加水边放水,求何时能加满水" ...
蓄水池采样算法（Reservoir Sampling）原理，证明和代码
有一个在大数据下很现实的例子: "给出一个数据流,这个数据流的长度很大或者未知.并且对该数据流中数据只能访问一次.请写出一个随机选择算法,使得数据流中所有数据被选中的概率相等." ...
机器学习中的数学——蓄水池抽样算法（Reservoir Sampling Algorithm）
分类目录:<机器学习中的数学>总目录蓄水池抽样算法(Reservoir Sampling Algorithm)解决了未知长度数据的均匀抽样问题,即:给定一个数据流,数据流长度NNN很大, ...

蓄水池取样(Reservoir sampling)

蓄水池取样(Reservoir sampling)相关推荐

最新文章

热门文章