Faster weighted sampling without replacement

前端未结

关注

 3  689

臣服心动 2020-12-01 01:45

This question led to a new R package: wrswoR

R\'s default sampling without replacement using sample.int se

3条回答

南笙 (楼主)

2020-12-01 02:42
Let me throw in my own implementation of a faster approach based on rejection sampling with replacement. The idea is this:
- Generate a sample with replacement that is "somewhat" larger than the requested size
- Throw away the duplicate values
- If not enough values have been drawn, call the same procedure recursively with adjusted n, size and prob parameters
- Remap the returned indexes to the original indexes
How big a sample do we need to draw? Assuming a uniform distribution, the result is the expected number of trials to see x unique values out of N total values. It is a difference of two harmonic numbers (H_n and H_{n - size}). The first few harmonic numbers are tabulated, otherwise an approximation using the natural logarithm is used. (This is only a ballpark figure, no need to be too precise here.) Now, for a non-uniform distribution, the expected number of items to be drawn can only be larger, so we won't be drawing too many samples. In addition, the number of samples drawn is limited by twice the population size -- I assume that it's faster to have a few recursive calls than sampling up to O(n ln n) items.

The code is available in the R package wrswoR in the sample.int.rej routine in sample_int_rej.R. Install with:
```
library(devtools)
install_github('wrswoR', 'muelleki')
```
It seems to work "fast enough", however no formal runtime tests have been carried out yet. Also, the package is tested in Ubuntu only. I appreciate your feedback.
0 讨论(0)

查看其它3个回答
发布评论:

提交评论
- 加载中...