This question led to a new R package: wrswoR
R\'s default sampling without replacement using sample.int se
Let me throw in my own implementation of a faster approach based on rejection sampling with replacement. The idea is this:
Generate a sample with replacement that is "somewhat" larger than the requested size
Throw away the duplicate values
If not enough values have been drawn, call the same procedure recursively with adjusted n, size and prob parameters
Remap the returned indexes to the original indexes
How big a sample do we need to draw? Assuming a uniform distribution, the result is the expected number of trials to see x unique values out of N total values. It is a difference of two harmonic numbers (H_n and H_{n - size}). The first few harmonic numbers are tabulated, otherwise an approximation using the natural logarithm is used. (This is only a ballpark figure, no need to be too precise here.) Now, for a non-uniform distribution, the expected number of items to be drawn can only be larger, so we won't be drawing too many samples. In addition, the number of samples drawn is limited by twice the population size -- I assume that it's faster to have a few recursive calls than sampling up to O(n ln n) items.
The code is available in the R package wrswoR in the sample.int.rej routine in sample_int_rej.R. Install with:
library(devtools)
install_github('wrswoR', 'muelleki')
It seems to work "fast enough", however no formal runtime tests have been carried out yet. Also, the package is tested in Ubuntu only. I appreciate your feedback.