memory efficient random number iterator without replacement

问题

I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.

What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.

Thanks for any sugestions!

Here's my current solution.

import numpy as np
import itertools as itr
import random as rdm

#works great
x_count = 10
y_count = 5

#good luck!
#x_count = 10000
#y_count = 20000

x_indices = np.arange(x_count)
y_indices = np.arange(y_count)

cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)

for i in range(25):
    print list_cell_indices[i]

So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:

import numpy as np
import itertools as itr
import random as rdm

x_count = 10000
y_count = 5000

sample_count = 10000
keep_probability = 0.01


tried_cells = set()
kept_cells = set()

while len(kept_cells) < sample_count:
    x = rdm.randint(0, x_count)
    y = rdm.randint(0, y_count)

    if (x, y) in tried_cells:
        pass
    else:
        tried_cells.add((x, y))
        keep = rdm.random() < keep_probability
        if keep:
            kept_cells.add((x,y))


print "worked"

In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.

回答1:

How about this approach. I create first the x*y array and reshape it to 2-D. Then, knowing each cell can be uniquely identified by a single integer, get a sample from 0 to (x*y).

import numpy

x_count = 10000
y_count = 20000

x_indices = numpy.arange(x_count)
y_indices = numpy.arange(y_count)

large_table = numpy.arange(y_count * x_count).reshape(y_count, x_count)
print large_table

def get_random_item(sample_size):
    from random import sample
    for i in sample(xrange(y_count * x_count), sample_size):
        y,x = divmod(i, y_count)
        yield (x,y)

for x,y in get_random_item(10):
    print '%12i   x: %5i y: %5i' % (large_table[x][y],  x,y)

Which returns:

(first to simulate your existing 2-D array you created via product)

[[        0         1         2 ...,      9997      9998      9999]
 [    10000     10001     10002 ...,     19997     19998     19999]
 [    20000     20001     20002 ...,     29997     29998     29999]
 ..., 
 [199970000 199970001 199970002 ..., 199979997 199979998 199979999]
 [199980000 199980001 199980002 ..., 199989997 199989998 199989999]
 [199990000 199990001 199990002 ..., 199999997 199999998 199999999]]

Then, it returns the 2-dim coordinates, which can be translated into your cell contents simply via array[x][y]

   154080675   x: 15408 y:   675
   186978188   x: 18697 y:  8188
   157506087   x: 15750 y:  6087
   168859259   x: 16885 y:  9259
    29775768   x:  2977 y:  5768
    94167866   x:  9416 y:  7866
    15978144   x:  1597 y:  8144
    91964007   x:  9196 y:  4007
   163462830   x: 16346 y:  2830
    62613129   x:  6261 y:  3129

sample() states it is 'Used for random sampling without replacement' and this approach adheres to the advice 'This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).' found on the python random page.

I note that while I use get_random_item() as a generator, the underlying sample() still is producing a full list, so the memory use is still y*x + sample_size, but it runs quite swiftly all the same.

回答2:

I believe that there is simply no way to sample a sequence without replacement without using a significant amount of auxiliary storage for sample sizes close to R * C. And while there are clever ways to reduce the amount of storage for small sample sizes, if you expect to be sampling more than about a third of your dataset, you're better off just creating a separate list. random.sample is a natural choice for this purpose; and frankly I would just pass a flattened version of your 2-d numpy array straight to it. (Unless you need the indices too, in which case, randomly sampling ints and translating them into coordinates, a la hexparrot's solution, is a reasonable way to go.)

>>> a = numpy.arange(25).reshape((5, 5))
>>> random.sample(a.ravel(), 5)
[0, 13, 8, 18, 4]

If you look at the implementation of random.sample, you'll see that for smaller sample sizes, it already does roughly what the perl code above does -- tracks previously selected items in a set and discards selections that are already in the set. For larger sample sizes, it creates a copy of the input -- which is more memory efficient than a set for larger values, because sets take up more space than lists per stored item -- and does a slightly modified Fisher-Yates shuffle, stopping when it has sample_size items (i.e. it doesn't shuffle the whole list, so it's more efficient than shuffling the whole thing yourself.)

Basically, my bet is that you won't do better than random.sample, rolling your own, unless you code something up in c.

However -- I did find this, which you might find interesting: numpy.random.choice. This appears to do random sampling with or without replacement at c speed. The catch? It's new with Numpy 1.7!

回答3:

You're already using O(N=R*C) memory, so you might as well use O(N) memory for your iterator. Copy all elements and randomly sort them as you would do for the single-dimensional case. This is only a reasonable thing to do if you are going to visit each element, which you say is your case.

(for the record: Otherwise, memory isn't such a bad issue because you only need to "remember" where you were previously. Thus you can keep a list of indices that you've already visited. This is bad if you ever do plan to visit each element, because rejection sampling can take very long if implementing with a growing blacklist. (You can also implement it with a decreasing whitelist, which is equivalent to the first solution.))

回答4:

You've said that your test is likely to fail for most of the cells in your grid. If that's the case, randomly sampling the cells may not be a good idea to do, since you'll have a hard time keeping track of the cells you've checked already without using a lot of memory.

Instead, you might be better off applying your test to the whole grid and selecting a random element from those that pass it.

This function returns a random element that passes the test (or None if they all fail). It uses very little memory.

def findRandomElementThatPasses(iterable, testFunc):
    value = None
    passed = 0

    for element in iterable:
        if testFunc(element):
            passed += 1
            if random.random() > 1.0/passed:
                value = element

    return value

You'd call it with something like:

e = findRandomElementThatPasses((x,y) for x in xrange(X_SIZE)
                                      for y in xrange(Y_SIZE),
                                someFunctionTakingAnXYTuple)

If you're using Python 3, use range instead of xrange.

回答5:

in perl you could do something like this:

# how many you got and need
$xmax = 10000000;
$ymax = 10000000;
$sampleSize = 10000;

# build hash, dictionary in python
$cells = {};
$count = 0;
while($count < $sampleSize) {
  $x = rand($xmax);
  $y = rand($ymax);
  if(! $cells->{$x}->{$y}) {
    $cells->{$x}->{$y}++;
    $count++;
  }
}
# now grab the stuff
foreach ($x keys %{$cells}) {
  foreach ($y keys %{$cells->{$x}}) {
    getSample($x, $y);
  }
}

No duplicates, pretty random, not too hard on memory.

来源：https://stackoverflow.com/questions/10726540/memory-efficient-random-number-iterator-without-replacement

标签

python

random

numpy

itertools