问题
I feel like this one should be easy but after numerous searches and attempts I can't figure an answer out. Basically I have a very large number of items that I want to sample in a random order without replacement. In this case they are cells in a 2D array. The solution that I would use for a smaller array doesn't translate because it requires shuffling an in memory array. If the number I had to sample was small I could also just randomly sample items and keep a list of the values I'd tried. Unfortunately I often will have to sample a very large proportion of all the cells, as many as all.
What I'd like to create is an iterator using some combination of itertools, numpy and/or random that yields the next random cell (x and y indices). Another possible solution would be to create an iterator that would yield the next random number (without replacement) between 0 and (x_count * y_count) which I could map back to a cell location. Neither of which seems easily accomplished.
Thanks for any sugestions!
Here's my current solution.
import numpy as np
import itertools as itr
import random as rdm
#works great
x_count = 10
y_count = 5
#good luck!
#x_count = 10000
#y_count = 20000
x_indices = np.arange(x_count)
y_indices = np.arange(y_count)
cell_indices = itr.product(x_indices, y_indices)
list_cell_indices = list(cell_indices)
rdm.shuffle(list_cell_indices)
for i in range(25):
print list_cell_indices[i]
So based on the current responses and my attempt to translate perl which I know nothing about, I'm understanding that the best I can do is the following:
import numpy as np
import itertools as itr
import random as rdm
x_count = 10000
y_count = 5000
sample_count = 10000
keep_probability = 0.01
tried_cells = set()
kept_cells = set()
while len(kept_cells) < sample_count:
x = rdm.randint(0, x_count)
y = rdm.randint(0, y_count)
if (x, y) in tried_cells:
pass
else:
tried_cells.add((x, y))
keep = rdm.random() < keep_probability
if keep:
kept_cells.add((x,y))
print "worked"
In most cases the processing time and memory used isn't that bad. Maybe I could do a check of the average cell keep_probability and sample_count and throw an error for difficult cases.
回答1:
How about this approach. I create first the x*y array and reshape it to 2-D. Then, knowing each cell can be uniquely identified by a single integer, get a sample from 0 to (x*y).
import numpy
x_count = 10000
y_count = 20000
x_indices = numpy.arange(x_count)
y_indices = numpy.arange(y_count)
large_table = numpy.arange(y_count * x_count).reshape(y_count, x_count)
print large_table
def get_random_item(sample_size):
from random import sample
for i in sample(xrange(y_count * x_count), sample_size):
y,x = divmod(i, y_count)
yield (x,y)
for x,y in get_random_item(10):
print '%12i x: %5i y: %5i' % (large_table[x][y], x,y)
Which returns:
(first to simulate your existing 2-D array you created via product)
[[ 0 1 2 ..., 9997 9998 9999]
[ 10000 10001 10002 ..., 19997 19998 19999]
[ 20000 20001 20002 ..., 29997 29998 29999]
...,
[199970000 199970001 199970002 ..., 199979997 199979998 199979999]
[199980000 199980001 199980002 ..., 199989997 199989998 199989999]
[199990000 199990001 199990002 ..., 199999997 199999998 199999999]]
Then, it returns the 2-dim coordinates, which can be translated into your cell contents simply via array[x][y]
154080675 x: 15408 y: 675
186978188 x: 18697 y: 8188
157506087 x: 15750 y: 6087
168859259 x: 16885 y: 9259
29775768 x: 2977 y: 5768
94167866 x: 9416 y: 7866
15978144 x: 1597 y: 8144
91964007 x: 9196 y: 4007
163462830 x: 16346 y: 2830
62613129 x: 6261 y: 3129
sample() states it is 'Used for random sampling without replacement' and this approach adheres to the advice 'This is especially fast and space efficient for sampling from a large population: sample(xrange(10000000), 60).' found on the python random page.
I note that while I use get_random_item() as a generator, the underlying sample() still is producing a full list, so the memory use is still y*x + sample_size, but it runs quite swiftly all the same.
回答2:
I believe that there is simply no way to sample a sequence without replacement without using a significant amount of auxiliary storage for sample sizes close to R * C
. And while there are clever ways to reduce the amount of storage for small sample sizes, if you expect to be sampling more than about a third of your dataset, you're better off just creating a separate list. random.sample
is a natural choice for this purpose; and frankly I would just pass a flattened version of your 2-d numpy array straight to it. (Unless you need the indices too, in which case, randomly sampling ints and translating them into coordinates, a la hexparrot's solution, is a reasonable way to go.)
>>> a = numpy.arange(25).reshape((5, 5))
>>> random.sample(a.ravel(), 5)
[0, 13, 8, 18, 4]
If you look at the implementation of random.sample, you'll see that for smaller sample sizes, it already does roughly what the perl code above does -- tracks previously selected items in a set and discards selections that are already in the set. For larger sample sizes, it creates a copy of the input -- which is more memory efficient than a set for larger values, because sets take up more space than lists per stored item -- and does a slightly modified Fisher-Yates shuffle, stopping when it has sample_size
items (i.e. it doesn't shuffle the whole list, so it's more efficient than shuffling the whole thing yourself.)
Basically, my bet is that you won't do better than random.sample
, rolling your own, unless you code something up in c.
However -- I did find this, which you might find interesting: numpy.random.choice. This appears to do random sampling with or without replacement at c speed. The catch? It's new with Numpy 1.7!
回答3:
You're already using O(N=R*C) memory, so you might as well use O(N) memory for your iterator. Copy all elements and randomly sort them as you would do for the single-dimensional case. This is only a reasonable thing to do if you are going to visit each element, which you say is your case.
(for the record: Otherwise, memory isn't such a bad issue because you only need to "remember" where you were previously. Thus you can keep a list of indices that you've already visited. This is bad if you ever do plan to visit each element, because rejection sampling can take very long if implementing with a growing blacklist. (You can also implement it with a decreasing whitelist, which is equivalent to the first solution.))
回答4:
You've said that your test is likely to fail for most of the cells in your grid. If that's the case, randomly sampling the cells may not be a good idea to do, since you'll have a hard time keeping track of the cells you've checked already without using a lot of memory.
Instead, you might be better off applying your test to the whole grid and selecting a random element from those that pass it.
This function returns a random element that passes the test (or None if they all fail). It uses very little memory.
def findRandomElementThatPasses(iterable, testFunc):
value = None
passed = 0
for element in iterable:
if testFunc(element):
passed += 1
if random.random() > 1.0/passed:
value = element
return value
You'd call it with something like:
e = findRandomElementThatPasses((x,y) for x in xrange(X_SIZE)
for y in xrange(Y_SIZE),
someFunctionTakingAnXYTuple)
If you're using Python 3, use range instead of xrange.
回答5:
in perl you could do something like this:
# how many you got and need
$xmax = 10000000;
$ymax = 10000000;
$sampleSize = 10000;
# build hash, dictionary in python
$cells = {};
$count = 0;
while($count < $sampleSize) {
$x = rand($xmax);
$y = rand($ymax);
if(! $cells->{$x}->{$y}) {
$cells->{$x}->{$y}++;
$count++;
}
}
# now grab the stuff
foreach ($x keys %{$cells}) {
foreach ($y keys %{$cells->{$x}}) {
getSample($x, $y);
}
}
No duplicates, pretty random, not too hard on memory.
来源:https://stackoverflow.com/questions/10726540/memory-efficient-random-number-iterator-without-replacement