How can I shuffle a very large list stored in a file in Python?

问题

I need to deterministically generate a randomized list containing the numbers from 0 to 2^32-1.

This would be the naive (and totally nonfunctional) way of doing it, just so it's clear what I'm wanting.

import random
numbers = range(2**32)
random.seed(0)
random.shuffle(numbers)

I've tried making the list with numpy.arange() and using pycrypto's random.shuffle() to shuffle it. Making the list ate up about 8gb of ram, then shuffling raised that to around 25gb. I only have 32gb to give. But that doesn't matter because...

I've tried cutting the list into 1024 slices and trying the above, but even one of these slices takes way too long. I cut one of these slices into 128 yet smaller slices, and that took about 620ms each. If it grew linearly, then that means the whole thing would take about 22 and a half hours to complete. That sounds alright, but it doesn't grow linearly.

Another thing I've tried is generating random numbers for every entry and using those as indices for their new location. I then go down the list and attempt to place the number at the new index. If that index is already in use, the index is incremented until it finds a free one. This works in theory, and it can do about half of it, but near the end it keeps having to search for new spots, wrapping around the list several times.

Is there any way to pull this off? Is this a feasible goal at all?

回答1:

Computing all the values seems impossible, since Crypto compute a random integer in about a milisecond, so the whole job take days.

Here is a Knuth algorithm implementation as a generator:

from Crypto.Random.random import randint  
import numpy as np

def onthefly(n):
    numbers=np.arange(n,dtype=np.uint32)
    for i in range(n):
        j=randint(i,n-1)
        numbers[i],numbers[j]=numbers[j],numbers[i]
        yield numbers[i]

For n=10 :

gen=onthefly(10)
print([next(gen) for i in range(9)])
print(next(gen))
#[9, 0, 2, 6, 4, 8, 7, 3, 1]
#5

For n=2**32, the generator take a minute to initialize, but calls are O(1).

回答2:

If you have a continuous range of numbers, you don't need to store them at all. It is easy to devise a bidirectional mapping between the value in a shuffled list and its position in that list. The idea is to use a pseudo-random permutation and this is exactly what block ciphers provide.

The trick is to find a block cipher that matches exactly your requirement of 32-bit integers. There are very few such block ciphers, but the Simon and Speck ciphers (released by the NSA) are parameterisable and support a block size of 32-bit (usually block sizes are much larger).

This library seems to provide an implementation of that. We can devise the following functions:

def get_value_from_index(key, i):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.encrypt(i)

def get_index_from_value(key, val):
    cipher = SpeckCipher(key, mode='ECB', key_size=64, block_size=32)
    return cipher.decrypt(val)

The library works with Python's big integers, so you might not even need to encode them.

A 64-bit key (for example 0x123456789ABCDEF0) is not much. You could use a similar construction that increased the key size in DES to Triple DES. Keep in mind that keys should be chosen randomly and they have to be constant if you want determinism.

If you don't want to use an algorithm by the NSA for that, I would understand. There are others, but I can't find them now. The Hasty Pudding cipher is even more flexible, but I don't know if there is an implementation of that for Python.

回答3:

The class I created uses a bitarray of keep track which numbers have already been used. With the comments, I think the code is pretty self explanatory.

import bitarray
import random


class UniqueRandom:
    def __init__(self):
        """ Init boolean array of used numbers and set all to False
        """
        self.used = bitarray.bitarray(2**32)
        self.used.setall(False)

    def draw(self):
        """ Draw a previously unused number
         Return False if no free numbers are left
        """

        # Check if there are numbers left to use; return False if none are left
        if self._free() == 0:
            return False

        # Draw a random index
        i = random.randint(0, 2**32-1)

        # Skip ahead from the random index to a undrawn number
        while self.used[i]:
            i = (i+1) % 2**32

        # Update used array
        self.used[i] = True

        # return the selected number
        return i

    def _free(self):
        """ Check how many places are unused
        """
        return self.used.count(False)


def main():
    r = UniqueRandom()
    for _ in range(20):
        print r.draw()


if __name__ == '__main__':
    main()

Design considerations
While Garrigan Stafford's answer is perfectly fine, the memory footprint of this solution is much smaller (a bit more than 4 GB). Another difference between our answers is that Garrigan's algorithm takes more time to generate a random number when the amount of generated numbers increases (because he keeps iterating until an unused number is found). This algorithm just looks up the next unused number if a certain number is already used. This makes the time it takes to draw a number every time practically the same, regardless of how far the pool of free numbers is exhausted.

回答4:

So one way is to keep track of which numbers you have already given out and keep handing out new random numbers one at a time, consider

import random
random.seed(0)

class RandomDeck:
      def __init__(self):
           self.usedNumbers = set()

      def draw(self):
          number = random.randint(0,2**32)
          while number in self.usedNumbers:
                 number = random.randint(0,2**32)
          self.usedNumbers.append(number)
          return number

      def shuffle(self):
          self.usedNumbers = set()

As you can see we essentially have a deck of random numbers between 0 and 2^32 but we only store the numbers we have given out to ensure we don't have repeats. Then you can re-shuffle the deck by forgetting all the numbers you have already given out.

This should be efficient in most use cases as long as you don't draw ~1 million numbers without a reshuffle.

回答5:

Here is a permutation RNG which I wrote which uses the fact that a squaring a number mod a prime (plus some intricacies) gives a pseudorandom permutation.

https://github.com/pytorch/pytorch/blob/09b4f4f2ff88088306ecedf1bbe23d8aac2d3f75/torch/utils/data/_utils/index_utils.py

Short version:

def _is_prime(n):
    if n == 2:
        return True
    if n == 1 or n % 2 == 0:
        return False

    for d in range(3, floor(sqrt(n)) + 1, 2):  # can use isqrt in Python 3.8
        if n % d == 0:
            return False

    return True


class Permutation(Range):
    """
    Generates a random permutation of integers from 0 up to size.
    Inspired by https://preshing.com/20121224/how-to-generate-a-sequence-of-unique-random-integers/
    """

    size: int
    prime: int
    seed: int

    def __init__(self, size: int, seed: int):
        self.size = size
        self.prime = self._get_prime(size)
        self.seed = seed % self.prime

    def __getitem__(self, index):
        x = self._map(index)

        while x >= self.size:
            # If we map to a number greater than size, then the cycle of successive mappings must eventually result
            # in a number less than size. Proof: The cycle of successive mappings traces a path
            # that either always stays in the set n>=size or it enters and leaves it,
            # else the 1:1 mapping would be violated (two numbers would map to the same number).
            # Moreover, `set(range(size)) - set(map(n) for n in range(size) if map(n) < size)`
            # equals the `set(map(n) for n in range(size, prime) if map(n) < size)`
            # because the total mapping is exhaustive.
            # Which means we'll arrive at a number that wasn't mapped to by any other valid index.
            # This will take at most `prime-size` steps, and `prime-size` is on the order of log(size), so fast.
            # But usually we just need to remap once.
            x = self._map(x)

        return x

    @staticmethod
    def _get_prime(size):
        """
        Returns the prime number >= size which has the form (4n-1)
        """
        n = size + (3 - size % 4)
        while not _is_prime(n):
            # We expect to find a prime after O(log(size)) iterations
            # Using a brute-force primehood test, total complexity is O(log(size)*sqrt(size)), which is pretty good.
            n = n + 4
        return n

    def _map(self, index):
        a = self._permute_qpr(index)
        b = (a + self.seed) % self.prime
        c = self._permute_qpr(b)
        return c

    def _permute_qpr(self, x):
        residue = pow(x, 2, self.prime)

        if x * 2 < self.prime:
            return residue
        else:
            return self.prime - residue

来源：https://stackoverflow.com/questions/44857817/how-can-i-shuffle-a-very-large-list-stored-in-a-file-in-python

标签

python

list

numpy

random

pycrypto