How to generate random permutations with CUDA

前端 未结 3 517

What parallel algorithms could I use to generate random permutations from a given set? Especially proposals or links to papers suitable for CUDA would be helpful.

A

相关标签:
3条回答
  • 2020-12-16 07:26

    Fisher-Yates shuffle could be parallelized. For example, 4 concurrent workers need only 3 iterations to shuffle vector of 8 elements. On first iteration they swap 0<->1, 2<->3, 4<->5, 6<->7; on second iteration 0<->2, 1<->3, 4<->5, 6<->7; and on last iteration 0<->4, 1<->5, 2<->6, 3<->7.

    ParallelFisherYates

    This could be easily implemented as CUDA __device__ code (inspired by standard min/max reduction):

    const int id  = threadIdx.x;
    __shared__ int perm_shared[2 * BLOCK_SIZE];
    perm_shared[2 * id]     = 2 * id;
    perm_shared[2 * id + 1] = 2 * id + 1;
    __syncthreads();
    
    unsigned int shift = 1;
    unsigned int pos = id * 2;  
    while(shift <= BLOCK_SIZE)
    {
        if (curand(&curand_state) & 1) swap(perm_shared, pos, pos + shift);
        shift = shift << 1;
        pos = (pos & ~shift) | ((pos & shift) >> 1);
        __syncthreads();
    }
    

    Here the curand initialization code is omitted, and method swap(int *p, int i, int j) exchanges values p[i] and p[j].

    Note that the code above has the following assumptions:

    1. The length of permutation is 2 * BLOCK_SIZE, where BLOCK_SIZE is a power of 2.
    2. 2 * BLOCK_SIZE integers fit into __shared__ memory of CUDA device
    3. BLOCK_SIZE is a valid size of CUDA block (usually something between 32 and 512)

    To generate more than one permutation I would suggest to utilize different CUDA blocks. If the goal is to make permutation of 7 elements (as it was mentioned in the original question) then I believe it will be faster to do it in single thread.

    0 讨论(0)
  • 2020-12-16 07:31

    For large sets, using a sort primitive on a vector of randomized keys might be efficient enough for your needs. First, setup some vectors:

    const int N = 65535;
    thrust:device_vector<uint16_t> d_cards(N);
    thrust:device_vector<uint16_t> d_keys(N);
    thrust::sequence(d_cards.begin(), d_cards.end());
    

    Then, each time you want to shuffle the d_cards call the pair of:

    thrust::tabulate(d_keys.begin(), d_keys.end(), PRNFunc(rand()*rand());
    thrust::sort_by_key(d_keys.begin(), d_keys.end(), d_cards.begin());
    // d_cards now freshly shuffled
    

    The random keys are generated from a functor that uses a seed (evaluated in host-code and copied to the kernel at launch-time) and a key number (which tabulate passes in at thread-creation time):

    struct PRNFunc
    {
      uint32_t seed;
      PRNFunc(uint32_t s) { seed = s; }
      __device__ __host__ uint32_t operator()(uint32_t kn) const
      {
        thrust::minstd_rand randEng(seed);
        randEng.discard(kn);
        return randEnd();
      }
    };
    

    I have found that performance could be improved (by probably 30%) if I could figure out how to cache the allocations that thrust::sort_by_key does internally.

    Any corrections or suggestions welcome.

    0 讨论(0)
  • 2020-12-16 07:38

    If the length of s = s_L, a very crude way of doing this could be implemented in thrust:

    http://thrust.github.com.

    First, create a vector val of length s_L x n that repeats s n times.

    Create a vector val_keys associate n unique keys repeated s_L times with each element of val, e.g.,

      val = {1,2,...,7,1,2,...,7,....,1,2,...7}
      val_keys = {0,0,0,0,0,0,0,1,1,1,1,1,1,2,2,2,...., n,n,n}
    

    Now the fun part. create a vector of length s_L x n uniformly distributed random variables

      U  = {0.24, 0.1, .... , 0.83} 
    

    then you can do zip iterator over val,val_keys and sort them according to U:

    http://codeyarns.com/2011/04/04/thrust-zip_iterator/

    both val, val_keys will be all over the place, so you have to put them back together again using thrust::stable_sort_by_key() to make sure that if val[i] and val[j] both belong to key[k] and val[i] precedes val[j] following the random sort, then in the final version val[i] should still precede val[j]. If all goes according to plan, val_keys should look just as before, but val should reflect the shuffling.

    0 讨论(0)
提交回复
热议问题