Shuffling lines of a file with a fixed seed?

后端 未结 3 1395
粉色の甜心
粉色の甜心 2020-12-16 14:43

I want to shuffle the lines of a file with a fixed seed so that I always get the same random order. The command I am using is as follows:

sort -R file.txt |          


        
相关标签:
3条回答
  • 2020-12-16 15:13

    If you're randomly shuffling lines, you're not sorting. I haven't seen a sort with --random-source prompt before. It'd be interesting if it does exist. However, that's not sorting the lines in a fixed order.

    I believe you'll have to write a program to that, and I don't think Bash can quite do it.

    Actually, it might. The $RANDOM environment variable selects a random number from 0 to 32767. You can assign a seed to RANDOM and the random number sequence will appear over and over. You can use a card dealing algorithm. Read in each line into a Bash array, then use the card dealing algorithm to pick each line.

    I'm not going to write a test program -- especially in Bash, but you should get the idea.

    0 讨论(0)
  • 2020-12-16 15:28

    You may not need to use external tools like sort, whose options and usage may vary depending on your operating system. Bash has an internal random number generator accessible through the $RANDOM variable. It's common practice to seed the generator by setting the variable, like so:

    RANDOM=$$
    

    or

    RANDOM=$(date '+%s')
    

    etc. But of course, you can also use a predictable seed in order to get predictable not-so-random results:

    $ RANDOM=12345; echo $RANDOM
    28207
    $ RANDOM=12345; echo $RANDOM
    28207
    

    To reorder the lines of the mapped file randomly, you can read the file into an array using mapfile:

    $ mapfile -t a < source.txt
    

    Then simply rewrite the array indices:

    $ for i in ${!a[@]}; do a[$((RANDOM+${#a[@]}))]="${a[$i]}"; unset a[$i]; done
    

    When reading a non-associative array, bash naturally orders elements in ascending order of index value.

    Note that the new index for each line has the number of array elements added to it to avoid collisions within that range. This solution is still fallible -- there's no guarantee that $RANDOM will produce unique numbers. You can mitigate that risk with extra code that checks for prior use of each index, or reduce the risk with bit-shifting:

    ... a[$(( (RANDOM<<15)+RANDOM+${#a[@]} ))]= ...
    

    This makes your index values into a 30-bit unsigned int instead of a 15 bit unsigned int.

    0 讨论(0)
  • 2020-12-16 15:31

    The GNU implementation of sort has a --random-source argument. Passing this argument with the name of a file with known contents will result in a reliable set of output.

    See the Random sources documentation in the GNU coreutils manual, which contains the following sample implementation and example:

    get_seeded_random()
    {
      seed="$1"
      openssl enc -aes-256-ctr -pass pass:"$seed" -nosalt \
        </dev/zero 2>/dev/null
    }
    
    shuf -i1-100 --random-source=<(get_seeded_random 42)
    

    Since GNU sort is also part of coreutils, the relevant documentation applies there as well:

    sort --random-source=<(get_seeded_random 42) -R file.txt | head -200 > file.sff
    
    0 讨论(0)
提交回复
热议问题