bash - shuffle a file that is too large to fit in memory

前端 未结 6 1094
梦毁少年i
梦毁少年i 2021-01-01 20:29

I\'ve got a file that\'s too large to fit in memory. shuf seems to run in RAM, and sort -R doesn\'t shuffle (identical lines end up next to each ot

6条回答
  •  灰色年华
    2021-01-01 21:02

    Using a form of decorate-sort-undecorate pattern and awk you can do something like:

    $ seq 10 | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
    8
    5
    1
    9
    6
    3
    7
    2
    10
    4
    

    For a file, you would do:

    $ awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' SORTED.TXT | sort -n | cut -c8- > SHUFFLED.TXT
    

    or cat the file at the start of the pipeline.

    This works by generating a column of random numbers between 000000 and 999999 inclusive (decorate); sorting on that column (sort); then deleting the column (undecorate). That should work on platforms where sort does not understand numerics by generating a column with leading zeros for lexicographic sorting.

    You can increase that randomization, if desired, in several ways:

    1. If your platform's sort understands numerical values (POSIX, GNU and BSD do) you can do awk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2- to use a near double float for random representation.

    2. If you are limited to a lexicographic sort, just combine two calls to rand into one column like so: awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2- which gives a composite 12 digits of randomization.

提交回复
热议问题