bash - shuffle a file that is too large to fit in memory

前端未结

关注

 6  1094

梦毁少年i 2021-01-01 20:29

I\'ve got a file that\'s too large to fit in memory. shuf seems to run in RAM, and sort -R doesn\'t shuffle (identical lines end up next to each ot

6条回答

灰色年华 (楼主)

2021-01-01 21:02
Using a form of decorate-sort-undecorate pattern and awk you can do something like:
```
$ seq 10 | awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' | sort -n | cut -c8-
8
5
1
9
6
3
7
2
10
4
```
For a file, you would do:
```
$ awk 'BEGIN{srand();} {printf "%06d %s\n", rand()*1000000, $0;}' SORTED.TXT | sort -n | cut -c8- > SHUFFLED.TXT
```
or cat the file at the start of the pipeline.

This works by generating a column of random numbers between 000000 and 999999 inclusive (decorate); sorting on that column (sort); then deleting the column (undecorate). That should work on platforms where sort does not understand numerics by generating a column with leading zeros for lexicographic sorting.

You can increase that randomization, if desired, in several ways:
1. If your platform's sort understands numerical values (POSIX, GNU and BSD do) you can do awk 'BEGIN{srand();} {printf "%0.15f\t%s\n", rand(), $0;}' FILE.TXT | sort -n | cut -f 2- to use a near double float for random representation.
2. If you are limited to a lexicographic sort, just combine two calls to rand into one column like so: awk 'BEGIN{srand();} {printf "%06d%06d\t%s\n", rand()*1000000,rand()*1000000, $0;}' FILE.TXT | sort -n | cut -f 2- which gives a composite 12 digits of randomization.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...