shuffle a large list of items without loading in memory

后端未结

关注

 6  1883

礼貌的吻别 2021-01-07 20:48

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat

6条回答

Happy的楠姐 (楼主)

2021-01-07 21:06
I think the simplest in your case is to do a recursive shuffle&split - shuffle - merge. You define two numbers : the number of files you want to split one file in : N (typicaly between 32 and 256), and the size at which you can directly shuffle in memory M (typicaly about 128 Mo). Then you have in pseudo code :
```
def big_shuffle(file):
    if size_of(file) < M :
        memory_shuffle(file)
    else:
        create N files
        for line in file:
            write_randomly_to_one_of_the_N_files
        for sub_file in (N_files):
            big_shuffle(file)
        merge_the_N_files_one_line_each
```
As each of the sub-file is shuffled, you should have no bias.

It will be far less fast than Alex Reynolds solution (because a lot of disk io), but your only limit will be disk space.
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...