shuffle a large list of items without loading in memory

后端 未结 6 1883
礼貌的吻别
礼貌的吻别 2021-01-07 20:48

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat

6条回答
  •  Happy的楠姐
    2021-01-07 21:06

    I think the simplest in your case is to do a recursive shuffle&split - shuffle - merge. You define two numbers : the number of files you want to split one file in : N (typicaly between 32 and 256), and the size at which you can directly shuffle in memory M (typicaly about 128 Mo). Then you have in pseudo code :

    def big_shuffle(file):
        if size_of(file) < M :
            memory_shuffle(file)
        else:
            create N files
            for line in file:
                write_randomly_to_one_of_the_N_files
            for sub_file in (N_files):
                big_shuffle(file)
            merge_the_N_files_one_line_each
    

    As each of the sub-file is shuffled, you should have no bias.

    It will be far less fast than Alex Reynolds solution (because a lot of disk io), but your only limit will be disk space.

提交回复
热议问题