I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat
I think the simplest in your case is to do a recursive shuffle&split - shuffle - merge.
You define two numbers : the number of files you want to split one file in : N
(typicaly between 32 and 256), and the size at which you can directly shuffle in memory M
(typicaly about 128 Mo). Then you have in pseudo code :
def big_shuffle(file):
if size_of(file) < M :
memory_shuffle(file)
else:
create N files
for line in file:
write_randomly_to_one_of_the_N_files
for sub_file in (N_files):
big_shuffle(file)
merge_the_N_files_one_line_each
As each of the sub-file is shuffled, you should have no bias.
It will be far less fast than Alex Reynolds solution (because a lot of disk io), but your only limit will be disk space.