I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat
How about:
import mmap
from random import shuffle
def find_lines(data):
for i, char in enumerate(data):
if char == '\n':
yield i
def shuffle_file(in_file, out_file):
with open(in_file) as f:
data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
start = 0
lines = []
for end in find_lines(data):
lines.append((start, end))
start = end + 1
shuffle(lines)
with open(out_file, 'w') as out:
for start, end in lines:
out.write(data[start:end+1])
if __name__ == "__main__":
shuffle_file('data', 'result')
This solution should only ever store all the file offsets of the lines in the file, that's 2 words per line, plus container overhead.