shuffle a large list of items without loading in memory

后端未结

关注

 6  1896

礼貌的吻别 2021-01-07 20:48

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat

6条回答

清歌不尽 (楼主)

2021-01-07 21:11

How about:

import mmap
from random import shuffle

def find_lines(data):
    for i, char in enumerate(data):
        if char == '\n':
            yield i 

def shuffle_file(in_file, out_file):
    with open(in_file) as f:
        data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        start = 0
        lines = []
        for end in find_lines(data):
            lines.append((start, end))
            start = end + 1
        shuffle(lines)

        with open(out_file, 'w') as out:
            for start, end in lines:
                out.write(data[start:end+1])

if __name__ == "__main__":
    shuffle_file('data', 'result')

This solution should only ever store all the file offsets of the lines in the file, that's 2 words per line, plus container overhead.

0 讨论(0)

查看其它6个回答