shuffle a large list of items without loading in memory

后端 未结 6 1896
礼貌的吻别
礼貌的吻别 2021-01-07 20:48

I have a file with ~2 billion lines of text (~200gigs). I want to produce a new file containing the same text lines, but shuffled randomly by line. I can\'t hold all the dat

6条回答
  •  清歌不尽
    2021-01-07 21:11

    How about:

    import mmap
    from random import shuffle
    
    def find_lines(data):
        for i, char in enumerate(data):
            if char == '\n':
                yield i 
    
    def shuffle_file(in_file, out_file):
        with open(in_file) as f:
            data = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
            start = 0
            lines = []
            for end in find_lines(data):
                lines.append((start, end))
                start = end + 1
            shuffle(lines)
    
            with open(out_file, 'w') as out:
                for start, end in lines:
                    out.write(data[start:end+1])
    
    if __name__ == "__main__":
        shuffle_file('data', 'result')
    

    This solution should only ever store all the file offsets of the lines in the file, that's 2 words per line, plus container overhead.

提交回复
热议问题