Python random N lines from large file (no duplicate lines)

前端 未结 5 891
[愿得一人]
[愿得一人] 2020-12-11 23:42

I need to use python to take N number of lines from large txt file. These files are basically tab delimited tables. My task has the following constraints:

  • The
5条回答
  •  死守一世寂寞
    2020-12-12 00:24

    There is only one way of avoiding a sequential read of all the file up to the last line you are sampling - I am surprised that none of the answers up to now mentioned it:

    You have to seek to an arbitrary location inside the file, read some bytes, if you have a typical line length, as you said, 3 or 4 times that value should do it. Then split the chunk you read on the new line characters ("\n"), and pick the second field - that is a line in a random position.

    Also, in order to be able to consistently seek into the file, it should be opened in "binary read" mode, thus, the conversion of the end of line markers should be taken care of manually.

    This technique can't give you the line number that was read, thus you keep the selected line offset in the file to avoid repetition:

    #! /usr/bin/python
    # coding: utf-8
    
    import random, os
    
    
    CHUNK_SIZE = 1000
    PATH = "/var/log/cron"
    
    def pick_next_random_line(file, offset):
        file.seek(offset)
        chunk = file.read(CHUNK_SIZE)
        lines = chunk.split(os.linesep)
        # Make some provision in case yiou had not read at least one full line here
        line_offset = offset + len(os.linesep) + chunk.find(os.linesep) 
        return line_offset, lines[1]
    
    def get_n_random_lines(path, n=5):
        lenght = os.stat(path).st_size
        results = []
        result_offsets = set()
        with open(path) as input:
            for x in range(n):
                while True:
                    offset, line = pick_next_random_line(input, random.randint(0, lenght - CHUNK_SIZE))
                    if not offset in result_offsets:
                        result_offsets.add(offset)
                        results.append(line)
                        break
        return results
    
    if __name__ == "__main__":
        print get_n_random_lines(PATH)
    

提交回复
热议问题