Process very large (>20GB) text file line by line

后端 未结 11 1736
慢半拍i
慢半拍i 2020-11-29 17:54

I have a number of very large text files which I need to process, the largest being about 60GB.

Each line has 54 characters in seven fields and I want to remove the

11条回答
  •  渐次进展
    2020-11-29 18:16

    Heres the code for loading text files of any size without causing memory issues. It support gigabytes sized files. It will run smoothly on any kind of machine, you just need to configure CHUNK_SIZE based on your system RAM. More the CHUNK_SIZE, more will be the data read at a time

    https://gist.github.com/iyvinjose/e6c1cb2821abd5f01fd1b9065cbc759d

    download the file data_loading_utils.py and import it into your code

    usage

    import data_loading_utils.py.py
    file_name = 'file_name.ext'
    CHUNK_SIZE = 1000000
    
    
    def process_lines(line, eof, file_name):
    
        # check if end of file reached
        if not eof:
             # process data, data is one single line of the file
    
        else:
             # end of file reached
    
    data_loading_utils.read_lines_from_file_as_data_chunks(file_name, chunk_size=CHUNK_SIZE, callback=process_lines)
    

    process_lines method is the callback function. It will be called for all the lines, with parameter line representing one single line of the file at a time.

    You can configure the variable CHUNK_SIZE depending on your machine hardware configurations.

提交回复
热议问题