Splitting large text file into smaller text files by line numbers using Python

前端 未结 7 730
野的像风
野的像风 2020-11-28 08:50

I have a text file say really_big_file.txt that contains:

line 1
line 2
line 3
line 4
...
line 99999
line 100000

I would like to write a Py

7条回答
  •  小蘑菇
    小蘑菇 (楼主)
    2020-11-28 09:16

    Using itertools grouper recipe:

    from itertools import zip_longest
    
    def grouper(n, iterable, fillvalue=None):
        "Collect data into fixed-length chunks or blocks"
        # grouper(3, 'ABCDEFG', 'x') --> ABC DEF Gxx
        args = [iter(iterable)] * n
        return zip_longest(fillvalue=fillvalue, *args)
    
    n = 300
    
    with open('really_big_file.txt') as f:
        for i, g in enumerate(grouper(n, f, fillvalue=''), 1):
            with open('small_file_{0}'.format(i * n), 'w') as fout:
                fout.writelines(g)
    

    The advantage of this method as opposed to storing each line in a list, is that it works with iterables, line by line, so it doesn't have to store each small_file into memory at once.

    Note that the last file in this case will be small_file_100200 but will only go until line 100000. This happens because fillvalue='', meaning I write out nothing to the file when I don't have any more lines left to write because a group size doesn't divide equally. You can fix this by writing to a temp file and then renaming it after instead of naming it first like I have. Here's how that can be done.

    import os, tempfile
    
    with open('really_big_file.txt') as f:
        for i, g in enumerate(grouper(n, f, fillvalue=None)):
            with tempfile.NamedTemporaryFile('w', delete=False) as fout:
                for j, line in enumerate(g, 1): # count number of lines in group
                    if line is None:
                        j -= 1 # don't count this line
                        break
                    fout.write(line)
            os.rename(fout.name, 'small_file_{0}.txt'.format(i * n + j))
    

    This time the fillvalue=None and I go through each line checking for None, when it occurs, I know the process has finished so I subtract 1 from j to not count the filler and then write the file.

提交回复
热议问题