How can I split a file in python?

前端 未结 9 594
失恋的感觉
失恋的感觉 2020-12-03 07:29

Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?

相关标签:
9条回答
  • 2020-12-03 08:17

    All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).

    I found the following is the most efficient way to do it:

    import os
    
    MAX_NUM_LINES = 1000
    FILE_NAME = "input_file.txt"
    SPLIT_PARAM = "-d"
    PREFIX = "__"
    
    if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
        print("Done:")
        print(os.system(f"ls {PREFIX}??"))
    else:
        print("Failed!")
    

    Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/

    0 讨论(0)
  • 2020-12-03 08:20

    Solution to split binary files into chapters .000, .001, etc.:

    FILE = 'scons-conversion.7z'
    
    MAX  = 500*1024*1024  # 500Mb  - max chapter size
    BUF  = 50*1024*1024*1024  # 50GB   - memory buffer size
    
    chapters = 0
    uglybuf  = ''
    with open(FILE, 'rb') as src:
      while True:
        tgt = open(FILE + '.%03d' % chapters, 'wb')
        written = 0
        while written < MAX:
          if len(uglybuf) > 0:
            tgt.write(uglybuf)
          tgt.write(src.read(min(BUF, MAX - written)))
          written += min(BUF, MAX - written)
          uglybuf = src.read(1)
          if len(uglybuf) == 0:
            break
        tgt.close()
        if len(uglybuf) == 0:
          break
        chapters += 1
    
    0 讨论(0)
  • 2020-12-03 08:25
    import re
    PATENTS = 'patent.data'
    
    def split_file(filename):
        # Open file to read
        with open(filename, "r") as r:
    
            # Counter
            n=0
    
            # Start reading file line by line
            for i, line in enumerate(r):
    
                # If line match with teplate -- <?xml --increase counter n
                if re.match(r'\<\?xml', line):
                    n+=1
    
                    # This "if" can be deleted, without it will start naming from 1
                    # or you can keep it. It depends where is "re" will find at
                    # first time the template. In my case it was first line
                    if i == 0:
                        n = 0               
    
                # Write lines to file    
                with open("{}-{}".format(PATENTS, n), "a") as f:
                    f.write(line)
    
    split_file(PATENTS)
    

    As a result you will get:

    patent.data-0

    patent.data-1

    patent.data-N

    0 讨论(0)
提交回复
热议问题