How can I split a file in python?

前端未结

关注

 9  609

Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?

相关标签:

9条回答

隐瞒了意图╮

2020-12-03 08:17
All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).

I found the following is the most efficient way to do it:
```
import os

MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"

if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
    print("Done:")
    print(os.system(f"ls {PREFIX}??"))
else:
    print("Failed!")
```
Read more about split here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/
0 讨论(0)
发布评论:

提交评论
- 加载中...

逝去的感伤

2020-12-03 08:20

Solution to split binary files into chapters .000, .001, etc.:

FILE = 'scons-conversion.7z'

MAX  = 500*1024*1024  # 500Mb  - max chapter size
BUF  = 50*1024*1024*1024  # 50GB   - memory buffer size

chapters = 0
uglybuf  = ''
with open(FILE, 'rb') as src:
  while True:
    tgt = open(FILE + '.%03d' % chapters, 'wb')
    written = 0
    while written < MAX:
      if len(uglybuf) > 0:
        tgt.write(uglybuf)
      tgt.write(src.read(min(BUF, MAX - written)))
      written += min(BUF, MAX - written)
      uglybuf = src.read(1)
      if len(uglybuf) == 0:
        break
    tgt.close()
    if len(uglybuf) == 0:
      break
    chapters += 1

0 讨论(0)

栀梦

2020-12-03 08:25

import re
PATENTS = 'patent.data'

def split_file(filename):
    # Open file to read
    with open(filename, "r") as r:

        # Counter
        n=0

        # Start reading file line by line
        for i, line in enumerate(r):

            # If line match with teplate -- <?xml --increase counter n
            if re.match(r'\<\?xml', line):
                n+=1

                # This "if" can be deleted, without it will start naming from 1
                # or you can keep it. It depends where is "re" will find at
                # first time the template. In my case it was first line
                if i == 0:
                    n = 0               

            # Write lines to file    
            with open("{}-{}".format(PATENTS, n), "a") as f:
                f.write(line)

split_file(PATENTS)

As a result you will get:

patent.data-0

patent.data-1

patent.data-N

0 讨论(0)

上一页 1 2