Is it possible to split a file? For example you have huge wordlist, I want to split it so that it becomes more than one file. How is this possible?
All the provided answers are good and (probably) work However, they need to load the file into memory (as a whole or partially). We know Python is not very efficient in this kind of tasks (or at least is not as efficient as the OS level commands).
I found the following is the most efficient way to do it:
import os
MAX_NUM_LINES = 1000
FILE_NAME = "input_file.txt"
SPLIT_PARAM = "-d"
PREFIX = "__"
if os.system(f"split -l {MAX_NUM_LINES} {SPLIT_PARAM} {FILE_NAME} {PREFIX}") == 0:
print("Done:")
print(os.system(f"ls {PREFIX}??"))
else:
print("Failed!")
Read more about split
here: https://linoxide.com/linux-how-to/split-large-text-file-smaller-files-linux/
Solution to split binary files into chapters .000, .001, etc.:
FILE = 'scons-conversion.7z'
MAX = 500*1024*1024 # 500Mb - max chapter size
BUF = 50*1024*1024*1024 # 50GB - memory buffer size
chapters = 0
uglybuf = ''
with open(FILE, 'rb') as src:
while True:
tgt = open(FILE + '.%03d' % chapters, 'wb')
written = 0
while written < MAX:
if len(uglybuf) > 0:
tgt.write(uglybuf)
tgt.write(src.read(min(BUF, MAX - written)))
written += min(BUF, MAX - written)
uglybuf = src.read(1)
if len(uglybuf) == 0:
break
tgt.close()
if len(uglybuf) == 0:
break
chapters += 1
import re
PATENTS = 'patent.data'
def split_file(filename):
# Open file to read
with open(filename, "r") as r:
# Counter
n=0
# Start reading file line by line
for i, line in enumerate(r):
# If line match with teplate -- <?xml --increase counter n
if re.match(r'\<\?xml', line):
n+=1
# This "if" can be deleted, without it will start naming from 1
# or you can keep it. It depends where is "re" will find at
# first time the template. In my case it was first line
if i == 0:
n = 0
# Write lines to file
with open("{}-{}".format(PATENTS, n), "a") as f:
f.write(line)
split_file(PATENTS)
As a result you will get:
patent.data-0
patent.data-1
patent.data-N