I have several large text text files that all have the same structure and I want to delete the first 3 lines and then remove illegal characters from the 4th line. I don\'t w
As wim said in the comments, sed is the right tool for this. The following command should do what you want:
sed -i -e '4 s/(dB)//' -e '4 s/Best Unit/Best_Unit/' -e '1,3 d' yourfile.whatever
To explain the command a little:
-i
executes the command in place, that is it writes the output back into the input file
-e
execute a command
'4 s/(dB)//'
on line 4
, subsitute ''
for '(dB)'
'4 s/Best Unit/Best_Unit/'
same as above, except different find and replace strings
'1,3 d'
from line 1 to line 3 (inclusive) delete the entire line
sed
is a really powerful tool, which can do much more than just this, well worth learning.
You can use file.readlines()
with an aditional argument in order to read just a few first lines from the file. From the docs:
f.readlines() returns a list containing all the lines of data in the file. If given an optional parameter sizehint, it reads that many bytes from the file and enough more to complete a line, and returns the lines from that. This is often used to allow efficient reading of a large file by lines, but without having to load the entire file in memory. Only complete lines will be returned.
Then the most robust way to manipulate generic strings are Regular Expressions. In Python, this means the re
module with, for example, the re.sub()
function.
My suggestion, which should be adapted to suit your needs:
import re
f = open('somefile.txt')
line4 = f.readlines(100)[3]
line4 = re.sub('\([^\)].*?\)', '', line4)
line4 = re.sub('Best(\s.*?)', 'Best_', line4)
newfilestring = ''.join(line4 + [line for line in f.readlines[4:]])
f.close()
newfile = open('someotherfile.txt', 'w')
newfile.write(newfilestring)
newfile.close()
Just try it for each file. 100 MB per file is not that big, and as you can see, the code to just make an attempt is not time-consuming to write.
with open('file.txt') as f:
lines = f.readlines()
lines[:] = lines[3:]
lines[0] = lines[0].replace('Rx(db)', 'Rx')
lines[0] = lines[0].replace('Best Unit', 'Best_Unit')
with open('output.txt', 'w') as f:
f.write('\n'.join(lines))