There are multiple SO questions addressing some form of this topic, but they all seem terribly inefficient for removing only a single row from a csv file (usually they invol
Editing files in-place is a task riddled with gotchas (much like modifying an iterable while iterating over it) and usually not worth the trouble. In most cases, writing to a temporary file (or working memory, in dependence of what you have more - storage space or RAM) then deleting the source file and replacing the source file with the temporary file will be equally performant as attempting to do the same thing in-place.
But, if you insist, here's a generalized solution:
import os
def remove_line(path, comp):
with open(path, "r+b") as f: # open the file in rw mode
mod_lines = 0 # hold the overwrite offset
while True:
last_pos = f.tell() # keep the last line position
line = f.readline() # read the next line
if not line: # EOF
break
if mod_lines: # we've already encountered what we search for
f.seek(last_pos - mod_lines) # move back to the beginning of the gap
f.write(line) # fill the gap with the current line
f.seek(mod_lines, os.SEEK_CUR) # move forward til the next line start
elif comp(line): # search for our data
mod_lines = len(line) # store the offset when found to create a gap
f.seek(last_pos - mod_lines) # seek back the extra removed characters
f.truncate() # truncate the rest
This will remove only the line matching the provided comparison function and then iterate over the rest of the file shifting the data over the 'removed' line. You won't need to load the rest of the file into your working memory, either. To test it, with test.csv
containing:
fname,lname,age,sex John,Doe,28,m Sarah,Smith,27,f Xavier,Moore,19,m
You can run it as:
remove_line("test.csv", lambda x: x.startswith(b"Sarah"))
And you'll get test.csv
with the in-place removed Sarah
line:
fname,lname,age,sex John,Doe,28,m Xavier,Moore,19,m
Keep in mind that we're passing a bytes
comparison function as the file is opened in binary mode to keep consistent line breaks while truncating/overwriting.
UPDATE: I was interested in the actual performance of various techniques presented here but I didn't have the time to test them yesterday, so with a bit of a delay I've created a benchmark that could shed some light on it. If you're interested only in the results, scroll all the way down. First I'll explain what was I benchmarking and how I set up the test. I'll also provide all the scripts so you can run the same benchmark at your system.
As for what, I've tested all of the mentioned techniques in this and other answers, namely line replacement using a temporary file (temp_file_*
functions) and using an in-place editing (in_place_*
) functions. I have both of those set up in a streaming (reading line by line, *_stream
functions) and memory (reading the rest of the file in working memory, *_wm
functions) modes. I've also added an in-place line deletion technique using the mmap
module (the in_place_mmap
function). The benchmarked script containing all the functions as well as a small bit of logic to be controlled through the CLI is as follows:
#!/usr/bin/env python
import mmap
import os
import shutil
import sys
import time
def get_temporary_path(path): # use tempfile facilities in production
folder, filename = os.path.split(path)
return os.path.join(folder, "~$" + filename)
def temp_file_wm(path, comp):
path_out = get_temporary_path(path)
with open(path, "rb") as f_in, open(path_out, "wb") as f_out:
while True:
line = f_in.readline()
if not line:
break
if comp(line):
f_out.write(f_in.read())
break
else:
f_out.write(line)
f_out.flush()
os.fsync(f_out.fileno())
shutil.move(path_out, path)
def temp_file_stream(path, comp):
path_out = get_temporary_path(path)
not_found = True # a flag to stop comparison after the first match, for fairness
with open(path, "rb") as f_in, open(path_out, "wb") as f_out:
while True:
line = f_in.readline()
if not line:
break
if not_found and comp(line):
continue
f_out.write(line)
f_out.flush()
os.fsync(f_out.fileno())
shutil.move(path_out, path)
def in_place_wm(path, comp):
with open(path, "r+b") as f:
while True:
last_pos = f.tell()
line = f.readline()
if not line:
break
if comp(line):
rest = f.read()
f.seek(last_pos)
f.write(rest)
break
f.truncate()
f.flush()
os.fsync(f.fileno())
def in_place_stream(path, comp):
with open(path, "r+b") as f:
mod_lines = 0
while True:
last_pos = f.tell()
line = f.readline()
if not line:
break
if mod_lines:
f.seek(last_pos - mod_lines)
f.write(line)
f.seek(mod_lines, os.SEEK_CUR)
elif comp(line):
mod_lines = len(line)
f.seek(last_pos - mod_lines)
f.truncate()
f.flush()
os.fsync(f.fileno())
def in_place_mmap(path, comp):
with open(path, "r+b") as f:
stream = mmap.mmap(f.fileno(), 0)
total_size = len(stream)
while True:
last_pos = stream.tell()
line = stream.readline()
if not line:
break
if comp(line):
current_pos = stream.tell()
stream.move(last_pos, current_pos, total_size - current_pos)
total_size -= len(line)
break
stream.flush()
stream.close()
f.truncate(total_size)
f.flush()
os.fsync(f.fileno())
if __name__ == "__main__":
if len(sys.argv) < 3:
print("Usage: {} target_file.ext [function_name]".format(__file__))
exit(1)
target_file = sys.argv[1]
search_func = globals().get(sys.argv[3] if len(sys.argv) > 3 else None, in_place_wm)
start_time = time.time()
search_func(target_file, lambda x: x.startswith(sys.argv[2].encode("utf-8")))
# some info for the test runner...
print("python_version: " + sys.version.split()[0])
print("python_time: {:.2f}".format(time.time() - start_time))
Next step is to build a tester that will run these functions in as isolated environment as possible, trying to obtain a fair benchmark for each of them. My test is structured as:
chrt -f 99
) through /usr/bin/time
for benchmark since Python cannot really be trusted to accurately measure its performance in scenarios like these.Unfortunately, I didn't have a system at hand where I could run the test fully isolated so my numbers are obtained from running it in a hypervisor. This means that the I/O performance is probably very skewed, but it should similarly affect all the tests still providing comparable data. Either way, you're welcome to run this test on your own system to get results you can relate to.
I've set a test script performing the aforementioned scenario as:
#!/usr/bin/env python
import collections
import os
import random
import shutil
import subprocess
import sys
import time
try:
range = xrange # cover Python 2.x
except NameError:
pass
try:
DEV_NULL = subprocess.DEVNULL
except AttributeError:
DEV_NULL = open(os.devnull, "wb") # cover Python 2.x
SAMPLE_ROWS = 10**6 # 1M lines
TEST_LOOPS = 3
CALL_SCRIPT = os.path.join(os.getcwd(), "remove_line.py") # the above script
def get_temporary_path(path):
folder, filename = os.path.split(path)
return os.path.join(folder, "~$" + filename)
def generate_samples(path, data="LINE", rows=10**6, columns=10): # 1Mx10 default matrix
sample_beginning = os.path.join(path, "sample_beg.csv")
sample_middle = os.path.join(path, "sample_mid.csv")
sample_end = os.path.join(path, "sample_end.csv")
separator = os.linesep
middle_row = rows // 2
with open(sample_beginning, "w") as f_b, \
open(sample_middle, "w") as f_m, \
open(sample_end, "w") as f_e:
f_b.write(data)
f_b.write(separator)
for i in range(rows):
if not i % middle_row:
f_m.write(data)
f_m.write(separator)
for t in (f_b, f_m, f_e):
t.write(",".join((str(random.random()) for _ in range(columns))))
t.write(separator)
f_e.write(data)
f_e.write(separator)
return ("beginning", sample_beginning), ("middle", sample_middle), ("end", sample_end)
def normalize_field(field):
field = field.lower()
while True:
s_index = field.find('(')
e_index = field.find(')')
if s_index == -1 or e_index == -1:
break
field = field[:s_index] + field[e_index + 1:]
return "_".join(field.split())
def encode_csv_field(field):
if isinstance(field, (int, float)):
field = str(field)
escape = False
if '"' in field:
escape = True
field = field.replace('"', '""')
elif "," in field or "\n" in field:
escape = True
if escape:
return ('"' + field + '"').encode("utf-8")
return field.encode("utf-8")
if __name__ == "__main__":
print("Generating sample data...")
start_time = time.time()
samples = generate_samples(os.getcwd(), "REMOVE THIS LINE", SAMPLE_ROWS)
print("Done, generation took: {:2} seconds.".format(time.time() - start_time))
print("Beginning tests...")
search_string = "REMOVE"
header = None
results = []
for f in ("temp_file_stream", "temp_file_wm",
"in_place_stream", "in_place_wm", "in_place_mmap"):
for s, path in samples:
for test in range(TEST_LOOPS):
result = collections.OrderedDict((("function", f), ("sample", s),
("test", test)))
print("Running {function} test, {sample} #{test}...".format(**result))
temp_sample = get_temporary_path(path)
shutil.copy(path, temp_sample)
print(" Clearing caches...")
subprocess.call(["sudo", "/usr/bin/sync"], stdout=DEV_NULL)
with open("/proc/sys/vm/drop_caches", "w") as dc:
dc.write("3\n") # free pagecache, inodes, dentries...
# you can add more cache clearing/invalidating calls here...
print(" Removing a line starting with `{}`...".format(search_string))
out = subprocess.check_output(["sudo", "chrt", "-f", "99",
"/usr/bin/time", "--verbose",
sys.executable, CALL_SCRIPT, temp_sample,
search_string, f], stderr=subprocess.STDOUT)
print(" Cleaning up...")
os.remove(temp_sample)
for line in out.decode("utf-8").split("\n"):
pair = line.strip().rsplit(": ", 1)
if len(pair) >= 2:
result[normalize_field(pair[0].strip())] = pair[1].strip()
results.append(result)
if not header: # store the header for later reference
header = result.keys()
print("Cleaning up sample data...")
for s, path in samples:
os.remove(path)
output_file = sys.argv[1] if len(sys.argv) > 1 else "results.csv"
output_results = os.path.join(os.getcwd(), output_file)
print("All tests completed, writing results to: " + output_results)
with open(output_results, "wb") as f:
f.write(b",".join(encode_csv_field(k) for k in header) + b"\n")
for result in results:
f.write(b",".join(encode_csv_field(v) for v in result.values()) + b"\n")
print("All done.")
Finally (and TL;DR): here are my results - I'm extracting only best time and memory data from the result set, but you can get the full result sets here: Python 2.7 Raw Test Data and Python 3.6 Raw Test Data.
Based on the data I gathered, a couple of final notes:
*_stream
functions provide small footprint. On Python 3.x a mid-way would be the mmap
technique.in_place_*
functions are viable.in_place_stream
but at the expense of processing time and increased I/O calls (compared to *_wm
functions).in_place_*
functions are dangerous as they may lead to data corruption if they are stopped mid-way. temp_file_*
functions (without integrity checks) are only dangerous on non-transactional file systems.