Fastest way to concatenate multiple files column wise - Python

问题

What is the fastest method to concatenate multiple files column wise (within Python)?

Assume that I have two files with 1,000,000,000 lines and ~200 UTF8 characters per line.

Method 1: Cheating with paste

I could concatenate the two files under a linux system by using paste in shell and I could cheat using os.system, i.e.:

def concat_files_cheat(file_path, file1, file2, output_path, output):
    file1 = os.path.join(file_path, file1)
    file2 = os.path.join(file_path, file2)
    output = os.path.join(output_path, output)
    if not os.path.exists(output):
        os.system('paste ' + file1 + ' ' + file2 + ' > ' + output)

Method 2: Using nested context manager with zip:

def concat_files_zip(file_path, file1, file2, output_path, output):
    with open(output, 'wb') as fout:
        with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
            for line1, line2 in zip(fin1, fin2):
                fout.write(line1 + '\t' + line2)

Method 3: Using fileinput

Does fileinput iterate through the files in parallel? Or will they iterate through each file sequentially on after the other?

If it is the former, I would assume it would look like this:

def concat_files_fileinput(file_path, file1, file2, output_path, output):
    with fileinput.input(files=(file1, file2)) as f:
        for line in f:
            line1, line2 = process(line)
            fout.write(line1 + '\t' + line2)

Method 4: Treat them like csv

with open(output, 'wb') as fout:
    with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2:
        writer = csv.writer(w)
        reader1, reader2 = csv.reader(fin1), csv.reader(fin2)
        for line1, line2 in zip(reader1, reader2):
          writer.writerow(line1 + '\t' + line2)

Given the data size, which would be the fastest?

Why would one choose one over the other? Would I lose or add information?

For each method how would I choose a different delimiter other than , or \t?

Are there other ways of achieving the same concatenation column wise? Are they as fast?

回答1:

From all four methods I'd take the second. But you have to take care of small details in the implementation. (with a few improvements it takes 0.002 seconds meanwhile the original implementation takes about 6 seconds; the file I was working was 1M rows; but there should not be too much difference if the file is 1K times bigger as we are not using almost memory).

Changes from the original implementation:

Use iterators if possible, otherwise memory consumption will be penalized and you have to handle the whole file at once. (mainly if you are using python 2, instead of using zip use itertools.izip)
When you are concatenating strings, use "%s%s".format() or similar; otherwise you generate one new string instance each time.
There's no need of writing line by line inside the for. You can use an iterator inside the write.
Small buffers are very interesting but if we are using iterators the difference is very small, but if we try to fetch all data at once (so, for example, we put f1.readlines(1024*1000), it's much slower).

Example:

def concat_iter(file1, file2, output):
    with open(output, 'w', 1024) as fo, \
        open(file1, 'r') as f1, \
        open(file2, 'r') as f2:
        fo.write("".join("{}\t{}".format(l1, l2) 
           for l1, l2 in izip(f1.readlines(1024), 
                              f2.readlines(1024))))

Profiler original solution.

We see that the biggest problem is in write and zip (mainly for not using iterators and having to handle/ process all file in memory).

~/personal/python-algorithms/files$ python -m cProfile sol_original.py 
10000006 function calls in 5.208 seconds

Ordered by: standard name

ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    5.208    5.208 sol_original.py:1(<module>)
    1    2.422    2.422    5.208    5.208 sol_original.py:1(concat_files_zip)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
    **9999999    1.713    0.000    1.713    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}
    1    1.072    1.072    1.072    1.072 {zip}

Profiler:

~/personal/python-algorithms/files$ python -m cProfile sol1.py 
     3731 function calls in 0.002 seconds

Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
    1    0.000    0.000    0.002    0.002 sol1.py:1(<module>)
    1    0.000    0.000    0.002    0.002 sol1.py:3(concat_iter6)
 1861    0.001    0.000    0.001    0.000 sol1.py:5(<genexpr>)
    1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
 1860    0.001    0.000    0.001    0.000 {method 'format' of 'str' objects}
    1    0.000    0.000    0.002    0.002 {method 'join' of 'str' objects}
    2    0.000    0.000    0.000    0.000 {method 'readlines' of 'file' objects}
    **1    0.000    0.000    0.000    0.000 {method 'write' of 'file' objects}**
    3    0.000    0.000    0.000    0.000 {open}

And in python 3 is even faster, because iterators are built-in and we dont need to import any library.

~/personal/python-algorithms/files$ python3.5 -m cProfile sol2.py 
843 function calls (842 primitive calls) in 0.001 seconds
[...]

And also it's very nice to see memory consumption and File System accesses that confirms what we have said before:

$ /usr/bin/time -v python sol1.py
Command being timed: "python sol1.py"
User time (seconds): 0.01
[...]
Maximum resident set size (kbytes): 7120
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 914
[...]
File system outputs: 40
Socket messages sent: 0
Socket messages received: 0


$ /usr/bin/time -v python sol_original.py 
Command being timed: "python sol_original.py"
User time (seconds): 5.64
[...]
Maximum resident set size (kbytes): 1752852
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 427697
[...]
File system inputs: 0
File system outputs: 327696

回答2:

You can replace the for loop with writelines by passing a genexp to it and replace zip with izip from itertools in method 2. This may come close to paste or surpass it.

with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:
    fout.writelines(b"{}\t{}".format(*line) for line in izip(fin1, fin2))

If you don't want to embed \t in the format string, you can use repeat from itertools;

    fout.writelines(b"{}{}{}".format(*line) for line in izip(fin1, repeat(b'\t'), fin2))

If the files are of same length, you can do away with izip.

with open(file1, 'rb') as fin1, open(file2, 'rb') as fin2, open(output, 'wb') as fout:
    fout.writelines(b"{}\t{}".format(line, next(fin2)) for line in fin1)

回答3:

You can try to test your function with timeit. This doc could be helpful.

Or the same magic function %%timeit in Jupyter notebook. You just need to write %%timeit func(data) and you will get a response with the assessment of your function. This paper could help you with it.

回答4:

Method #1 is the fastest because it uses native (instead of Python) code to concatenate the files. However it is definitively cheating.

If you want to cheat, you may also consider also writing your own C extension for Python - it may be even faster, depending on your coding skills.

I am afraid Method #4 would not be working, since you are concatenating lists with string. I would go with writer.writerow(line1 + line2) instead. You may use delimiter parameter of both csv.reader and csv.writer to customise the delimiter (see https://docs.python.org/2/library/csv.html).

来源：https://stackoverflow.com/questions/39888949/fastest-way-to-concatenate-multiple-files-column-wise-python

标签

python

shell

text-files

delimiter

paste