Fast concatenate multiple files on Linux

问题

I am using Python multiprocessing to generate a temporary output file per process. They can be several GBs in size and I make several tens of these. These temporary files need to be concated to form the desired output and this is the step that is proving to be a bottleneck (and a parallelism killer). Is there a Linux tool that will create the concated file by modifying the file-system meta-data and not actually copy the content ? As long as it works on any Linux system that would be acceptable to me. But a file system specific solution wont be of much help.

I am not OS or CS trained, but in theory it seems it should be possible to create a new inode and copy over the inode pointer structure from the inode of the files I want to copy from, and then unlink those inodes. Is there any utility that will do this ? Given the surfeit of well thought out unix utilities I fully expected it to be, but could not find anything. Hence my question on SO. The file system is on a block device, a hard disk actually, in case this information matters. I dont have the confidence to write this on my own, as I have never done any systems level programming before, so any pointers (to C/Python code snipppets) will be very helpful.

回答1:

Even if there was such a tool, this could only work if the files except the last were guaranteed to have a size that is a multiple of the filesystem's block size.

If you control how the data is written into the temporary files, and you know how large each one will be, you can instead do the following

Before starting the multiprocessing, create the final output file, and grow it to the final size by fseek()ing to the end, this will create a sparse file.
Start multiprocessing, handing each process the FD and the offset into its particular slice of the file.

This way, the processes will collaboratively fill the single output file, removing the need to cat them together later.

EDIT

If you can't predict the size of the individual files, but the consumer of the final file can work with sequential (as opposed to random-access) input, you can feed cat tmpfile1 .. tmpfileN to the consumer, either on stdin

cat tmpfile1 ... tmpfileN | consumer

or via named pipes (using bash's Process Substitution):

consumer <(cat tmpfile1 ... tmpfileN)

回答2:

You indicate that you don't know in advance the size of each temporary file. With this in mind, I think your best bet is to write a FUSE filesystem that would present the chunks as a single large file, while keeping them as individual files on the underlying filesystem.

In this solution, your producing and consuming apps remain unchanged. The producers write out a bunch of files that the FUSE layer makes appear as a single file. This virtual file is then presented to the consumer.

FUSE has bindings for a bunch of languages, including Python. If you look at some examples here or here (these are for different bindings), this requires surprisingly little code.

回答3:

I don't think so, inode may be aligned, so it may only possible if you are ok to leave some zeros (or unknown bytes) between one file's footer and another file's header.

Instead of concatenate these files, I'd like suggest to re-design the analysis tool to support sourcing from multiple files. Take log files for example, many log analyzers support to read log files each for one day.

EDIT

@san: As you say the code in use you can't control, well you can concatenate the separate files on the fly by using named pipes:

$ mkfifo /tmp/cat
$ cat file1 file2 ... >/tmp/cat &
$ user_program /tmp/cat
...
$ rm /tmp/cat

回答4:

For 4 files; xaa, xab, xac, xad a fast concatention in bash (as root):

losetup -v -f xaa; losetup -v -f xab; losetup -v -f xac; losetup -v -f xad

(Let's suppose that loop0, loop1, loop2, loop3 are the names of the new device files.)

Put http://pastebin.com/PtEDQH7G into a "join_us" script file. Then you can use it like this:

./join_us /dev/loop{0..3}

Then (if this big file is a film) you can give its ownership to a normal user (chown itsme /dev/mapper/joined) and then he/she can play it via: mplayer /dev/mapper/joined

The cleanup after these (as root):

dmsetup remove joined; losetup -d /dev/loop[0123]

回答5:

No, there is no such tool or syscall.

You might investigate if it's possible for each process to write directly into the final file. Say process 1 writes bytes 0-X, process 2 writes X-2X and so on.

回答6:

A potential alternative is to cat all your temp files into a named pipe and then use that named pipe as input to your single-input program. As long as your single-input program just reads the input sequentially and doesn't seek.

来源：https://stackoverflow.com/questions/5893531/fast-concatenate-multiple-files-on-linux

标签

Linux

copy

parallel-processing

cat