Sorting a file to optimize for compression efficiency

问题

We have some large data files that are being concatenated, compressed, and then sent to another server. The compression reduces the transmission time to the destination server, so the smaller we can get the file in a short period of time, the better. This is a highly time-sensitive process.

The data files contain many rows of tab-delimited text, and the order of the rows does not matter.

We noticed that when we sorted the file by the first field, the compressed file size was much smaller, presumably because duplicates of that column are next to each other. However, sorting a large file is slow, and there's no real reason that it needs to be in sorted other than that it happens to improves compression. There's also no relationship between what's in the first column and what's in subsequent columns. There could be some ordering of rows that compressed even smaller, or alternatively there could be an algorithm that could similarly improve compression performance but require less time to run.

What approach could I use to reorder rows to optimize the similarity between neighboring rows and improve compression performance?

回答1:

Here are a few suggestions:

Split the file into smaller batches and sort those. Sorting multiple small sets of data is faster than sorting a single big chunk. You can also easily parallelize the work this way.
Experiment with different compression algorithms. Different algorithms have different throughput and ratio. You are interested in algorithms that are on the pareto frontier of those two dimensions.
Use bigger dictionary sizes. This allows the compressor to reference data that is further in the past.

Note, that sorting is important no matter what algorithm and dictionary size you chose because references to old data tend to use more bits. Also, sorting by a time dimension tends to group rows together that come from a similar data distribution. For example, Stack Overflow has more bot traffic at night than during the day. Probably, the UserAgent field value distribution in their HTTP logs greatly varies with the time of day.

回答2:

If the columns contain different types of data, e.g.

Name, Favourite drink, Favourite language, Favourite algorithm

then you may find that transposing the data (e.g. changing rows into columns) will improve compression because for each new item the zip algorithm just needs to encode which item is favourite, rather than both which item and which category.

On the other hand, if a word is equally likely to appear in any column, then this approach is unlikely to be of any use.

回答3:

Just in: Simply try using a different compression format. We found for our application (compressed SQLite db) that LZMA / 7z compresses about 4 times better compared to zip. Just saying, before you implement anything.

来源：https://stackoverflow.com/questions/24149980/sorting-a-file-to-optimize-for-compression-efficiency

标签

algorithm

sorting

compression