Sorting a file to optimize for compression efficiency
问题 We have some large data files that are being concatenated, compressed, and then sent to another server. The compression reduces the transmission time to the destination server, so the smaller we can get the file in a short period of time, the better. This is a highly time-sensitive process. The data files contain many rows of tab-delimited text, and the order of the rows does not matter. We noticed that when we sorted the file by the first field, the compressed file size was much smaller,