Algorithm for efficient diffing of huge files

前端未结

关注

 5  995

深忆病人 2021-01-31 05:21

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two in

5条回答

半阙折子戏 (楼主)

2021-01-31 06:12
one question is what is the record size in your files, i.e. can the offsets change byte by byte or do the files consist of, say, 1024B blocks. Assuming the data is byte-oriented, you could do the following:
1. Create a suffix array for the file A. This array is a permutation of all index values to the file A. If A has 2^37 bytes then the index array is easiest represented by 64-bit integers, so every byte (offset to the file) corresponds to 8 bytes in the index array, so the index array will be 2^40 bytes long then. E.g. 800 GB, say. You can also index only every 1024th location, say, to reduce the size of the index array. This then detoriates the quality of packing depending on how long the average runs of copyable fragments are.
2. Now then to greedily pack the file B you start from its beginning at offset o=0 and then use the index array to find the longest match in A that matches the data starting at 'o'. You output the pair in the packed file. This takes in your case without any encoding 16 bytes, so if the run is < 16 bytes you actually lose space. This can be easily remedied by using then bit-level encoding and using a bit marker to mark whether you encode an isolated byte (marker + 8 bits = 9 bits) or an offset/length pair (marker + 40 bits + 40 bits = 81 bits), say. After packing the longest fragment at o, increase o to the next byte after the fragment and repeat until at end of file.
The construction and use of a suffix array is easy and you should find references easily. In high-speed applications people use suffix trees or suffix tries instead, which are more complex to manipulate but provide faster lookup. In your case you're going to have the array on secondary storage and if the run speed of the packing phase is not an issue a suffix array should be enough.
0 讨论(0)

查看其它5个回答
发布评论:

提交评论
- 加载中...