Algorithm for efficient diffing of huge files

前端 未结 5 990
深忆病人
深忆病人 2021-01-31 05:21

I have to store two files A and B which are both very large (like 100GB). However B is likely to be similar in big parts to A so i could store A and diff(A, B). There are two in

5条回答
  •  情深已故
    2021-01-31 06:05

    Depending on your performance requirements, you could get away with sampling the chunks you fingerprint, and growing them when they match. That way you don't have to run a checksum on your entire large file.

    If you need arbitrary byte alignments and you really care about performance, look at the simhash algorithm, and use it to find similar but unaligned blocks.

提交回复
热议问题