How to search pattern in big binary files efficiently

六月ゝ 毕业季﹏ 提交于 2020-01-22 02:04:40

问题


I have several binary files, which are mostly bigger than 10GB. In this files, I want to find patterns with Python, i.e. data between the pattern 0x01 0x02 0x03 and 0xF1 0xF2 0xF3.

My problem: I know how to handle binary data or how I use search algorithms, but due to the size of the files it is very inefficient to read the file completely first. That's why I thought it would be smart to read the file blockwise and search for the pattern inside a block.

My goal: I would like to have Python determine the positions (start and stop) of a found pattern. Is there a special algorithm or maybe even a Python library that I could use to solve the problem?


回答1:


The common way when searching a pattern in a large file is to read the file by chunks into a buffer that has the size of the read buffer + the size of the pattern - 1.

On first read, you only search the pattern in the read buffer, then you repeatedly copy size_of_pattern-1 chars from the end of the buffer to the beginning, read a new chunk after that and search in the whole buffer. That way, you are sure to find any occurence of the pattern, even if it starts in one chunk and ends in next.



来源:https://stackoverflow.com/questions/59307194/how-to-search-pattern-in-big-binary-files-efficiently

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!