What is the best way to do a search in a large file?

戏子无情 提交于 2020-01-03 22:33:13

问题


I'm looking to apply a KMP (or similar) search to a large file (> 4GB).

I'm expecting this to give me problems though.I can't copy it all to memory because there isn't enough space there.

My question is, what is the best way to go about doing this search? Should I simply create a FILE* and do the search directly in the file, should I copy blocks (say 4k) to memory and search those, or something else completely?


回答1:


If you are using a platform that supports it, you can use mmap(). Pagination of the file is also a possibility, but remember to keep the buffer as large as possible to reduce the IO overhead, and to be careful between boundaries of two pages (suppose a string is matching, but is splitted by the page boundary)

Alternatively, I suggest you to build an index of some sort, and use the index to restrict the search. KMP search is not particularly efficient. This of course depends on the nature of your file, how it gets created, etc.




回答2:


For the file access I would recommend to use memory mapped file to avoid the data copy. It is trivial on unix machines. You may have to split the file mapping into smaller blocks if it can't be allocated in one block. I can provide some code if you are interested.

For the search I would recommend using the Boyer More search algorithm.




回答3:


Searching directly in the file would be very slow, using buffering will give much better performance. But note that your buffer has to be bigger than what you search (SearchLength), of course, and you have to refresh the buffer when being SearchLength bytes before its end.




回答4:


Best approach is to read it in blocks and search that. You should make the block size a parameter, so you can experiment with what gives the best performance.

However, it is usually more efficient to try and index the file in some way so that you don't have to linearly search through the whole file. For example, KMP is a string searching algorithm - are you just looking for occuences of a word? Then you can just create a hash table (on disk) of the words and their location in the file and have very efficient search.



来源:https://stackoverflow.com/questions/1212255/what-is-the-best-way-to-do-a-search-in-a-large-file

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!