what's the fastest way to scan a very large file in java?

北战南征 提交于 2019-12-21 04:43:13

问题


Imagine I have a very large text file. Performance really matters.

All I want to do is to scan it to look for a certain string. Maybe I want to count how many I have of those, but it really is not the point.

The point is: what's the fastest way ?

I don't care about maintainance it needs to be fast.

Fast is key.


回答1:


For a one off search use a Scanner, as suggested here

A simple technique that could well be considerably faster than indexOf() is to use a Scanner, with the method findWithinHorizon(). If you use a constructor that takes a File object, Scanner will internally make a FileChannel to read the file. And for pattern matching it will end up using a Boyer-Moore algorithm for efficient string searching.




回答2:


First of all, use nio (FileChannel) rather than the java.io classes. Second, use an efficient string search algorithm like Boyer-Moore.

If you need to search through the same file multiple times for different strings, you'll want to construct some kind of index, so take a look at Lucene.




回答3:


Load the whole file into memory and then look at using a string searching algorithm such as Knuth Morris Pratt.

Edit:
A quick google shows this string searching library that seems to have implemented a few different string search algorithms. Note I've never used it so can't vouch for it.




回答4:


Whatever may be the specifics, memory mapped IO is usually the answer.

Edit: depending on your requirements, you could try importing the file into an SQL database and then leveraging the performance improvements through JDBC.

Edit2: this thread at JavaRanch has some other ideas, involving FileChannel. I think it might be exactly what you are searching.




回答5:


I'd say the fastest you can get will be to use BufferedInputStreams on top of FileInputStreams... or use custom buffers if you want to avoid the BufferedInputStream instantiation.

This will explain it better than me : http://java.sun.com/developer/technicalArticles/Programming/PerfTuning/




回答6:


Use the right tool: full text-search library

My suggestion is to do a in-memory index (or file based index with caching enabled) and then perform the search on it. As @Michael Borgwardt suggested, Lucene is the best library out there.




回答7:


I don't know if this is a stupid suggestion, but isn't grep a pretty efficient file searching tool? Maybe you can call it using Runtime.getRuntime().exec(..)




回答8:


It depends on whether you need to do more than one search per file. If you need to do just one search, read the file in from disk and parse it using the tools suggested by Michael Bogwart. If you need to do more than one search, you should probably build an index of the file with a tool like Lucene: read the file in, tokenise it, stick tokens in index. If the index is small enough, have it in RAM (Lucene gives option of RAM or disk-backed index). If not keep it on disk. And if it is too large for RAM and you are very, very, very concerned about speed, store your index on a solid state/flash drive.



来源:https://stackoverflow.com/questions/4886154/whats-the-fastest-way-to-scan-a-very-large-file-in-java

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!