Efficient substring search in a large text file containing 100 millions strings(no duplicate string)

问题

I have a large text file(1.5 Gb) having 100 millions Strings(no duplicate String) and all the Strings are arranged line by line in the file . i want to make a wepapplication in java so that when user give a keyword(Substring) he get the count of All the strings present in the file which contains that keyword. i know one technique LUCENE already..is there any other way to do this.?? i want the result within 3-4 seconds. MY SYSTEM HAS 4GB RAM AND DUAL CORE configuration.... need to do this in "JAVA ONLY"

回答1:

Try to use hash tables. One more thing that can be done is any method similar to MAP-REDUCE. What i want to say is that you can try to use inverted index. Google uses the same technique. All you can create a file of stopwords where you can put words that can be ignored e.g. I, am, the, a, an, in, on etc.

this is the only thing which i suppose is possible. I read somewhere that for searching, u can arrays.

回答2:

Is there expected to be a lot of overlap in your keywords? If so, you might be able to store a hash map from keyword (String) to file locations (ArrayList). You can not store all the lines in memory though with the object overhead.

Once you have the file location, you can seek to that location in the text file and then look nearby to get the enclosing newline characters, returning the line. That will definitely be less than 4 seconds. Here is a little info on that. If this is just for a little exercise, that would work fine.

A better solution though would be a two tiered index, one mapping keywords to line numbers, and then another mapping line numbers to line text. This will not fit in memory on your machine. There are great disk based key-value stores though that would work well. If this is anything beyond a toy problem, go with the Reddis route.

回答3:

You could build a directory structure based on the first few letters of each word. For example:

/A
/A/AA
/A/AB
/A/AC
...
/Z/ZU

Under that structure, you can keep a file containing all the strings with the first characters matching the folder name. The first characters in your search term will narrow the selection down to a folder with a small fraction of your overall list. From there, you do can do a full search of just that file. If it's too slow, increase the depth of your directory tree to cover more letters.

回答4:

Since you have more RAM than the size of the file, you might be able to store the entire data as a structure in the RAM and search it very quickly. A trie might be a good data structure to use; it does have fast prefix finding, but not sure how it performs for substrings.

来源：https://stackoverflow.com/questions/14633286/efficient-substring-search-in-a-large-text-file-containing-100-millions-strings

标签

java

mysql

file

lucene