问题
Say I have a Set of Strings:
Set<String> things = new HashSet<String>();
things.add("coffee cup");
things.add("smartphone");
things.add("inkjet printer");
// :
// list could be quite large (100K or so, perhaps loaded from a database)
// :
Now I want to check if another string completely contains any of the Strings in the above set. So:
"a coffee cup" - matches
"android smartphone" - matches
"inkjet printer for sale" - matches
"laser printer" - does not match
"printer" - does not match
The only way I can think of is iterating through the set (and break-ing if found). Is there a more efficient and elegant way to do this?
回答1:
You need Aho-Corasick algorithm. http://en.wikipedia.org/wiki/Aho%E2%80%93Corasick_string_matching_algorithm
https://github.com/raymanrt/aho-corasick
Time complexity is O(m) for preprocessing (where m is total length of strings in the set) and O(n) for matching (where n is length of matched string). So it's asymptotically optimal.
回答2:
Iterate over all substrings of the candidate, and check whether the set contains them?
boolean containsSubstring(Set<String> set, String str) {
for (int i = 0; i < str.length; i++) {
for (int j = i + 1; j < str.length; j++) {
if (set.contains(str.substring(i,j))) {
return true;
}
}
}
return false;
}
Yes, a string of length k has k^2 substrings, but that may still be far less than the number of strings in the set ...
回答3:
I built off of @meriton's suggestion. Instead of every possible substring combination, I'm going to do every possible word combination.
Set<String> permutations = new HashSet<String>();
String [] arr = token.split(" ");
int size = arr.length;
for (int i = size ; i > 0; i--) {
for (int j = 0 ; j < i; j++) {
StringBuilder permutation = new StringBuilder();
permutation.append(arr[j]);
for (int k = j+1 ; k < i; k++) {
permutation.append(" ");
permutation.append(arr[k]);
}
permutations.add(permutation.toString());
}
}
If I run "inkjet printer for sale" through the code snippet above, I get:
- inkjet printer for sale
- printer for sale
- for sale
- sale
- inkjet printer for
- printer for
- for
- inkjet printer
- printer
- inkjet
Then I can do a simple contains() on the original set of words.
来源:https://stackoverflow.com/questions/18645261/how-to-do-a-efficiently-check-if-a-string-partially-exists-in-a-much-larger-set