Best way to test for existing string against a large list of comparables

試著忘記壹切 提交于 2019-12-07 00:54:36

Personally I don't think 30 is particularly large for a regex so I wouldn't be too quick to rule it out. You can create the regex with a single line of code:

var acronyms = new[] { "AB", "BC", "CD", "ZZAB" };
var regex = new Regex(string.Join("|", acronyms), RegexOptions.Compiled);
for (var match = regex.Match("ZZZABCDZZZ"); match.Success; match = match.NextMatch())
    Console.WriteLine(match.Value);
// returns AB and CD

So the code is relatively elegant and maintainable. If you know the upper bound for the number of acronyms I would to some testing, who knows what kind of optimizations there are already built into the regex engine. You'll also be able to benefit for free from future regex engine optimizations. Unless you have reason to believe performance will be an issue keep it simple.

On the other hand regex may have other limitations e.g. by default if you have acronyms AB, BC and CD then it'll only return two of these as a match in "ABCD". So its good at telling you there is an acronym but you need to be careful about catching multiple matches.

When performance became an issue for me (> 10,000 items) I put the 'acronyms' in a HashSet and then searched each substring of the text (from min acronym length to max acronym length). This was ok for me because the source text was very short. I'd not heard of it before, but at first look the Aho-Corasick algorithm, referred to in the question you reference, seems like a better general solution to this problem.

If acronym's have fixed size (like in above example), you could calculate a hash for all of them (could be done once per application life) and then split the string in such overlapped pieces and calculate hashes for them too. Then all you'd have to do is to search for values from one array into another one.

You probably could create a suffix/prefix tree or something similar from acronyms and search using this information, there's plenty of algorithms in Wikipedia to do just that.

You could also create an deterministic automata for each of acronyms but it's very similar to previous approach.

Why not simply split the string and compare the returned list? It seems like needless overhead to use a REGEX in this case. I know your format may differ, but it would seem that you could:

  • Split the string based on the 'title separator', in your case a colon :
  • Take the 2nd half of the result, the acronym string, and split it based on the acronym separator, in this case a pipe |
  • Finally, iterate over the newly split list of acronyms and compare each to your list of candidates with a nested for loop

EDIT: If you only need to know if a particular acronym or set of acronyms exist inside a string, use the .Search() method instead of .Match().

The regex approach seems efficient and elegant enough. Of course, you'll have to watch out for unescaped characters when building the expression, or a failure to compile it because of complexity or size limitations.

Another way to do this would be to construct a trie data structure to represent all the acronyms (this may somewhat duplicate what the regex matcher is doing). As you step through each character in the string, you would create a new pointer to the root of the trie, and advance existing pointers to the appropriate child (if any). You get a match when any pointer reaches a leaf.

Here is what I came up with. I would appreciate any constructive criticism that you could offer...

First, create an enum that holds each of my acronym's:

enum acronym
{ AB1,DE2,CC3 }

Next I create a string array of the enum:

string[] acronyms = Enum.GetNames(typeof(acronym));

Finally I loop through the string array and peform the regex.match method:

foreach (string a in acronyms)
{
    Match aMatch = Regex.Match(input, a.ToString(), RegexOptions.None);
    if (aMatch.Success)
    {
        ...<do something>...
        break;
    }
}

See anything wrong with that?

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!