i want to remove an array of stop words from input string, and I have the following procedure
string[] arrToCheck = new string[] { \"try \", \"yourself\", \
Here you go:
var words_to_remove = new HashSet { "try", "yourself", "before" };
string input = "Did you try this yourself before asking";
string output = string.Join(
" ",
input
.Split(new[] { ' ', '\t', '\n', '\r' /* etc... */ })
.Where(word => !words_to_remove.Contains(word))
);
Console.WriteLine(output);
This prints:
Did you this asking
The HashSet provides extremely quick lookups, so 450 elements in words_to_remove should be no problem at all. Also, we are traversing the input string only once (instead of once per word to remove as in your example).
However, if the input string is very long, there are ways to make this more memory efficient (if not quicker), by not holding the split result in memory all at once.
To remove not just "do" but "doing", "does" etc... you'll have to include all these variants in the words_to_remove. If you wanted to remove prefixes in a general way, this would be possible to do (relatively) efficiently using a trie of words to remove (or alternatively a suffix tree of input string), but what to do when "do" is not a prefix of something that should be removed, such as "did"? Or when it is prefix of something that shouldn't be removed, such as "dog"?
BTW, to remove words no matter their case, simply pass the appropriate case-insensitive comparer to HashSet constructor, for example StringComparer.CurrentCultureIgnoreCase.
Here is another alternative:
var words_to_remove = new[] { " ", "try", "yourself", "before" }; // Note the space!
string input = "Did you try this yourself before asking";
string output = string.Join(
" ",
input.Split(words_to_remove, StringSplitOptions.RemoveEmptyEntries)
);
I'm guessing it should be slower (unless string.Split uses a hashtable internally), but is nice and tidy ;)