remove stop words from text C#

后端 未结 6 1986
忘掉有多难
忘掉有多难 2020-12-21 15:41

i want to remove an array of stop words from input string, and I have the following procedure

string[] arrToCheck = new string[] { \"try \", \"yourself\", \         


        
6条回答
  •  太阳男子
    2020-12-21 15:57

    Here you go:

    var words_to_remove = new HashSet { "try", "yourself", "before" };
    string input = "Did you try this yourself before asking";
    
    string output = string.Join(
        " ",
        input
            .Split(new[] { ' ', '\t', '\n', '\r' /* etc... */ })
            .Where(word => !words_to_remove.Contains(word))
    );
    
    Console.WriteLine(output);
    

    This prints:

    Did you this asking
    

    The HashSet provides extremely quick lookups, so 450 elements in words_to_remove should be no problem at all. Also, we are traversing the input string only once (instead of once per word to remove as in your example).

    However, if the input string is very long, there are ways to make this more memory efficient (if not quicker), by not holding the split result in memory all at once.

    To remove not just "do" but "doing", "does" etc... you'll have to include all these variants in the words_to_remove. If you wanted to remove prefixes in a general way, this would be possible to do (relatively) efficiently using a trie of words to remove (or alternatively a suffix tree of input string), but what to do when "do" is not a prefix of something that should be removed, such as "did"? Or when it is prefix of something that shouldn't be removed, such as "dog"?

    BTW, to remove words no matter their case, simply pass the appropriate case-insensitive comparer to HashSet constructor, for example StringComparer.CurrentCultureIgnoreCase.

    --- EDIT ---

    Here is another alternative:

    var words_to_remove = new[] { " ", "try", "yourself", "before" }; // Note the space!
    string input = "Did you try this yourself before asking";
    
    string output = string.Join(
        " ",
        input.Split(words_to_remove, StringSplitOptions.RemoveEmptyEntries)
    );
    

    I'm guessing it should be slower (unless string.Split uses a hashtable internally), but is nice and tidy ;)

提交回复
热议问题