remove stop words from text C#

后端 未结 6 1988
忘掉有多难
忘掉有多难 2020-12-21 15:41

i want to remove an array of stop words from input string, and I have the following procedure

string[] arrToCheck = new string[] { \"try \", \"yourself\", \         


        
6条回答
  •  旧巷少年郎
    2020-12-21 15:58

    There are a few aspects to this

    Premature optimization
    The method given works and is easy to understand/maintain. Is it causing a performance problem? If not, then don't worry about it. If it ever causes a problem, then look at it.

    Expected Results
    In the example, what you do want the output to be?

    "Did you this asking"
    

    or

    "Did you  this   asking"
    

    You haved added spaces to the end of "try" and "before" but not "yourself". Why? Typo?

    string.Replace() is case-sensitive. If you care about casing, you need to modify the code.

    Working with partials is messy.
    Words change in different tenses. The example of 'do' being removed from 'doing' words, but how about 'take' and 'taking'? The order of the stop words matters because you are changing the input. It is possible (I've no idea how likely but possible) that a word which was not in the input before a change 'appears' in the input after the change. Do you want to go back and recheck each time?

    Do you really need to remove the partials?

    Optimizations
    The current method is going to work its way through the input string n times, where n is the number of words to be redacted, creating a new string each time a replacement occurs. This is slow.

    Using StringBuilder (akatakritos above) will speed that up an amount, so I would try this first. Retest to see if this makes it fast enough.

    Linq can be used

    EDIT
    Just splitting by ' ' to demonstrate. You would need to allow for punctuation marks as well and decide what should happen with them.
    END EDIT

    [TestMethod]
    public void RedactTextLinqNoPartials() {
    
        var arrToCheck = new string[] { "try", "yourself", "before" };
        var input = "Did you try this yourself before asking";
    
        var output = string.Join(" ",input.Split(' ').Where(wrd => !arrToCheck.Contains(wrd)));
    
        Assert.AreEqual("Did you this asking", output);
    
    }
    

    Will remove all the whole words (and the spaces. It will not be possible to see from where the words were removed) but without some benchmarking I would not say that it is faster.

    Handling partials with linq becomes messy but can work if we only want one pass (no checking for 'discovered' words')

    [TestMethod]
    public void RedactTextLinqPartials() {
    
        var arrToCheck = new string[] { "try", "yourself", "before", "ask" };
        var input = "Did you try this yourself before asking";
    
        var output = string.Join(" ", input.Split(' ').Select(wrd => {
            var found = arrToCheck.FirstOrDefault(chk => wrd.IndexOf(chk) != -1);
                return found != null
                       ? wrd.Replace(found,"")
                       : wrd;
        }).Where(wrd => wrd != ""));
    
    
        Assert.AreEqual("Did you this ing", output);
    
    }
    

    Just from looking at this I would say that it is slower than the string.Replace() but without some numbers there is no way to tell. It is definitely more complicated.

    Bottom Line
    The String.Replace() approach (modified to use string builder and to be case insensitive) looks like a good first cut solution. Before trying anything more complicated I would benchmark it under likely performance conditions.

    hth,
    Alan.

提交回复
热议问题