Replace Bad words using Regex

后端 未结 4 1422
小蘑菇
小蘑菇 2021-01-11 14:32

I am trying to create a bad word filter method that I can call before every insert and update to check the string for any bad words and replace with \"[Censored]\".

4条回答
  •  难免孤独
    2021-01-11 15:11

    Please see this "clbuttic" (or for your case cl[Censored]ic) article before doing a string replace without considering word boundaries:

    http://www.codinghorror.com/blog/2008/10/obscenity-filters-bad-idea-or-incredibly-intercoursing-bad-idea.html

    Update

    Obviously not foolproof (see article above - this approach is so easy to get around or produce false positives...) or optimized (the regular expressions should be cached and compiled), but the following will filter out whole words (no "clbuttics") and simple plurals of words:

    const string CensoredText = "[Censored]";
    const string PatternTemplate = @"\b({0})(s?)\b";
    const RegexOptions Options = RegexOptions.IgnoreCase;
    
    string[] badWords = new[] { "cranberrying", "chuffing", "ass" };
    
    IEnumerable badWordMatchers = badWords.
        Select(x => new Regex(string.Format(PatternTemplate, x), Options));
    
    string input = "I've had no cranberrying sleep for chuffing chuffings days -
        the next door neighbour is playing classical music at full tilt!";
    
    string output = badWordMatchers.
       Aggregate(input, (current, matcher) => matcher.Replace(current, CensoredText));
    
    Console.WriteLine(output);
    

    Gives the output:

    I've had no [Censored] sleep for [Censored] [Censored] days - the next door neighbour is playing classical music at full tilt!

    Note that "classical" does not become "cl[Censored]ical", as whole words are matched with the regular expression.

    Update 2

    And to demonstrate a flavour of how this (and in general basic string\pattern matching techniques) can be easily subverted, see the following string:

    "I've had no cranberryıng sleep for chuffıng chuffıngs days - the next door neighbour is playing classical music at full tilt!"

    I have replaced the "i"'s with Turkish lower case undottted "ı"'s. Still looks pretty offensive!

提交回复
热议问题