Profanity Filter using a Regular [removed]list of 100 words)

前端 未结 2 1767
孤街浪徒
孤街浪徒 2020-12-31 17:53

What is the correct way to strip profane words from a string given:
1) I have a list of 100 words to look for in an array of strings. 2) What is the correct way to handl

2条回答
  •  无人及你
    2020-12-31 18:38

    This is quite a difficult problem to solve and you need determine if regular expressions will work for you and how you handle embedding (when you add a dictionary word to profanity like frackface except with the real F-word).

    Regular expressions generally have a limit to how long they can be and this usually prevents you from using a single regex for all your words. Executing multiple regular expressions against a string is really slow, depending on what performance you need and how big your blacklist gets. We initially implement CleanSpeak as a regular expression system, but it didn't scale and we rewrote it using a different mechanism.

    You also need to consider phrases, punctuation, spaces, leet-speak and other languages. All of these make regular expressions less appealing as a solution. Here are some examples using the word hello (assume it is profanity for this exercise):

    • List item
    • h e l l o
    • h.e.l.l.o
    • h_e_l_l_o
    • |-|ello
    • h3llo
    • "hello there" (this phrase might not contain any profane words but combined they are profane)

    You also need to handle edge cases where two or more dictionary (whitelist) words contain a profanity when next to each other. Some examples that contain the s-word:

    • bash it
    • ssh it's quiet time

    These are obviously not profanity, but most homegrown and many commercial solutions have problems with these cases.

    We have spent the last 3 years perfecting the filter used by CleanSpeak to ensure it handles all of these cases and we continue to tweak it and make it better. We also spent 8 months perfecting our system for performance and it can handle about 5,000 messages per second. Not to say you can't build something usable, but be prepared to handle a lot of issues that might come up and also to create a system that doesn't use regular expressions.

提交回复
热议问题