Efficient way to search for invalid characters in python

前端 未结 9 1427
萌比男神i
萌比男神i 2021-01-07 03:10

I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whol

相关标签:
9条回答
  • 2021-01-07 04:08

    For a regex solution, there are two ways to go here:

    1. Find one invalid char anywhere in the string.
    2. Validate every char in the string.

    Here is a script that implements both:

    import re
    topic_message = 'This topic is a-ok'
    
    # Option 1: Invalidate one char in string.
    re1 = re.compile(r"[<>/{}[\]~`]");
    if re1.search(topic_message):
        print ("RE1: Invalid char detected.")
    else:
        print ("RE1: No invalid char detected.")
    
    # Option 2: Validate all chars in string.
    re2 =  re.compile(r"^[^<>/{}[\]~`]*$");
    if re2.match(topic_message):
        print ("RE2: All chars are valid.")
    else:
        print ("RE2: Not all chars are valid.")
    

    Take your pick.

    Note: the original regex erroneously has a right square bracket in the character class which needs to be escaped.

    Benchmarks: After seeing gnibbler's interesting solution using set(), I was curious to find out which of these methods would actually be fastest, so I decided to measure them. Here are the benchmark data and statements measured and the timeit result values:

    Test data:

    r"""
    TEST topic_message STRINGS:
    ok:  'This topic is A-ok.     This topic is     A-ok.'
    bad: 'This topic is <not>-ok. This topic is {not}-ok.'
    
    MEASURED PYTHON STATEMENTS:
    Method 1: 're1.search(topic_message)'
    Method 2: 're2.match(topic_message)'
    Method 3: 'set(invalid_chars).intersection(topic_message)'
    """
    

    Results:

    r"""
    Seconds to perform 1000000 Ok-match/Bad-no-match loops:
    Method  Ok-time  Bad-time
    1        1.054    1.190
    2        1.830    1.636
    3        4.364    4.577
    """
    

    The benchmark tests show that Option 1 is slightly faster than option 2 and both are much faster than the set().intersection() method. This is true for strings which both match and don't match.

    0 讨论(0)
  • 2021-01-07 04:09

    I can't say what would be more efficient, but you certainly should get rid of the $ (unless it's an invalid character for the message)... right now you only match the re if the characters are at the end of topic_message because $ anchors the match to the right-hand side of the line.

    0 讨论(0)
  • 2021-01-07 04:10

    re.match and re.search behave differently. Splitting words is not required to search using regular expressions.

    import re
    symbols_re = re.compile(r"[^<>/\{}[]~`]");
    
    if symbols_re.search(self.cleaned_data('topic_message')):
        //raise Validation error
    
    0 讨论(0)
提交回复
热议问题