Efficient way to search for invalid characters in python

前端 未结 9 1462
萌比男神i
萌比男神i 2021-01-07 03:10

I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whol

9条回答
  •  被撕碎了的回忆
    2021-01-07 04:08

    For a regex solution, there are two ways to go here:

    1. Find one invalid char anywhere in the string.
    2. Validate every char in the string.

    Here is a script that implements both:

    import re
    topic_message = 'This topic is a-ok'
    
    # Option 1: Invalidate one char in string.
    re1 = re.compile(r"[<>/{}[\]~`]");
    if re1.search(topic_message):
        print ("RE1: Invalid char detected.")
    else:
        print ("RE1: No invalid char detected.")
    
    # Option 2: Validate all chars in string.
    re2 =  re.compile(r"^[^<>/{}[\]~`]*$");
    if re2.match(topic_message):
        print ("RE2: All chars are valid.")
    else:
        print ("RE2: Not all chars are valid.")
    

    Take your pick.

    Note: the original regex erroneously has a right square bracket in the character class which needs to be escaped.

    Benchmarks: After seeing gnibbler's interesting solution using set(), I was curious to find out which of these methods would actually be fastest, so I decided to measure them. Here are the benchmark data and statements measured and the timeit result values:

    Test data:

    r"""
    TEST topic_message STRINGS:
    ok:  'This topic is A-ok.     This topic is     A-ok.'
    bad: 'This topic is -ok. This topic is {not}-ok.'
    
    MEASURED PYTHON STATEMENTS:
    Method 1: 're1.search(topic_message)'
    Method 2: 're2.match(topic_message)'
    Method 3: 'set(invalid_chars).intersection(topic_message)'
    """
    

    Results:

    r"""
    Seconds to perform 1000000 Ok-match/Bad-no-match loops:
    Method  Ok-time  Bad-time
    1        1.054    1.190
    2        1.830    1.636
    3        4.364    4.577
    """
    

    The benchmark tests show that Option 1 is slightly faster than option 2 and both are much faster than the set().intersection() method. This is true for strings which both match and don't match.

提交回复
热议问题