Efficient way to search for invalid characters in python

前端未结

关注

 9  1427

I am building a forum application in Django and I want to make sure that users dont enter certain characters in their forum posts. I need an efficient way to scan their whol

相关标签:

9条回答

被撕碎了的回忆

2021-01-07 04:08

For a regex solution, there are two ways to go here:

Find one invalid char anywhere in the string.
Validate every char in the string.

Here is a script that implements both:

import re
topic_message = 'This topic is a-ok'

# Option 1: Invalidate one char in string.
re1 = re.compile(r"[<>/{}[\]~`]");
if re1.search(topic_message):
    print ("RE1: Invalid char detected.")
else:
    print ("RE1: No invalid char detected.")

# Option 2: Validate all chars in string.
re2 =  re.compile(r"^[^<>/{}[\]~`]*$");
if re2.match(topic_message):
    print ("RE2: All chars are valid.")
else:
    print ("RE2: Not all chars are valid.")

Take your pick.

Note: the original regex erroneously has a right square bracket in the character class which needs to be escaped.

Benchmarks: After seeing gnibbler's interesting solution using set(), I was curious to find out which of these methods would actually be fastest, so I decided to measure them. Here are the benchmark data and statements measured and the timeit result values:

Test data:

r"""
TEST topic_message STRINGS:
ok:  'This topic is A-ok.     This topic is     A-ok.'
bad: 'This topic is <not>-ok. This topic is {not}-ok.'

MEASURED PYTHON STATEMENTS:
Method 1: 're1.search(topic_message)'
Method 2: 're2.match(topic_message)'
Method 3: 'set(invalid_chars).intersection(topic_message)'
"""

Results:

r"""
Seconds to perform 1000000 Ok-match/Bad-no-match loops:
Method  Ok-time  Bad-time
1        1.054    1.190
2        1.830    1.636
3        4.364    4.577
"""

The benchmark tests show that Option 1 is slightly faster than option 2 and both are much faster than the set().intersection() method. This is true for strings which both match and don't match.

0 讨论(0)

既然无缘

2021-01-07 04:09

I can't say what would be more efficient, but you certainly should get rid of the $ (unless it's an invalid character for the message)... right now you only match the re if the characters are at the end of topic_message because $ anchors the match to the right-hand side of the line.

0 讨论(0)
发布评论:

提交评论
- 加载中...
傲寒

2021-01-07 04:10
re.match and re.search behave differently. Splitting words is not required to search using regular expressions.
```
import re
symbols_re = re.compile(r"[^<>/\{}[]~`]");

if symbols_re.search(self.cleaned_data('topic_message')):
    //raise Validation error
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

上一页 1 2