Regular Expression Wildcard Matching

前端未结

关注

 9  1517

甜味超标 2020-12-15 04:13

I have a list of about 120 thousand english words (basically every word in the language).

I need a regular expression that would allow searching through these words

9条回答

孤街浪徒 (楼主)

2020-12-15 04:45
Here is a way to transform wildcard into regex:
1. Prepend all special characters ([{\^-=$!|]}).+ with \ - so they are matched as characters and don't make user experience unexpected. Also you could enclose it within \Q (which starts the quote) and \E (which ends it). Also see paragraph about security.
2. Replace * wildcard with \S*
3. Replace ? wildcard with \S?
4. Optionally: prepend pattern with ^ - this will enforce exact match with the beginning.
5. Optionally: append $ to pattern - this will enforce exact match with the end.
  
  \S - stand for non-space character, which happens zero or more times.
Consider using reluctant (non-greedy) quantifiers if you have characters to match after * or +. This can be done by adding ? after * or + like this: \S*? and \S*+?

Consider security: user will send you code to run (because regex is kind of a code too, and user string is used as the regex). You should avoid passing unescaped regex to any other parts of application and only use to filter data retrieved by other means. Because if you do user can affect speed of your code by supplying different regex withing wildcard string - this could be used in DoS attacks.

Example to show execution speeds of similar patterns:
```
seq 1 50000000 > ~/1
du -sh ~/1
563M
time grep -P '.*' ~/1 &>/dev/null
6.65s
time grep -P '.*.*.*.*.*.*.*.*' ~/1 &>/dev/null
12.55s
time grep -P '.*..*..*..*..*.*' ~/1 &>/dev/null
31.14s
time grep -P '\S*.\S*.\S*.\S*.\S*\S*' ~/1 &>/dev/null
31.27s
```
I'd suggest against using .* simply because it can match anything, and usually things are separated with spaces.
0 讨论(0)

查看其它9个回答
发布评论:

提交评论
- 加载中...