Regular expression generator/reducer?

后端未结

关注

 8  1086

I was posed an interesting question from a colleague for an operational pain point we currently have, and am curious if there\'s anything out there (utility/library/algorith

相关标签:

8条回答

走了就别回头了

2020-12-04 20:43

Taking the cue from the other two answers, is all you need to match is only the strings supplied, you probably better off doing a straight string match (slow) or constructing a simple FSM that matches those strings(fast).

A regex actually creates a FSM and then matches your input against it, so if the inputs are from a set of previously known set, it is possible and often easier to make the FSM yourself instead of trying to auto-generate a regex.

Aho-Corasick has already been suggested. It is fast, but can be tricky to implement. How about putting all the strings in a Trie and then querying on that instead (since you are matching entire strings, not searching for substrings)?

0 讨论(0)
发布评论:

提交评论
- 加载中...
清酒与你

2020-12-04 20:49
An easy way to do this is to use Python's hachoir_regex module:
```
urls = ['http://www.example.com','http://www.example.com/subdir','http://foo.example.com']
as_regex = [hachoir_regex.parse(url) for url in urls]
reduce(lambda x, y: x | y, as_regex)
```
creates the simplified regular expression
```
http://(www.example.com(|/subdir)|foo.example.com)
```
The code first creates a simple regex type for each URL, then concatenates these with | in the reduce step.
0 讨论(0)
发布评论:

提交评论
- 加载中...
醉话见心

2020-12-04 20:50

An automatic generator for regular expression is available here. The tool has a web interface and uses Genetic Programming to generate regexes from a set of few examples: you can choose between a syntax ready for Java or JavaScript regex engines. It has been developed by our research group and has been presented at GECCO 2012 conference.

0 讨论(0)
发布评论:

提交评论
- 加载中...
深忆病人

2020-12-04 20:50

I think it would make sense to take a step back and think about what you're doing, and why.

To match all those URLs, only those URLs and no other, you don't need a regexp; you can probably get acceptable performance from doing exact string comparisons over each item in your list of URLs.

If you do need regexps, then what are the variable differences you're trying to accomodate? I.e. which part of the input must match verbatim, and where is there wiggle room?

If you really do want to use a regexp to match a fixed list of strings, perhaps for performance reasons, then it should be simple enough to write a method that glues all your input strings together as alternatives, as in your example. The state machine doing regexp matching behind the scenes is quite clever and will not run more slowly if your match alternatives have common (and thus possibly redundant) substrings.

0 讨论(0)
发布评论:

提交评论
- 加载中...
[愿得一人]

2020-12-04 20:52

If you want to compare against all the strings in a set and only against those, use a trie, or compressed trie, or even better a directed acyclic word graph. The latter should be particularly efficient for URLs IMO.

You would have to abandon regexps though.

0 讨论(0)
发布评论:

提交评论
- 加载中...
长发绾君心

2020-12-04 20:55

The Emacs utility function regexp-opt (source code) does not do exactly what you want (it only works on fixed strings), but it might be a useful starting point.

0 讨论(0)
发布评论:

提交评论
- 加载中...

1 2 下一页