Efficient algorithm for converting a character set into a nfa/dfa

☆樱花仙子☆ 提交于 2019-12-04 02:40:27

There are a number of ways to handle it. They all boil down to treating sets of characters at a time in the data structures, instead of enumerating the entire alphabet ever at all. It's also how you make scanners for Unicode in a reasonable amount of memory.

You've many choices about how to represent and process sets of characters. I'm presently working with a solution that keeps an ordered list of boundary conditions and corresponding target states. You can process operations on these lists much faster than you could if you had to scan the entire alphabet at each juncture. In fact, it's fast enough that it runs in Python with acceptable speed.

Look at what regular expression libraries like Google RE2 and TRE are doing.

I had the same problem with my scanner generator, so I've come up with the idea of replacing intervals by their ids which is determined using interval tree. For instance a..z range in dfa can be represented as: 97, 98, 99, ..., 122, instead I represent ranges as [97, 122], then build interval tree structure out of them, so at the end they are represented as ids that is referring to the interval tree. Given the following RE: a..z+, we end up with such DFA:

0 -> a -> 1
0 -> b -> 1
0 -> c -> 1
0 -> ... -> 1
0 -> z -> 1

1 -> a -> 1
1 -> b -> 1
1 -> c -> 1
1 -> ... -> 1
1 -> z -> 1
1 -> E -> ACCEPT

Now compress intervals:

0 -> a..z -> 1

1 -> a..z -> 1
1 -> E -> ACCEPT

Extract all intervals from your DFA and build interval tree out of them:

{
    "left": null,
    "middle": {
        id: 0,
        interval: [a, z],
    },
    "right": null
}

Replace actual intervals to their ids:

0 -> 0 -> 1
1 -> 0 -> 1
1 -> E -> ACCEPT

In this library (http://mtimmerm.github.io/dfalex/) I do it by putting a range of consecutive characters on each transition, instead of single characters. This is carried through all the steps of NFA constuction, NFA->DFA conversion, DFA minimization, and optimization.

It's quite compact, but it adds code complexity to every step.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!