Short example of regular expression converted to a state machine?

问题

In the Stack Overflow podcast #36 (http://blog.stackoverflow.com/2009/01/podcast-36/), this opinion was expressed: Once you understand how easy it is to set up a state machine, you’ll never try to use a regular expression inappropriately ever again.

I've done a bunch of searching. I've found some academic papers and other complicated examples, but I'd like to find a simple example that would help me understand this process. I use a lot of regular expressions, and I'd like to make sure I never use one "inappropriately" ever again.

回答1:

Sure, although you'll need more complicated examples to truly understand how REs work. Consider the following RE:

^[A-Za-z][A-Za-z0-9_]*$

which is a typical identifier (must start with alpha and can have any number of alphanumeric and undescore characters following, including none). The following pseudo-code shows how this can be done with a finite state machine:

state = FIRSTCHAR
for char in all_chars_in(string):
    if state == FIRSTCHAR:
            if char is not in the set "A-Z" or "a-z":
                error "Invalid first character"
            state = SUBSEQUENTCHARS
            next char
    if state == SUBSEQUENTCHARS:
            if char is not in the set "A-Z" or "a-z" or "0-9" or "_":
                error "Invalid subsequent character"
            state = SUBSEQUENTCHARS
            next char

Now, as I said, this is a very simple example. It doesn't show how to do greedy/nongreedy matches, backtracking, matching within a line (instead of the whole line) and other more esoteric features of state machines that are easily handled by the RE syntax.

That's why REs are so powerful. The actual finite state machine code required to do what a one-liner RE can do is usually very long and complex.

The best thing you could do is grab a copy of some lex/yacc (or equivalent) code for a specific simple language and see the code it generates. It's not pretty (it doesn't have to be since it's not supposed to be read by humans, they're supposed to be looking at the lex/yacc code) but it may give you a better idea as to how they work.

回答2:

A rather convenient way to help look at this to use python's little-known re.DEBUG flag on any pattern:

>>> re.compile(r'<([A-Z][A-Z0-9]*)\b[^>]*>(.*?)</\1>', re.DEBUG)
literal 60
subpattern 1
  in
    range (65, 90)
  max_repeat 0 65535
    in
      range (65, 90)
      range (48, 57)
at at_boundary
max_repeat 0 65535
  not_literal 62
literal 62
subpattern 2
  min_repeat 0 65535
    any None
literal 60
literal 47
groupref 1
literal 62

The numbers after 'literal' and 'range' refer to the integer values of the ascii characters they're supposed to match.

回答3:

Make your own on the fly!

http://osteele.com/tools/reanimator/???

This is a really nicely put together tool which visualises regular expressions as FSMs. It doesn't have support for some of the syntax you'll find in real-world regular expression engines, but certainly enough to understand exactly what's going on.

回答4:

Is the question "How do I choose the states and the transition conditions?", or "How do I implement my abstract state machine in Foo?"

How do I choose the states and the transition conditions?

I usually use FSMs for fairly simple problems and choose them intuitively. In my answer to another question about regular expressions, I just looked at the parsing problem as one of being either Inside or outside a tag pair, and wrote out the transitions from there (with a beginning and ending state to keep the implementation clean).

How do I implement my abstract state machine in Foo?

If your implementation language supports a structure like c's switch statement, then you switch on the current state and process the input to see which action and/or transition too perform next.

Without switch-like structures, or if they are deficient in some way, you if style branching. Ugh.

Written all in one place in c the example I linked would look something like this:

token_t token;
state_t state=BEGIN_STATE;
do {
   switch ( state.getValue() ) {
   case BEGIN_STATE;
      state=OUT_STATE;
      break;
   case OUT_STATE:
      switch ( token.getValue() ) {
         case CODE_TOKEN:
            state = IN_STATE;
            output(token.string());
            break;
         case NEWLINE_TOKEN;
            output("<break>");
            output(token.string());
            break;
         ...
      }
      break;
   ...
   }
} while (state != END_STATE);

which is pretty messy, so I usually rip the state cases out to separate functions.

回答5:

I'm sure someone has better examples, but you could check this post by Phil Haack, which has an example of a regular expression and a state machine doing the same thing (there's a previous post with a few more regex examples in there as well I think..)

Check the "HenriFormatter" on that page.

回答6:

I don't know what academic papers you've already read but it really isn't that difficult to understand how to implement a finite state machine. There are some interesting mathematics but to idea is actually very trivial to understand. The easiest way to understand an FSM is through input and output (actually, this comprises most of the formal definition, that I won't describe here). A "state" is essentially just describing a set of input and outputs that have occurred and can occur from a certain point.

Finite state machines are easiest to understand via diagrams. For example:

alt text http://img6.imageshack.us/img6/7571/mathfinitestatemachinedco3.gif

All this is saying is that if you begin in some state q0 (the one with the Start symbol next to it) you can go to other states. Each state is a circle. Each arrow represents an input or output (depending on how you look at it). Another way to think of an finite state machine is in terms of "valid" or "acceptable" input. There are certain output strings that are NOT possible certain finite state machines; this would allow you to "match" expressions.

Now suppose you start at q0. Now, if you input a 0 you will go to state q1. However, if you input a 1 you will go to state q2. You can see this by the symbols above the input/output arrows.

Let's say you start at q0 and get this input

0, 1, 0, 1, 1, 1

This means you have gone through states (no input for q0, you just start there):

q0 -> q1 -> q0 -> q1 -> q0 -> q2 -> q3 -> q3

Trace the picture with your finger if it doesn't make sense. Notice that q3 goes back to itself for both inputs 0 and 1.

Another way to say all this is "If you are in state q0 and you see a 0 go to q1 but if you see a 1 go to q2." If you make these conditions for each state you are nearly done defining your state machine. All you have to do is have a state variable and then a way to pump input in and that is basically what is there.

Ok, so why is this important regarding Joel's statement? Well, building the "ONE TRUE REGULAR EXPRESSION TO RULE THEM ALL" can be very difficult and also difficult to maintain modify or even for others to come back and understand. Also, in some cases it is more efficient.

Of course, state machines have many other uses. Hope this helps in some small way. Note, I didn't bother going into the theory but there are some interesting proofs regarding FSMs.

来源：https://stackoverflow.com/questions/525004/short-example-of-regular-expression-converted-to-a-state-machine

标签

regex

state-machine

fsm