What is the best algorithm for arbitrary delimiter/escape character processing?

后端 未结 7 1883
心在旅途
心在旅途 2021-01-02 07:57

I\'m a little surprised that there isn\'t some information on this on the web, and I keep finding that the problem is a little stickier than I thought.

Here\'s the r

7条回答
  •  春和景丽
    2021-01-02 08:31

    The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.

    You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).


    Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:

    state(input)     action
    ========================
    BEGIN(*):         token.clear(); state=START;
    END(*):           return;
    *(\n\0):          token.emit(); state=END;
    START(DELIMITER): ; // NB: the input is *not* added to the token!
    START(ESCAPE):    state=ESC; // NB: the input is *not* added to the token!
    START(*):         token.append(input); state=NORM;
    NORM(DELIMITER):  token.emit(); token.clear(); state=START;
    NORM(ESCAPE):     state=ESC; // NB: the input is *not* added to the token!
    NORM(*):          token.append(input);
    ESC(*):           token.append(input); state=NORM;
    

    This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).

提交回复
热议问题