What is the best algorithm for arbitrary delimiter/escape character processing?

后端未结

关注

 7  1883

心在旅途 2021-01-02 07:57

I\'m a little surprised that there isn\'t some information on this on the web, and I keep finding that the problem is a little stickier than I thought.

Here\'s the r

7条回答

春和景丽 (楼主)

2021-01-02 08:31
The implementation of this kind of tokenizer in terms of a FSM is fairly straight forward.

You do have a few decisions to make (like, what do I do with leading delimiters? strip or emit NULL tokens).

Here is an abstract version which ignores leading and multiple delimiters, and doesn't allow escaping the newline:
```
state(input)     action
========================
BEGIN(*):         token.clear(); state=START;
END(*):           return;
*(\n\0):          token.emit(); state=END;
START(DELIMITER): ; // NB: the input is *not* added to the token!
START(ESCAPE):    state=ESC; // NB: the input is *not* added to the token!
START(*):         token.append(input); state=NORM;
NORM(DELIMITER):  token.emit(); token.clear(); state=START;
NORM(ESCAPE):     state=ESC; // NB: the input is *not* added to the token!
NORM(*):          token.append(input);
ESC(*):           token.append(input); state=NORM;
```
This kind of implementation has the advantage of dealing with consecutive excapes naturally, and can be easily extended to give special meaning to more escape sequences (i.e. add a rule like ESC(t) token.appeand(TAB)).
0 讨论(0)

查看其它7个回答
发布评论:

提交评论
- 加载中...