ParseKit: What built-in Productions should I use in my Grammars?

时光总嘲笑我的痴心妄想 提交于 2019-12-23 01:54:23

问题


I just started using ParseKit to explore language creation and perhaps build a small toy DSL. However, the current SVN trunk from Google is throwing a -[PKToken intValue]: unrecognized selector sent to instance ... when parsing this grammar:

@start = identifier ;
identifier = (Letter | '_') | (letterOrDigit | '_') ;
letterOrDigit = Letter | Digit ;

Against this input:

foo

Clearly, I am missing something or have incorrectly configured my project. What can I do to fix this issue?


回答1:


Developer of ParseKit here.

First, see the ParseKit Tokenization docs.

Basically, ParseKit can work in one of two modes: Let's call them Tokens Mode and Chars Mode. (There are no formal names for these two modes, but perhaps there should be.)

Tokens Mode is more popular by far. Virtually every example you will find of using ParseKit will show how to use Tokens Mode. I believe all of the documentation on http://parsekit.com is using Tokens Mode. ParseKit's grammar feature (that you are using in your example only works in Tokens Mode).

Chars Mode is a very little-known feature of ParseKit. I've never had anyone ask about it before.

So the differences in the modes are:

  • In Tokens Mode, the ParseKit Tokenizer emits multi-char tokens (like Words, Symbols, Numbers, QuotedStrings etc) which are then parsed by the ParseKit parsers you create (programmatically or via grammars).
  • In Chars Mode, the ParseKit Tokenizer always emits single-char tokens which are then parsed by the ParseKit parsers you create programmatically. (grammars don't currently work with this mode as this mode is not popular).

You could use Chars Mode to implement Regular Expresions which parse on a char-by-char basis.


For your example, you should be ignoring Chars Mode and just use Tokens Mode. The following Built-in Productions are for Chars Mode only. Do not use them in your grammars:
(PK)Letter
(PK)Digit
(PK)Char
(PK)SpecificChar 

Notice how all of those Productions sound like they match individual chars. That's because they do.

Your example above should probably look like:

@start = identifier;
identifier = Word; // by default Words start with a-zA-Z_ and contain -0-9a-zAZ_'

Keep in mind the Productions in your grammars (parsers like identifier) will be working on Tokens already emitted from ParseKit's Tokenizer. Not individual chars.

IOW: by the time your grammar goes to work parsing input, the input has already been tokenized into Tokens of type Word, Number, Symbol, QuotedString, etc.

Here are all of the Built-in Productions available for use in your Grammar:

Word
Number 
Symbol
QuotedString
Comment
Any
S // Whitespace. only available when @preservesWhitespaceTokens=YES. NO by default.

Also:

DelimitedString('start', 'end', 'allowedCharset')
/xxx/i // RegEx match

There are also operators for composite parsers:

  // Sequence
| // Alternation
? // Optional
+ // Multiple
* // Repetition
~ // Negation
& // Intersection
- // Difference


来源:https://stackoverflow.com/questions/13792980/parsekit-what-built-in-productions-should-i-use-in-my-grammars

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!