grammar | 易学教程

Is parsing in multiple passes common for PEG grammars?

阅读更多关于 Is parsing in multiple passes common for PEG grammars?

问题 I'm designing a music programming language and implementing its syntax as a PEG grammar. The parsing process has ended up being fairly complicated, so what seemed like the simplest approach was to define several, separate grammars, and apply them in sequence. So far I have three grammars: Take the entire contents of source file and strip out the comments. Take the source file (comments removed) and separate it by instrument. This results in pairs of instrument name/definition and the "music

Top-down parser classification

阅读更多关于 Top-down parser classification

问题 I've watched this course by Alex Aiken and read through many other resources. But I'm struggling to find clear classification of top-down parsers. This document doesn't provide a clear classification either but at least gives some definitions I'll use in the post. So here is the classification I've come up so far: Backtracking VS Predictive Backtracking One solution to parsing would be to implement backtracking. Based on the information the parser currently has about the input, a decision is

Generating PCFG from Universal tagset [duplicate]

阅读更多关于 Generating PCFG from Universal tagset [duplicate]

问题 This question already has an answer here : nltk cant interpret grammar category PRP$ output by stanford parser (1 answer) Closed 2 years ago . I am trying to build a PCFG using the POS tags obtained from the below code: from nltk.corpus import treebank corpus = treebank.tagged_sents(tagset='universal') tags = set() for sent in corpus: for (word, tag) in sent: tags.add(tag) tags = list(tags) print tags Gives, ['ADV', 'NOUN', 'ADP', 'PRON', 'DET', '.', 'PRT', 'NUM', 'X', 'CONJ', 'ADJ', 'VERB']

How to encode FIRST & FOLLOW sets inside a compiler

阅读更多关于 How to encode FIRST & FOLLOW sets inside a compiler

问题 I am writing a compiler for a compiler design course that I am taking and I am currently at the Syntax Analysis where I need to write a parser. I need to have the FIRST and FOLLOW sets to handle any errors that may appear in the source text. I have precalculated the FIRST and FOLLOW sets for all of the non-terminals in my grammar but I am having trouble deciding where I should actually encode them inside of my program. Should I place them in a map where the key is the name of the non-terminal

Parse Parenthesis as atoms ANTLR

阅读更多关于 Parse Parenthesis as atoms ANTLR

问题 I'm trying to match balanced parentheses such that, a PARAMS tree is created if a match is made, else the LPARAM and RPARAM tokens are simply added as atoms to the tree... tokens { LIST; PARAMS; } start : list -> ^(LIST list); list : (expr|atom)+; expr : LPARAM list? RPARAM -> ^(PARAMS list?); atom : INT | LPARAM | RPARAM; INT : '0'..'9'+; LPARAM : '('; RPARAM : ')'; At the moment, it will never create a PARAMS tree, because in the rule expr it will always see the end RPARAM as an atom,

ANTLR lexer rule consumes too much

阅读更多关于 ANTLR lexer rule consumes too much

问题 ANTLR Lexer Rule Design I have a requirement for the following token: Allowable characters include uppercase, lowercase, numeric, space, and hyphen characters Unfixed length (must be at least two characters in length) Token must contain at least one space or hyphen Token must start and end in an uppercase, lowercase, numeric, space, or hyphen character (cannot begin or end with a space) The ANTLR lexer rule "AlphaNumericSpaceHyphen" in the grammar below almost works except for one case. Using

What's wrong with my grammar

阅读更多关于 What's wrong with my grammar

问题 I try to input the following into my yacc parser: int main(void) { return; } It looks valid to me according to what's defined in the yacc file, but I get a "syntax error" message after the return. Why is that? The yacc file: /* C-Minus BNF Grammar */ %{ #include "parser.h" #include <string.h> %} %union { int intval; struct symtab *symp; } %token ELSE %token IF %token INT %token RETURN %token VOID %token WHILE %token <symp> ID %token <intval> NUM %token LTE %token GTE %token EQUAL %token

Why s--> ^ and A --> a ? in Context Free Grammars

阅读更多关于 Why s--> ^ and A --> a ? in Context Free Grammars

问题 I've been reading: "Tips for creating Context free grammar" post for learning purposes and I nearly understand the concept, but I don't quite understand the following. If we have: L = {a m b n | m >= n}. I understand this: S --> B B --> aBb A --> aA But what I don't understand is the concept of adding to the end of these particular values, such as: S --> B | ^ B --> aBb | A A --> aA | a Why do we add ^ (null), A , and a to the end of these lines? What do they do and why do we need them? All

Why my simple Ragel grammar use all memory and crash

阅读更多关于 Why my simple Ragel grammar use all memory and crash

问题 I am trying to convert a set of regular expression from Adblock Plus rules into an optimized function I could call from C++. I was expecting to be able to use a lexer generator such as Ragel to do this but when I try with a very small set of Regex the memory usage go very high > 30 GB and Ragel quit without error message and without producing the output file. I included the toy grammar bellow, I am trying to understand if I am doing anything stupid that could be optimized to solve the issue.

How can I parse a special character differently in two terminal rules using antlr?

阅读更多关于 How can I parse a special character differently in two terminal rules using antlr?

问题 I have a grammar that uses the $ character at the start of many terminal rules, such as $video{ , $audio{ , $image{ , $link{ and others that are like this. However, I'd also like to match all the $ and { and } characters that don't match these rules too. For example, my grammar does not properly match $100 in the CHUNK rule, but adding the $ to the long list of acceptable characters in CHUNK causes the other production rules to break. How can I change my grammar so that it's smart enough to