Writing grammar rules for context sensitive elements using Pyparsing

I am trying to write a grammar for a set of sentences and using Pyparsing to parse it. These sentences tell what and how to search in a text file, and I need to convert them into corresponding regex search codes. However, there are some elements that are not really context-free and hence, I am finding it difficult to write production rules for them. Basically, my aim is to parse these sentences and then write regexes for them.

Some examples of context-sensitive elements found in these sentences -

LINE_CONTAINS phrase1 BEFORE {phrase2 AND phrase3} means in the line, phrase1 can come anywhere before phrase2 and phrase. Similarly for AFTER
LINE_CONTAINS abc JOIN xyz means search for abc xyz and abc-xyz and abcxyz
LINE_CONTAINS abcd AND xyzw means the line should contain both abcd and xyzw

Example - LINE_CONTAINS we transfected BEFORE {sirna} AND gene AND LINE_STARTSWITH Therefore

This should be converted to re.search(r'(^!Therefore.*?we transfected.*?sirna)' and re.search(r'(gene)) (A better regex can be made I believe)

I had begun writing grammar for the sentences as -

Beginner = LINE_CONTAINS|LINE_STARTSWITH|other line beginners...
Phrase = word+
sentence = Beginner + phrase + AND + Beginner + phrase

Any of these motifs/elements can occur in any line and can be in combination too. Like

LINE_CONTAINS {x AND y} BEFORE {a letter AND b letter} AND zoo people AND LINE_STARTSWITH dfg

So my question is -

How do I write grammar rules that can handle such context-sensitive elements, given that any sentence can have them (though most sentences won't have multiple, but still). Should I write rules for many kinds of sentences, each containing one of these different kinds of elements. Or should I write a rule that contains all such elements and make them optional.

I do understand that these elements may not exactly be context-sensitive, but my problem lies in not being able to write independent production rules for elements like BEFORE, JOIN etc. How do I best define their function in the production rules?

Edit - The phrases can be multi-word

Making some guesses about your grammar, here is a rough stab. Notice how I separately define the line expressions from the phrase expressions:

from pyparsing import (CaselessKeyword, Word, alphas, MatchFirst, quotedString, 
    infixNotation, opAssoc, Suppress, Group)


LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH = map(CaselessKeyword,
    """LINE_CONTAINS LINE_STARTSWITH LINE_ENDSWITH""".split())
NOT, AND, OR = map(CaselessKeyword, "NOT AND OR".split())
BEFORE, AFTER, JOIN = map(CaselessKeyword, "BEFORE AFTER JOIN".split())

keyword = MatchFirst([LINE_CONTAINS, LINE_STARTSWITH, LINE_ENDSWITH, NOT, AND, OR, 
                      BEFORE, AFTER, JOIN])
phrase_word = ~keyword + Word(alphas + '_')

phrase_term = phrase_word | quotedString

phrase_expr = infixNotation(phrase_term,
                            [
                             ((BEFORE | AFTER | JOIN), 2, opAssoc.LEFT,),
                             (NOT, 1, opAssoc.RIGHT,),
                             (AND, 2, opAssoc.LEFT,),
                             (OR, 2, opAssoc.LEFT),
                            ],
                            lpar=Suppress('{'), rpar=Suppress('}')
                            )

line_term = Group((LINE_CONTAINS | LINE_STARTSWITH | LINE_ENDSWITH)("line_directive") + 
                  Group(phrase_expr)("phrase"))
line_contents_expr = infixNotation(line_term,
                                   [(NOT, 1, opAssoc.RIGHT,),
                                    (AND, 2, opAssoc.LEFT,),
                                    (OR, 2, opAssoc.LEFT),
                                    ]
                                   )

sample = """
LINE_CONTAINS transfected BEFORE {sirna} AND gene AND LINE_STARTSWITH Therefore
"""

line_contents_expr.runTests(sample)

parses your sample as:

LINE_CONTAINS transfected BEFORE {sirna} AND gene AND LINE_STARTSWITH Therefore
[[['LINE_CONTAINS', [[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']]], 'AND', ['LINE_STARTSWITH', ['Therefore']]]]
[0]:
  [['LINE_CONTAINS', [[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']]], 'AND', ['LINE_STARTSWITH', ['Therefore']]]
  [0]:
    ['LINE_CONTAINS', [[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']]]
    - line_directive: 'LINE_CONTAINS'
    - phrase: [[['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']]
      [0]:
        [['transfected', 'BEFORE', 'sirna'], 'AND', 'gene']
        [0]:
          ['transfected', 'BEFORE', 'sirna']
        [1]:
          AND
        [2]:
          gene
  [1]:
    AND
  [2]:
    ['LINE_STARTSWITH', ['Therefore']]
    - line_directive: 'LINE_STARTSWITH'
    - phrase: ['Therefore']

The phrase_word starts with a negative lookahead, to avoid accidentally treating strings like 'LINE_STARTSWITH' as phrase words. I also added quoted strings as valid phrase words, since you never know when your search will have to actually include the string "LINE_STARTSWITH".

You use {}s for grouping in your phrase expressions, infixNotation has optional lpar and rpar arguments to override the defaults of ( and ).

From here, you can look at other infixNotation examples (such as SimpleBool.py on the pyparsing wiki examples page) to convert this into your respective regex-generating code.

This seems to me to be a very simplistic grammar. I think you are "overthinking" the problem.

Looking at your examples, I see this:

a JOIN b
a BEFORE b

a AND b
a OR b

STARTSWITH a

Those are simply operators. Unary operators (STARTSWITH) are like ~x or -x in python. Binary operators (JOIN, BEFORE, AND, OR) are like x + y or x in y in python.

I don't think CONTAINS is an operator, so much as a place-holder. Pretty much everything except STARTSWITH is implicitly a contains. So that's kind of like the unary-plus operator: defined, understood, allowed, but useless.

Anyway, figure out what the operators are (make a list). Figure out whether they are unary (startswith) or binary (and). Then figure out what their precedence and associativity are.

Once you know that information, you can build your parser: you will know the key words, and know how to arrange the key words in a grammar.

来源：https://stackoverflow.com/questions/42415837/writing-grammar-rules-for-context-sensitive-elements-using-pyparsing

标签

python

regex

parsing

context-free-grammar

pyparsing