Writing a lexer for a new programming language in python

白昼怎懂夜的黑 提交于 2020-08-20 10:36:15

问题


I have no idea how/where to start. I'm supposed to be using python, and more specifically, the ply library. So far, all I've done in create a list of tokens that will be part of the language. That list is given below:


tokens = (
                                                         # OPERATORS #
    'PLUS' ,        # +
    'MINUS' ,       # -
    'MULTIPLY',     # *
    'DIVIDE',       # /
    'MODULO',       # %


    'NOT',          # ~
    'EQUALS',       # =

                                                         # COMPARATORS #
    'LT',           # <
    'GT',           # >
    'LTE',          # <=
    'GTE',          # >=
    'DOUBLEEQUAL',  # ==
    'NE',           # #


    'AND',          # &
    'OR',           # |
                                                    # CONDITIONS AND LOOPS #    

    'IF',           # if
    'ELSE',         # else
    'ELSEIF',       # elseif
    'WHILE',        # while
    'FOR',          # for
#   'DOWHILE',      # haven't thought about this yet
                                                          # BRACKETS #

    'LPAREN',       # (
    'RPAREN',       # )
    'LBRACE',       # [
    'RBRACE',       # ]
    'BLOCKSTART',   # {
    'BLOCKEND',     # }
                                                        # IDENTIFIERS #

    'INTEGER',      # int
    'DOUBLE',       # dbl
    'STRING',       # str
    'CHAR',         # char

    'SEMICOLON',    # ;
    'DOT',          # .
    'COMMA',        # ,
    'QUOTES',       # '
    'DOUBLEQUOTES', # "
    'COMMENTLINE',  # --

    'RETURN',       # return

)

I've obviously got a long way to go, seeing as I also need to write a parser and an interpreter.

I've got a few questions:

  1. How do I use the ply library?
  2. Is this a good start, and if so, what do I go from this?
  3. Are there any resources I can use to help me with this.

I've tried googling stuff on writing new programming languages, but I haven't yet found anything satisfactory


回答1:


How do I use the ply library?

Assuming that you already have Ply installed, you should start with exploring the tutorials on the official Ply website. They are well written and easy to follow.

Is this a good start, and if so, what do I go from this?

Ply requires token definitions to begin with. You have already done that. However, the complexities increase when your lexer has to differentiate between say a string like "forget" and a reserved keyword like for. The library provides good support for variable precedence to resolve grammar ambiguity. This can be as easy as defining the precedence as tuples:

precedence = (
 ('left', 'STRING', 'KEYWORD'),
 ('left', 'MULTIPLY', 'DIVIDE')
 )

However, I recommend you should read more about lexers and yacc before deep diving into the more advanced features like expressions and precedence in Ply. For a start, you should build a simple numerical lexer that successfully parses integers, operators and bracket symbols. I've reduced the token definition to suit this purpose. The following example has been modified from the official tutorials.

  • Library import & Token definition:

    import ply.lex as lex #library import
    # List of token names.   This is always required
    tokens = [
                                                         # OPERATORS #
    'PLUS' ,        # +
    'MINUS' ,       # -
    'MULTIPLY',     # *
    'DIVIDE',       # /
    'MODULO',       # %
    
    
    'NOT',          # ~
    'EQUALS',       # =
    
                                                         # COMPARATORS #
    'LT',           # <
    'GT',           # >
    'LTE',          # <=
    'GTE',          # >=
    'DOUBLEEQUAL',  # ==
    'NE',           # !=
    'AND',          # &
    'OR' ,          # |                                                
                                                          # BRACKETS #
    
    'LPAREN',       # (
    'RPAREN',       # )
    'LBRACE',       # [
    'RBRACE',       # ]
    'BLOCKSTART',   # {
    'BLOCKEND',     # }
                                                        # DATA TYPES#
    
    'INTEGER',      # int
    'FLOAT',       # dbl
    
    'COMMENT',  # --
    
    ]
    
  • Define regular expression rules for simple tokens: Ply uses the re Python library to find regex matches for tokenization. Each token requires a regex definition. We first define regex definitions for simple tokens. Each rule declaration begins with the special prefix t_ to indicate that it defines a token.

    # Regular expression rules for simple tokens
    
    t_PLUS    = r'\+'
    t_MINUS   = r'-'
    t_MULTIPLY   = r'\*'
    t_DIVIDE  = r'/'
    t_MODULO = r'%'
    t_LPAREN  = r'\('
    t_RPAREN  = r'\)'
    t_LBRACE = r'\['
    t_RBRACE = r'\]'
    t_BLOCKSTART = r'\{'
    t_BLOCKEND = r'\}'
    t_NOT = r'\~'
    t_EQUALS = r'\='
    t_GT = r'\>'
    t_LT = r'\<'
    t_LTE = r'\<\='
    t_GTE = r'\>\='
    t_DOUBLEEQUAL = r'\=\='
    t_NE = r'\!\='
    t_AND = r'\&'
    t_OR = r'\|'
    t_COMMENT = r'\#.*'            
    t_ignore  = ' \t' ignore spaces and tabs
    
  • Define regular expression rules for more complex tokens like data types such as int, float and newline characters to track line numbers. You will notice that these definitions are quite similar to the above.

    #Rules for INTEGER and FLOAT tokens
    def t_INTEGER(t):
        r'\d+'
        t.value = int(t.value)    
        return t
    
    def t_FLOAT(t):
        r'(\d*\.\d+)|(\d+\.\d*)'
        t.value = float(t.value)
        return t        
    
    # Define a rule so we can track line numbers
    def t_newline(t):
        r'\n+'
        t.lexer.lineno += len(t.value)
    
  • Add some error handling for invalid characters:

    # Error handling rule
    def t_error(t):
        print("Illegal character '%s'" % t.value[0])
        t.lexer.skip(1)
    
  • Build the lexer:

    lexer = lex.lex()
    
  • Test the lexer with some input data, tokenize and print tokens:

    data = '''
    [25/(3*40) + {300-20} -16.5]
    {(300-250)<(400-500)}
    20 & 30 | 50
    # This is a comment
    '''
    
    # Give the lexer some input
    lexer.input(data)
    
    # Tokenize
    for tok in lexer:
        print(tok)
    

You can add this example code to a Python script file like new_lexer.py and run it like python new_lexer.py. You should get the following output. Note that the input data consisted of newline('\n') characters that were successfully ignored in the output.

    #Output
    LexToken(LBRACE,'[',2,1)
    LexToken(INTEGER,25,2,2)
    LexToken(DIVIDE,'/',2,4)
    LexToken(LPAREN,'(',2,5)
    LexToken(INTEGER,3,2,6)
    LexToken(MULTIPLY,'*',2,7)
    LexToken(INTEGER,40,2,8)
    LexToken(RPAREN,')',2,10)
    LexToken(PLUS,'+',2,12)
    LexToken(BLOCKSTART,'{',2,14)
    LexToken(INTEGER,300,2,15)
    LexToken(MINUS,'-',2,18)
    LexToken(INTEGER,20,2,19)
    LexToken(BLOCKEND,'}',2,21)
    LexToken(MINUS,'-',2,23)
    LexToken(INTEGER,16,2,24)
    LexToken(FLOAT,0.5,2,26)
    LexToken(RBRACE,']',2,28)
    LexToken(BLOCKSTART,'{',3,30)
    LexToken(LPAREN,'(',3,31)
    LexToken(INTEGER,300,3,32)
    LexToken(MINUS,'-',3,35)
    LexToken(INTEGER,250,3,36)
    LexToken(RPAREN,')',3,39)
    LexToken(LT,'<',3,40)
    LexToken(LPAREN,'(',3,41)
    LexToken(INTEGER,400,3,42)
    LexToken(MINUS,'-',3,45)
    LexToken(INTEGER,500,3,46)
    LexToken(RPAREN,')',3,49)
    LexToken(BLOCKEND,'}',3,50)
    LexToken(INTEGER,20,4,52)
    LexToken(AND,'&',4,55)
    LexToken(INTEGER,30,4,57)
    LexToken(OR,'|',4,60)
    LexToken(INTEGER,50,4,62)
    LexToken(COMMENT,'# This is a comment',5,65)

There are many other features you can make use of. For instance, debugging can be enabled with lex.lex(debug=True). The official tutorials provide more detailed information around these features.

I hope this helps to get you started. You can extend the code further to include reserved keywords like if, while and string identification with STRING, character identification with CHAR. The tutorials cover the implementation of reserved words by defining a key-value dictionary mapping like this:

    reserved = {
    'if' : 'IF',
    'then' : 'THEN',
    'else' : 'ELSE',
    'while' : 'WHILE',
    ...
    }

extending the tokens list further by defining the reserved token type as 'ID' and including the reserved dict values: tokens.append('ID') and tokens = tokens + list(reserved.values()). Then, add a definition for t_ID as above.

Are there any resources I can use to help me with this.

There are many resources available to learn about lexers, parsers and compilers. You should start with a good book that covers the theory and implementation. There are many books available that cover these topics. I liked this one. Here's another resource that may help. If you'd like to explore similar Python libraries or resources, this SO answer may help.



来源:https://stackoverflow.com/questions/55571086/writing-a-lexer-for-a-new-programming-language-in-python

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!