Pythonic way to implement a tokenizer

前端 未结 12 691
青春惊慌失措
青春惊慌失措 2020-12-30 07:11

I\'m going to implement a tokenizer in Python and I was wondering if you could offer some style advice?

I\'ve implemented a tokenizer before in C and in Java so I\'m

12条回答
  •  [愿得一人]
    2020-12-30 07:47

    I have recently built a tokenizer, too, and passed through some of your issues.

    Token types are declared as "constants", i.e. variables with ALL_CAPS names, at the module level. For example,

    _INTEGER = 0x0007
    _FLOAT = 0x0008
    _VARIABLE = 0x0009
    

    and so on. I have used an underscore in front of the name to point out that somehow those fields are "private" for the module, but I really don't know if this is typical or advisable, not even how much Pythonic. (Also, I'll probably ditch numbers in favour of strings, because during debugging they are much more readable.)

    Tokens are returned as named tuples.

    from collections import namedtuple
    Token = namedtuple('Token', ['value', 'type'])
    # so that e.g. somewhere in a function/method I can write...
    t = Token(n, _INTEGER)
    # ...and return it properly
    

    I have used named tuples because the tokenizer's client code (e.g. the parser) seems a little clearer while using names (e.g. token.value) instead of indexes (e.g. token[0]).

    Finally, I've noticed that sometimes, especially writing tests, I prefer to pass a string to the tokenizer instead of a file object. I call it a "reader", and have a specific method to open it and let the tokenizer access it through the same interface.

    def open_reader(self, source):
        """
        Produces a file object from source.
        The source can be either a file object already, or a string.
        """
        if hasattr(source, 'read'):
            return source
        else:
            from io import StringIO
            return StringIO(source)
    

提交回复
热议问题