I\'ve been trying to parse some given text with PLY for a while and I haven\'t been able to figure it out. I have these tokens defined:
tokens = [\'ID\', \'INT\'
You say: "It should take 9var as an ID". But then you point out that 9var doesn't match the ID regex pattern. So why should 9var be scanned as an ID?
If you want 9var to be an ID, it would be easy enough to change the regex, from [a-zA-Z_][a-zA-Z_0-9]* to [a-zA-Z_0-9]+. (That will also match pure integers, so you'd need to ensure that the INT pattern is applied first. Alternatively, you could use [a-zA-Z_0-9]*[a-zA-Z_][a-zA-Z_0-9]*.)
I suspect that what you really want is for 9var to be recognized as a lexical error rather than a parsing error. But if it is going to be recognized as an error in any case, does it really matter whether it is a lexical error or a syntax error?
It's worth mentioning that the Python lexer works exactly the way your lexer does: it will scan 9var as two tokens, and that will later create a syntax error.
Of course, it is possible that in your language, there is some syntactically correct construction in which an ID can directly follow an INT. Or, if not, where a keyword can directly follow an INT, such as the Python expression 3 if x else 2. (Again, Python doesn't complain if you write that as 3if x else 2.)
So if you really really insist on flagging a scanner error for tokens which start with a digit and continue with non-digits, you can insert another pattern, such as [0-9]+[a-zA-Z_][a-zA-Z_0-9]*, and have it raise an error in its action.