Is it advisable to use tokens for the purpose of syntax highlighting?

问题

I'm trying to implement syntax highlighting in C# on Android, using Xamarin. I'm using the ANTLR v4 library for C# to achieve this. My code, which is currently syntax highlighting Java with this grammar, does not attempt to build a parse tree and use the visitor pattern. Instead, I simply convert the input into a list of tokens:

private static IList<IToken> Tokenize(string text)
{
    var inputStream = new AntlrInputStream(text);
    var lexer = new JavaLexer(inputStream);
    var tokenStream = new CommonTokenStream(lexer);
    tokenStream.Fill();
    return tokenStream.GetTokens();
}

Then I loop through all of the tokens in the highlighter and assign a color to them based on their kind.

public void HighlightAll(IList<IToken> tokens)
{
    int tokenCount = tokens.Count;

    for (int i = 0; i < tokenCount; i++)
    {
        var token = tokens[i];
        var kind = GetSyntaxKind(token);
        HighlightNext(token, kind);

        if (kind == SyntaxKind.Annotation)
        {
            var nextToken = tokens[++i];
            Debug.Assert(token.Text == "@" && nextToken.Type == Identifier);
            HighlightNext(nextToken, SyntaxKind.Annotation);
        }
    }
}

public void HighlightNext(IToken token, SyntaxKind tokenKind)
{
    int count = token.Text.Length;

    if (token.Type != -1)
    {
        _text.SetSpan(_styler.GetSpan(tokenKind), _index, _index + count, SpanTypes.InclusiveExclusive);
        _index += count;
    }
}

Initially, I figured this was wise because syntax highlighting is largely context-independent. However, I have already found myself needing to special-case identifiers in front of @, since I want those to get highlighted as annotations just as on GitHub (example). GitHub has further examples of coloring identifiers in certain contexts: here, List and ArrayList are colored, while mItems is not. I will likely have to add further code to highlight identifiers in those scenarios.

My question is, is it a good idea to examine tokens rather than a parse tree here? On one hand, I'm worried that I might have to end up doing a lot of special-casing for when a token's neighbors alter how it should be highlighted. On the other, parsing will add additional overhead for memory-constrained mobile devices, and make it more complicated to implement efficient syntax highlighting (e.g. not re-tokenizing/parsing everything) when the user edits text in the code editor. I also found it significantly less complicated to handle all of the token types rather than the parser rule types, because you just switch on token.Type rather than overriding a bunch of Visit* methods.

For reference, the full code of the syntax highlighter is available here.

回答1:

It depends on what you are syntax highlighting.

If you use a naive parser, then any syntax error in the text will cause highlighting to fail. That makes it quite a fragile solution since a lot of the texts you might want to syntax highlight are not guaranteed to be correct (particularly user input, which at best will not be correct until it is fully typed). Since syntax highlighting can help make syntax errors visible and is often used for that purpose, failing completely on syntax errors is counter-productive.

Text with errors does not readily fit into a syntax tree. But it does have more structure than a stream of tokens. Probably the most accurate representation would be a forest of subtree fragments, but that is an even more awkward data structure to work with than a tree.

Whatever the solution you choose, you will end up negotiating between conflicting goals: complexity vs. accuracy vs. speed vs. usability. A parser may be part of the solution, but so may ad hoc pattern matching.

回答2:

Your approach is totally fine and pretty much what everybody's using. And it's totally normal to fine tune type matching by looking around (and it's cheap since the token types are cached). So you can always just look back or ahead in the token stream if you need to adjust actually used SyntaxKind. Don't start parsing your input. It won't help you.

回答3:

I ended up choosing to use a parser because there were too many ad hoc rules. For example, although I wanted to color regular identifiers white, I wanted types in type declarations (e.g. C in class C) to be green. There ended up being about 20 of these special rules in total. Also, the added overhead of parsing turned out to be miniscule compared to other bottlenecks in my app.

For those interested, you can view my code here: https://github.com/jamesqo/Repository/blob/e5d5653093861bc35f4c0ac71ad6e27265e656f3/Repository.EditorServices/Internal/Java/Highlighting/JavaSyntaxHighlighter.VisitMethods.cs#L19-L76. I've highlighted all of the ~20 special rules I've had to make.

来源：https://stackoverflow.com/questions/44484852/is-it-advisable-to-use-tokens-for-the-purpose-of-syntax-highlighting

标签

.net

parsing

antlr

antlr4