How do I parse PDF strings with nested string delimiters in antlr?

社会主义新天地 提交于 2020-01-05 04:45:28

问题


I'm working on parsing PDF content streams. Strings are delimited by parentheses but can contain nested unescaped parentheses. From the PDF Reference:

A literal string shall be written as an arbitrary number of characters enclosed in parentheses. Any characters may appear in a string except unbalanced parentheses (LEFT PARENHESIS (28h) and RIGHT PARENTHESIS (29h)) and the backslash (REVERSE SOLIDUS (5Ch)), which shall be treated specially as described in this sub-clause. Balanced pairs of parentheses within a string require no special treatment.

EXAMPLE 1:

The following are valid literal strings: 
(This is a string)
(Strings may contain newlines
and such.)
(Strings may contain balanced parentheses ( ) and special characters (*!&}^% and so on).)

It seems like pushing lexer modes onto a stack would be the thing to handle this. Here's a stripped-down version of my lexer and parser.

lexer grammar PdfStringLexer;

Tj: 'Tj' ;
TJ: 'TJ' ;

NULL: 'null' ;

BOOLEAN: ('true'|'false') ;

LBRACKET: '[' ;
RBRACKET: ']' ;
LDOUBLEANGLE: '<<' ;
RDOUBLEANGLE: '>>' ;

NUMBER: ('+' | '-')? (INT | FLOAT) ;

NAME: '/' ID ;

// A sequence of literal characters enclosed in parentheses.
OPEN_PAREN: '(' -> more, pushMode(STR) ; 

// Hexadecimal data enclosed in angle brackets
HEX_STRING: '<' [0-9A-Za-z]+ '>' ; 

fragment INT: DIGIT+ ; // match 1 or more digits

fragment FLOAT:  DIGIT+ '.' DIGIT*  // match 1. 39. 3.14159 etc...
     |         '.' DIGIT+  // match .1 .14159
     ;

fragment DIGIT:   [0-9] ;        // match single digit

// Accept all characters except whitespace and defined delimiters ()<>[]{}/%
ID: ~[ \t\r\n\u000C\u0000()<>[\]{}/%]+ ;

WS: [ \t\r\n\u000C\u0000]+ -> skip ; // PDF defines six whitespace characters

mode STR;

LITERAL_STRING : ')' -> popMode ;
STRING_OPEN_PAREN: '(' -> more, pushMode(STR) ; 
TEXT : . -> more ;

parser grammar PdfStringParser;

options { tokenVocab=PdfStringLexer; } 

array: LBRACKET object* RBRACKET ;
dictionary: LDOUBLEANGLE (NAME object)* RDOUBLEANGLE ;
string: (LITERAL_STRING | HEX_STRING) ;
object
    : NULL
    | array
    | dictionary
    | BOOLEAN
    | NUMBER
    | string
    | NAME
    ;

content : stat* ;

stat
    : tj
    ;

tj: ((string Tj) | (array TJ)) ; // Show text

When I process this file:

(Oliver’s Army) Tj
((What’s So Funny ’Bout) Peace, Love, and Understanding) Tj

I get this error and parse tree:

line 2:24 extraneous input ' Peace, Love, and Understanding)' expecting 'Tj'

So maybe pushMode doesn't push duplicate modes onto the stack. If not, what would be the way to handle nested parentheses?


Edit

I left out the instructions regarding escape sequences within the string:

Within a literal string, the REVERSE SOLIDUS is used as an escape character. The character immediately following the REVERSE SOLIDUS determines its precise interpretation as shown in Table 3. If the character following the REVERSE SOLIDUS is not one of those shown in Table 3, the REVERSE SOLIDUS shall be ignored.

Table 3 lists \n, \r, \t, \b backspace (08h), \f formfeed (FF), \(, \), \\, and \ddd character code ddd (octal)

An end-of-line marker appearing within a literal string without a preceding REVERSE SOLIDUS shall be treated as a byte value of (0Ah), irrespective of whether the end-of-line marker was a CARRIAGE RETURN (0Dh), a LINE FEED (0Ah), or both.

EXAMPLE 2:

(These \
two strings \
are the same.)
(These two strings are the same.)

EXAMPLE 3:

(This string has an end-of-line at the end of it. 
)
(So does this one.\n)

Should I use this STRING definition:

STRING
 : '(' ( ~[()]+ | STRING )* ')'
 ;

without modes and deal with escape sequences in my code or create a lexer mode for strings and deal with escape sequences in the grammar?


回答1:


You could do this with lexical modes, but in this case it's not really needed. You could simply define a lexer rule like this:

STRING
 : '(' ( ~[()]+ | STRING )* ')'
 ;

And with escape sequences, you could try:

STRING
 : '(' ( ~[()\\]+ |  ESCAPE_SEQUENCE | STRING )* ')'
 ;

fragment ESCAPE_SEQUENCE
 : '\\' ( [nrtbf()\\] | [0-7] [0-7] [0-7] )
 ;


来源:https://stackoverflow.com/questions/57012793/how-do-i-parse-pdf-strings-with-nested-string-delimiters-in-antlr

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!