How can lexing efficiency be improved?

问题

In parsing a large 3 gigabyte file with DCG, efficiency is of importance.

The current version of my lexer is using mostly the or predicate ;/2 but I read that indexing can help.

Indexing is a technique used to quickly select candidate clauses of a predicate for a specific goal. In most Prolog systems, indexing is done (only) on the first argument of the head. If this argument is instantiated to an atom, integer, float or compound term with functor, hashing is used to quickly select all clauses where the first argument may unify with the first argument of the goal. SWI-Prolog supports just-in-time and multi-argument indexing. See section 2.18.

Can someone give an example of using indexing for lexing and possibly explain how it improves efficiency?

Details

Note: I changed some of the names before coping the source code into this question. If you find a mistake feel free to edit it here or leave me a comment and I will gladly fix it.

Currently my lexer/tokenizer (based on mzapotoczny/prolog-interpreter parser.pl) is this

% N.B.
% Since the lexer uses "" for values, the double_quotes flag has to be set to `chars`.
% If double_quotes flag is set to `code`, the the values with "" will not be matched.

:- use_module(library(pio)). 
:- use_module(library(dcg/basics)).
:- set_prolog_flag(double_quotes,chars).

lexer(Tokens) -->
   white_space,
   (
       (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

white_space -->
   [Char], { code_type(Char, space) }, !, white_space.
white_space -->
    "--", whole_line, !, white_space.
white_space -->
   [].

whole_line --> "\n", !.
whole_line --> [_], whole_line.

digit(D) -->
   [D],
      { code_type(D, digit) }.

digits([D|T]) -->
   digit(D),
   !,
   digits(T).
digits([]) -->
   [].

number(D, N) -->
   digits(Ds),
      { number_chars(N, [D|Ds]) }.

letter(L) -->
   [L], { code_type(L, alpha) }.

alphanum([A|T]) -->
   [A], { code_type(A, alnum) }, !, alphanum(T).
alphanum([]) -->
   [].

alphanum([]).
alphanum([H|T]) :- code_type(H, alpha), alphanum(T).

identifier(L, Id) -->
   alphanum(As),
      { atom_codes(Id, [L|As]) }.

Here are some helper predicates used for development and testing.

read_file_for_lexing_and_user_review(Path) :-
    open(Path,read,Input),
    read_input_for_user_review(Input), !,
    close(Input).

read_file_for_lexing_and_performance(Path,Limit) :-
    open(Path,read,Input),
    read_input_for_performance(Input,0,Limit), !,
    close(Input).

read_input(Input) :-
    at_end_of_stream(Input).

read_input(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    read_input(Input).

read_input_for_user_review(Input) :-
    at_end_of_stream(Input).

read_input_for_user_review(Input) :-
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line_for_user_review(Line),
    nl,
    print('Press spacebar to continue or any other key to exit: '),
    get_single_char(Key),
    process_user_continue_or_exit_key(Key,Input).

read_input_for_performance(Input,Count,Limit) :-
    Count >= Limit.

read_input_for_performance(Input,_,_) :-
    at_end_of_stream(Input).

read_input_for_performance(Input,Count0,Limit) :-
    % print(Count0),
    \+ at_end_of_stream(Input),
    read_string(Input, "\n", "\r\t ", _, Line),
    lex_line(Line),
    Count is Count0 + 1,
    read_input_for_performance(Input,Count,Limit).

process_user_continue_or_exit_key(32,Input) :-  % space bar
    nl, nl,
    read_input_for_user_review(Input).

process_user_continue_or_exit_key(Key) :-
    Key \= 32.

lex_line_for_user_review(Line) :-
    lex_line(Line,TokList),
    print(Line),
    nl,
    print(TokList),
    nl.

lex_line(Line,TokList) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

lex_line(Line) :-
    string_chars(Line,Code_line),
    phrase(lexer(TokList),Code_line).

read_user_input_for_lexing_and_user_review :-
    print('Enter a line to parse or just Enter to exit: '),
    nl,
    read_string(user, "\n", "\r", _, String),
    nl,
    lex_line_for_user_review(String),
    nl,
    continue_user_input_for_lexing_and_user_review(String).

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,N),
    N > 0,
    read_user_input_for_lexing_and_user_review.

continue_user_input_for_lexing_and_user_review(String) :-
    string_length(String,0).

read_user_input_for_lexing_and_user_review/0 allows a user to enter a string at the terminal for lexing and review the tokens.

read_file_for_lexing_and_user_review/1 Reads a file for lexing and review the tokens for each line one line at a time.

read_file_for_lexing_and_performance/2 Reads a file for lexing with a limit on the number of lines to lex. This is for use with gathering basic performance statistics to measure efficiency. Meant to be used with time/1.

回答1:

Solution:

You should replace the following:

lexer(Tokens) -->
   white_space,
   (
      (  ":",       !, { Token = tokColon }
      ;  "(",       !, { Token = tokLParen }
      ;  ")",       !, { Token = tokRParen }
      ;  "{",       !, { Token = tokLMusta}
      ;  "}",       !, { Token = tokRMusta}
      ;  "\\",      !, { Token = tokSlash}
      ;  "->",      !, { Token = tokImpl}
      ;  "+",       !, { Token = tokPlus }
      ;  "-",       !, { Token = tokMinus }
      ;  "*",       !, { Token = tokTimes }
      ;  "=",       !, { Token = tokEqual }
      ;  "<",       !, { Token = tokLt }
      ;  ">",       !, { Token = tokGt }
      ;  "_",       !, { Token = tokUnderscore }
      ;  ".",       !, { Token = tokPeriod }
      ;  "/",       !, { Token = tokForwardSlash }
      ;  ",",       !, { Token = tokComma }
      ;  ";",       !, { Token = tokSemicolon }
      ;  digit(D),  !,
            number(D, N),
            { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
            {  member((Id, Token), [ (div, tokDiv),
                                     (mod, tokMod),
                                     (where, tokWhere)]),
               !
            ;  Token = tokVar(Id)
            }
      ;  [_],
            { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;  [],
         { Tokens = [] }
   ).

with

lexer(Tokens) -->
   white_space,
   (
      (
         op_token(Token), ! % replace ;/2 long chain searched blindly with call to new predicate op_token//1 which clauses have indexed access by first arg in Prolog standard way
      ;
         digit(D),  !, number(D, N),
         { Token = tokNumber(N) }
      ;  letter(L), !, identifier(L, Id),
         {  member((Id, Token), [ (div, tokDiv),
                                 (mod, tokMod),
                                 (where, tokWhere)]),
            !
      ;  Token = tokVar(Id)
         }
      ;  [_],
         { Token = tokUnknown }
      ),
      !,
      { Tokens = [Token | TokList] },
      lexer(TokList)
   ;
      [],
      { Tokens = [] }
   ).

%%%
op_token(tokColon)      --> ";".
op_token(tokLParen)     --> "(".
op_token(tokRParen)     --> ")".
op_token(tokLMusta)     --> "{".
op_token(tokRMusta)     --> "}".
op_token(tokBackSlash)  --> "\\".
op_token(tokImpl)       --> "->".
op_token(tokPlus)       --> "+".
op_token(tokMinus)      --> "-".
op_token(tokTimes)      --> "*".
op_token(tokEqual)      --> "=".
op_token(tokLt)         --> "<".
op_token(tokGt)         --> ">".
op_token(tokUnderscore) --> "_".
op_token(tokPeriod)     --> ".".
op_token(tokSlash)      --> "/".
op_token(tokComma)      --> ",".
op_token(tokSemicolon)  --> ";".

Edit by Guy Coder

I ran a test using the example data posted in the question into a list where each item in the list was a line in the data converted to character codes. Then with time/1 called lexer on each item in the list and repeated the test for the list 10000 times. The reason the data was loaded into a list and converted to characters codes before time/1 was so that those processes did not skew the results. Each of these runs was repeated 5 times to get a consistency of data.

In the following runs below, for all of the different versions the lexer was extended to cover all of the 7-bit ASCII characters which significantly increased the number of cases for special characters.

The version of Prolog used for the following was SWI-Prolog 8.0.

For the version in the question.

Version: 1

:- set_prolog_flag(double_quotes,chars).

% 694,080,002 inferences, 151.141 CPU in 151.394 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 150.813 CPU in 151.059 seconds (100% CPU, 4602271 Lips)
% 694,080,001 inferences, 152.063 CPU in 152.326 seconds (100% CPU, 4564439 Lips)
% 694,080,001 inferences, 151.141 CPU in 151.334 seconds (100% CPU, 4592280 Lips)
% 694,080,001 inferences, 151.875 CPU in 152.139 seconds (100% CPU, 4570074 Lips)

For the version as posted above in this answer

Version: 2

:- set_prolog_flag(double_quotes,chars).

% 773,260,002 inferences, 77.469 CPU in 77.543 seconds (100% CPU, 9981573 Lips)
% 773,260,001 inferences, 77.344 CPU in 77.560 seconds (100% CPU, 9997705 Lips)
% 773,260,001 inferences, 77.406 CPU in 77.629 seconds (100% CPU, 9989633 Lips)
% 773,260,001 inferences, 77.891 CPU in 77.967 seconds (100% CPU, 9927511 Lips)
% 773,260,001 inferences, 78.422 CPU in 78.644 seconds (100% CPU, 9860259 Lips)

Version 2 gives a dramatic improvement by using indexing from Version 1.

In doing further research on the code, upon looking at op_token which is DCG and has two hidden variables for implicitly passing around a state representation, using listing/1 showed:

op_token(tokUnderscore,['_'|A], A).

Notice that the first parameter is not the character being searched and that in this answer the indexing code is written as

c_digit(0'0,0).

where the first parameter is the character being searched and the second parameter is the result.

So change this

op_token(Token), !

to this

[S], { special_character_indexed(S,Token) }

with indexed clauses as

special_character_indexed( ';' ,tokSemicolon).

Version: 3

:- set_prolog_flag(double_quotes,chars).

% 765,800,002 inferences, 74.125 CPU in 74.348 seconds (100% CPU, 10331197 Lips)
% 765,800,001 inferences, 74.766 CPU in 74.958 seconds (100% CPU, 10242675 Lips)
% 765,800,001 inferences, 74.734 CPU in 74.943 seconds (100% CPU, 10246958 Lips)
% 765,800,001 inferences, 74.828 CPU in 75.036 seconds (100% CPU, 10234120 Lips)
% 765,800,001 inferences, 74.547 CPU in 74.625 seconds (100% CPU, 10272731 Lips)

Version 3 gives a slightly better but consistently better result than Version 2.

Lastly just changing double_quotes flag to atom as noted in a comment by AntonDanilov

Version: 4

:- set_prolog_flag(double_quotes,atom).

% 765,800,003 inferences, 84.234 CPU in 84.539 seconds (100% CPU, 9091300 Lips)
% 765,800,001 inferences, 74.797 CPU in 74.930 seconds (100% CPU, 10238396 Lips)
% 765,800,001 inferences, 75.125 CPU in 75.303 seconds (100% CPU, 10193677 Lips)
% 765,800,001 inferences, 75.078 CPU in 75.218 seconds (100% CPU, 10200042 Lips)
% 765,800,001 inferences, 75.031 CPU in 75.281 seconds (100% CPU, 10206414 Lips)

Version 4 is almost the same as Version 3.

Just looking at CPU numbers, using indexing is faster, e.g. (Version: 1) 151.875 vs (Version: 3) 74.547

回答2:

One thing it means is that this is silly code:

token(T) -->
    ( "1", !, { T = one }
    ; "2", !, { T = two }
    ; "3", !, { T = three }
    )

This is less silly code:

token(T) --> one_two_three(T).

one_two_three(one) --> "1".
one_two_three(two) --> "2".
one_two_three(three) --> "3".

But still not so good. Maybe better:

token(T) --> [X], { one_two_three(X, T) }.

one_two_three(0'1, one).
one_two_three(0'2, two).
one_two_three(0'3, three).

Last example also starts to look silly but remember that now you have indexing on first argument. You read once, no choice point, no backtrack.

But if you want to really know how to write efficient you need to measure where the time and space goes. Have you measured?

But if you really want to know how to fix you maybe read "Craft of Prolog", I do not understand all of this book but I remember it had big section on DCG.

But if you really want to parse such formats large files maybe find existing libraries in other languages, it might be much faster than fastest Prolog.

来源：https://stackoverflow.com/questions/54259696/how-can-lexing-efficiency-be-improved

标签

performance

prolog

tokenize

lexical-analysis