Is there an easy way to chunk a text file into brace-balanced sections?

十年热恋 提交于 2019-12-20 03:06:51

问题


I'm trying to parse some data out of a file using Perl & Parse::RecDescent. I can't throw the full data file at the perl script because RecDescent will take days poring over it. So I split up the huge datafile into RD-sized chunks to reduce the runtime.

However, I need to extract sections within balanced brackets and the routine I have now is not robust (it depends too much on the position of the final close-bracket from a newline). Example:

cell ( identifier ) {
  keyword2 { };
  ...
  keyword3 { keyword4 {  } };
}

...more sections...

I need to grab everything from cell ... { to the matching closing } which can have various amounts of spacing and sub-sections.

There must be some linux command line thing to do this easily? Any ideas?

Edit: Input files are around 8M, grammar ~60 rules.


回答1:


Why does RecDescent take so long? Is it because your grammar is complex? If that's the case, you could two a bi-level pass using Parse::RecDescent. The idea is that you would define a simple grammar that parses cell ... { ... } and then passes parsed output from the first parser into a call to Parse::RecDescent with your more complex grammar. This is guessing about the reason for RecDescent being slow on your data.

Another option is to write your own simple parser that matches on the cell entries, counts the number of braces it's seen so far, and then finds the matching brace when the closing brace count is equal to the opening brace count. That should be fast, but the suggestion above might be faster to implement and easier to maintain.

Edit: You should definitely try Parse::RecDescent with a simplified grammar. The algorithmic complexity of recursive descent parsing is proportional to the number of possible parse trees, which should be something like is B ^ N, where B is the number of branching points in your grammar, and N is the number of nodes.

If you'd like to try rolling your own simple parser for a first pass over your input, the following code can get you started.

#!/usr/bin/perl -w

use strict;

my $input_file = "input";
open FILE, "<$input_file" or die $!;

my $in_block = 0;
my $current_block = '';
my $open_bracket_count = 0;
while( my $line = <FILE> ) {
    if ( $line =~ /cell/ ) {
        $in_block = 1;
    }

    if ( $in_block ) {
        while ( $line =~ /([\{\}]{1})/g ) {
            my $token = $1;
            if ( $token eq '{' ) {
                $open_bracket_count++;
            } elsif ( $token eq '}' ) {
                $open_bracket_count--;
            }
        }

        $current_block .= $line;
    }

    if ( $open_bracket_count == 0 && $current_block ne '' ) {
        print '-' x 80, "\n";
        print $current_block, "\n";
        $in_block = 0;
        $current_block = '';
    }
}
close FILE or die $!;

Edit: changed code to avoid slurping the entire file into memory. While this is trivial for an 8MB file, it's cleaner to just read the file in line-by-line.




回答2:


Show what you are feeding Parse::RecDescent; it may be possible to make it much better.

Or you could try using Text::Balanced to parse the { ... }.




回答3:


Use yapp LALR(1) parser which works in linear time and constant space.



来源:https://stackoverflow.com/questions/998918/is-there-an-easy-way-to-chunk-a-text-file-into-brace-balanced-sections

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!