“Out of memory” while parsing large (100 Mb) XML file using perl

左心房为你撑大大i 提交于 2019-12-08 17:00:59

问题


I have an error "Out of memory" while parsing large (100 Mb) XML file

use strict;
use warnings;
use XML::Twig;

my $twig=XML::Twig->new();
my $data = XML::Twig->new
             ->parsefile("divisionhouserooms-v3.xml")
               ->simplify( keyattr => []);

my @good_division_numbers = qw( 30 31 32 35 38 );

foreach my $property ( @{ $data->{DivisionHouseRoom}}) {

    my $house_code = $property->{HouseCode};
    print $house_code, "\n";

    my $amount_of_bedrooms = 0;

    foreach my $division ( @{ $property->{Divisions}->{Division} } ) {

        next unless grep { $_ eq $division->{DivisionNumber} } @good_division_numbers;
        $amount_of_bedrooms += $division->{DivisionQuantity};
    }

    open my $fh, ">>", "Result.csv" or die $!;
    print $fh join("\t", $house_code, $amount_of_bedrooms), "\n";
    close $fh;
}

What i can do to fix this error issue?


回答1:


Handling large XML files that don't fit in memory is something that XML::Twig advertises:

One of the strengths of XML::Twig is that it let you work with files that do not fit in memory (BTW storing an XML document in memory as a tree is quite memory-expensive, the expansion factor being often around 10).

To do this you can define handlers, that will be called once a specific element has been completely parsed. In these handlers you can access the element and process it as you see fit (...)


The code posted in the question isn't making use of the strength of XML::Twig at all (using the simplify method doesn't make it much better than XML::Simple).

What's missing from the code are the 'twig_handlers' or 'twig_roots', which essentially cause the parser to focus on relevant portions of the XML document memory-efficiently.

It's difficult to say without seeing the XML whether processing the document chunk-by-chunk or just selected parts is the way to go, but either one should solve this issue.

So the code should look something like the following (chunk-by-chunk demo):

use strict;
use warnings;
use XML::Twig;
use List::Util 'sum';   # To make life easier
use Data::Dump 'dump';  # To see what's going on

my %bedrooms;           # Data structure to store the wanted info

my $xml = XML::Twig->new (
                          twig_roots => {
                                          DivisionHouseRoom => \&count_bedrooms,
                                        }
                         );

$xml->parsefile( 'divisionhouserooms-v3.xml');

sub count_bedrooms {

    my ( $twig, $element ) = @_;

    my @divParents = $element->children( 'Divisions' );
    my $id = $element->first_child_text( 'HouseCode' );

    for my $divParent ( @divParents ) {
        my @divisions = $divParent->children( 'Division' );
        my $total = sum map { $_->text } @divisions;
        $bedrooms{$id} = $total;
    }

    $element->purge;   # Free up memory
}

dump \%bedrooms;



回答2:


See Processing an XML document chunk by chunk section of XML::Twig documentation, it specifically discuss how to process document part by part, allowing for large XML file processing.



来源:https://stackoverflow.com/questions/7293687/out-of-memory-while-parsing-large-100-mb-xml-file-using-perl

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!