One option is to catch the exceptions, figure out where in the input they occurred, fix the input there, and retry.
The following is a quick, inefficient proof-of-concept script using XML::Twig
because I still haven't figured out how to build & install libxml2 from scratch on Windows.
#!/usr/bin/env perl
use strict;
use warnings;
use XML::Twig;
my $xml = q{ };
while ( 1 ) {
eval {
my $twig = XML::Twig->new(
twig_handlers => { tag => \&tag_handler },
);
$twig->parse( $xml );
1;
} and last;
my $err = $@;
my ($i) = ($err =~ /byte ([0-9]+)/)
or die $err;
substr($xml, $i, 1) eq '<'
or die $err;
$xml = substr($xml, 0, $i) . '<' . substr($xml, $i + 1);
}
sub tag_handler {
my (undef, $elt) = @_;
print $elt->att('v'), "\n";
}
I wrote more about this on my blog.