How can I remove unused, nested HTML span tags with a Perl regex?

后端 未结 4 1144
南旧
南旧 2021-01-06 10:23

I\'m trying to remove unused spans (i.e. those with no attribute) from HTML files, having already cleaned up all the attributes I didn\'t want with other regular expressions

4条回答
  •  不知归路
    2021-01-06 11:01

    Try HTML::Parser:

    #!/usr/bin/perl
    
    use strict;
    use warnings;
    
    use HTML::Parser;
    
    my @print_span;
    my $p = HTML::Parser->new(
      start_h   => [ sub {
        my ($text, $name, $attr) = @_;
        if ( $name eq 'span' ) {
          my $print_tag = %$attr;
          push @print_span, $print_tag;
          return if !$print_tag;
        }
        print $text;
      }, 'text,tagname,attr'],
      end_h => [ sub {
        my ($text, $name) = @_;
        if ( $name eq 'span' ) {
          return if !pop @print_span;
        }
        print $text;
      }, 'text,tagname'],
      default_h => [ sub { print shift }, 'text'],
    );
    $p->parse_file(\*DATA) or die "Err: $!";
    $p->eof;
    
    __END__
    
    
    This is a title
    
    
    

    This is a header

    a b c de

提交回复
热议问题