RegEx to remove carriage returns between

tags

前端 未结 7 617
臣服心动
臣服心动 2021-01-07 07:57

I\'ve stumped myself trying to figure out how to remove carriage returns that occur between

tags. (Technically I need to replace them with spaces, not

7条回答
  •  余生分开走
    2021-01-07 08:36

    This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)

    #!/usr/bin/perl
    use strict;
    use warnings;
    
    while ($html !~ /\G\Z/cg) {
      if ($html =~ /\G(]*>)/cg) {
        $output .= $1;
        $in_p ++;
      } elsif ($html =~ m[\G(

    )]cg) { $output .= $1; $in_p --; # Woe unto anyone who doesn't provide a closing tag. # Tag soup parsers are good for this because they can generate an # "artificial" end to the P when they find an element that can't contain # a P, or the end of the enclosing element. We're not smart enough for that. } elsif ($html =~ /\G([^<]+)/cg) { my $text = $1; $text =~ s/\s*\n\s*/ /g if $in_p; $output .= $text; } elsif ($html =~ /\G(<)/cg) { $output .= $1; } else { die "Can't happen, but not having an else is scary!"; } }

提交回复
热议问题