tags
I\'ve stumped myself trying to figure out how to remove carriage returns that occur between tags. (Technically I need to replace them with spaces, not
This is the "almost good enough" lexing solution promised in my other answer, to sketch how it can be done. It makes a half-hearted attempt at coping with attributes, but not seriously. It also doesn't attempt to cope with unencoded "<" in attributes. These are relatively minor failings, and it does handle nested P tags, but as described in the comments it's totally unable to handle the case where someone doesn't close a P, because we can't do that without a thorough understanding of HTML. Considering how prevalent that practice still is, it's safe to declare this code "nearly useless". :)
#!/usr/bin/perl
use strict;
use warnings;
while ($html !~ /\G\Z/cg) {
if ($html =~ /\G(]*>)/cg) {
$output .= $1;
$in_p ++;
} elsif ($html =~ m[\G(
)]cg) {
$output .= $1;
$in_p --; # Woe unto anyone who doesn't provide a closing tag.
# Tag soup parsers are good for this because they can generate an
# "artificial" end to the P when they find an element that can't contain
# a P, or the end of the enclosing element. We're not smart enough for that.
} elsif ($html =~ /\G([^<]+)/cg) {
my $text = $1;
$text =~ s/\s*\n\s*/ /g if $in_p;
$output .= $text;
} elsif ($html =~ /\G(<)/cg) {
$output .= $1;
} else {
die "Can't happen, but not having an else is scary!";
}
}