and
I need to match and remove all tags using a regular expression in Perl. I have the following:
<\\\\??(?!p).+?>
But this still matche
In my opinion, trying to parse HTML with anything other than an HTML parser is just asking for a world of pain. HTML is a really complex language (which is one of the major reasons that XHTML was created, which is much simpler than HTML).
For example, this:
/
is a complete, 100% well-formed, 100% valid HTML document. (Well, it's missing the DOCTYPE declaration, but other than that ...)
It is semantically equivalent to
>
>
But it's nevertheless valid HTML that you're going to have to deal with. You could, of course, devise a regex to parse it, but, as others already suggested, using an actual HTML parser is just sooo much easier.