Regex match text between paragraph tags

问题

I'm attempting to match only the content between opening/closing paragraph tags. Playing around with it on RegExr, I can get <p.*?> to match an opening paragraph tag that may or may not have any additional attributes such as class and/or ID.

However, when I attempt to add that pattern to a positive look behind, it breaks and I'm not sure why. I've tried escaping the < and > symbols, but that doesn't seem to help. The look ahead, however, works perfectly.

Here's an example of the entire pattern:

(?<=\<p.*?\>).*?(?=</p>)

I'd like to be able to match only the content within the paragraph tags, and not include the tags themselves. Hence why I was attempting to use look aheads and look behinds.

回答1:

Problem

The problem with using lookbehinds is that in most regex engines, you are not allowed to use repetition inside of them.

(?<=.*)

This is invalid because of the * quantifier. If it was {8}, it would be okay since it is a fixed-width.

Solution

My advice is to match everything, and use capture groups and backreferences to process your data.

Example

<p.*?>(.*?)<\/p>

So, $1 or \1 would contain the data you want.

回答2:

you should not use regex for this kind of task.There are many issues can be found. see this post: Should I use regex or just DOM/string manipulation?

use DOMDocument it is very simple.

Sample example:

$str= "<p>tetsd</p> doutside <p> 232323234</p>";
$doc = new DOMDocument();
$doc->loadHTML($str);
foreach($doc->getElementsByTagName('p') as $para) {
    echo $para->textContent;
}

live demo

来源：https://stackoverflow.com/questions/22133501/regex-match-text-between-paragraph-tags

标签

php

regex

pattern-matching