PHP RegExp for nested Div tags

前端 未结 4 1615
心在旅途
心在旅途 2021-01-01 07:52

I need a regexp I can use with PHP\'s preg_match_all() to match out content inside div-tags. The divs look like this:

Content
4条回答
  •  天涯浪人
    2021-01-01 08:00

    Try a parser instead:

    require_once "simple_html_dom.php";
    $text = 'foo 
    Content
    more stuff
    bar
    even more
    baz
    yes
    '; $html = str_get_html($text); foreach($html->find('div') as $e) { if(isset($e->attr['id']) && preg_match('/^t\d++/', $e->attr['id'])) { echo $e->outertext . "\n"; } }

    Output:

    Content
    more stuff
    yes

    Download the parser here: http://simplehtmldom.sourceforge.net/

    Edit: More for my own amusement I tried to do it in regex. Here's what I came up with:

    $text = 'foo 
    Content
    more stuff
    bar
    even more
    baz
    yes
    aaa
    bbb
    ccc
    bbb
    aaa
    '; if(preg_match_all('#[^<>]*(]*>(?:[^<>]*|(?1))*
    )[^<>]*
#si', $text, $matches)) { print_r($matches[0]); }

Output:

Array
(
    [0] => 
Content
more stuff
[1] =>
yes
aaa
bbb
ccc
bbb
aaa
)

And a small explanation:

  # match an opening 'div' with an id that starts with 't' and some digits
[^<>]*             # match zero or more chars other than '<' and '>'
(                  # open group 1
  ]*>       #   match an opening 'div'
  (?:              #   open a non-matching group
    [^<>]*         #     match zero or more chars other than '<' and '>'
    |              #     OR
    (?1)           #     recursively match what is defined by group 1
  )*               #   close the non-matching group and repeat it zero or more times
  
# match a closing 'div' ) # close group 1 [^<>]* # match zero or more chars other than '<' and '>'
# match a closing 'div'

Now perhaps you understand why people try to persuade you from not using regex for this. As already noted, it will not help if the the html is improperly formed: the regex will make a bigger mess of the output than an html parser, I assure you. Also, the regex will probably make your eyes bleed and your colleagues (or the people who will maintain your software) may come looking for you after seeing what you did. :)

Your best bet is to first clean up your input (using TIDY or similar), and then use a parser to get the info you want.

提交回复
热议问题