PHP RegExp for nested Div tags

前端未结

关注

 4  1629

心在旅途 2021-01-01 07:52

I need a regexp I can use with PHP\'s preg_match_all() to match out content inside div-tags. The divs look like this:

Content


      
      
        
          4条回答        

        
                    
            
            
                         
                
              
              
                
                   天涯浪人
                                             
                
                
                (楼主)
            
              
              
                2021-01-01 08:00
              

            
            
                        
Try a parser instead:

require_once "simple_html_dom.php";
$text = 'foo Content more stuff
 bar even more
 baz  yes';
$html = str_get_html($text);
foreach($html->find('div') as $e) {
    if(isset($e->attr['id']) && preg_match('/^t\d++/', $e->attr['id'])) {
        echo $e->outertext . "\n";
    }
}


Output:

Content more stuff
yes


Download the parser here: http://simplehtmldom.sourceforge.net/

Edit: More for my own amusement I tried to do it in regex. Here's what I came up with:

$text = 'foo Content more stuff
 bar even more
      baz yes aaabbbccc
bbb
aaa
 ';
if(preg_match_all('#[^<>]*(]*>(?:[^<>]*|(?1))*
)[^<>]*
#si', $text, $matches)) {
    print_r($matches[0]);
}


Output:

Array
(
    [0] => Content more stuff
    [1] => yes aaabbbccc
bbb
aaa
 
)


And a small explanation:

  # match an opening 'div' with an id that starts with 't' and some digits
[^<>]*             # match zero or more chars other than '<' and '>'
(                  # open group 1
  ]*>       #   match an opening 'div'
  (?:              #   open a non-matching group
    [^<>]*         #     match zero or more chars other than '<' and '>'
    |              #     OR
    (?1)           #     recursively match what is defined by group 1
  )*               #   close the non-matching group and repeat it zero or more times

           #   match a closing 'div'
)                  # close group 1
[^<>]*             # match zero or more chars other than '<' and '>'