How can I extract certain HTML tags e.g. <ul> using Regex with preg_match_all in PHP?

耗尽温柔 提交于 2019-12-06 09:25:34
Jonny 5

As the comments stated already, it's generally not recommended to parse html with regex. In my opinion, it depends on what exactly you're going to do.


If you want to use regex and know, that there are no nested tags of the same kind, the most simple pattern for getting everything that's between <ul> and closest </ul> would be:

$pattern = '~<ul>(.*?)</ul>~s';

It matches <ul> followed by as few characters of any kind as possible to meet </ul>. The dot is a metacharacter, that matches any single character except newlines (\n). To make it match newlines too, after the ending delimiter ~ I put the s-modifier. The quantifier * means zero or more times.

By default quantifiers are greedy, which means, they eat up as much as possible to be satisfied. A question-mark ? after the * makes them non-greedy (or lazy) and match as few characters as possible to meet </ul>. As pattern-delimiter I chose the ~ tilde.

preg_match_all($pattern, $html, $out);

Matches are captured and can be found in the output-variable, that you set for preg_match or preg_match_all, where [0] contains everything, that matches the whole pattern, [1] the first captured parenthesized subpattern, ...


If your searched tag can contain attributes (e.g. <ul class="my_list"...) this extended pattern, would after <ul also include [^>]* any amount of characters, that are not > before meeting >

$pattern = '~<ul[^>]*>\K.*(?=</ul>)~Uis';

Instead of the question-mark, here I use the U-modifier, to make all quantifiers lazy. For only getting captured the desired parts, that are <ul> inside </ul>. \K is used to reset beginning of the reported match. Instead of capturing the ending </ul> a lookahead is used (?=, as we neither want that part in the output.

This is basically the same as '~<ul[^>]*>(.*)</ul>~Uis' which would capture whole-pattern matches to [0] and first parenthesized group to [1].


But, if your html contains nested tags of same kind, the idea of the following pattern is to catch the innermost ones. At each character inside <ul>...</ul> it checks if there is no opening <ul

$pattern = '~<ul[^>]*>\K(?:(?!<ul).)*(?=</ul>)~Uis';

Get matches using preg_match_all

$html = '<div><ul><li><ul><li>.1.</li></ul>...</li></ul></div>
         <ul><li>.2.</li></ul>';

if(preg_match_all($pattern, $html, $out))
{
  echo "<pre>"; print_r(array_map('htmlspecialchars',$out[0])); echo "</pre>";
} else {

  echo "FAIL";
}

Matches between \K and (?= will be captured to $out[0]

  • \K resets beginning of the reported match (supported in PHP since 5.2.4)
  • the second pattern, when <ul> matched, looks ahead (?!... at each character, if there's no opening <ul before meeting </ul>, if so starts over until </ul> is ahead (?=</ul>).
  • [^>]* any amount of characters, that are not > (negated character class)
  • (?: starts a non-capturing group.

Used Modifiers: Uis (part after the ending delimiter ~)

U (PCRE_UNGREEDY), i (PCRE_CASELESS), s (PCRE_DOTALL)

Conside using strpos as mentioned here

$html = "the page's html source";
$first = strpos($html,'<ul>');
$last = strpos($html,'</ul>');

$ul = substr($html,$first,$last-$first); //the html between the <ul></ul>

If there are more than 1 pair of <ul> tags, then consider using an offset in strpos to grab the relevant bits.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!