Splitting up html code tags and content

眉间皱痕 提交于 2019-12-08 03:58:16

问题


Does anyone with more knowledge than me about regular expressions know how to split up html code so that all tags and all words are seperated ie.

<p>Some content <a href="www.test.com">A link</a></p>

Is seperated like this:

array = { [0]=>"<p>",
          [1]=>"Some",
          [2]=>"content",
          [3]=>"<a href='www.test.com'>,
          [4]=>"A",
          [5]=>"Link",
          [6]=>"</a>",
          [7]=>"</p>"

I've been using preg_split so far and have either successfully managed to split the string by whitespace or split by tags - but then all the content is in one array element when I eed this to be split to.

Anyone help me out?


回答1:


preg_split shouldn't be used in that case. Try preg_match_all:

$text = '<p>Some content <a href="www.test.com">A link</a></p>';
preg_match_all('/<[^>]++>|[^<>\s]++/', $text, $tokens);
print_r($tokens);

output:

Array
(
    [0] => Array
        (
            [0] => <p>
            [1] => Some
            [2] => content
            [3] => <a href="www.test.com">
            [4] => A
            [5] => link
            [6] => </a>
            [7] => </p>
        )

)

I assume you forgot to include the 'A' in 'A link' in your example.

Realize that when your HTML contains < or >'s not meant as the start or end of tags, regex will mess things up badly! (hence the warnings)




回答2:


You could check out Simple HTML DOM Parser

Or look at the DOM parser in PHP




回答3:


Give Simple HTML Dom Parser a try. HTML is too irregular for regular expressions.




回答4:


I disagree with Bart about the recommendation of preg_match_all() over preg_split().

The task is literally to "split" the whole string on a variety of delimiters. I, first, recommend the stability of using a dom parser over regex, but if you don't require that level of stability because your input html is relatively predictable/simplistic, then regex can be used as a cheaper, more concise alternative.

Code: (Demo)

$html = <<<HTML
<p>Some content <a href="www.test.com">A link</a></p>
HTML;

var_export(preg_split('~\s+|(<[^>]+>)~', $html, -1, PREG_SPLIT_NO_EMPTY | PREG_SPLIT_DELIM_CAPTURE));

Output:

array (
  0 => '<p>',
  1 => 'Some',
  2 => 'content',
  3 => '<a href="www.test.com">',
  4 => 'A',
  5 => 'link',
  6 => '</a>',
  7 => '</p>',
)

My pattern splits on one or more whitespace characters or on a (weak interpretation of a) html tag. The whitespaces are merely discarded. The tags are retained in the output.

Beyond logical semantics, preg_split() has the additional benefit of producing a less bloated and therefore more direct output. preg_split() provides a one dimensional array and preg_match_all() provides a multidimensional array.

Finally, preg_split() cannot "fail" like preg_match_all() might. Imagine the unlikely fringe case where the input string doesn't contain any spaces or tags. preg_split() will return the whole input string as a single element array (useful and consistent with more common input strings). preg_match_all() will generate an empty array (not very useful).




回答5:


I currently use Simple HTML DOM Parser in several applications and find it to be an excellent tool, even when compared against other HTML parsers written in other languages.

Why exactly are you splitting up HTML into the string of tokens you described? Is not a tree-like structure of DOM elements a better approach for your specific application?



来源:https://stackoverflow.com/questions/1693396/splitting-up-html-code-tags-and-content

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!