How, using regex, can I capture the outer HTML element, when the same element type is nested within it?

微笑、不失礼 提交于 2019-12-11 23:52:57

问题


I'm trying to capture certain parts of HTML using regular expressions, and I've come across a situation which I don't know how to resolve.

I've got an HTML fragment like this:

<span ...> .... <span ...> ... </span> ... </span>

so, a <span> element into which another <span> element is nested.

I've been successfully using the following regex (in PHP's preg_match() / preg_match_all()) to capture entire HTML elements:

@<sometag[^>]+>.*?</sometag>@

This would capture a given starting tag and everything up to the closing tag of the same type.

However, in the situation above, this would capture the starting <span> and everything up to the next closing </span> encountered, so what I get is this:

<span ...> .... <span ...> ... </span>

that is, the outer starting tag, then everything until the starting tag of the inner span, then everything up to the closing tag of the inner span, which, of course, is not what I want.

What I really wanted is the outer <span> element complete with everything that is inside it, including the inner nested <span>.

Is there any practical way to achieve this?

Note: parsing the HTML using an XML parser is probably not an option, as the HTML I'm working on is old and very broken HTML 4 coming out of MS FrontPage that any parser would choke on.

Thanks for any help!


回答1:


Obviously, the "right" answer is to use a DOM parser instead of regex, but you say your markup is too broken for a parser.

Before resorting to a regex, though, check out whether simpleHTMLDOM can make sense out of it. it is a bit more lenient towards broken markup than the PHP DOM based parsers.



来源:https://stackoverflow.com/questions/3457072/how-using-regex-can-i-capture-the-outer-html-element-when-the-same-element-ty

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!