Correctly matching an ending tag with its starting tag in HTML with a RegEx

梦想的初衷 提交于 2019-12-11 14:07:36

问题


I'm using VB.Net in an ASP.Net 2.0 app to run some regular expressions that remove some unnecessary markup. One of the things that I'd like to do is remove span elements that don't have any attributes in them:

output = Regex.Replace(output, "<span\s*>(?<Text>.*?)</span>" & styleRegex, "${Text}", RegexOptions.Compiled Or RegexOptions.CultureInvariant Or RegexOptions.IgnoreCase Or RegexOptions.Singleline)

So for this content:

<span>Lorem <span class="special">ipsum</span> dolor sit amet.</span>

I'd like to remove the outer span elements. Unfortunately, my regex above gives me this as a result, since the closing span matches the first one it comes across:

Lorem <span class="special">ipsum dolor sit amet.</span>

Is this possible with a RegEx or will I have to implement something a bit more advanced?


回答1:


Unfortunatly, regular expressions do not have this power. You'd need at least a context-sensitive language to express something like that. (sorry for the theoretical stuff)

I'd also propose to use XSLT instead.




回答2:


I would use XSLT rather than regex.

It seems .NET has good support for XSLT (google: xslt vb.net) but I don't know whether it will parse non-XHTML. The standard xsltproc command will, with the --html flag.




回答3:


HTML agility pack should help with this.

HTML Agility Pack on Codeplex




回答4:


XSLT isn't an option since the input may not always be valid XML and the HTML Agility Pack on Codeplex looks pretty sweet but is really overkill in this case. Here's the final RegEx I ended up using:

<span\s*>(?<Text>.*?(?:<span[^>]*>.*?</span>.*?)*)</span>

Replacing that with ${Text} effectively stripped the useless outer span tags in all cases I've tested.



来源:https://stackoverflow.com/questions/926617/correctly-matching-an-ending-tag-with-its-starting-tag-in-html-with-a-regex

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!