Regex - Matching exactly one single tag

怎甘沉沦 提交于 2021-02-16 18:24:08

问题


I have a regex to extract the text from an HTML font tag:

<FONT FACE=\"Excelsior LT Std Bold\"(.*)>(.*)</FONT>

That's working fine until I have some nested font tags. Instead of matching

<FONT FACE="Excelsior LT Std Bold">Fett</FONT>

the result for string

<FONT FACE="Excelsior LT Std Bold">Fett</FONT> + <U>Unterstrichen</U> + <FONT FACE="Excelsior LT Std Italic">Kursiv</FONT> und Normal

is

<FONT FACE="Excelsior LT Std Bold">Fett</FONT> + <U>Unterstrichen</U> + <FONT FACE="Excelsior LT Std Italic"

How do I get only the first tag?


回答1:


You need to disabale greedy matching with .*? instead of .*.

<FONT FACE=\"Excelsior LT Std Bold\"([^>]*)>(.*?)</FONT>

Note that this will fail if there is a attribute like BadAttribute="<FooBar>" somewhere after the FACE attribute for the <FONT> tag. This will mix both matching groups and it could get completly messed up if an attribute would contain </FONT>. There is no way araound this because regular expressions cannot count matching tags or quotes. So I absolutly agree with Tomalak - try to avoid using regular expressions for processing XML, HTML, and other markup up languages like these.




回答2:


You must use the non-greedy star:

<FONT FACE=\"Excelsior LT Std Bold\"[^>]*>(.*?)</FONT>
                                    ^^^^^  ^^^
                                      |     |
     match any character except ">" --+     +--------+
                                                     |
   match anything, but only up to the next </FONT> --+

The usual warnings about using regex to process HTML apply: You shouldn't.




回答3:


you need to use a non-greedy capture denoted by '?'

 <FONT FACE=\"Excelsior LT Std Bold\"(.*?)>(.*?)</FONT>



回答4:


<FONT[^>]*Excelsior LT Std Bold[^>]*></FONT>

See Phil Haack's post here.

Here is my C# usage of this expression. This was used to remove specific CSS and JS files from an HTTP response.

const string CSSFormat = "<link[^>]*{0}[^>]*css[^>]*>";
const string JSFormat = "<script[^>]*{0}[^>]*js[^>]*></script>";

static readonly Regex OverrideCss = new Regex(string.Format(CSSFormat, "override-"), RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);
static readonly Regex OverrideIconsJs = new Regex(string.Format(JSFormat, "overrideicons"), RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);


来源:https://stackoverflow.com/questions/781897/regex-matching-exactly-one-single-tag

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!