问题
I have a regex to extract the text from an HTML font tag:
<FONT FACE=\"Excelsior LT Std Bold\"(.*)>(.*)</FONT>
That's working fine until I have some nested font tags. Instead of matching
<FONT FACE="Excelsior LT Std Bold">Fett</FONT>
the result for string
<FONT FACE="Excelsior LT Std Bold">Fett</FONT> + <U>Unterstrichen</U> + <FONT FACE="Excelsior LT Std Italic">Kursiv</FONT> und Normal
is
<FONT FACE="Excelsior LT Std Bold">Fett</FONT> + <U>Unterstrichen</U> + <FONT FACE="Excelsior LT Std Italic"
How do I get only the first tag?
回答1:
You need to disabale greedy matching with .*? instead of .*.
<FONT FACE=\"Excelsior LT Std Bold\"([^>]*)>(.*?)</FONT>
Note that this will fail if there is a attribute like BadAttribute="<FooBar>" somewhere after the FACE attribute for the <FONT> tag. This will mix both matching groups and it could get completly messed up if an attribute would contain </FONT>. There is no way araound this because regular expressions cannot count matching tags or quotes. So I absolutly agree with Tomalak - try to avoid using regular expressions for processing XML, HTML, and other markup up languages like these.
回答2:
You must use the non-greedy star:
<FONT FACE=\"Excelsior LT Std Bold\"[^>]*>(.*?)</FONT>
^^^^^ ^^^
| |
match any character except ">" --+ +--------+
|
match anything, but only up to the next </FONT> --+
The usual warnings about using regex to process HTML apply: You shouldn't.
回答3:
you need to use a non-greedy capture denoted by '?'
<FONT FACE=\"Excelsior LT Std Bold\"(.*?)>(.*?)</FONT>
回答4:
<FONT[^>]*Excelsior LT Std Bold[^>]*></FONT>
See Phil Haack's post here.
Here is my C# usage of this expression. This was used to remove specific CSS and JS files from an HTTP response.
const string CSSFormat = "<link[^>]*{0}[^>]*css[^>]*>";
const string JSFormat = "<script[^>]*{0}[^>]*js[^>]*></script>";
static readonly Regex OverrideCss = new Regex(string.Format(CSSFormat, "override-"), RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);
static readonly Regex OverrideIconsJs = new Regex(string.Format(JSFormat, "overrideicons"), RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);
来源:https://stackoverflow.com/questions/781897/regex-matching-exactly-one-single-tag