问题
I'm using following regex pattern to check a string contains html.
string input = "<a href=\"www.google.com\">test</a>";
const string pattern = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
Regex reg = new Regex(pattern);
var matches = reg.Matches(input);
It works fine but if string text value contains < or > characters it returns true too, but it shouldn't. For example the following is not considered an HTML tag in our system.
string input = "<test>";
How can I add to this pattern an AND for </ and />
Thanks
回答1:
I would not use regex to parse or validate HTML. You could use HtmlAgilityPack:
string input = "<a href=\"www.google.com\">test</a>";
var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(input);
bool isValidHtml = doc.ParseErrors.Count() == 0; // true
If you want to allow only specific tags you could create a white-list of allowed tags:
var whiteList = new List<string> { "a", "b", "img", "#text" }; //fill more whitelist tags
bool isValidHtmlAndTags = doc.ParseErrors.Count() == 0 && doc.DocumentNode.Descendants()
.All(node => whiteList.Contains(node.Name));
来源:https://stackoverflow.com/questions/26531086/c-sharp-regex-check-string-contains-html