C# regex check string contains html

喜你入骨 提交于 2021-02-08 10:10:53

问题


I'm using following regex pattern to check a string contains html.

string input = "<a href=\"www.google.com\">test</a>";
const string pattern = "</?\\w+((\\s+\\w+(\\s*=\\s*(?:\".*?\"|'.*?'|[^'\">\\s]+))?)+\\s*|\\s*)/?>";
Regex reg = new Regex(pattern);
var matches = reg.Matches(input);

It works fine but if string text value contains < or > characters it returns true too, but it shouldn't. For example the following is not considered an HTML tag in our system.

string input = "<test>";

How can I add to this pattern an AND for </ and />

Thanks


回答1:


I would not use regex to parse or validate HTML. You could use HtmlAgilityPack:

string input = "<a href=\"www.google.com\">test</a>";

var doc = new HtmlAgilityPack.HtmlDocument();
doc.LoadHtml(input);
bool isValidHtml = doc.ParseErrors.Count() == 0;  // true

If you want to allow only specific tags you could create a white-list of allowed tags:

var whiteList = new List<string> { "a", "b", "img", "#text" }; //fill more whitelist tags
bool isValidHtmlAndTags = doc.ParseErrors.Count() == 0 && doc.DocumentNode.Descendants()
    .All(node => whiteList.Contains(node.Name));


来源:https://stackoverflow.com/questions/26531086/c-sharp-regex-check-string-contains-html

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!