regex c# extracting url from <a> tag

☆樱花仙子☆ 提交于 2019-12-02 05:05:48

Regular expressions can be used in very specific, simple cases with HTML. For example, if the text contains only a single tag, you can use "href\\s*=\\s*\"(?<url>.*?)\"" to extract the URL, eg:

var url=Regex.Match(text,"href\\s*=\\s*\"(?<url>.*?)\"").Groups["url"].Value;

This pattern will return :

https://website.com/-id1

This regex doesn't do anything fancy. It looks for href= with possible whitespace and then captures anything between the first double quote and the next in a non-greedy manner (.*?). This is captured in the named group url.

Anything more fancy and things get very complex. For example, supporting both single and double quotes would require special handling to avoid starting on a single and ending on a double quote. The string could multiple <a> tags that used both types of quotes.

For complex parsing it would be better to use a library like AngleSharp or HtmlAgilityPack

Try this:

var input = "<a style=\"font - weight: bold; \" href=\"https://website.com/-id1\">MyLink</a><a style=\"font - weight: bold; \" href=\"https://website.com/-id2\">MyLink2</a>";
var r = new Regex("<a.*?href=\"(.*?)\".*?>");
var output = r.Matches(input);
var urls = new List<string>();
foreach (var item in output) {
    urls.Add((item as Match).Groups[1].Value);
}

It will find all a tags and extract their href values then store it in urls List.

Explanation

<a match begining of <a> tag
.*?href= match anything until href=
"(.*?)"match and capture anything inside ""
.*?> match end of <a> tag

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!