Regex to extract Favicon url from a webpage

假装没事ソ 提交于 2019-12-05 16:59:46

This should match the whole link tag that contain href=http://3dbin.com/favicon.ico

 <link .*? href="http://3dbin\.com/favicon\.ico" [^>]* />

Correction based on your comment:

I see you have a C# solutions Excellent! But just in case you were still wondering if it could be done with regular expressions the following expression would do what you want. The group 1 of the match will have only the url.

 <link .*? href="(.*?.ico)"

Simple C# snipet that makes use of it:

// this is the snipet from your example with an extra link item in the form <link ... href="...ico" > ... </link> 
//just to make sure it would pick it up properly.
String htmlText = String htnlText = "<meta name=\"description\" content=\"Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service.\" /><meta name=\"robots\" content=\"index, follow\" /><meta name=\"verify-v1\" content=\"x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=\" /><link rel=\"shortcut icon\" href=\"http://3dbin.com/favicon.ico\" type=\"image/x-icon\" /><link rel=\"shortcut icon\" href=\"http://anotherURL/someicofile.ico\" type=\"image/x-icon\">just to make sure it works with different link ending</link><link rel=\"stylesheet\" type=\"text/css\" href=\"http://3dbin.com/css/1261391049/style.min.css\" /><!--[if lt IE 8]>    <script src=\"http://3dbin.com/js/1261039165/IE8.js\" type=\"text/javascript\"></script><![endif]-->";

foreach (Match match in Regex.Matches(htmlText, "<link .*? href=\"(.*?.ico)\""))
{
    String url = match.Groups[1].Value;

    Console.WriteLine(url);
}

which prints the following to the console:

http://3dbin.com/favicon.ico
http://anotherURL/someicofile.ico
<link\s+[^>]*(?:href\s*=\s*"([^"]+)"\s+)?rel\s*=\s*"shortcut icon"(?:\s+href\s*=\s*"([^"]+)")?

maybe... it is not robust, but could work. (I used perl regex)

Johnsyweb

This is not a job for a regular expression, as you'll see if you spend 2 minutes on StackOverflow looking for how to parse HTML.

Use an HTML parser instead!

Here's a trivial example in Python (I'm sure this is equally do-able in C#):

% python
Python 2.7.1 (r271:86832, May 16 2011, 19:49:41) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen('https://stackoverflow.com/')
>>> soup = BeautifulSoup(page)
>>> link = soup.html.head.find(lambda x: x.name == 'link' and x['rel'] == 'shortcut icon')
>>> link['href']
u'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
>>> link['href'].endswith('.ico')
True

I had a go at this a wee while back so here is something that is pretty simple. First it attempts to find the /favicon.ico file. If that fails I load up the page using Html Agility pack and then use xpath to find any tags. I loop through the link tags to see if they have a rel='icon' attribute. If they do I grab the href attribute and expand that if it exists into an absolute url for that site.

Please feel free to play around with this and offer any improvements.

private static Uri GetFaviconUrl(string siteUrl)
{
    // try looking for a /favicon.ico first
    var url = new Uri(siteUrl);
    var faviconUrl = new Uri(string.Format("{0}://{1}/favicon.ico", url.Scheme, url.Host));
    try
    {
        using (var httpWebResponse = WebRequest.Create(faviconUrl).GetResponse() as HttpWebResponse)
        {
            if (httpWebResponse != null && httpWebResponse.StatusCode == HttpStatusCode.OK)
            {
                // Log("Found a /favicon.ico file for {0}", url);
                return faviconUrl;
            }
        }
    }
    catch (WebException)
    {
    }

    // otherwise parse the html and look for <link rel='icon' href='' /> using html agility pack
    var htmlDocument = new HtmlWeb().Load(url.ToString());
    var links = htmlDocument.DocumentNode.SelectNodes("//link");
    if (links != null)
    {
        foreach (var linkTag in links)
        {
            var rel = GetAttr(linkTag, "rel");
            if (rel == null)
                continue;

            if (rel.Value.IndexOf("icon", StringComparison.InvariantCultureIgnoreCase) > 0)
            {
                var href = GetAttr(linkTag, "href");
                if (href == null)
                    continue;

                Uri absoluteUrl;
                if (Uri.TryCreate(href.Value, UriKind.Absolute, out absoluteUrl))
                {
                    // Log("Found an absolute favicon url {0}", absoluteUrl);
                    return absoluteUrl;
                }

                var expandedUrl = new Uri(string.Format("{0}://{1}{2}", url.Scheme, url.Host, href.Value));
                //Log("Found a relative favicon url for {0} and expanded it to {1}", url, expandedUrl);
                return expandedUrl;
            }
        }
    }

    // Log("Could not find a favicon for {0}", url);
    return null;
}

public static HtmlAttribute GetAttr(HtmlNode linkTag, string attr)
{
    return linkTag.Attributes.FirstOrDefault(x => x.Name.Equals(attr, StringComparison.InvariantCultureIgnoreCase));
}
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!