Regex to extract Favicon url from a webpage

Please help me to find the Favicon url from the sample html below using Regular expression. It should also check for file extension ".ico". I am developing a personal bookmarking site and i want to save the favicons of links which i bookmark. I have already written the c# code to convert icon to gif and save but i have very limited knowledge about regex so i am unable to select this tag because ending tags are different in different sites . Example of ending tags "/>" "/link>"

My programming language is C#

<meta name="description" content="Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service." />
<meta name="robots" content="index, follow" />
<meta name="verify-v1" content="x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=" />
<link rel="shortcut icon" href="http://3dbin.com/favicon.ico" type="image/x-icon" />
<link rel="stylesheet" type="text/css" href="http://3dbin.com/css/1261391049/style.min.css" />
<!--[if lt IE 8]>
    <script src="http://3dbin.com/js/1261039165/IE8.js" type="text/javascript"></script>
<![endif]-->

solution: one more way to do this Download and add reference to htmlagilitypack dll. Thanks for helping me. I really love this site :)

 HtmlDocument doc = new HtmlDocument();
    doc.LoadHtml(readcontent);

    if (doc.DocumentNode != null)
    {
        foreach (HtmlNode link in doc.DocumentNode.SelectNodes(@"//link[@href]"))
        {

            HtmlAttribute att = link.Attributes["href"];
            if (att.Value.EndsWith(".ico"))
            {
                faviconurl = att.Value;
            }
        }
    }

This should match the whole link tag that contain href=http://3dbin.com/favicon.ico

 <link .*? href="http://3dbin\.com/favicon\.ico" [^>]* />

Correction based on your comment:

I see you have a C# solutions Excellent! But just in case you were still wondering if it could be done with regular expressions the following expression would do what you want. The group 1 of the match will have only the url.

 <link .*? href="(.*?.ico)"

Simple C# snipet that makes use of it:

// this is the snipet from your example with an extra link item in the form <link ... href="...ico" > ... </link> 
//just to make sure it would pick it up properly.
String htmlText = String htnlText = "<meta name=\"description\" content=\"Create 360 degree rotation product presentation online with 3Dbin. 360 product pics, object rotationg presentation can be created for your website at 3DBin.com web service.\" /><meta name=\"robots\" content=\"index, follow\" /><meta name=\"verify-v1\" content=\"x42ckCSDiernwyVbSdBDlxN0x9AgHmZz312zpWWtMf4=\" /><link rel=\"shortcut icon\" href=\"http://3dbin.com/favicon.ico\" type=\"image/x-icon\" /><link rel=\"shortcut icon\" href=\"http://anotherURL/someicofile.ico\" type=\"image/x-icon\">just to make sure it works with different link ending</link><link rel=\"stylesheet\" type=\"text/css\" href=\"http://3dbin.com/css/1261391049/style.min.css\" /><!--[if lt IE 8]>    <script src=\"http://3dbin.com/js/1261039165/IE8.js\" type=\"text/javascript\"></script><![endif]-->";

foreach (Match match in Regex.Matches(htmlText, "<link .*? href=\"(.*?.ico)\""))
{
    String url = match.Groups[1].Value;

    Console.WriteLine(url);
}

which prints the following to the console:

http://3dbin.com/favicon.ico
http://anotherURL/someicofile.ico

<link\s+[^>]*(?:href\s*=\s*"([^"]+)"\s+)?rel\s*=\s*"shortcut icon"(?:\s+href\s*=\s*"([^"]+)")?

maybe... it is not robust, but could work. (I used perl regex)

Johnsyweb

This is not a job for a regular expression, as you'll see if you spend 2 minutes on StackOverflow looking for how to parse HTML.

Use an HTML parser instead!

Here's a trivial example in Python (I'm sure this is equally do-able in C#):

% python
Python 2.7.1 (r271:86832, May 16 2011, 19:49:41) 
[GCC 4.2.1 (Apple Inc. build 5646) (dot 1)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from BeautifulSoup import BeautifulSoup
>>> import urllib2
>>> page = urllib2.urlopen('https://stackoverflow.com/')
>>> soup = BeautifulSoup(page)
>>> link = soup.html.head.find(lambda x: x.name == 'link' and x['rel'] == 'shortcut icon')
>>> link['href']
u'http://cdn.sstatic.net/stackoverflow/img/favicon.ico'
>>> link['href'].endswith('.ico')
True

I had a go at this a wee while back so here is something that is pretty simple. First it attempts to find the /favicon.ico file. If that fails I load up the page using Html Agility pack and then use xpath to find any tags. I loop through the link tags to see if they have a rel='icon' attribute. If they do I grab the href attribute and expand that if it exists into an absolute url for that site.

Please feel free to play around with this and offer any improvements.

private static Uri GetFaviconUrl(string siteUrl)
{
    // try looking for a /favicon.ico first
    var url = new Uri(siteUrl);
    var faviconUrl = new Uri(string.Format("{0}://{1}/favicon.ico", url.Scheme, url.Host));
    try
    {
        using (var httpWebResponse = WebRequest.Create(faviconUrl).GetResponse() as HttpWebResponse)
        {
            if (httpWebResponse != null && httpWebResponse.StatusCode == HttpStatusCode.OK)
            {
                // Log("Found a /favicon.ico file for {0}", url);
                return faviconUrl;
            }
        }
    }
    catch (WebException)
    {
    }

    // otherwise parse the html and look for <link rel='icon' href='' /> using html agility pack
    var htmlDocument = new HtmlWeb().Load(url.ToString());
    var links = htmlDocument.DocumentNode.SelectNodes("//link");
    if (links != null)
    {
        foreach (var linkTag in links)
        {
            var rel = GetAttr(linkTag, "rel");
            if (rel == null)
                continue;

            if (rel.Value.IndexOf("icon", StringComparison.InvariantCultureIgnoreCase) > 0)
            {
                var href = GetAttr(linkTag, "href");
                if (href == null)
                    continue;

                Uri absoluteUrl;
                if (Uri.TryCreate(href.Value, UriKind.Absolute, out absoluteUrl))
                {
                    // Log("Found an absolute favicon url {0}", absoluteUrl);
                    return absoluteUrl;
                }

                var expandedUrl = new Uri(string.Format("{0}://{1}{2}", url.Scheme, url.Host, href.Value));
                //Log("Found a relative favicon url for {0} and expanded it to {1}", url, expandedUrl);
                return expandedUrl;
            }
        }
    }

    // Log("Could not find a favicon for {0}", url);
    return null;
}

public static HtmlAttribute GetAttr(HtmlNode linkTag, string attr)
{
    return linkTag.Attributes.FirstOrDefault(x => x.Name.Equals(attr, StringComparison.InvariantCultureIgnoreCase));
}

来源：https://stackoverflow.com/questions/6556141/regex-to-extract-favicon-url-from-a-webpage

标签

html

regex

favicon