Regular Expression to get the SRC of images in C#

后端 未结 8 1511
情深已故
情深已故 2020-11-29 09:34

I\'m looking for a regular expression to isolate the src value of an img. (I know that this is not the best way to do this but this is what I have to do in this case)

相关标签:
8条回答
  • 2020-11-29 09:37

    This is what I use to get the tags out of strings:

    </? *img[^>]*>
    
    0 讨论(0)
  • 2020-11-29 09:38
    string matchString = Regex.Match(original_text, "<img.+?src=[\"'](.+?)[\"'].*?>", RegexOptions.IgnoreCase).Groups[1].Value;
    
    0 讨论(0)
  • 2020-11-29 09:42

    This should capture all img tags and just the src part no matter where its located (before or after class etc) and supports html/xhtml :D

    <img.+?src="(.+?)".+?/?>
    
    0 讨论(0)
  • 2020-11-29 09:43

    Here is the one I use:

    <img.*?src\s*?=\s*?(?:(['"])(?<src>(?:(?!\1).)*)\1|(?<src>[^\s>]+))[^>]*?>
    

    The good part is that it matches any of the below:

    <img src='test.jpg'>
    <img src=test.jpg>
    <img src="test.jpg">
    

    And it can also match some unexpected scenarios like extra attributes, e.g:

    <img src = "test.jpg" width="300">
    
    0 讨论(0)
  • 2020-11-29 09:48

    I tried what Francisco Noriega suggested, but it looks that the api to the HtmlAgilityPack has been altered. Here is how I solved it:

            List<string> images = new List<string>();
            WebClient client = new WebClient();
            string site = "http://www.mysite.com";
            var htmlText = client.DownloadString(site);
    
            var htmlDoc = new HtmlDocument()
                        {
                            OptionFixNestedTags = true,
                            OptionAutoCloseOnEnd = true
                        };
    
            htmlDoc.LoadHtml(htmlText);
    
            foreach (HtmlNode img in htmlDoc.DocumentNode.SelectNodes("//img"))
            {
                HtmlAttribute att = img.Attributes["src"];
                images.Add(att.Value);
            }
    
    0 讨论(0)
  • 2020-11-29 09:50

    I know you say you have to use regex, but if possible i would really give this open source project a chance: HtmlAgilityPack

    It is really easy to use, I just discovered it and it helped me out a lot, since I was doing some heavier html parsing. It basically lets you use XPATHS to get your elements.

    Their example page is a little outdated, but the API is really easy to understand, and if you are a little bit familiar with xpaths you will get head around it in now time

    The code for your query would look something like this: (uncompiled code)

     List<string> imgScrs = new List<string>();
     HtmlDocument doc = new HtmlDocument();
     doc.LoadHtml(htmlText);//or doc.Load(htmlFileStream)
     var nodes = doc.DocumentNode.SelectNodes(@"//img[@src]"); s
     foreach (var img in nodes)
     {
        HtmlAttribute att = img["src"];
        imgScrs.Add(att.Value)
     }
    
    0 讨论(0)
提交回复
热议问题