Get and Download pictures with AngleSharp

被刻印的时光 ゝ 提交于 2019-12-24 12:34:52

问题


I started using Anglesharp for a Project, I need to get and download not only HTML but also images of the document. I know that in the Document object there is a property called Images, but appearently it doesn't get all of them, I did a test on a YouTube page and got only one (repeated several times). For example I'd like to get the thumbinail of the current video, and this seems to be inside a <meta> tag. To be more precise, images are stored inside this kind of tags:

<meta content="https://i.ytimg.com/vi/hW-kDv1WcQM/hqdefault.jpg" property="og:image">

So I wonder if there is a way to select all the nodes/url of any image inside a page, no matter the tag used. I don't think that QuerySelectorAll does work in this case, as this selects only one type of node. You can try the sample code you find on github to verify that (I just changed the url with the YouTube one, and the selector too :D):

// Setup the configuration to support document loading
var config = Configuration.Default.WithDefaultLoader();
// Load the names of all The Big Bang Theory episodes from Wikipedia
var address  = "https://www.youtube.com/watch?v=hW-kDv1WcQM&feature=youtu.be";
// Asynchronously get the document in a new context using the configuration
var document = await BrowsingContext.New(config).OpenAsync(address);
// This CSS selector gets the desired content
var cellSelector = "img";
// Perform the query to get all cells with the content
var cells = document.QuerySelectorAll(cellSelector);
// We are only interested in the text - select it with LINQ
var titles = cells.Select(m => m.TextContent);

Oh, shure, you can also add this to check that the Image property doesn't get the video thumbinails:

var Images = document.Images.Select(sl=> sl.Source).Distinct().ToList();

Any other method to select nodes based on the URL content? (like all of the urls ending with ".jpg", or ".png", etc.)


回答1:


You can use the LINQ API to get all attributes that contains image URL in a page, like so :

.....
var document = await BrowsingContext.New(config).OpenAsync(address);

//list all image file extension here :
var fileExtensions = new string[] { ".jpg", ".png" };

//find all attribute in any element...
//where the value ends with one of the listed file extension                     
var result = from element in document.All
             from attribute in element.Attributes
             where fileExtensions.Any(e => attribute.Value.EndsWith(e))
             select attribute;

foreach (var item in result)
{
    Console.WriteLine(item.Value);
}


来源:https://stackoverflow.com/questions/36023919/get-and-download-pictures-with-anglesharp

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!