Getting text between all tags in a given html and recursively going through links

本小妞迷上赌 提交于 2019-11-29 17:33:38

Regex is not a good choice for parsing HTML files..

HTML is not strict nor is it regular with its format..

Use htmlagilitypack


This extracts all the links from the web page

public List<string> getAllLinks(string webAddress)
{
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument newdoc=web.Load(webAddress);

    return doc.DocumentNode.SelectNodes("//a[@href]")
              .Where(y=>y.Attributes["href"].Value.StartsWith("http"))
              .Select(x=>x.Attributes["href"].Value)
              .ToList<string>();
}

this gets all the content excluding tags in the html

public string getContent(string webAddress)
{
    HtmlAgilityPack.HtmlWeb web = new HtmlAgilityPack.HtmlWeb();
    HtmlDocument doc=web.Load(webAddress);

    return string.Join(" ",doc.DocumentNode.Descendants().Select(x=>x.InnerText));
}

this crawls through all the links

public void crawl(string seedSite)
{
        getContent(seedSite);//gets all the content
        getAllLinks(seedSite);//get's all the links
}
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!