Simple web crawler in C#

前端 未结 4 767
囚心锁ツ
囚心锁ツ 2020-12-04 18:55

I have created a simple web crawler but i want to add the recursion function so that every page that is opened i can get the urls in this page,but i have no idea how i can d

相关标签:
4条回答
  • 2020-12-04 19:04

    i have created something similar using Reactive Extension.

    https://github.com/Misterhex/WebCrawler

    i hope it can help you.

    Crawler crawler = new Crawler();
    
    IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));
    
    observable.Subscribe(onNext: Console.WriteLine, 
    onCompleted: () => Console.WriteLine("Crawling completed"));
    
    0 讨论(0)
  • 2020-12-04 19:13

    From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.

    0 讨论(0)
  • 2020-12-04 19:20

    I fixed your GetContent method as follow to get new links from crawled page:

    public ISet<string> GetNewLinks(string content)
    {
        Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");
    
        ISet<string> newLinks = new HashSet<string>();    
        foreach (var match in regexLink.Matches(content))
        {
            if (!newLinks.Contains(match.ToString()))
                newLinks.Add(match.ToString());
        }
    
        return newLinks;
    }
    

    Updated

    Fixed: regex should be regexLink. Thanks @shashlearner for pointing this out (my mistype).

    0 讨论(0)
  • 2020-12-04 19:25

    The following includes an answer/recommendation.

    I believe you should use a dataGridView instead of a textBox as when you look at it in GUI it is easier to see the links (URLs) found.

    You could change:

    textBox3.Text = Links;
    

    to

     dataGridView.DataSource = Links;  
    

    Now for the question, you haven't included:

    using System.  "'s"
    

    which ones were used, as it would be appreciated if I could get them as can't figure it out.

    0 讨论(0)
提交回复
热议问题