Simple web crawler in C#

自闭症网瘾萝莉.ら 提交于 2019-11-27 18:25:56

I fixed your GetContent method as follow to get new links from crawled page:

public ISet<string> GetNewLinks(string content)
{
    Regex regexLink = new Regex("(?<=<a\\s*?href=(?:'|\"))[^'\"]*?(?=(?:'|\"))");

    ISet<string> newLinks = new HashSet<string>();    
    foreach (var match in regexLink.Matches(content))
    {
        if (!newLinks.Contains(match.ToString()))
            newLinks.Add(match.ToString());
    }

    return newLinks;
}

Updated

Fixed: regex should be regexLink. Thanks @shashlearner for pointing this out (my mistype).

i have created something similar using Reactive Extension.

https://github.com/Misterhex/WebCrawler

i hope it can help you.

Crawler crawler = new Crawler();

IObservable observable = crawler.Crawl(new Uri("http://www.codinghorror.com/"));

observable.Subscribe(onNext: Console.WriteLine, 
onCompleted: () => Console.WriteLine("Crawling completed"));
Connor

The following includes an answer/recommendation.

I believe you should use a dataGridView instead of a textBox as when you look at it in GUI it is easier to see the links (URLs) found.

You could change:

textBox3.Text = Links;

to

 dataGridView.DataSource = Links;  

Now for the question, you haven't included:

using System.  "'s"

which ones were used, as it would be appreciated if I could get them as can't figure it out.

From a design standpoint, I've written a few webcrawlers. Basically you want to implement a Depth First Search using a Stack data structure. You can use Breadth First Search also, but you'll likely come into stack memory issues. Good luck.

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!