Make a web crawler/spider

ぐ巨炮叔叔 提交于 2019-12-05 10:13:59
Chris Diver

In VB.NET you will need to get the HTML first, so use the WebClient class or HttpWebRequest and HttpWebResponse classes. There is plenty of info on how to use these on the interweb.

Then you will need to parse the HTML. I recommend using regular expressions for this.

Your idea of using Google for a filetype search is a good one. I did a similar thing a few years ago to gather PDFs to test PDF indexing in SharePoint, which worked really well.

Here is a link on a tutorial on how to write a web crawler in java. http://java.sun.com/developer/technicalArticles/ThirdParty/WebCrawler/ I'm sure if you google it you can find ones for other languages.

Memin

The pseudo code should be like:

Method spider(URL startURL){ 
 Collection URLStore; // Can be an arraylist  
    push(startURL,URLStore);// start with a know url
       while URLStore ! Empty do 
         currURL= pop(URLStore); //take an url
         download URL page;
        push (URLx, URLStore); //for all links to URL in the page which are not already followed, then put in the list

To read some data from a web page in Java you can do:

URL myURL = new URL("http://www.w3.org"); 
 BufferedReader in =  new BufferedReader( new InputStreamReader(myURL.openStream())); 
 String inputLine; 
 while ((inputLine = in.readLine()) != null) //you will get all content of the page
 System.out.println(inputLine); //  here you need to extract the hyperlinks
 in.close();
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!