How to write a crawler?

后端 未结 10 1840
感情败类
感情败类 2020-12-02 03:47

I have had thoughts of trying to write a simple crawler that might crawl and produce a list of its findings for our NPO\'s websites and content.

Does anybody have an

10条回答
  •  一生所求
    2020-12-02 04:31

    Multithreaded Web Crawler

    If you want to crawl large sized website then you should write a multi-threaded crawler. connecting,fetching and writing crawled information in files/database - these are the three steps of crawling but if you use a single threaded than your CPU and network utilization will be pour.

    A multi threaded web crawler needs two data structures- linksVisited(this should be implemented as a hashmap or trai) and linksToBeVisited(this is a queue).

    Web crawler uses BFS to traverse world wide web.

    Algorithm of a basic web crawler:-

    1. Add one or more seed urls to linksToBeVisited. The method to add a url to linksToBeVisited must be synchronized.
    2. Pop an element from linksToBeVisited and add this to linksVisited. This pop method to pop url from linksToBeVisited must be synchronized.
    3. Fetch the page from internet.
    4. Parse the file and add any till now not visited link found in the page to linksToBeVisited. URL's can be filtered if needed. The user can give a set of rules to filter which url's to be scanned.
    5. The necessary information found on the page is saved in database or file.
    6. repeat step 2 to 5 until queue is linksToBeVisited empty.

      Here is a code snippet on how to synchronize the threads....

       public void add(String site) {
         synchronized (this) {
         if (!linksVisited.contains(site)) {
           linksToBeVisited.add(site);
           }
         }
       }
      
       public String next() {
          if (linksToBeVisited.size() == 0) {
          return null;
          }
             synchronized (this) {
              // Need to check again if size has changed
             if (linksToBeVisited.size() > 0) {
                String s = linksToBeVisited.get(0);
                linksToBeVisited.remove(0);
                linksVisited.add(s);
                return s;
             }
           return null;
           }
        }
      

提交回复
热议问题