Implementing Threads Into Java Web Crawler

梦想的初衷 提交于 2019-12-10 12:06:24

问题


Here is the original web crawler in which i wrote: (Just for reference)

https://github.com/domshahbazi/java-webcrawler/tree/master

This is a simple web crawler which visits a given initial web page, scrapes all the links from the page and adds them to a Queue (LinkedList), where they are then popped off one by one and each visited, where the cycle starts again. To speed up my program, and for learning, i tried to implement using threads so i could have many threads operating at once, indexing more pages in less time. Below is each class:

Main class

public class controller {

    public static void main(String args[]) throws InterruptedException {

        DataStruc data = new DataStruc("http://www.imdb.com/title/tt1045772/?ref_=nm_flmg_act_12");

        Thread crawl1 = new Crawler(data);
        Thread crawl2 = new Crawler(data);

        crawl1.start();
        crawl2.start();
   }    
}

Crawler Class (Thread)

public class Crawler extends Thread {

    /** Instance of Data Structure **/
    DataStruc data;

    /** Number of page connections allowed before program terminates **/
    private final int INDEX_LIMIT = 10;

    /** Initial URL to visit **/
    public Crawler(DataStruc d) {
        data = d;
    }

    public void run() {

        // Counter to keep track of number of indexed URLS
        int counter = 0;

        // While URL's left to visit
        while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {

            // Pop next URL to visit from stack
            String currentUrl = data.getURL();

            try {
                // Fetch and parse HTML document
                Document doc = Jsoup.connect(currentUrl)                 
                        .userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/37.0.2062.124 Safari/537.36")
                        .referrer("http://www.google.com")
                        .timeout(12000)
                        .followRedirects(true)
                        .get();

                // Increment counter if connection to web page succeeds
                counter++;

                /** .select returns a list of elements (links in this case) **/
                Elements links = doc.select("a[href]"); // Relative URL

                // Add newly found links to stack
                addLinksToQueue(links);                             

            } catch (IOException e) {
                //e.printStackTrace();
                System.out.println("Error: "+currentUrl);
            }               
        }       
    }

    public void addLinksToQueue(Elements el) {
        // For each element in links
        for(Element e : el) {           

            String theLink = e.attr("abs:href"); // 'abs' prefix ensures absolute url is returned rather then relative url ('www.reddit.com/hello' rather then '/hello')

            if(theLink.startsWith("http") && !data.oldLink(theLink)) {
                data.addURL(theLink);
                data.addVisitedURL(theLink); // Register each unique URL to ensure it isnt stored in 'url_to_visit' again
                System.out.println(theLink);
            }               
        }   
    }
}

DataStruc Class

public class DataStruc {

    /** Queue to store URL's, can be accessed by multiple threads **/
    private ConcurrentLinkedQueue<String> url_to_visit = new ConcurrentLinkedQueue<String>();

    /** ArrayList of visited URL's **/
    private ArrayList<String> visited_url = new ArrayList<String>();

    public DataStruc(String initial_url) {
        url_to_visit.offer(initial_url);
    }

    // Method to add seed URL to queue
    public void addURL(String url) {
        url_to_visit.offer(url);
    }

    // Get URL at front of queue
    public String getURL() {
        return url_to_visit.poll();
    }

    // URL to visit size
    public int url_to_visit_size() {
        return url_to_visit.size();
    }

    // Add visited URL
    public void addVisitedURL(String url) {
        visited_url.add(url);
    }

    // Checks if link has already been visited
    public boolean oldLink(String link) {
        for(String s : visited_url) {
            if(s.equals(link)) {
                return true;
            }
        }   
        return false;
    }       
}

DataStruc is the shared data structure class, which will be concurrently accessed by each instance of a Crawler.java thread. DataStruc has a Queue to store links to be visited, and an arraylist to store visited URL's, to prevent entering a loop. I used a ConcurrentLinkedQueue to store the urls to be visited, as i see it takes care of concurrent access. I didnt require concurrent access with my arraylist of visited urls, as all i need to be able to do is add to this and iterate over it to check for matches.

My problem is that when i compare operation time of using a single thread VS using 2 threads (on the same URL), my single threaded version seems to work faster. I feel i have implemented the threading incorrectly, and would like some tips if anybody can pinpoint the issues?

Thanks!


回答1:


Added: see my comment, I think the check in Crawler

// While URL's left to visit
        while((data.url_to_visit_size() > 0) && counter<INDEX_LIMIT) {

is wrong. The 2nd Thread will stop immediately since the 1st Thread polled the only URL.

You can ignore the remaining, but left for history ...

My general approach to such types of "big blocks that can run in parallel" is:

  1. Make each crawler a Callable. Probably Callable<List<String>>
  2. Submit them to an ExecutorService
  3. When they complete, take the results one at a time and add them to a List.

Using this strategy there is no need to use any concurrent lists at all. The disadvantage is that you don't get much live feedback as they are runnìng. And, if what they return is huge, you may need to worry about memory.

Would this suit your needs? You would have to worry about the addVisitedURL so you still need that as a concurrent data structure.

Added: Since you are starting with a single URL this strategy doesn't apply. You could apply it after the visit to the first URL.




回答2:


class controller {

    public static void main(String args[]) throws InterruptedException {

        final int LIMIT = 4;

        List<String> seedList = new ArrayList<>(); //1
        seedList.add("https://www.youtube.com/");
        seedList.add("https://www.digg.com/");
        seedList.add("https://www.reddit.com/");
        seedList.add("https://www.nytimes.com/");

        DataStruc[] data = new DataStruc[LIMIT];

        for(int i = 0; i < LIMIT; i++){
            data[i] = new DataStruc(seedList.get(i)); //2
        }

        ExecutorService es = Executors.newFixedThreadPool(LIMIT);
        Crawler[] crawl = new Crawler[LIMIT];
        for(int i = 0; i < LIMIT; i++){
            crawl[i] = new Crawler(data[i]); //3
        }

        for(int i = 0; i < LIMIT; i++){
            es.submit(crawl[i]) // 4
        }
    }
}

you can try this out

  1. create a seedlist

  2. create objects of datastruc and add the seedlist to each of them

  3. create crawl array and pass datastruc object to them one by one
  4. pass the crawl object to the excutor


来源:https://stackoverflow.com/questions/26363585/implementing-threads-into-java-web-crawler

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!