Java Web Crawler Libraries

前端 未结 12 1048
栀梦
栀梦 2020-12-13 04:58

I wanted to make a Java based web crawler for an experiment. I heard that making a Web Crawler in Java was the way to go if this is your first time. However, I have two impo

相关标签:
12条回答
  • 2020-12-13 05:29

    For parsing content, I'm using Apache Tika.

    0 讨论(0)
  • 2020-12-13 05:29

    I recommend you to use the HttpClient library. You can found examples here.

    0 讨论(0)
  • 2020-12-13 05:31

    Though mainly used for Unit Testing web applications, HttpUnit traverses a website, clicks links, analyzes tables and form elements, and gives you meta data about all the pages. I use it for Web Crawling, not just for Unit Testing. - http://httpunit.sourceforge.net/

    0 讨论(0)
  • 2020-12-13 05:34

    This is How your program 'visit' or 'connect' to web pages.

        URL url;
        InputStream is = null;
        DataInputStream dis;
        String line;
    
        try {
            url = new URL("http://stackoverflow.com/");
            is = url.openStream();  // throws an IOException
            dis = new DataInputStream(new BufferedInputStream(is));
    
            while ((line = dis.readLine()) != null) {
                System.out.println(line);
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException ioe) {
                // nothing to see here
            }
        }
    

    This will download source of html page.

    For HTML parsing see this

    Also take a look at jSpider and jsoup

    0 讨论(0)
  • 2020-12-13 05:35

    I would prefer crawler4j. Crawler4j is an open source Java crawler which provides a simple interface for crawling the Web. You can setup a multi-threaded web crawler in few hours.

    0 讨论(0)
  • 2020-12-13 05:36

    Right now there is a inclusion of many java based HTML parser that support visiting and parsing the HTML pages.

    • Jsoup
    • Jaunt API
    • HtmlCleaner
    • JTidy
    • NekoHTML
    • TagSoup

    Here's the complete list of HTML parser with basic comparison.

    0 讨论(0)
提交回复
热议问题