Extract links from a web page

前端 未结 6 1014
遇见更好的自我
遇见更好的自我 2020-12-01 08:22

Using Java, how can I extract all the links from a given web page?

6条回答
  •  萌比男神i
    2020-12-01 08:40

    download java file as plain text/html pass it through Jsoup or html cleaner both are similar and can be used to parse even malformed html 4.0 syntax and then you can use the popular HTML DOM parsing methods like getElementsByName("a") or in jsoup its even cool you can simply use

    File input = new File("/tmp/input.html");
     Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");
    
    Elements links = doc.select("a[href]"); // a with href
    Elements pngs = doc.select("img[src$=.png]");
    // img with src ending .png
    
    Element masthead = doc.select("div.masthead").first();
    

    and find all links and then get the detials using

    String linkhref=links.attr("href");
    

    Taken from http://jsoup.org/cookbook/extracting-data/selector-syntax

    The selectors have same syntax as jQuery if you know jQuery function chaining then you will certainly love it.

    EDIT: In case you want more tutorials, you can try out this one made by mkyong.

    http://www.mkyong.com/java/jsoup-html-parser-hello-world-examples/

提交回复
热议问题