Extract links from a web page

前端 未结 6 1016
遇见更好的自我
遇见更好的自我 2020-12-01 08:22

Using Java, how can I extract all the links from a given web page?

6条回答
  •  Happy的楠姐
    2020-12-01 08:42

    Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.

    A simple regex which would match 99% of pages could be this:

    // The HTML page as a String
    String HTMLPage;
    Pattern linkPattern = Pattern.compile("(]+>.+?<\/a>)",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
    Matcher pageMatcher = linkPattern.matcher(HTMLPage);
    ArrayList links = new ArrayList();
    while(pageMatcher.find()){
        links.add(pageMatcher.group());
    }
    // links ArrayList now contains all links in the page as a HTML tag
    // i.e. Text inside tag
    

    You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case. If you are only interested in the href="" and text in between you can also use this regex:

    Pattern linkPattern = Pattern.compile("]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
    

    And access the link part with .group(1) and the text part with .group(2)

提交回复
热议问题