Using Java, how can I extract all the links from a given web page?
Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.
A simple regex which would match 99% of pages could be this:
// The HTML page as a String
String HTMLPage;
Pattern linkPattern = Pattern.compile("(]+>.+?<\/a>)", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(HTMLPage);
ArrayList links = new ArrayList();
while(pageMatcher.find()){
links.add(pageMatcher.group());
}
// links ArrayList now contains all links in the page as a HTML tag
// i.e. Text inside tag
You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case. If you are only interested in the href="" and text in between you can also use this regex:
Pattern linkPattern = Pattern.compile("]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>", Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
And access the link part with .group(1) and the text part with .group(2)