发表新帖

发表新帖

Extract links from a web page

前端未结

关注

 6  1016

遇见更好的自我 2020-12-01 08:22

Using Java, how can I extract all the links from a given web page?

6条回答

Happy的楠姐 (楼主)

2020-12-01 08:42
Either use a Regular Expression and the appropriate classes or use a HTML parser. Which one you want to use depends on whether you want to be able to handle the whole web or just a few specific pages of which you know the layout and which you can test against.

A simple regex which would match 99% of pages could be this:
```
// The HTML page as a String
String HTMLPage;
Pattern linkPattern = Pattern.compile("(]+>.+?<\/a>)",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
Matcher pageMatcher = linkPattern.matcher(HTMLPage);
ArrayList links = new ArrayList();
while(pageMatcher.find()){
    links.add(pageMatcher.group());
}
// links ArrayList now contains all links in the page as a HTML tag
// i.e. Text inside tag
```
You can edit it to match more, be more standard compliant etc. but you would want a real parser in that case. If you are only interested in the href="" and text in between you can also use this regex:
```
Pattern linkPattern = Pattern.compile("]+href=[\"']?([\"'>]+)[\"']?[^>]*>(.+?)<\/a>",  Pattern.CASE_INSENSITIVE|Pattern.DOTALL);
```
And access the link part with .group(1) and the text part with .group(2)
0 讨论(0)

查看其它6个回答
发布评论:

提交评论
- 加载中...

热议问题