Trying to parse links in an HTML directory listing using Java

后端未结

关注

 2  905

Please can someone help me parse these links from an HTML page

http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299
http://nemer

相关标签:

2条回答

长情又很酷

2020-12-22 03:19

Looks like your regex is doing something wrong. Instead of

Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");

Try:

Pattern pattern = Pattern.compile("<a\\s+href=\"(.+?)\"");

the 'a.+' on your first pattern is matching any character at least one time. If you intended to set the space character the use '\s+' instead.

The following code works perfect:

String s = "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299\"/> " + "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154\" /> " + "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158\"/>"; Pattern p = Pattern.compile("<a\\s+href=\"(.+?)\"", Pattern.MULTILINE); Matcher m = p.matcher(s); while(m.find()){ System.out.println(m.start()+" : "+m.group(1)); }

output:

0 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299 72 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154 145 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158

0 讨论(0)

发布评论:

提交评论

加载中...

一向

2020-12-22 03:23

Your regular expression is looking at ALL <a href... tags. "handle" is always used as "/dspace/handle" etc. so you can use something like this to scrape the urls you're looking for:

Pattern pattern = Pattern.compile("<a.+href=\"(/dspace/handle/.+?)\"");

0 讨论(0)

发布评论:

提交评论

加载中...

验证码

看不清?

提交回复