Trying to parse links in an HTML directory listing using Java

后端 未结 2 899
闹比i
闹比i 2020-12-22 02:53

Please can someone help me parse these links from an HTML page

  • http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299
  • http://nemer
相关标签:
2条回答
  • 2020-12-22 03:19

    Looks like your regex is doing something wrong. Instead of

    Pattern pattern = Pattern.compile("<a.+href=\"(.+?)\"");
    

    Try:

    Pattern pattern = Pattern.compile("<a\\s+href=\"(.+?)\"");
    

    the 'a.+' on your first pattern is matching any character at least one time. If you intended to set the space character the use '\s+' instead.

    The following code works perfect:

        String s = "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299\"/> " +
                "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154\" /> " +
                "<a href=\"http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158\"/>";
    
        Pattern p = Pattern.compile("<a\\s+href=\"(.+?)\"", Pattern.MULTILINE);
        Matcher m = p.matcher(s); 
        while(m.find()){
            System.out.println(m.start()+" : "+m.group(1));
        }
    

    output:

    0 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/2299
    72 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3154
    145 : http://nemertes.lis.upatras.gr/dspace/handle/123456789/3158
    
    0 讨论(0)
  • 2020-12-22 03:23

    Your regular expression is looking at ALL <a href... tags. "handle" is always used as "/dspace/handle" etc. so you can use something like this to scrape the urls you're looking for:

    Pattern pattern = Pattern.compile("<a.+href=\"(/dspace/handle/.+?)\"");
    
    0 讨论(0)
提交回复
热议问题