Extract links from a web page

前端 未结 6 1023
遇见更好的自我
遇见更好的自我 2020-12-01 08:22

Using Java, how can I extract all the links from a given web page?

6条回答
  •  没有蜡笔的小新
    2020-12-01 09:01

    This simple example seems to work, using a regex from here

    import java.util.regex.Matcher;
    import java.util.regex.Pattern;
    
    public ArrayList extractUrlsFromString(String content)
    {
        ArrayList result = new ArrayList();
    
        String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
    
        Pattern p = Pattern.compile(regex);
        Matcher m = p.matcher(content);
        while (m.find())
        {
            result.add(m.group());
        }
    
        return result;
    }
    

    and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.

    import org.apache.commons.io.IOUtils;
    
    public String getUrlContentsAsString(String urlAsString)
    {
        try
        {
            URL url = new URL(urlAsString);
            String result = IOUtils.toString(url);
            return result;
        }
        catch (Exception e)
        {
            return null;
        }
    }
    

提交回复
热议问题