Using Java, how can I extract all the links from a given web page?
This simple example seems to work, using a regex from here
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public ArrayList extractUrlsFromString(String content)
{
ArrayList result = new ArrayList();
String regex = "(https?|ftp|file)://[-a-zA-Z0-9+&@#/%?=~_|!:,.;]*[-a-zA-Z0-9+&@#/%=~_|]";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher(content);
while (m.find())
{
result.add(m.group());
}
return result;
}
and if you need it, this seems to work to get the HTML of an url as well, returning null if it can't be grabbed. It works fine with https urls as well.
import org.apache.commons.io.IOUtils;
public String getUrlContentsAsString(String urlAsString)
{
try
{
URL url = new URL(urlAsString);
String result = IOUtils.toString(url);
return result;
}
catch (Exception e)
{
return null;
}
}