Extracting URLs from a text document using Java + Regular Expressions

后端 未结 4 1558
醉话见心
醉话见心 2020-12-08 23:11

I\'m trying to create a regular expression to extract URLs from text documents using Java, but thus far I\'ve been unsuccessful. The two cases I\'m looking to capture are li

4条回答
  •  一生所求
    2020-12-08 23:28

    If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

    import java.util.*;
    import java.util.regex.*;
    
    class FindUrls
    {
        public static List extractUrls(String input) {
            List result = new ArrayList();
    
            Pattern pattern = Pattern.compile(
                "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
                "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
                "|mil|biz|info|mobi|name|aero|jobs|museum" + 
                "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
                "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
                "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
                "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
                "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
                "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
                "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");
    
            Matcher matcher = pattern.matcher(input);
            while (matcher.find()) {
                result.add(matcher.group());
            }
    
            return result;
        }
    }
    

提交回复
热议问题