Extracting URLs from a text document using Java + Regular Expressions

后端 未结 4 1557
醉话见心
醉话见心 2020-12-08 23:11

I\'m trying to create a regular expression to extract URLs from text documents using Java, but thus far I\'ve been unsuccessful. The two cases I\'m looking to capture are li

相关标签:
4条回答
  • 2020-12-08 23:26

    This tests a certain line if it is a URL

    Pattern p = Pattern.compile("http://.*|www\\..*");
    Matcher m = p.matcher("http://..."); // put here the line you want to check
    if(m.matches()){
        so something
    }
    
    0 讨论(0)
  • 2020-12-08 23:27

    This link has very good URL RegExs (they are surprisingly hard to get right, by the way - thinh http/https; port #s, valid characters, GET strings, pound signs for anchor links, etc...)

    http://flanders.co.nz/2009/11/08/a-good-url-regular-expression-repost/

    Perl has CPAN libraries that contain cannedRegExes, including for URLs. Not sure about Java though :(

    0 讨论(0)
  • 2020-12-08 23:28

    If you want to make sure you are really matching a url adress and not only some word starting with 'www.' you can use the expression mentioned by DVK before. I modified it slightly and wrote a small code snippet to be a starting point for you:

    import java.util.*;
    import java.util.regex.*;
    
    class FindUrls
    {
        public static List<String> extractUrls(String input) {
            List<String> result = new ArrayList<String>();
    
            Pattern pattern = Pattern.compile(
                "\\b(((ht|f)tp(s?)\\:\\/\\/|~\\/|\\/)|www.)" + 
                "(\\w+:\\w+@)?(([-\\w]+\\.)+(com|org|net|gov" + 
                "|mil|biz|info|mobi|name|aero|jobs|museum" + 
                "|travel|[a-z]{2}))(:[\\d]{1,5})?" + 
                "(((\\/([-\\w~!$+|.,=]|%[a-f\\d]{2})+)+|\\/)+|\\?|#)?" + 
                "((\\?([-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
                "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)" + 
                "(&(?:[-\\w~!$+|.,*:]|%[a-f\\d{2}])+=?" + 
                "([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)*)*" + 
                "(#([-\\w~!$+|.,*:=]|%[a-f\\d]{2})*)?\\b");
    
            Matcher matcher = pattern.matcher(input);
            while (matcher.find()) {
                result.add(matcher.group());
            }
    
            return result;
        }
    }
    
    0 讨论(0)
  • 2020-12-08 23:46

    All RegEx -based code is over-engineered, especially code from the most voted answer, and here is why: it will find only valid URLs! As a sample, it will ignore anything starting with "http://" and having non-ASCII characters inside.

    Even more: I have encountered 1-2-seconds processing times (single-threaded, dedicated) with Java RegEx package for very small and simple sentences, nothing specific; possibly bug in Java 6 RegEx...

    Simplest/Fastest solution would be to use StringTokenizer to split text into tokens, to remove tokens starting with "http://" etc., and to concatenate tokens into text again.

    If you really want to use RegEx with Java, try Automaton

    0 讨论(0)
提交回复
热议问题