Whats the correct format of Java String REGEX to identify DOI

非 Y 不嫁゛ 提交于 2019-12-25 16:58:16

问题


I am conducting some research on identify DOI in free format text.

I am using Java 8 and REGEX

I Have found these REGEX's that are supposed to fulfil my requirements

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i

The code I am trying is

private static final Pattern pattern_one = Pattern.compile("/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i", Pattern.CASE_INSENSITIVE);

Matcher matcher = pattern_one.matcher("http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1");
while (matcher.find()) {
                System.out.print("Start index: " + matcher.start());
                System.out.print(" End index: " + matcher.end() + " ");
                System.out.println(matcher.group());
        }

However the matcher doesnt find anything.

Where have I gone wrong?

UPDATE

I have encountered a valid DOI that my set of REGEXs do not match

heres an example DOI : 10.1175/1520-0485(2002)032<0870:CT>2.0.CO;2

Why doesn't this pattern work?

/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i

回答1:


In Java, a regex is written as a String. In other languages, the regex is quoted using /.../, with options like i given after the ending /. So, what is written as /XXX/i will in Java be done like this:

// Using flags parameter
Pattern p = Pattern.compile("XXX", Pattern.CASE_INSENSITIVE);

// Using embedded flags
Pattern p = Pattern.compile("(?i)XXX");

In most languages, regex are using to find a matching substring. Java can do that too, using the find() method (or any of the many replaceXxx() regex methods), however Java also has the matches() method which will match against the entire string, eliminating the need for the begin and end boundary matchers ^ and $.

Anyway, your problem is that the regex has both ^ and $ boundary matchers, which means it will only work if string is nothing but the text you want to match. Since you actually want to find a substring, remove those matchers.

To search for one of multiple patterns, using the | logical regex operator.

And finally, since Java regex is given as a String literal, any special characters, most notably \, needs to be escaped.

So, to build a single regex that can find substrings matching any of the following:

/^10.\d{4,9}/[-._;()/:A-Z0-9]+$/i
/^10.1002/[^\s]+$/i
/^10.\d{4}/\d+-\d+X?(\d+)\d+<[\d\w]+:[\d\w]*>\d+.\d+.\w+;\d$/i
/^10.1021/\w\w\d++$/i
/^10.1207/[\w\d]+\&\d+_\d+$/i

You would write it like this:

String regex = "10.\\d{4,9}/[-._;()/:A-Z0-9]+" +
              "|10.1002/[^\\s]+" +
              "|10.\\d{4}/\\d+-\\d+X?(\\d+)\\d+<[\\d\\w]+:[\\d\\w]*>\\d+.\\d+.\\w+;\\d" +
              "|10.1021/\\w\\w\\d++" +
              "|10.1207/[\\w\\d]+\\&\\d+_\\d+";
Pattern p = Pattern.compile(regex, Pattern.CASE_INSENSITIVE);

String input = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
Matcher m = p.matcher(input);
while (m.find()) {
    System.out.println("Start index: " + m.start() +
                       " End index: " + m.end() +
                       " " + m.group());
}

Output

Start index: 37 End index: 54 10.1175/JPO3002.1



回答2:


Your pattern looks incorrect to me. You are currently using this:

/^10.\\d{4,9}/[-._;()/:A-Z0-9]+$/i

But I think you intend to use this:

^.*/10\\.\\d{4,9}/[-._;()/:A-Z0-9]+$

Problems with your pattern include that you are using JavaScript regex syntax, or some other language's syntax. Also, you were not escaping a literal dot in the regex, and the start of the pattern marker was out of place.

Code:

String pattern = "^.*/10\\.\\d{4,9}/[-._;()/:A-Z0-9]+$";
String url = "http://journals.ametsoc.org/doi/full/10.1175/JPO3002.1";
Pattern r = Pattern.compile(pattern);
Matcher m = r.matcher(url);
if (m.find( )) {
    System.out.println("Found value: " + m.group(0) );
} else {
    System.out.println("NO MATCH");
}

Demo here:

Rextester



来源:https://stackoverflow.com/questions/43683957/whats-the-correct-format-of-java-string-regex-to-identify-doi

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!