r regex Lookbehind Lookahead issue

故事扮演 提交于 2021-02-08 10:06:42


I try to extract passages like (precisely, nn.nn.nn.nn-n, where n stands for any number from 0-9) from text in R.

I want to extract passages if they are "sticked" to non-number marks:

  • extracted from nsfghstighsl44.11.36.00-1vsdfgh is OK
  • extracted from fa0044.11.36.00-1000 is NOT

I have read that str_extract_all is not working with Lookbehind and Lookahead expressions, so I sadly came back to grep, but cannot deal with it:

> pattern1 <- "(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})"
> grep(pattern1, "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 ", perl=TRUE, value = TRUE)

[1] "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 "

which is not the result I expected.

I thought that:

  • (?<![0-9]{1}) means "match expression which is not preceeded by a number"
  • [0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1} stands for the expression I seek for
  • (?![0-9]{1}) means "match expression which is not followed by a number"


AS @Roland said in his comment, you need to use regmatches instead of grep

> s <- "nsfghstighsl44.11.36.00-1vsdfgh"
> m <- gregexpr("(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})", s, perl=TRUE)
> regmatches(s, m)
[1] ""

A reduced one,

> x <- c('nsfghstighsl44.11.36.00-1vsdfgh', 'fa0044.11.36.00-1000')
> m <- gregexpr("(?<!\\d)\\d{2}\\.\\d{2}\\.\\d{2}\\.\\d{2}-\\d(?!\\d)", x, perl=TRUE)
> regmatches(x, m)
[1] ""


You don't actually need lookahead or lookbehind with this approach. Just parenthesize the portion you want extracted:

x <- c("nsfghstighsl44.11.36.00-1vsdfgh", "fa0044.11.36.00-1000") # test data

pat <- "(^|\\D)(\\d{2}[.]\\d{2}[.]\\d{2}[.]\\d{2}-\\d)(\\D|$)"
strapply(x, pat, ~ ..2, simplify = c)

## ""

Note that ~ ..2 is short for the function function(...) ..2 which means grab the match to the second parenthesized portion in the regular expression. We could also have written function(x, y, z) y or x + y + z ~ y .

Note: The question seems to say that a non-numeric must come directly before and after the string but based on comments that have since disappeared it appears that what was really wanted was that the string be either at the beginning or just after a non-number and must either be at the end or folowed by a non-number. The answer has been so modified.

