r regex Lookbehind Lookahead issue

故事扮演 提交于 2021-02-08 10:06:42

问题


I try to extract passages like 44.11.36.00-1 (precisely, nn.nn.nn.nn-n, where n stands for any number from 0-9) from text in R.

I want to extract passages if they are "sticked" to non-number marks:

  • 44.11.36.00-1 extracted from nsfghstighsl44.11.36.00-1vsdfgh is OK
  • 44.11.36.00-1 extracted from fa0044.11.36.00-1000 is NOT

I have read that str_extract_all is not working with Lookbehind and Lookahead expressions, so I sadly came back to grep, but cannot deal with it:

> pattern1 <- "(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})"
> grep(pattern1, "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 ", perl=TRUE, value = TRUE)

[1] "dyj44.11.36.00-1aregjspotgji 44113600-1 agdtklj441136001 "

which is not the result I expected.

I thought that:

  • (?<![0-9]{1}) means "match expression which is not preceeded by a number"
  • [0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1} stands for the expression I seek for
  • (?![0-9]{1}) means "match expression which is not followed by a number"

回答1:


AS @Roland said in his comment, you need to use regmatches instead of grep

> s <- "nsfghstighsl44.11.36.00-1vsdfgh"
> m <- gregexpr("(?<![0-9]{1})[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}\\.[0-9]{2}-[0-9]{1}(?![0-9]{1})", s, perl=TRUE)
> regmatches(s, m)
[1] "44.11.36.00-1"

A reduced one,

> x <- c('nsfghstighsl44.11.36.00-1vsdfgh', 'fa0044.11.36.00-1000')
> m <- gregexpr("(?<!\\d)\\d{2}\\.\\d{2}\\.\\d{2}\\.\\d{2}-\\d(?!\\d)", x, perl=TRUE)
> regmatches(x, m)
[1] "44.11.36.00-1"



回答2:


You don't actually need lookahead or lookbehind with this approach. Just parenthesize the portion you want extracted:

library(gsubfn)
x <- c("nsfghstighsl44.11.36.00-1vsdfgh", "fa0044.11.36.00-1000") # test data

pat <- "(^|\\D)(\\d{2}[.]\\d{2}[.]\\d{2}[.]\\d{2}-\\d)(\\D|$)"
strapply(x, pat, ~ ..2, simplify = c)

## "44.11.36.00-1"

Note that ~ ..2 is short for the function function(...) ..2 which means grab the match to the second parenthesized portion in the regular expression. We could also have written function(x, y, z) y or x + y + z ~ y .

Note: The question seems to say that a non-numeric must come directly before and after the string but based on comments that have since disappeared it appears that what was really wanted was that the string be either at the beginning or just after a non-number and must either be at the end or folowed by a non-number. The answer has been so modified.



来源:https://stackoverflow.com/questions/26573611/r-regex-lookbehind-lookahead-issue

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!