I have a vector filled with strings of the following format: <year1><year2><id1><id2>
the first entries of the vector looks like this:
199719982001
199719982002
199719982003
199719982003
For the first entry we have: year1 = 1997, year2 = 1998, id1 = 2, id2 = 001.
I want to write a regular expression that pulls out year1, id1, and the digits of id2 that are not zero. So for the first entry the regex should output: 199721.
I have tried doing this with the stringr package, and created the following regex:
"^\\d{4}|\\d{1}(?<=\\d{3}$)"
to pull out year1 and id1, however when using the lookbehind i get a "invalid regular expression" error. This is a bit puzzling to me, can R not handle lookaheads and lookbehinds?
Since this is fixed format, why not use substr? year1 is extracted using substr(s,1,4), id1 is extracted using substr(s,9,9) and the id2 as as.numeric(substr(s,10,13)). In the last case I used as.numeric to get rid of the zeroes.
You will need to use gregexpr from the base package. This works:
> s <- "199719982001"
> gregexpr("^\\d{4}|\\d{1}(?<=\\d{3}$)",s,perl=TRUE)
[[1]]
[1] 1 12
attr(,"match.length")
[1] 4 1
attr(,"useBytes")
[1] TRUE
Note the perl=TRUE setting. For more details look into ?regex.
Judging from the output your regular expression does not catch id1 though.
You can use sub.
sub("^(.{4}).{4}(.{1}).*([1-9]{1,3})$","\\1\\2\\3",s)
来源:https://stackoverflow.com/questions/8834872/r-regular-expression-lookbehind