R Strsplit keep delimiter in second element

别来无恙 提交于 2019-12-24 08:43:33

问题


I have been trying to solve this little issue for almost 2 hours, but without success. I simply want to separate a string by the delimiter: one space followed by any character. In the second element I want to keep the delimiter, whereas in the first element it shall not appear. Example:

 x <- "123123 123 A123"
 strsplit(x," [A-Z]")

results in:

"123123 123" "A123"

However, this does not keep the letter A in the second element. I have tried using

strsplit(x,"(?<=[A-Z])",perl=T)

but this does not really work for my issue. It would also be okay, if there is a space in the second element, it just need the character in it.


回答1:


If you want to follow your approach, you need to match 1+ whitespaces followed (i.e. you need a lookahead here) with a letter to consume the whitespaces:

> strsplit(x,"\\s+(?=[A-Z])",perl=T)
[[1]]
[1] "123123 123" "A123"

See the PCRE regex demo.

Details:

  • \s+ - 1 or more whitespaces (put into the match value and thus will be removed during splitting)
  • (?=[A-Z]) - the uppercase ASCII letter must appear immediately to the right of the current location, else fail the match (the letter is not part of the match value, and will be kept in the result)

You may also match up to the last non-whitespace char followed with 1+ whitespaces and use \K match reset operator to discard the match before the whitespace:

> strsplit(x,"^.*\\S\\K\\s+",perl=T)
[[1]]
[1] "123123 123" "A123"  

If the string contains line breaks, add a DOTALL flag since a dot in a PCRE regex does not match line breaks by default: "(?s)^.*\\S\\K\\s+".

Details:

  • ^ - start of string
  • .* - any 0+ chars up to the last occurrence of the subsequent subpatterns (that is, \S\s+)
  • \\S - a non-whitespace
  • \\K - here, drop all the text matched so far
  • \\s+ - 1 or more whitespaces.

See another PCRE regex demo.




回答2:


I would go with stringi package:

library(stringi)
x <- c("123123 123 A123","34512 321 B521")#some modified input data

l1<-stri_split(x,fixed=" ")
[1] "123123" "123"    "A123"  

Then:

lapply(seq_along(1:length(l1)),  function(x) c(paste0(l1[[x]][1]," ",l1[[x]][2]),l1[[x]][3]))

[[1]] 
[1] "123123 123" "A123"      

[[2]]
[1] "34512 321" "B521"    


来源:https://stackoverflow.com/questions/44674503/r-strsplit-keep-delimiter-in-second-element

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!