R split string at last whitespace chars using tidyr::separate

痴心易碎 提交于 2019-12-10 03:14:24

问题


Suppose I have a dataframe like this:

df<-data.frame(a=c("AA","BB"),b=c("short string","this is the longer string"))

I would like to split each string using a regex based on the last space occuring. I tried:

library(dplyr)
library(tidyr)
df%>%
  separate(b,c("partA","partB"),sep=" [^ ]*$")

But this omits the second part of the string in the output. My desired output would look like this:

   a              partA  partB
1 AA              short string
2 BB this is the longer string

How do I do this. Would be nice if I could use tidyr and dplyr for this.


回答1:


We can use extract from tidyr by using the capture groups ((...)). We match zero or more characters (.*) and place it within the parentheses ((.*)), followed by zero or more space (\\s+), followed by the next capture group which includes only characters that are not a space ([^ ]) until the end ($) of the string.

library(tidyr)
extract(df, b, into = c('partA', 'partB'), '(.*)\\s+([^ ]+)$')
#   a              partA  partB
#1 AA              short string
#2 BB this is the longer string



回答2:


You may turn the [^ ]*$ part of your regex into a (?=[^ ]*$) non-consuming pattern, a positive lookahead (that will not consume the non-whitespace chars at the end of the string, i.e. they won't be put into the match value and thus will stay there in the output):

df%>%
  separate(b,c("partA","partB"),sep=" (?=[^ ]*$)")

Or, a bit more universal since it matches any whitespace chars:

df %>%
  separate(b,c("partA","partB"),sep="\\s+(?=\\S*$)")

See the regex demo and its graph below:

Output:

   a              partA  partB
1 AA              short string
2 BB this is the longer string


来源:https://stackoverflow.com/questions/32119963/r-split-string-at-last-whitespace-chars-using-tidyrseparate

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!