Truncating the end of a string in R after a character that can be present zero or more times

≯℡__Kan透↙ 提交于 2019-11-30 09:17:50

You can solve this with a simple regex:

sub("(.*?):.*", "\\1", x)
 [1] "AIR BAGS"                  "SERVICE BRAKES HYDRAULIC"  "PARKING BRAKE"             "SEATS"                    
 [5] "POWER TRAIN"               "SUSPENSION"                "ENGINE AND ENGINE COOLING" "SERVICE BRAKES HYDRAULIC" 
 [9] "SUSPENSION"                "ENGINE AND ENGINE COOLING" "VISIBILITY"     

How the regex works:

  • "(.*?):.*" Look for a repeated set of any characters .* but modify it with ? to not be greedy. This should be followed by a colon and then any character (repeated)
  • Substitute the entire string with the bit found inside the parentheses - "\\1"

The bit to understand is that any regex match is greedy by default. By modifying it to be non-greedy, the first pattern match can not include the colon, since the first character after the parentheses is a colon. The regex after the colon is back to the default, i.e. greedy.

Another approach is to look for the first ":" and replace it and anything after it with nothing:

yy <- sub(":.*$", "", yy )

If no ":" is found then nothing is substituted and you get the whole of the original string. If there is a ":" then the first one is matched along with everything after it, this is then replace with nothing ("") which deletes it and leaves everything up to that first colon.

Does this work (assuming your data is in a character vector):

x <- c('foobar','foo:bar','foo1:bar1 foo:bar','foo bar')
> sapply(str_split(x,":"),'[',1)
[1] "foobar"  "foo"     "foo1"    "foo bar"

sorry to add this as an answer. In response to times taken:

> yy<-rep("foo1:bar1",times=100000)
> system.time(yy1<-sapply(strsplit(yy,":"),'[',1))
   user  system elapsed 
   0.26    0.00    0.27 
> 
> system.time(yy2<-sub("(.*?):.*", "\\1", yy))
   user  system elapsed 
    0.1     0.0     0.1 
> 
> system.time(yy3 <- sub(":.*$", "", yy ))
   user  system elapsed 
   0.08    0.00    0.07 
> 
> system.time(yy4<-gsub("([^:]*).*","\\1",yy))
   user  system elapsed 
   0.09    0.00    0.09 

The regex are roughly equivalent the strsplit takes a bit longer

in this case

yy<-c("AIR BAGS:FRONTAL",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"PARKING BRAKE:CONVENTIONAL",
"SEATS:FRONT ASSEMBLY:POWER ADJUST",
"POWER TRAIN:AUTOMATIC TRANSMISSION",
"SUSPENSION",
"ENGINE AND ENGINE COOLING:ENGINE",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"SUSPENSION:FRONT",
"ENGINE AND ENGINE COOLING:ENGINE",
"VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")
yy<-gsub("([^:]*).*","\\1",yy)
yy

may work for you

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!