问题
I have the following data:
temp<-c("AIR BAGS:FRONTAL" ,"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"PARKING BRAKE:CONVENTIONAL",
"SEATS:FRONT ASSEMBLY:POWER ADJUST",
"POWER TRAIN:AUTOMATIC TRANSMISSION",
"SUSPENSION",
"ENGINE AND ENGINE COOLING:ENGINE",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"SUSPENSION:FRONT",
"ENGINE AND ENGINE COOLING:ENGINE",
"VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")
I would like to create a new vector that retains only the text before the first ":" in the cases where a ":" is present, and the whole word when ":" is not present.
I have tried to use:
temp=data.frame(matrix(unlist(str_split(temp,pattern=":",n=2)),
+ ncol=2, byrow=TRUE))
but it does not work in the cases where there is no ":"
I know this question is very similar to: truncate string from a certain character in R, which used:
sub("^[^.]*", "", x)
But I am not very familiar with regular expressions and have struggled to reverse that example to retain only the beginning of the string.
回答1:
You can solve this with a simple regex:
sub("(.*?):.*", "\\1", x)
[1] "AIR BAGS" "SERVICE BRAKES HYDRAULIC" "PARKING BRAKE" "SEATS"
[5] "POWER TRAIN" "SUSPENSION" "ENGINE AND ENGINE COOLING" "SERVICE BRAKES HYDRAULIC"
[9] "SUSPENSION" "ENGINE AND ENGINE COOLING" "VISIBILITY"
How the regex works:
"(.*?):.*"
Look for a repeated set of any characters.*
but modify it with?
to not be greedy. This should be followed by a colon and then any character (repeated)- Substitute the entire string with the bit found inside the parentheses -
"\\1"
The bit to understand is that any regex match is greedy by default. By modifying it to be non-greedy, the first pattern match can not include the colon, since the first character after the parentheses is a colon. The regex after the colon is back to the default, i.e. greedy.
回答2:
Another approach is to look for the first ":" and replace it and anything after it with nothing:
yy <- sub(":.*$", "", yy )
If no ":" is found then nothing is substituted and you get the whole of the original string. If there is a ":" then the first one is matched along with everything after it, this is then replace with nothing ("") which deletes it and leaves everything up to that first colon.
回答3:
Does this work (assuming your data is in a character vector):
x <- c('foobar','foo:bar','foo1:bar1 foo:bar','foo bar')
> sapply(str_split(x,":"),'[',1)
[1] "foobar" "foo" "foo1" "foo bar"
回答4:
sorry to add this as an answer. In response to times taken:
> yy<-rep("foo1:bar1",times=100000)
> system.time(yy1<-sapply(strsplit(yy,":"),'[',1))
user system elapsed
0.26 0.00 0.27
>
> system.time(yy2<-sub("(.*?):.*", "\\1", yy))
user system elapsed
0.1 0.0 0.1
>
> system.time(yy3 <- sub(":.*$", "", yy ))
user system elapsed
0.08 0.00 0.07
>
> system.time(yy4<-gsub("([^:]*).*","\\1",yy))
user system elapsed
0.09 0.00 0.09
The regex are roughly equivalent the strsplit takes a bit longer
回答5:
in this case
yy<-c("AIR BAGS:FRONTAL",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"PARKING BRAKE:CONVENTIONAL",
"SEATS:FRONT ASSEMBLY:POWER ADJUST",
"POWER TRAIN:AUTOMATIC TRANSMISSION",
"SUSPENSION",
"ENGINE AND ENGINE COOLING:ENGINE",
"SERVICE BRAKES HYDRAULIC:ANTILOCK",
"SUSPENSION:FRONT",
"ENGINE AND ENGINE COOLING:ENGINE",
"VISIBILITY:WINDSHIELD WIPER/WASHER:LINKAGES")
yy<-gsub("([^:]*).*","\\1",yy)
yy
may work for you
来源:https://stackoverflow.com/questions/10883605/truncating-the-end-of-a-string-in-r-after-a-character-that-can-be-present-zero-o