问题
I have a text file with over 10,000 lines, each line have a word that starts with the CDID_ followed by 10 more characters with no spaces as below:
a <- c("Test CDID_1254WE_1023 Sky","CDID_1254XE01478 Blue","This File named as CDID_ZXASWE_1111")
I would like to extract the words that start with CDID_ only to make the lines above look like this:
CDID_1254WE_1023
CDID_1254XE01478
CDID_ZXASWE_1111
回答1:
Here are three base R options.
Option 1: Use sub(), removing everything except the CDID_* section:
sub(".*(CDID_\\S+).*", "\\1", a)
# [1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"
Option 2: Use regexpr(), extracting the CDID_* section:
regmatches(a, regexpr("CDID_\\S+", a))
# [1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"
Option 3: For a data frame result, we can use the new strcapture() function (v3.4.0) and do all the work in a single call:
strcapture(".*(CDID_\\S+).*", a, data.frame(out = character()))
# out
# 1 CDID_1254WE_1023
# 2 CDID_1254XE01478
# 3 CDID_ZXASWE_1111
回答2:
All the other solutions are great. Here is one solution using functions from stringr package. We can first split the string using str_split by space, convert the resulting list to a vector, and then use str_subset to get strings with CDID_ in the beginning.
library(stringr)
str_subset(unlist(str_split(a, pattern = " ")), "^CDID_")
[1] "CDID_1254WE_1023" "CDID_1254XE01478" "CDID_ZXASWE_1111"
回答3:
I'd use a lookbehind with the stringi package:
a <- c("Test CDID_1254WE_1023 Sky","CDID_1254XE01478 Blue","This File named as CDID_ZXASWE_1111")
library(stringi)
stringi::stri_extract_all_regex(a, '(?<=(^|\\s))(CDID_[^ ]+)')
(?<=(^|\\s)) = preceded by the beginning of the line or space; then CDID_ AND all then [^ ]+ = characters that follow that are not spaces.
[[1]]
[1] "CDID_1254WE_1023"
[[2]]
[1] "CDID_1254XE01478"
[[3]]
[1] "CDID_ZXASWE_1111"
You may want to use unlist to force it into a vector.
来源:https://stackoverflow.com/questions/45991860/extract-specific-words-from-a-text-file