Prevent grep in R from treating “.” as a letter

孤者浪人 提交于 2019-12-22 09:11:53

问题


I have a character vector that contains text similar to the following:

text <- c("ABc.def.xYz", "ge", "lmo.qrstu")

I would like to remove everything before a .:

> "xYz" "ge" "qrstu"

However, the grep function seems to be treating . as a letter:

pattern <- "([A-Z]|[a-z])+$"

grep(pattern, text, value = T)

> "ABc.def.xYz" "ge"          "lmo.qrstu" 

The pattern works elsewhere, such as on regexpal.

How can I get grep to behave as expected?


回答1:


grep is for finding the pattern. It returns the index of the vector that matches a pattern. If, value=TRUE is specified, it returns the value. From the description, it seems that you want to remove substring instead of returning a subset of the initial vector.

If you need to remove the substring, you can use sub

 sub('.*\\.', '', text)
 #[1] "xYz"   "ge"    "qrstu"

As the first argument, we match a pattern i.e. '.*\\.'. It matches one of more characters (.*) followed by a dot (\\.). The \\ is needed to escape the . to treat it as that symbol instead of any character. This will match until the last . character in the string. We replace that matched pattern with a '' as the replacement argument and thereby remove the substring.




回答2:


grep doesn't do any replacements. It searches for matches and returns the indices (or the value if you specify value=T) that give a match. The results you're getting are just saying that those meet your criteria at some point in the string. If you added something that doesn't meet the criteria anywhere into your text vector (for example: "9", "#$%23", ...) then it wouldn't return those when you called grep on it.

If you want it just to return the matched portion you should look at the regmatches function. However for your purposes it seems like sub or gsub should do what you want.

gsub(".*\\.", "", text)

I would suggest reading the help page for regexs ?regex. The wikipedia page is a decent read as well but note that R's regexs are a little different than some others. https://en.wikipedia.org/wiki/Regular_expression




回答3:


You may try str_extract function from stringr package.

str_extract(text, "[^.]*$")

This would match all the non-dot characters exists at the last.




回答4:


Your pattern does work, the problem is that grep does something different than what you are thinking it does.

Let's first use your pattern with str_extract_all from the package stringr.

library(stringr)
str_extract_all(text, pattern ="([A-Z]|[a-z])+$")
[[1]]
[1] "xYz"

[[2]]
[1] "ge"

[[3]]
[1] "qrstu"

Notice that the results came as you expected!

The problem you are having is that grep will give you the complete element that matches you regular expression and not only the matching part of the element. For example, in the example below, grep will return you the first element because it matches "a":

grep(pattern = "a", x = c("abcdef", "bcdf"), value = TRUE)
[1] "abcdef"


来源:https://stackoverflow.com/questions/31747578/prevent-grep-in-r-from-treating-as-a-letter

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!