Regex gsub R differentiate between ellipsis and periods

扶醉桌前 提交于 2019-12-25 03:35:34

问题


text="stack overflow... is a popular website."

I want to separate punctuation marks from words. The output should be:

"stack overflow ... is a popular website . "

Of course, the command gsub("\\.", " \\. ", text, fixed = FALSE) returns:

"stack overflow . . . is a popular website . " because it does not differentiate between periods and ellipsis (suspension points). In short, when three periods are found together in the text, R should consider them as a single punctuation mark.


回答1:


I think a non-lookaround approach will be more efficient and readable:

text="stack overflow... is a popular website."
gsub("*[[:space:]]*(\\.+)[[:space:]]*", " \\1 ", text)
## => [1] "stack overflow ... is a popular website . "

See IDEONE demo

I updated the post since the space is required before and after the punctuation.

The [[:space:]]* around the (\\.+) match zero or more whitespace and the (\\.+) will match one or more periods. The (...) form a capturing group whose value is stored in a numbered buffer #1 that we can access using the \1 backreference from the replacement pattern. So, \1 is replaced with the periods captured by the pattern. Capturing is more efficient than using lookarounds since there is no overhead of checking text before/after the current position.

Now, if you need to handle all punctuation, use [[:punct:]]:

gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)

See R regex help:

[:punct:]
Punctuation characters:
! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.

Code demo:

text="Hi!stack overflow... is a popular website, I visit it every day."
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)
## => [1] "Hi ! stack overflow ... is a popular website , I visit it every day . "

UPDATE FOR HYPHENATED WORDS

To avoid matching hyphenated words, you can match and skip the - that are surrounded with word boundaries:

text="Hi!stack-overflow... is a popular website, I visit it every day."
gsub("\\b-\\b(*SKIP)(*F)|\\s*(\\p{P}+)\\s*", " \\1 ", text, perl=T)
## => [1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "

See demo




回答2:


After this load of comments this regex should be the most likely to fit your needs:

(?:\b| )([.,:;!]+)(?: |\b)

Demo

To use it in R the backslashes have to be doubled.

So we end up with:

text<-c('Hi!stack-overflow... is a popular website, I visit it every day.',
    'aaa...',
    'AAA...B"B"B',
    'AA .BBB #unlikely to happen but managed anyway')

> gsub('(?:\\b| )([.,:;!]+)(?: |\\b)',' \\1 ',text)
[1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
[2] "aaa ... "                                                              
[3] "AAA ... B\"B\"B"                                                       
[4] "AA . BBB #unlikely to happen but managed anyway"     



回答3:


Try

gsub("(?<=\\.)$|(?<=\\w)(?=\\.)", " ", text, perl=TRUE)
#[1] "stack overflow ... is a popular website . "

gsub("(?<=\\.)$|(?<=\\w)(?=\\.)", " ", "aaa...", perl=TRUE)
#[1] "aaa ... "

gsub("(?<=\\.)(?=$|\\w)|(?<=\\w)(?=\\.)", " ", "aaa...bbb", perl=TRUE)
#[1] "aaa ... bbb"


来源:https://stackoverflow.com/questions/34762973/regex-gsub-r-differentiate-between-ellipsis-and-periods

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!