问题
text="stack overflow... is a popular website."
I want to separate punctuation marks from words. The output should be:
"stack overflow ... is a popular website . "
Of course, the command gsub("\\.", " \\. ", text, fixed = FALSE)
returns:
"stack overflow . . . is a popular website . "
because it does not differentiate between periods and ellipsis (suspension points). In short, when three periods are found together in the text, R should consider them as a single punctuation mark.
回答1:
I think a non-lookaround approach will be more efficient and readable:
text="stack overflow... is a popular website."
gsub("*[[:space:]]*(\\.+)[[:space:]]*", " \\1 ", text)
## => [1] "stack overflow ... is a popular website . "
See IDEONE demo
I updated the post since the space is required before and after the punctuation.
The [[:space:]]*
around the (\\.+)
match zero or more whitespace and the (\\.+)
will match one or more periods. The (...)
form a capturing group whose value is stored in a numbered buffer #1 that we can access using the \1
backreference from the replacement pattern. So, \1
is replaced with the periods captured by the pattern. Capturing is more efficient than using lookarounds since there is no overhead of checking text before/after the current position.
Now, if you need to handle all punctuation, use [[:punct:]]
:
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)
See R regex help:
[:punct:]
Punctuation characters:! " # $ % & ' ( ) * + , - . / : ; < = > ? @ [ \ ] ^ _ ` { | } ~.
Code demo:
text="Hi!stack overflow... is a popular website, I visit it every day."
gsub("[[:space:]]*([[:punct:]]+)[[:space:]]*", " \\1 ", text)
## => [1] "Hi ! stack overflow ... is a popular website , I visit it every day . "
UPDATE FOR HYPHENATED WORDS
To avoid matching hyphenated words, you can match and skip the -
that are surrounded with word boundaries:
text="Hi!stack-overflow... is a popular website, I visit it every day."
gsub("\\b-\\b(*SKIP)(*F)|\\s*(\\p{P}+)\\s*", " \\1 ", text, perl=T)
## => [1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
See demo
回答2:
After this load of comments this regex should be the most likely to fit your needs:
(?:\b| )([.,:;!]+)(?: |\b)
Demo
To use it in R the backslashes have to be doubled.
So we end up with:
text<-c('Hi!stack-overflow... is a popular website, I visit it every day.',
'aaa...',
'AAA...B"B"B',
'AA .BBB #unlikely to happen but managed anyway')
> gsub('(?:\\b| )([.,:;!]+)(?: |\\b)',' \\1 ',text)
[1] "Hi ! stack-overflow ... is a popular website , I visit it every day . "
[2] "aaa ... "
[3] "AAA ... B\"B\"B"
[4] "AA . BBB #unlikely to happen but managed anyway"
回答3:
Try
gsub("(?<=\\.)$|(?<=\\w)(?=\\.)", " ", text, perl=TRUE)
#[1] "stack overflow ... is a popular website . "
gsub("(?<=\\.)$|(?<=\\w)(?=\\.)", " ", "aaa...", perl=TRUE)
#[1] "aaa ... "
gsub("(?<=\\.)(?=$|\\w)|(?<=\\w)(?=\\.)", " ", "aaa...bbb", perl=TRUE)
#[1] "aaa ... bbb"
来源:https://stackoverflow.com/questions/34762973/regex-gsub-r-differentiate-between-ellipsis-and-periods