Extract text in parentheses in R

时光毁灭记忆、已成空白 提交于 2020-01-02 03:41:09

问题


Two related questions. I have vectors of text data such as

"a(b)jk(p)"  "ipq"  "e(ijkl)"

and want to easily separate it into a vector containing the text OUTSIDE the parentheses:

"ajk"  "ipq"  "e"

and a vector containing the text INSIDE the parentheses:

"bp"   ""  "ijkl"

Is there any easy way to do this? An added difficulty is that these can get quite large and have a large (unlimited) number of parentheses. Thus, I can't simply grab text "pre/post" the parentheses and need a smarter solution.


回答1:


Text outside the parenthesis

> x <- c("a(b)jk(p)"  ,"ipq" , "e(ijkl)")
> gsub("\\([^()]*\\)", "", x)
[1] "ajk" "ipq" "e"  

Text inside the parenthesis

> x <- c("a(b)jk(p)"  ,"ipq" , "e(ijkl)")
> gsub("(?<=\\()[^()]*(?=\\))(*SKIP)(*F)|.", "", x, perl=T)
[1] "bp"   ""     "ijkl"

The (?<=\\()[^()]*(?=\\)) matches all the characters which are present inside the brackets and then the following (*SKIP)(*F) makes the match to fail. Now it tries to execute the pattern which was just after to | symbol against the remaining string. So the dot . matches all the characters which are not already skipped. Replacing all the matched characters with an empty string will give only the text present inside the rackets.

> gsub("\\(([^()]*)\\)|.", "\\1", x, perl=T)
[1] "bp"   ""     "ijkl"

This regex would capture all the characters which are present inside the brackets and matches all the other characters. |. or part helps to match all the remaining characters other than the captured ones. So by replacing all the characters with the chars present inside the group index 1 will give you the desired output.




回答2:


The rm_round function in the qdapRegex package I maintain was born to do this:

First we'll get and load the package via pacman

if (!require("pacman")) install.packages("pacman")
pacman::p_load(qdapRegex)

## Then we can use it to remove and extract the parts you want:

x <-c("a(b)jk(p)", "ipq", "e(ijkl)")

rm_round(x)

## [1] "ajk" "ipq" "e" 

rm_round(x, extract=TRUE)

## [[1]]
## [1] "b" "p"
## 
## [[2]]
## [1] NA
## 
## [[3]]
## [1] "ijkl"

To condense b and p use:

sapply(rm_round(x, extract=TRUE), paste, collapse="")

## [1] "bp"   "NA"   "ijkl"


来源:https://stackoverflow.com/questions/28955367/extract-text-in-parentheses-in-r

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!