R: split column

ぃ、小莉子 提交于 2021-02-10 15:51:52

问题


I have a column (geneDesc) in a data frame (bacteria) that I want to split into two columns. The column contains the gene ID and the species name of the organism the gene comes from in brackets.

For example:

geneDesc
hypothetical protein, partial [Vibrio shilonii]
ankyrin repeat protein [Leptospira kirschneri]
helicase [Alteromonas macleodii]

I'm using the following command:

bacteria2 <- separate(bacteria, geneDesc, c("gene", "species"), sep = "\\[")

But I get this error:

Error: Values not split into 2 pieces at 341, 342, 448, 450, etc...

Is there a way to run the command anyway and just create another column where there is another "["? Everything after the first bracket is of no interest.


回答1:


You almost have it but your sep regular expression needs adjusted to match either a [ or ]:

library(tidyr)
bacteria %>% separate(geneDesc,c("gene","species"), sep="[\\[\\]]", extra="drop")

Output:

                            gene               species
1 hypothetical protein, partial        Vibrio shilonii
2        ankyrin repeat protein  Leptospira kirschneri
3                      helicase  Alteromonas macleodii



回答2:


separate(..., extra = "drop")

or

separate(..., extra = "merge")

another option is

library(stringr)
library(dplyr)
bacteria %>%
  mutate(gene = geneDesc %>% str_replace_all(" *\\[.*$", "") )



回答3:


If you only want to remove everything after the first bracket I suggest gsub

> df <- read.table(text='hypothetical protein, partial [Vibrio shilonii]
+ ankyrin repeat protein [Leptospira kirschneri]
+ helicase [Alteromonas macleodii]', sep='\n')

> df
                                               V1
1 hypothetical protein, partial [Vibrio shilonii]
2  ankyrin repeat protein [Leptospira kirschneri]
3                helicase [Alteromonas macleodii]

> gsub('\\s+\\[.*$', '', df$V1)
[1] "hypothetical protein, partial" "ankyrin repeat protein"        "helicase"                     

> data.frame(data=gsub('\\s+\\[.*$', '', df$V1))
                           data
1 hypothetical protein, partial
2        ankyrin repeat protein
3                      helicase


来源:https://stackoverflow.com/questions/32834538/r-split-column

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!