R: gsub and str_split_fixed in data.tables

落爺英雄遲暮 提交于 2019-12-12 04:36:31

问题


I am "converting" from data.frame to data.table

I now have a data.table:

library(data.table)


DT = data.table(ID = c("ab_cd.de","ab_ci.de","fb_cd.de","xy_cd.de"))
DT

         ID
1: ab_cd.de
2: ab_ci.de
3: fb_cd.de
4: xy_cd.de  

new_DT<- data.table(matrix(ncol = 2))
colnames(new_DT)<- c("test1", "test2")

I would like to to first: delete ".de" after every entry and in the next step separate every entry by the underscore and save the output in two new columns. The final output should look like this:

   test1 test2
1    ab    cd
2    ab    ci
3    fb    cd
4    xy    cd

In data.frame I did:

df = data.frame(ID = c("ab_cd.de","ab_ci.de","fb_cd.de","xy_cd.de"))
df

         ID
1: ab_cd.de
2: ab_ci.de
3: fb_cd.de
4: xy_cd.de


df[,1] <- gsub(".de", "", df[,1], fixed=FALSE)
df

      ID
1: ab_cd
2: ab_ci
3: fb_cd
4: xy_cd



 n <- 1
for (i in (1:length(df[,1]))){
    new_df[n,] <-str_split_fixed(df[i,1], "_", 2)
    n <- n+1
}
new_df

  test1 test2
1    ab    cd
2    ab    ci
3    fb    cd
4    xy    cd

Any help is appreciated!


回答1:


You can use tstrsplit to split the column into two after removing the suffix (.de) with sub:

DT[, c("test1", "test2") := tstrsplit(sub("\\.de", "", ID), "_")][, ID := NULL][]

#   test1 test2
#1:    ab    cd
#2:    ab    ci
#3:    fb    cd
#4:    xy    cd



回答2:


We can use extract from tidyr

library(tidyr)
df %>% 
   extract(ID, into = c('test1', 'test2'), '([^_]+)_([^.]+).*')
#  test1 test2
#1    ab    cd
#2    ab    ci
#3    fb    cd
#4    xy    cd

Or using data.table

library(data.table)
DT[, .(test1 = sub('_.*', '', ID), test2 = sub('[^_]+_([^.]+)\\..*', '\\1', ID))]
#   test1 test2
#1:    ab    cd
#2:    ab    ci
#3:    fb    cd
#4:    xy    cd


来源:https://stackoverflow.com/questions/44217340/r-gsub-and-str-split-fixed-in-data-tables

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!