R: gsub and str_split_fixed in data.tables

问题

I am "converting" from data.frame to data.table

I now have a data.table:

library(data.table)


DT = data.table(ID = c("ab_cd.de","ab_ci.de","fb_cd.de","xy_cd.de"))
DT

         ID
1: ab_cd.de
2: ab_ci.de
3: fb_cd.de
4: xy_cd.de  

new_DT<- data.table(matrix(ncol = 2))
colnames(new_DT)<- c("test1", "test2")

I would like to to first: delete ".de" after every entry and in the next step separate every entry by the underscore and save the output in two new columns. The final output should look like this:

   test1 test2
1    ab    cd
2    ab    ci
3    fb    cd
4    xy    cd

In data.frame I did:

df = data.frame(ID = c("ab_cd.de","ab_ci.de","fb_cd.de","xy_cd.de"))
df

         ID
1: ab_cd.de
2: ab_ci.de
3: fb_cd.de
4: xy_cd.de


df[,1] <- gsub(".de", "", df[,1], fixed=FALSE)
df

      ID
1: ab_cd
2: ab_ci
3: fb_cd
4: xy_cd



 n <- 1
for (i in (1:length(df[,1]))){
    new_df[n,] <-str_split_fixed(df[i,1], "_", 2)
    n <- n+1
}
new_df

  test1 test2
1    ab    cd
2    ab    ci
3    fb    cd
4    xy    cd

Any help is appreciated!

回答1:

You can use tstrsplit to split the column into two after removing the suffix (.de) with sub:

DT[, c("test1", "test2") := tstrsplit(sub("\\.de", "", ID), "_")][, ID := NULL][]

#   test1 test2
#1:    ab    cd
#2:    ab    ci
#3:    fb    cd
#4:    xy    cd

回答2:

We can use extract from tidyr

library(tidyr)
df %>% 
   extract(ID, into = c('test1', 'test2'), '([^_]+)_([^.]+).*')
#  test1 test2
#1    ab    cd
#2    ab    ci
#3    fb    cd
#4    xy    cd

Or using data.table

library(data.table)
DT[, .(test1 = sub('_.*', '', ID), test2 = sub('[^_]+_([^.]+)\\..*', '\\1', ID))]
#   test1 test2
#1:    ab    cd
#2:    ab    ci
#3:    fb    cd
#4:    xy    cd

来源：https://stackoverflow.com/questions/44217340/r-gsub-and-str-split-fixed-in-data-tables

标签

data.table

gsub

strsplit