Splitting text column into ragged multiple new columns in a data table in R

后端 未结 5 1677
爱一瞬间的悲伤
爱一瞬间的悲伤 2020-12-06 07:58

I have a data table containing 20000+ rows and one column. The string in each column has different number of words. I want to split the words and put each of them in a new c

相关标签:
5条回答
  • 2020-12-06 08:24

    OK for both data.table and data.frame

    # toy data
    df <- structure(list(x = structure(c(2L, 1L), .Label = c("This actually is not", 
    "This is interesting"), class = "factor")), .Names = "x", row.names = c(NA, 
    -2L), class = "data.frame")
    
    #                      x
    # 1  This is interesting
    # 2 This actually is not
    
    # the code
    split_result <- strsplit(as.character(df$x), " ")
    length_n <- sapply(split_result, length)
    length_max <- seq_len(max(length_n))
    as.data.frame(t(sapply(split_result, "[", i = length_max))) # Or as.data.table(...)
    
    #     V1       V2          V3   V4
    # 1 This       is interesting <NA>
    # 2 This actually          is  not
    
    0 讨论(0)
  • 2020-12-06 08:26

    Here is a solution based on rbind.fill.matrix(...) in the plyr package. On a dataset with 20,000 rows it runs in about 3.6 sec.

    # create an sample dataset - you have this already
    library(data.table)
    words <- LETTERS[1:10]     # "words" are just letters in this example
    set.seed(1)                # for reproducible example
    w  <- sapply(1:2e4,function(i)paste(words[sample(1:10,sample(1:10,1))],collapse=" "))
    dt <- data.table(words=w)
    head(dt)
    #          complaint
    # 1:           D F H
    # 2:           I J F
    # 3:   A B I E C D H
    # 4: J D G H B I A E
    # 5:         A D G C
    # 6:       F E B J I
    
    # you start here...
    library(plyr)
    result <- rbind.fill.matrix(lapply(strsplit(dt$words, split=" "),matrix,nr=1))
    result <- as.data.table(result)
    head(result)
    #    1 2 3  4  5  6  7  8  9 10
    # 1: D F H NA NA NA NA NA NA NA
    # 2: I J F NA NA NA NA NA NA NA
    # 3: A B I  E  C  D  H NA NA NA
    # 4: J D G  H  B  I  A  E NA NA
    # 5: A D G  C NA NA NA NA NA NA
    # 6: F E B  J  I NA NA NA NA NA
    

    EDIT: Added some benchmarking based on @Ananda's comment below.

    f.rfm    <- function() as.data.table(rbind.fill.matrix(lapply(strsplit(dt$complaint, split=" "),matrix,nr=1)))
    library(splitstackshape)
    f.csplit <- function() cSplit(dt, "complaint", " ",type.convert=FALSE)
    library(stringi)
    f.sl2m   <- function() as.data.table(stri_list2matrix(strsplit(dt$complaint, split=" "), byrow = TRUE))
    f.ssf    <- function() as.data.table(stri_split_fixed(dt$complaint, " ", simplify = TRUE))
    
    all.equal(f.rfm(),f.csplit(),check.names=FALSE)
    # [1] TRUE
    all.equal(f.rfm(),f.sl2m(),check.names=FALSE)
    # [1] TRUE
    all.equal(f.rfm(),f.ssf(),check.names=FALSE)
    # [1] TRUE
    library(microbenchmark)
    microbenchmark(f.rfm(),f.csplit(),f.sl2m(),f.ssf(),times=10)
    # Unit: milliseconds
    #        expr        min         lq     median        uq        max neval
    #     f.rfm() 3566.17724 3589.31203 3606.93303 3665.4087 3719.32299    10
    #  f.csplit()   98.05709  102.46456  104.51046  107.9588  117.26945    10
    #    f.sl2m()   55.45527   55.58852   56.75406   58.9347   67.44523    10
    #     f.ssf()   17.77499   17.98879   18.30831   18.4537   21.62161    10
    

    So it looks like stri_split_fixed(...) is the winner.

    0 讨论(0)
  • 2020-12-06 08:43

    Two functions, transpose() and tstrsplit(), are available since version 1.9.6 on CRAN.

    With this we can do:

    require(data.table)
    setDT(tstrsplit(as.character(df$x), " ", fixed=TRUE))[]
    #      V1       V2          V3  V4
    # 1: This       is interesting  NA
    # 2: This actually          is not
    

    tstrsplit is a wrapper for transpose(strsplit(...)).

    0 讨论(0)
  • 2020-12-06 08:44

    An example data would be nice, but if I understand what you want, it is not possible to do properly in a data frame. Given there are different numbers of words in each row you, will need a list. Even though, it is very simple to split the words in the whole object.

    If you run strsplit(as.character(Data[,1]), " ") you will get a list with each element corresponding to a row in your dataframe. From that, there are several different alternatives to rearrange this object, but the best approach will depend on your objective

    0 讨论(0)
  • 2020-12-06 08:48

    Check out cSplit from my "splitstackshape" package. It works on either data.frames or data.tables (but always returns a data.table).

    Assuming KFB's sample data is at least slightly representative of your actual data, you can try:

    library(splitstackshape)
    cSplit(df, "x", " ")
    #     x_1      x_2         x_3 x_4
    # 1: This       is interesting  NA
    # 2: This actually          is not
    

    Another (blazing) option is to use stri_split_fixed with simplify = TRUE (from "stringi") (which is obviously deemed to enter the "splitstackshape" code soon):

    library(stringi)
    stri_split_fixed(df$x, " ", simplify = TRUE)
    #      [,1]   [,2]       [,3]          [,4] 
    # [1,] "This" "is"       "interesting" NA   
    # [2,] "This" "actually" "is"          "not"
    
    0 讨论(0)
提交回复
热议问题