How to programmatically create binary columns based on a categorical variable in data.table?

后端 未结 3 833
無奈伤痛
無奈伤痛 2020-12-18 11:00

I have a big (12 million rows) data.table which looks like this:

library(data.table)
set.seed(123)
dt <         


        
相关标签:
3条回答
  • 2020-12-18 11:30

    data.table has its own dcast implementation using data.table's internals and should be fast. Give this a try:

    dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L)
    #    id a b c d e
    # 1:  1 0 1 1 1 1
    # 2:  2 1 0 1 0 1
    # 3:  3 1 0 1 1 1
    

    Just thought of another way to handle this by preallocating and updating by reference (perhaps dcast's logic should be done like this to avoid intermediates).

    ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
    

    All that's left is to fill existing combinations with 1L.

    dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
    ans
    #    id b d c e a
    # 1:  1 1 1 1 1 0
    # 2:  2 0 0 1 1 1
    # 3:  3 0 1 1 1 1
    

    Okay, I've gone ahead on benchmarked on OP's data dimensions with ~10 million rows and 10 columns.

    require(data.table)
    set.seed(45L)
    y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
    dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))
    
    system.time(ans1 <- AnsFunction())   # 2.3s
    system.time(ans2 <- dcastFunction()) # 2.2s
    system.time(ans3 <- TableFunction()) # 6.2s
    
    setcolorder(ans1, names(ans2))
    setcolorder(ans3, names(ans2))
    setorder(ans1, id)
    setkey(ans2, NULL)
    setorder(ans3, id)
    
    identical(ans1, ans2) # TRUE
    identical(ans1, ans3) # TRUE
    

    where,

    AnsFunction <- function() {
        ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
        dt[, {set(ans, i=.GRP, j=unique(y), value=1L); NULL}, by=id]
        ans
        # reorder columns outside
    }
    
    dcastFunction <- function() {
        # no need to load reshape2. data.table has its own dcast as well
        # no need for setDT
        df <- dcast(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
    }
    
    TableFunction <- function() {
        # need to return integer results for identical results
        # fixed 1 -> 1L; as.numeric -> as.integer
        df <- as.data.frame.matrix(table(dt$id, dt$y))
        df[df > 1L] <- 1L
        df <- cbind(id = as.integer(row.names(df)), df)
        setDT(df)
    }
    
    0 讨论(0)
  • 2020-12-18 11:33

    For small data sets the table function seems to be more efficient, but on large datasets dcast seems to be the most efficient and convenient option.

    TableFunction <- function(){
        df <- as.data.frame.matrix(table(dt$id, dt$y))
        df[df > 1] <- 1
        df <- cbind(id = as.numeric(row.names(df)), df)
        setDT(df)
    }
    
    
    AnsFunction <- function(){
        ans = data.table(id = unique(dt$id))[, unique(dt$y) := 0L][]
        dt[, {set(ans, i=id, j=unique(y), value=1L); NULL}, by=id]
    }
    
    dcastFunction <- function(){
        df <-dcast.data.table(dt, id ~ y, fun.aggregate = function(x) 1L, fill=0L,value.var = "y")
    
    }
    
    library(data.table)
    library(microbenchmark)
    set.seed(123)
    N = 10000
    dt <- data.table(id=rep(1:N, each=5),y=sample(letters[1 : 5], N*5, replace = T)) 
    
    
    microbenchmark(
        "dcast" = dcastFunction(),
        "Table" = TableFunction(),
        "Ans"   = AnsFunction()
        )
    
    
     Unit: milliseconds
      expr       min        lq      mean    median        uq       max neval cld
     dcast  42.48367  45.39793  47.56898  46.83755  49.33388  60.72327   100  b 
     Table  28.32704  28.74579  29.14043  29.00010  29.23320  35.16723   100 a  
       Ans 120.80609 123.95895 127.35880 126.85018 130.12491 156.53289   100   c
    
    > all(test1 == test2)
    [1] TRUE
    > all(test1 == test3)
    [1] TRUE
    
    y = apply(matrix(sample(letters, 10L*20L, TRUE), ncol=20L), 1L, paste, collapse="")
    dt = data.table(id=sample(1e5,1e7,TRUE), y=sample(y,1e7,TRUE))
    
    microbenchmark(
        "dcast" = dcastFunction(),
        "Table" = TableFunction(),
        "Ans"   = AnsFunction()
    )
    Unit: seconds
      expr      min       lq     mean   median       uq      max neval cld
     dcast 1.985969 2.064964 2.189764 2.216138 2.266959 2.643231   100 a  
     Table 5.022388 5.403263 5.605012 5.580228 5.830414 6.318729   100   c
       Ans 2.234636 2.414224 2.586727 2.599156 2.645717 2.982311   100  b 
    
    0 讨论(0)
  • 2020-12-18 11:44

    If you already know the range of the rows (as in you know that there are no more than 3 rows in your example) and you know the columns you can start with an array of zeros and use the apply function to update values in that secondary table.

    My R is a little rust but i think that should work. Additionally the function you pass to the apply method could contain conditions to add necessary rows and columns as is needed.

    My R is a little rust so I'm a bit tentative to write it up right now, but I think that's the way to do it.

    If you are looking for something a little more plug and play I found this little blerb:

    There are two sets of methods that are explained below:
    
    gather() and spread() from the tidyr package. This is a newer interface to the reshape2 package.
    
    melt() and dcast() from the reshape2 package.
    
    There are a number of other methods which aren’t covered here, since they are not as easy to use:
    
    The reshape() function, which is confusingly not part of the reshape2 package; it is part of the base install of R.
    
    stack() and unstack()
    

    from here :: http://www.cookbook-r.com/Manipulating_data/Converting_data_between_wide_and_long_format/

    If I was better versed in R I would tell you how those various methods handle collisions going from long lists to wide on. I was googling up "Make a table from flat data in R" to come up with this...

    Also Check out this It's that same website as above with my personal comment wrapper : p

    0 讨论(0)
提交回复
热议问题