complex data.table subset and vectorised maniulation

浪子不回头ぞ 提交于 2020-05-12 08:52:07

问题


Ok I have a complex function built using data.frames and in trying to speed it up I've turned to data.table. I'm totally new to this so I'm quite befuddled. Anyhow I've made a much much simpler toy example of what I want to do, but I cannot work out how to translate it into data.table format. Here is the example in data.frame form:

    rows <- 10
    data1 <- data.frame(   id =1:rows,
                    a = seq(0.2, 0.55, length.out = rows),
                  b = seq(0.35, 0.7, length.out = rows),
                  c = seq(0.4, 0.83, length.out = rows),
                  d = seq(0.6, 0.87, length.out = rows),
                  e = seq(0.7, 0.99, length.out = rows),
                  f = seq(0.52, 0.90, length.out = rows)             
    )
    DT1 <- data.table(data1) #for later

    data2 <- data.frame(   id =3:1,
                   a = rep(3, 3),
                   d = rep(2, 3),
                   f = rep(1, 3)
    )
    m.names <- c("a", "d", "f")

    data1[match(data2$id, data1$id),m.names] <- data1[match(data2$id, data1$id),m.names] + data2[match(data2$id, data1$id),m.names]

So note in the last step that I want perform addition between the pre-existing figures and the new data and its vectorised across several columns.

In a data.table format I've only gotten this far:

    DT1[id %in% data2$id, m.names, with=FALSE]

This selects the values I want to add to but after that I am lost. I would appreciate any help !

EDIT:

Ok I've figure out part of it - I can use the last line of code above to achieve the vectorised addition part using using data2 to store the added values as follows:

    data2[,m.names] <- data2[,m.names] + data.frame(DT1[id %in% data2$id, m.names, with=FALSE])

Even with 2.5million rows (in DT1) and 10,000 rows in data2 and 6 matching columns this only takes 0.004sec, but I still need to assign the new data2 to the appropriate dynamically assigned columns in data 1


回答1:


Here's another way, using devel version data.table v1.9.5:

require(data.table) ## v1.9.5+
setDT(data1)        ## data1 is now a data.table

cols1 = c("a", "d", "f")
cols2 = paste0("i.", cols1)
setkey(data1, id)   ## setkey and prepare for join
data1[data2, (cols1) := mapply(`+`, mget(cols1), mget(cols2), SIMPLIFY=FALSE)]
#     id         a         b         c    d         e         f
#  1:  1 3.2000000 0.3500000 0.4000000 2.60 0.7000000 1.5200000
#  2:  2 3.2388889 0.3888889 0.4477778 2.63 0.7322222 1.5622222
#  3:  3 3.2777778 0.4277778 0.4955556 2.66 0.7644444 1.6044444
#  4:  4 0.3166667 0.4666667 0.5433333 0.69 0.7966667 0.6466667
#  5:  5 0.3555556 0.5055556 0.5911111 0.72 0.8288889 0.6888889
#  6:  6 0.3944444 0.5444444 0.6388889 0.75 0.8611111 0.7311111
#  7:  7 0.4333333 0.5833333 0.6866667 0.78 0.8933333 0.7733333
#  8:  8 0.4722222 0.6222222 0.7344444 0.81 0.9255556 0.8155556
#  9:  9 0.5111111 0.6611111 0.7822222 0.84 0.9577778 0.8577778
# 10: 10 0.5500000 0.7000000 0.8300000 0.87 0.9900000 0.9000000

The join of the form x[i] is performed on the key column id. For each row of data2's id column, the corresponding matching rows in data1 is found. For example, for id = 2 from data2, the matching row is the 2nd row in data1.

Once we've all the matching rows, we evaluate the expression in j, which updates the data1 columns provided in col1 by adding the values from mget(cols1) and mget(cols2).

cols2 was generated with a i. prefix, which fetches the values from the data.table i -- here data2.

HTH




回答2:


One way would be to use set in a for loop as this involves multiple columns. Convert the second dataset to 'data.table' (DT2), set the key with the 'id' column, and join with 'data1'. Create column index vectors for the 'm.names' column in the dataset ('indx1') and i. columns created after the join ('indx2'). Using a for loop, set the NA elements in 'm.names' columns to '0', and then sum the corresponding columns based on the 'indx1' and 'indx2'.

DT2 <- as.data.table(data2)
DTNew <- setkey(DT2, id)[data1]
indx1 <- match(m.names, names(DTNew))
indx2 <- grep('i\\.', names(DTNew))

for(k in seq_along(indx1)){
  set(DTNew, i=which(is.na(DTNew[[indx1[k]]])), j= indx1[k], value=0)
  set(DTNew, i=NULL, j= indx2[k], value = DTNew[[indx1[k]]]+
                                      DTNew[[indx2[k]]])
  }

 res <- DTNew[,2:4 := NULL]
 setnames(res, names(data1))

Checking with the modified 'data1'

  data1[match(data2$id, data1$id),m.names] <- data1[match(data2$id, 
       data1$id),m.names] + data2[match(data2$id, data1$id),m.names]
  all.equal(setDF(res), data1)
  #[1] TRUE

Benchmarks

On a 1e6 dataset, 

set.seed(24)
data1 <- cbind(id=1:1e6,as.data.frame(matrix(rnorm(1e6*10), ncol=10, 
         dimnames=list(NULL, letters[1:10])) ))
 set.seed(46)
 data2 <- data.frame(id= sample(1:1000, 100, replace=FALSE), 
             a= rnorm(100), d=rnorm(100), f= rnorm(100))
 m.names <- c("a", "d", "f")

 DT2 <- as.data.table(data2)
 system.time({
   DTNew <- setkey(DT2, id)[data1]
   indx1 <- match(m.names, names(DTNew))
   indx2 <- grep('i\\.', names(DTNew))

   for(k in seq_along(indx1)){
   set(DTNew, i=which(is.na(DTNew[[indx1[k]]])), j= indx1[k], value=0)
   set(DTNew, i=NULL, j= indx2[k], value = DTNew[[indx1[k]]]+
                                  DTNew[[indx2[k]]])
   }

  res <- DTNew[,2:4 := NULL]
  setnames(res, names(data1))
  })

 # user  system elapsed 
 # 0.082   0.005   0.086 



回答3:


Ok thanks to @David Arenburg for his suggestion. I've modified it slightly to arrive at the following for my preferred solution

    text <- NULL
    for(i in 1:length(m.names)){
        text <- paste0(text, m.names[i], " = ", m.names[i], " + i.", m.names[i], ", ")
    }
    expr <- parse(text = paste0("\":=\"(", substr(text, 1, nchar(text)-2), ")" ))

    res2 <- DT1[data2, eval(expr)]


来源:https://stackoverflow.com/questions/29321218/complex-data-table-subset-and-vectorised-maniulation

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!