how can I eliminate a loop over a datatable? [closed]

问题

I've two data.tables as shown below:

N = 10
A.DT <- data.table(a1 = c(rnorm(N,0,1)), a2 = NA))
B.DT <- data.table(b1 = c(rnorm(N,0,1)), b2 = 1:N)
setkey(A.DT,a1)    
setkey(B.DT,b1)

I tried to change my previous data.frame implementation to a data.table implementation by changing the for-loop as shown below:

for (i in 1:nrow(B.DT)) {
  for (j in nrow(A.DT):1) {
    if (B.DT[i,b2] <= N/2 
        && B.DT[i,b1] < A.DT[j,a1]) {
      A.DT[j,]$a2 <- B.DT[i,]$b1
      break
    }
  }
}

I get the following error message:

Error in `[<-.data.table`(`*tmp*`, j, a2, value = -0.391987468746123) : 
  object "a2" not found

I think the way I access data.table is not quite right. I am new to it. I guess there is a quicker way of doing it than cycling up and down the two datatables.

I'd like to know if the loop shown above could be simplified/vectorised.

Edit The data.table data for copy/paste:

# A.DT
    a1  a2
1   -1.4917779  NA
2   -1.0731161  NA
3   -0.7533091  NA
4   -0.3673273  NA
5   -0.159569   NA
6   -0.1551948  NA
7   -0.0430574  NA
8   0.1783496   NA
9   0.4276034   NA
10  1.0697412   NA

# B.DT
    b1  b2
1   0.64229018  1
2   1.00527902  2
3   0.24746294  3
4   -0.50288835 4
5   0.34447791  5
6   -0.22205129 6
7   0.60099079  7
8   -0.70242284 8
9   0.6298599   9
10  0.08917988  10

The output I expect:

# OUTPUT
    a1  a2
1   -1.4917779  NA
2   -1.0731161  NA
3   -0.7533091  NA
4   -0.3673273  NA
5   -0.159569   NA
6   -0.1551948  NA
7   -0.0430574  NA
8   0.1783496   -0.50288835
9   0.4276034   0.24746294
10  1.0697412   0.64229018

The algorithm goes down one table, and for each row go up the other table, check some conditions and modify values accordingly. More specifically, it goes down B.DT, and for each row in B.DT goes up A.DT and assigns to a2 the first value of b1 such that b1 is smaller than a1. An additional condition is checked before assignment (b2 being equal or smaller than 5 in this example).

0.64229018 is the first value in B.DT, and it is assigned to the last unit of A.DT. 1.00527902 is the second value in B.DT, but it is left unassigned because it is bigger than all other values in A.DT. 0.24746294 is the third value in B.DT, and it is assigned to the second last unit in A.DT. -0.50288835 is the fourth value in B.DT, and it is assigned to unit #8 in A.DT 0.34447791 is the fifth value in B.DT, and it is left unassigned because it is too big.

This is of course a simplified problem (and therefore may not make much sense). Thanks for your time and input.

回答1:

Your code will run changing:

A.DT[j,]$a2 <- B.DT[i,]$b1

A.DT$a2[j,] <- B.DT[i,]$b1

As for more efficient use of data.table, I'll leave that to those more expert than I...

回答2:

Once you have created your data.table, there is little need for the regular assign operator <-, instead you want to use :=, and this goes inside of the brackets in the j location. (the reason for avoiding <- is that <- creates a copy of the object, whereas := does not, hence the efficiency)

So first modification to your code would be:

 # FROM: A.DT[j,]$a2 <- B.DT[i,]$b1
 # TO: 
 A.DT[j, a2 := B.DT[i, b1] ]

Now, one of data.table's (many) best features is it's by argument, which helps do away with a lot of for loops and *ply calls. In this specific case, you can clean up your dual loops as follows:

set.seed(201)
A.DT <- data.table(a1 = rnorm(N,0,1), key="a1")  # no need to create a2 if it will be NA. If you do, make sure it is as.numeric(NA)
B.DT <- data.table(b1 = rnorm(N,0,1), b2 = 1:N, key="b2")

# Assign to a2 in A.DT
A.DT[            
      , a2 := B.DT[ b2 <= N/2 & b1 < a1] [1, b1]
      , by=a1
     ]


> A.DT
             a1         a2
 1: -2.30403431         NA
 2: -1.69658097         NA
 3: -1.28548252         NA
 4: -0.34454603 -0.6478531
 5: -0.07503189 -0.6478531
 6:  0.05593404 -0.6478531
 7:  0.18900414 -0.6478531
 8:  0.26693735  0.2238094
 9:  0.28606069  0.2238094
10:  0.32576373  0.2238094

Two Sidenote on `key`s.

you can set the key at the same time as you are creating the data.table, saving you two lines of code
a data.table is sorted by its key. Judging by the fact that you are using row position to determine assignment, I am guessing that you will not want to set the keys as you have. In the code above, I changed B.DT's key to `b2.

来源：https://stackoverflow.com/questions/15475984/how-can-i-eliminate-a-loop-over-a-datatable

标签

data.table

how can I eliminate a loop over a datatable? [closed]

问题

回答1:

回答2:

Two Sidenote on keys.

Two Sidenote on `key`s.