Merging data.tables uses more than 10 GB RAM

前端 未结 2 1100
后悔当初
后悔当初 2020-12-06 19:23

I have two data.tables: DT and meta. When I merge them using DT[meta], memory usage increases by more than 10 GB (and the merge is ver

相关标签:
2条回答
  • 2020-12-06 20:08

    Maybe others functions can work better, like merge() or cbind().

    0 讨论(0)
  • 2020-12-06 20:10

    My bad. The problem was that keys were not unique:

    a<-data.table(x=c(1,1),y=c(1,2))
    b<-data.table(x=c(1,1),y=c(3,4))
    setkey(a,x)
    setkey(b,x)
    a[b]
         x y y.1
    [1,] 1 1   3
    [2,] 1 2   3
    [3,] 1 1   4
    [4,] 1 2   4
    

    It would be nice if data.table could give a warning for that.


    Update from Matthew

    This warning has now been implemented in v1.8.7 :

    New argument allow.cartesian ( default FALSE) added to X[Y] and merge(X,Y), #2464. Prevents large allocations due to misspecified joins; e.g., duplicate key values in Y joining to the same group in X over and over again. The word cartesian is used loosely for when more than max(nrow(X),nrow(Y)) rows would be returned. The error message is verbose and includes advice.

    0 讨论(0)
提交回复
热议问题