How to select rows from one data.table to apply in another data.table?

后端 未结 2 1316
旧时难觅i
旧时难觅i 2021-01-24 00:13

I have two data.tables df (21 MIO rows) and tmp (500k rows)

df has three columns linking an original patent (origpat)

2条回答
  •  天命终不由人
    2021-01-24 00:52

    Best idea I came with is:

    df[,idx := .I] # Add an index to the data.table to group by row of df
    df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
         length(tmp[pnum == ref.pat,prim]),by = idx]
    

    Or reusing your overlap function (still using the idx column):

    df[,compare := overlap(
                    mainprim,
                    tmp[pnum == ref.pat, prim]),
        by=idx]
    

    What it does here is grouping by row and then use columns from Subset Data to get the mainprim for this row and the subsets of tmp needed.

    If you want to avoid creating the idx column you can use by=1:nrow(df) instead but this could slow down the process (using an actual column is quicker in data.table).


    Great improvements by @Docendo:

    You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:

    df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]
    

    And in case there are duplicated combinations of ref.pat and mainprim in df you could further optimize the performance by using by = list(ref.pat, mainprim) instead of by = idx:

    df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
       by = list(ref.pat, mainprim)]
    

    And another, probably just minimal, improvement could be done by using mean() instead of sum()/length():

    df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]
    

提交回复
热议问题