I have two data.tables df (21 MIO rows) and tmp (500k rows)
df has three columns linking an original patent (origpat)
Best idea I came with is:
df[,idx := .I] # Add an index to the data.table to group by row of df
df[,compare := sum(tmp[pnum == ref.pat, prim] == mainprim) /
length(tmp[pnum == ref.pat,prim]),by = idx]
Or reusing your overlap function (still using the idx column):
df[,compare := overlap(
mainprim,
tmp[pnum == ref.pat, prim]),
by=idx]
What it does here is grouping by row and then use columns from Subset Data to get the mainprim for this row and the subsets of tmp needed.
If you want to avoid creating the idx column you can use by=1:nrow(df) instead but this could slow down the process (using an actual column is quicker in data.table).
Great improvements by @Docendo:
You can further speed up the process by creating an intermediate variable to store the subset instead of doing the subset twice per row:
df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)}, by = idx]
And in case there are duplicated combinations of ref.pat and mainprim in df you could further optimize the performance by using by = list(ref.pat, mainprim) instead of by = idx:
df[,compare := {x = tmp[pnum == ref.pat, prim]; sum(x == mainprim) / length(x)},
by = list(ref.pat, mainprim)]
And another, probably just minimal, improvement could be done by using mean() instead of sum()/length():
df[,compare := mean(tmp[pnum == ref.pat, prim] == mainprim), by = list(ref.pat, mainprim)]