data.table

Multiple indexing with multiple idxmin() and idmax() in one aggregate in pandas

回眸只為那壹抹淺笑 提交于 2020-08-27 21:57:25
问题 In R data.table it is possible and easy to aggregate on multiple columns using argmin or argmax functions in one aggregate. For example for DT: > DT = data.table(id=c(1,1,1,2,2,2,2,3,3,3), col1=c(1,3,5,2,5,3,6,3,67,7), col2=c(4,6,8,3,65,3,5,4,4,7), col3=c(34,64,53,5,6,2,4,6,4,67)) > DT id col1 col2 col3 1: 1 1 4 34 2: 1 3 6 64 3: 1 5 8 53 4: 2 2 3 5 5: 2 5 65 6 6: 2 3 3 2 7: 2 6 5 4 8: 3 3 4 6 9: 3 67 4 4 10: 3 7 7 67 > DT_agg = DT[, .(agg1 = col1[which.max(col2)] , agg2 = col2[which.min(col3

Cartesian join in data.table

放肆的年华 提交于 2020-08-21 05:23:37
问题 I am trying to do a full Cartesian join using data.table but with little luck. Code: a = data.table(dt=c(20131017,20131018)) setkey(a,dt) b = data.table(ticker=c("ABC","DEF","XYZ"),ind=c("MISC1","MISC2","MISC3")) setkey(b,ticker) Expected output: merge(data.frame(a),data.frame(b),all.x=TRUE,all.y=TRUE) I have tried merge(a,b,allow.cartesian=TRUE) but it gives me following error - " Error in merge.data.table(a, b, allow.cartesian = TRUE) : A non-empty vector of column names for by is required.

Cartesian join in data.table

喜夏-厌秋 提交于 2020-08-21 05:23:12
问题 I am trying to do a full Cartesian join using data.table but with little luck. Code: a = data.table(dt=c(20131017,20131018)) setkey(a,dt) b = data.table(ticker=c("ABC","DEF","XYZ"),ind=c("MISC1","MISC2","MISC3")) setkey(b,ticker) Expected output: merge(data.frame(a),data.frame(b),all.x=TRUE,all.y=TRUE) I have tried merge(a,b,allow.cartesian=TRUE) but it gives me following error - " Error in merge.data.table(a, b, allow.cartesian = TRUE) : A non-empty vector of column names for by is required.

Why is R's data.table so much faster than pandas?

让人想犯罪 __ 提交于 2020-08-21 04:35:07
问题 I have a 12 million rows dataset, with 3 columns as unique identifiers and another 2 columns with values. I'm trying to do a rather simple task: - group by the three identifiers. This yields about 2.6 million unique combinations - Task 1: calculate the median for column Val1 - Task 2: calculate the mean for column Val1 given some condition on Val2 Here are my results, using pandas and data.table (both latest versions at the moment, on the same machine): +-----------------+-----------------+--

R data.table remove rows where one column is duplicated if another column is NA

不想你离开。 提交于 2020-08-20 07:22:06
问题 Here is an example data.table dt <- data.table(col1 = c('A', 'A', 'B', 'C', 'C', 'D'), col2 = c(NA, 'dog', 'cat', 'jeep', 'porsch', NA)) col1 col2 1: A NA 2: A dog 3: B cat 4: C jeep 5: C porsch 6: D NA I want to remove rows where col1 is duplicated if col2 is NA and has a non-NA value in another row. AKA group by col1, then if group has more than one row and one of them is NA, remove it. This would be the result for dt : col1 col2 2: A dog 3: B cat 4: C jeep 5: C porsch 6: D NA I tried this: