data.table

Aggregating while merging two dataframes in R

Deadly 提交于 2020-01-01 06:26:27
问题 The ultimate goal is to sum the total quantity( transact_data$qty ) for each record in product_info where the transact_data$productId exists in product_info , and where transact_data$date is between product_info$beg_date and product_info$end_date . The dataframes are below: product_info <- data.frame(productId = c("A", "B", "A", "C","C","B"), old_price = c(0.5,0.10,0.11,0.12,0.3,0.4), new_price = c(0.7,0.11,0.12,0.11,0.2,0.3), beg_date = c("2014-05-01", "2014-06-01", "2014-05-01", "2014-06-01

Why does lapply() not retain my data.table keys?

痴心易碎 提交于 2020-01-01 04:24:09
问题 I have a bunch of data.tables in a list. I want to apply unique() to each data.table in my list, but doing so destroys all my data.table keys. Here's an example: A <- data.table(a = rep(c("a","b"), each = 3), b = runif(6), key = "a") B <- data.table(x = runif(6), b = runif(6), key = "x") blah <- unique(A) Here, blah still has a key, and everything is right in the world: key(blah) # [1] "a" But if I add the data.tables to a list and use lapply() , the keys get destroyed: dt.list <- list(A, B)

Computing the Levenshtein ratio of each element of a data.table with each value of a reference table and merge with maximum ratio

╄→尐↘猪︶ㄣ 提交于 2020-01-01 03:59:11
问题 I have a data.table dt with 3 columns: id name as string threshold as num A sample is: dt <- <- data.table(nid = c("n1","n2", "n3", "n4"), rname = c("apple", "pear", "banana", "kiwi"), maxr = c(0.5, 0.8, 0.7, 0.6)) nid | rname | maxr n1 | apple | 0.5 n2 | pear | 0.8 n3 | banana | 0.7 n4 | kiwi | 0.6 I have a second table dt.ref with 2 columns: id name as string A sample is: dt.ref <- <- data.table(cid = c("c1", "c2", "c3", "c4", "c5", "c6"), cname = c("apple", "maple", "peer", "dear", "bonobo

Cartesian Product using data.table package

北慕城南 提交于 2020-01-01 01:58:05
问题 Using the data.table package in R, I am trying to create a cartesian product of two data.tables using the merge method as one would do in base R. In base the following works: #assume this order data orders <- data.frame(date = as.POSIXct(c('2012-08-28','2012-08-29','2012-09-01')), first.name = as.character(c('John','George','Henry')), last.name = as.character(c('Doe','Smith','Smith')), qty = c(10,50,6)) #and these dates dates <- data.frame(date = seq(from = as.POSIXct('2012-08-28'), to = as

Get columns by string from data.table [duplicate]

余生颓废 提交于 2020-01-01 01:12:36
问题 This question already has answers here : Select / assign to data.table when variable names are stored in a character vector (3 answers) Closed 10 months ago . raw is a data.table and the following code works: raw[,r_responseTime] #Returns the whole column raw[,c_filesetSize] #Same as above, returns column plot(raw[,r_responseTime]~raw[,c_filesetSize]) #draws something Now I want to specify these columns from a string, so for example: col1="r_reponseTime" col2="c_filesetSize" How can I now

Get columns by string from data.table [duplicate]

强颜欢笑 提交于 2020-01-01 01:12:07
问题 This question already has answers here : Select / assign to data.table when variable names are stored in a character vector (3 answers) Closed 10 months ago . raw is a data.table and the following code works: raw[,r_responseTime] #Returns the whole column raw[,c_filesetSize] #Same as above, returns column plot(raw[,r_responseTime]~raw[,c_filesetSize]) #draws something Now I want to specify these columns from a string, so for example: col1="r_reponseTime" col2="c_filesetSize" How can I now

Efficient calculation of var-covar matrix in R

时光毁灭记忆、已成空白 提交于 2019-12-31 21:58:31
问题 I'm looking for efficiency gains in calculating the (auto)covariance matrix from individual measurements over time t with t, t-1 , etc.. In the data matrix, each row represents an individual and each column represents monthly measurements (the columns are in time order). Similar to the following data (although with some more co-variance). # simulate data set.seed(1) periods <- 70L ind <- 90000L mat <- sapply(rep(ind, periods), rnorm) Below is the (ugly) code I came up with to get the

r data.table functional programming / metaprogramming / computing on the language

*爱你&永不变心* 提交于 2019-12-31 16:47:47
问题 I am exploring different ways to wrap an aggregation function (but really it could be any type of function) using data.table (one dplyr example is also provided) and was wondering on best practices for functional programming / metaprogramming with respect to performance (does the implementation matter with respect to potential optimization that data.table may apply) readability (is there a commonly agreed standard e.g. in most packages utilizing data.table) ease of generalization (are there

Merging two sets of data by data.table roll='nearest' function

ⅰ亾dé卋堺 提交于 2019-12-31 06:46:36
问题 I have two sets of data. Sample of set_A (total number of rows: 45467): ID_a a1 a2 a3 time_a 2 35694 5245.2 301.6053 00.00944 3 85694 9278.9 301.6051 23.00972 4 65694 9375.2 301.6049 22.00972 5 85653 4375.5 301.6047 19.00972 6 12694 5236.3 301.6045 22.00972 7 85697 5345.2 301.6043 21.00972 8 85640 5274.1 301.6041 20.01000 9 30694 5279.0 301.6039 20.01000 Sample of set_B (total number of rows: 4798): ID_b b1 b2 source time_b 2 34.20 15.114 set1.csv.1 20.35750 7 67.20 16.114 set1.csv.2 21.35778

R data.table using max in i statement

浪尽此生 提交于 2019-12-31 05:30:09
问题 This should be so simple but for some reason data.table is not doing what I expect. I want to take the max of two values in a row to determine if a row should be filtered or not. What appears to be happening is that the max() function is looking at the entire column which is not what I want. Here's the code: > test_dt <- data.table(value1 = 1:10, value2 = 2:11, value3 = 3:12) > test_dt[max(value1, value2, value3) < 7] Empty data.table (0 rows) of 3 cols: value1,value2,value3 Here's what I