data.table

Fuzzy merging in R - seeking help to improve my code

那年仲夏 提交于 2021-01-20 19:50:53
问题 Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the

data.table sum and subset

隐身守侯 提交于 2021-01-18 05:29:25
问题 I have a data.table that I am wanting to aggregate library(data.table) dt1 <- data.table(year=c("2001","2001","2001","2002","2002","2002","2002"), group=c("a","a","b","a","a","b","b"), amt=c(20,40,20,35,30,28,19)) I am wanting to sum the amt by year and group and then filter where the summed amt for any given group is greater than 100. I've got the data.table sum nailed. dt1[, sum(amt),by=list(year,group)] year group V1 1: 2001 a 60 2: 2001 b 20 3: 2002 a 65 4: 2002 b 47 I am having trouble

Renaming a column entry when it is the maximum value by group

妖精的绣舞 提交于 2021-01-07 01:38:14
问题 I have a dataset as follows: library(data.table) DT <- structure(list(State_Ab = c("MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD"), County = c("Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore"), year = c(1994, 1994, 1998, 1998, 2000, 2000, 2004, 2004, 2006, 2006, 2010, 2010, 2016, 2016), Population = c(140942, 219673, 235413,

Renaming a column entry when it is the maximum value by group

蓝咒 提交于 2021-01-07 01:32:51
问题 I have a dataset as follows: library(data.table) DT <- structure(list(State_Ab = c("MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD"), County = c("Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore"), year = c(1994, 1994, 1998, 1998, 2000, 2000, 2004, 2004, 2006, 2006, 2010, 2010, 2016, 2016), Population = c(140942, 219673, 235413,

Renaming a column entry when it is the maximum value by group

為{幸葍}努か 提交于 2021-01-07 01:32:30
问题 I have a dataset as follows: library(data.table) DT <- structure(list(State_Ab = c("MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD"), County = c("Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore"), year = c(1994, 1994, 1998, 1998, 2000, 2000, 2004, 2004, 2006, 2006, 2010, 2010, 2016, 2016), Population = c(140942, 219673, 235413,

Renaming a column entry, when it is the maximum value by group, gives inconsistent results

岁酱吖の 提交于 2021-01-07 01:26:49
问题 I have data as follows: library(data.table) DT <- structure(list(State_Ab = c("VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA"), year = c(1995, 1995, 1995, 1995, 1999, 1999, 1999, 1999, 2001, 2001, 2001, 2001, 2005, 2005, 2005, 2005, 2007, 2007, 2007, 2007, 2011, 2011,

Column name labelling in data.table joins

元气小坏坏 提交于 2021-01-05 07:15:06
问题 I am trying to join data.table x to z using a non-equi join. Table x contains two columns X1 and X2 that are used as the range to use for joining with column Z1 in z. The current code successfully does the non-equi join however certain columns are removed or renamed. I would like to return the 'ideal' data.table supplied, instead of the one I currently have which I would have to rename columns or join data further to get the 'ideal' data supplied. > library(data.table) > > x <- data.table(Id

Matching based on different independent tables using data.table in R

|▌冷眼眸甩不掉的悲伤 提交于 2021-01-01 09:09:06
问题 I would like to match multiple conditions from independent data tables onto my main data table. How can I do this using the data.table package? What would be the most efficient/fastest way? I have a mock example, with some mock conditions here to illustrate my question: main_data <- data.frame( pnum = c(1,2,3,4,5,6,7,8,9,10), age = c(24,35,43,34,55,24,36,43,34,54), gender = c("f","m","f","f","m","f","m","f","f","m")) data_1 <- data.frame( pnum = c(1,4,5,8,9), value_data_1 = c(1, 2, 1, 1, 1),

Matching based on different independent tables using data.table in R

落花浮王杯 提交于 2021-01-01 09:07:52
问题 I would like to match multiple conditions from independent data tables onto my main data table. How can I do this using the data.table package? What would be the most efficient/fastest way? I have a mock example, with some mock conditions here to illustrate my question: main_data <- data.frame( pnum = c(1,2,3,4,5,6,7,8,9,10), age = c(24,35,43,34,55,24,36,43,34,54), gender = c("f","m","f","f","m","f","m","f","f","m")) data_1 <- data.frame( pnum = c(1,4,5,8,9), value_data_1 = c(1, 2, 1, 1, 1),

Cumulative sum from a month ago until the current day for all the rows

送分小仙女□ 提交于 2020-12-30 03:57:35
问题 I have a data.table with ID, dates and values like the following one: DT <- setDT(data.frame(ContractID= c(1,1,1,2,2), Date = c("2018-02-01", "2018-02-20", "2018-03-12", "2018-02-01", "2018-02-12"), Value = c(10,20,30,10,20))) ContractID Date Value 1: 1 2018-02-01 10 2: 1 2018-02-20 20 3: 1 2018-03-12 30 4: 2 2018-02-01 10 5: 2 2018-02-12 20 I'd like to get a new column with the total cumulative sum per ID from a month ago until the current day for each row, like in the table below. NB: the