data.table | 易学教程

Fuzzy merging in R - seeking help to improve my code

阅读更多关于 Fuzzy merging in R - seeking help to improve my code

问题 Inspired by the experimental fuzzy_join function from the statar package I wrote a function myself which combines exact and fuzzy (by string distances) matching. The merging job I have to do is quite big (resulting into multiple string distance matrices with a little bit less than one billion cells) and I had the impression that the fuzzy_join function is not written very efficiently (with regard to memory usage) and the parallelization is implemented in a weird manner (the computation of the

data.table sum and subset

阅读更多关于 data.table sum and subset

问题 I have a data.table that I am wanting to aggregate library(data.table) dt1 <- data.table(year=c("2001","2001","2001","2002","2002","2002","2002"), group=c("a","a","b","a","a","b","b"), amt=c(20,40,20,35,30,28,19)) I am wanting to sum the amt by year and group and then filter where the summed amt for any given group is greater than 100. I've got the data.table sum nailed. dt1[, sum(amt),by=list(year,group)] year group V1 1: 2001 a 60 2: 2001 b 20 3: 2002 a 65 4: 2002 b 47 I am having trouble

Renaming a column entry when it is the maximum value by group

阅读更多关于 Renaming a column entry when it is the maximum value by group

问题 I have a dataset as follows: library(data.table) DT <- structure(list(State_Ab = c("MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD", "MD"), County = c("Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore", "Baltimore"), year = c(1994, 1994, 1998, 1998, 2000, 2000, 2004, 2004, 2006, 2006, 2010, 2010, 2016, 2016), Population = c(140942, 219673, 235413,

Renaming a column entry when it is the maximum value by group

阅读更多关于 Renaming a column entry when it is the maximum value by group

Renaming a column entry when it is the maximum value by group

阅读更多关于 Renaming a column entry when it is the maximum value by group

Renaming a column entry, when it is the maximum value by group, gives inconsistent results

阅读更多关于 Renaming a column entry, when it is the maximum value by group, gives inconsistent results

问题 I have data as follows: library(data.table) DT <- structure(list(State_Ab = c("VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA", "VA"), year = c(1995, 1995, 1995, 1995, 1999, 1999, 1999, 1999, 2001, 2001, 2001, 2001, 2005, 2005, 2005, 2005, 2007, 2007, 2007, 2007, 2011, 2011,

Column name labelling in data.table joins

阅读更多关于 Column name labelling in data.table joins

问题 I am trying to join data.table x to z using a non-equi join. Table x contains two columns X1 and X2 that are used as the range to use for joining with column Z1 in z. The current code successfully does the non-equi join however certain columns are removed or renamed. I would like to return the 'ideal' data.table supplied, instead of the one I currently have which I would have to rename columns or join data further to get the 'ideal' data supplied. > library(data.table) > > x <- data.table(Id

Matching based on different independent tables using data.table in R

阅读更多关于 Matching based on different independent tables using data.table in R

问题 I would like to match multiple conditions from independent data tables onto my main data table. How can I do this using the data.table package? What would be the most efficient/fastest way? I have a mock example, with some mock conditions here to illustrate my question: main_data <- data.frame( pnum = c(1,2,3,4,5,6,7,8,9,10), age = c(24,35,43,34,55,24,36,43,34,54), gender = c("f","m","f","f","m","f","m","f","f","m")) data_1 <- data.frame( pnum = c(1,4,5,8,9), value_data_1 = c(1, 2, 1, 1, 1),

Matching based on different independent tables using data.table in R

阅读更多关于 Matching based on different independent tables using data.table in R

Cumulative sum from a month ago until the current day for all the rows

阅读更多关于 Cumulative sum from a month ago until the current day for all the rows

问题 I have a data.table with ID, dates and values like the following one: DT <- setDT(data.frame(ContractID= c(1,1,1,2,2), Date = c("2018-02-01", "2018-02-20", "2018-03-12", "2018-02-01", "2018-02-12"), Value = c(10,20,30,10,20))) ContractID Date Value 1: 1 2018-02-01 10 2: 1 2018-02-20 20 3: 1 2018-03-12 30 4: 2 2018-02-01 10 5: 2 2018-02-12 20 I'd like to get a new column with the total cumulative sum per ID from a month ago until the current day for each row, like in the table below. NB: the