data.table

R plyr, data.table, apply certain columns of data.frame

柔情痞子 提交于 2020-01-22 20:58:05
问题 I am looking for ways to speed up my code. I am looking into the apply / ply methods as well as data.table . Unfortunately, I am running into problems. Here is a small sample data: ids1 <- c(1, 1, 1, 1, 2, 2, 2, 2) ids2 <- c(1, 2, 3, 4, 1, 2, 3, 4) chars1 <- c("aa", " bb ", "__cc__", "dd ", "__ee", NA,NA, "n/a") chars2 <- c("vv", "_ ww_", " xx ", "yy__", " zz", NA, "n/a", "n/a") data <- data.frame(col1 = ids1, col2 = ids2, col3 = chars1, col4 = chars2, stringsAsFactors = FALSE) Here is a

Improve performance of data.table date+time pasting?

删除回忆录丶 提交于 2020-01-22 19:31:19
问题 I am not sure that I can ask this question here, let me know if I should do it somewhere else. I have a data.table with 1e6 rows having this structure: V1 V2 V3 1: 03/09/2011 08:05:40 1145.0 2: 03/09/2011 08:06:01 1207.3 3: 03/09/2011 08:06:17 1198.8 4: 03/09/2011 08:06:20 1158.4 5: 03/09/2011 08:06:40 1112.2 6: 03/09/2011 08:06:59 1199.3 I am converting the V1 and V2 variables to a unique datetime variable, using this code: system.time(DT[,`:=`(index= as.POSIXct(paste(V1,V2), format='%d/%m/

Improve performance of data.table date+time pasting?

时光总嘲笑我的痴心妄想 提交于 2020-01-22 19:31:07
问题 I am not sure that I can ask this question here, let me know if I should do it somewhere else. I have a data.table with 1e6 rows having this structure: V1 V2 V3 1: 03/09/2011 08:05:40 1145.0 2: 03/09/2011 08:06:01 1207.3 3: 03/09/2011 08:06:17 1198.8 4: 03/09/2011 08:06:20 1158.4 5: 03/09/2011 08:06:40 1112.2 6: 03/09/2011 08:06:59 1199.3 I am converting the V1 and V2 variables to a unique datetime variable, using this code: system.time(DT[,`:=`(index= as.POSIXct(paste(V1,V2), format='%d/%m/

What's the higher-performance alternative to for-loops for subsetting data by group-id?

╄→尐↘猪︶ㄣ 提交于 2020-01-22 14:38:12
问题 A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing. How I typically do this in R is something like the following: data.mat <- read.csv("...") groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups results <- matrix(rep("NA",300),ncol=3,nrow=100) for(i in 1:100) { tempmat <-

What's the higher-performance alternative to for-loops for subsetting data by group-id?

我的未来我决定 提交于 2020-01-22 14:38:07
问题 A recurring analysis paradigm I encounter in my research is the need to subset based on all different group id values, performing statistical analysis on each group in turn, and putting the results in an output matrix for further processing/summarizing. How I typically do this in R is something like the following: data.mat <- read.csv("...") groupids <- unique(data.mat$ID) #Assume there are then 100 unique groups results <- matrix(rep("NA",300),ncol=3,nrow=100) for(i in 1:100) { tempmat <-

Row maximum in data table

感情迁移 提交于 2020-01-22 14:37:23
问题 I have a dataset of 8,000,000 rows with 100 columns in a data.table where each column is a count. I need to find the maximum count in each row and which column this maximum is in. I can quickly get which column has the maximum value for each row using dt <- dt[, maxCol := which.max(.SD), by=pmxid] but trying to get the actual maximum value using dt <- dt[, nmax := max(.SD), by=pmxid] is incredibly slow. I ran it for nearly 20 mins and only 200,000 row maximums had been calculated. Finding the

Get the last row of a previous group in data.table

别来无恙 提交于 2020-01-22 09:50:29
问题 This is what my data table looks like: library(data.table) dt <- fread(' Product Group LastProductOfPriorGroup A 1 NA B 1 NA C 2 B D 2 B E 2 B F 3 E G 3 E ') The LastProductOfPriorGroup column is my desired column. I am trying to fetch the product from last row of the prior group. So in the first two rows, there are no prior groups and therefore it is NA . In the third row, the product in the last row of the prior group 1 is B . I am trying to accomplish this by dt[,LastGroupProduct:= shift

Backward replacement of NAs in time series only to a limited number of observations

时光总嘲笑我的痴心妄想 提交于 2020-01-22 02:16:13
问题 In a data table I want to perform a forward and backward gap-filling procedure over a period of 3 days in both directions. # Example data: library(data.table) library(zoo) dt <- data.table(Value = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, 0.1359223, NA, NA, NA, NA, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, NA)) > dt Value 1: NA 2: NA 3: NA 4: NA 5: NA 6: NA 7: NA 8: NA 9: NA 10: 0.1359223 11: NA 12: NA 13: NA 14: NA 15: 0.0000000 16: 0.0000000 17: 0.0000000 18: 0.0000000 19: 0

Is There A Neat/Simplest Way To This data.table R Code?

前提是你 提交于 2020-01-21 18:53:06
问题 The STRATUM from OECD data is so long, for simplicity I put this name and would like to simplified it to a more short and precise naming as in the code below. pisaMas[,`:=` (SchoolType = c(ifelse(STRATUM == "National Secondary School", "Public", ifelse(STRATUM == "Religious School", "Religious", ifelse(STRATUM == "MOE Technical School", "Technical",0)))))] pisaMas[,table(SchoolType)] I would like to know if there are a simple way to this problems, using data.table package. 回答1: Current

Is There A Neat/Simplest Way To This data.table R Code?

别来无恙 提交于 2020-01-21 18:52:08
问题 The STRATUM from OECD data is so long, for simplicity I put this name and would like to simplified it to a more short and precise naming as in the code below. pisaMas[,`:=` (SchoolType = c(ifelse(STRATUM == "National Secondary School", "Public", ifelse(STRATUM == "Religious School", "Religious", ifelse(STRATUM == "MOE Technical School", "Technical",0)))))] pisaMas[,table(SchoolType)] I would like to know if there are a simple way to this problems, using data.table package. 回答1: Current