data.table | 易学教程

How can I apply different aggregate functions to different columns in R?

阅读更多关于 How can I apply different aggregate functions to different columns in R?

问题 How can I apply different aggregate functions to different columns in R? The aggregate() function only offers one function argument to be passed: V1 V2 V3 1 18.45022 62.24411694 2 90.34637 20.86505214 1 50.77358 27.30074987 2 52.95872 30.26189013 1 61.36935 26.90993530 2 49.31730 70.60387016 1 43.64142 87.64433517 2 36.19730 83.47232907 1 91.51753 0.03056485 ... ... ... > aggregate(sample,by=sample["V1"],FUN=sum) V1 V1 V2 V3 1 1 10 578.5299 489.5307 2 2 20 575.2294 527.2222 How can I apply a

Count of values within specified range of value in each row using data.table

阅读更多关于 Count of values within specified range of value in each row using data.table

问题 To come up with a column of counts for each level (or combination of levels) for categorical variables is data.table syntax can be handled with something like: #setting up the data so it's pasteable df <- data.table(var1 = c('dog','cat','dog','cat','dog','dog','dog'), var2 = c(1,5,90,95,91,110,8), var3 = c('lamp','lamp','lamp','table','table','table','table')) #adding a count column for var1 df[, var1count := .N, by = .(var1)] #adding a count of each combo of var1 and var3 df[, var1and3comb :

How to change few column names in a data table

阅读更多关于 How to change few column names in a data table

问题 I have a data table with 10 columns. town tc one two three four five six seven total Need to generate mean for columns "one" to "total" for which I am using, DTmean <- DT[,(lapply(.SD,mean)),by = .(town,tc),.SDcols=3:10] This generates the mean, but then I want the column names to be suffixed with "_mean". How can we do this? Want the first two columns to remain the same as "town" and "tc". I tried the below but then it renames all "one" to "total" to just "_mean" for (i in 3:10) { setnames

data.table efficient recycling

阅读更多关于 data.table efficient recycling

问题 I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year. This can lead to something like that : library(data.table) dt <- data.table(cbind(1:500000, 500000:1)) dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ] But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this : library(data.table) dt <- data.table(cbind(1:50000000, 50000000

How to replicate observations based on weight

阅读更多关于 How to replicate observations based on weight

问题 Supposed we have, library(data.table) dt <- data.table(id = 1:4, x1 = 10:13, x2=21:24, wt=c(1,0,0.5,0.7)) return, id x1 x2 wt 1: 1 10 21 1.0 2: 2 11 22 0.0 3: 3 12 23 0.5 4: 4 13 24 0.7 I would like to replicate observations under the following conditions: If wt is 0 or 1, we assign flag equal to 1 and 0, respectively If 0 < wt < 1, we assign flag equal to 0. Further, we replicate this observation with wt = 1-wt and assign flag equal to 1. The return that I expect will be id x1 x2 wt flag 1:

data.table “sumproduct” style vector multiplication

阅读更多关于 data.table “sumproduct” style vector multiplication

问题 In this toy example, I want to "sumproduct" a list of coefficients with each row's respective value and assign the result to a new column. The code below works for a given record, but when I remove the i parameter it behaves unexpectedly to me. I could do this in a loop or apply, but it seems like there's a data.table way that I'm missing. DT <- data.table(mtcars) vars <- c("mpg","cyl","wt") coeffs <- c(2,3,4) DT[1,Calc := sum(coeffs*DT[1,vars,with=FALSE])] # row 1 is assigned 70.480 DT[,Calc

Calculating average Based on Condition in R

阅读更多关于 Calculating average Based on Condition in R

问题 Referring to the question "Calculating average of based on condition", I need to calculate average of the column E based on the column F Below is my part of data frame df but my actual data is 65K values. E F 3.130658445 -1 4.175605237 -1 4.949554963 0 4.653496112 0 4.382672845 0 3.870951272 0 3.905365677 0 3.795199341 0 3.374740696 0 3.104690415 0 2.801178871 0 2.487881321 0 2.449349554 0 2.405409636 0 2.090901539 0 1.632416356 0 1.700583696 0 1.846504012 0 1.949797831 0 1.963114449 0 2

read.csv faster than data.table::fread [duplicate]

阅读更多关于 read.csv faster than data.table::fread [duplicate]

问题 This question already has an answer here : Comparing speed of fread vs. read.table for reading the first 1M rows out of 100M (1 answer) Closed last year . across the web I can read that I should use data.table and fread to load my data. But when I run a benchmark, then I get the following results Unit: milliseconds expr min lq mean median uq max neval test1 1.229782 1.280000 1.382249 1.366277 1.460483 1.580176 10 test3 1.294726 1.355139 1.765871 1.391576 1.542041 4.770357 10 test2 23.115503

How to do a basic left outer join with data.table in R?

阅读更多关于 How to do a basic left outer join with data.table in R?

问题 I have a data.table of a and b that I've partitioned into below with b < .5 and above with b > .5: DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) above = DT[DT$b > .5] below = DT[DT$b < .5, list(a=a)] I'd like to do a left outer join between above and below : for each a in above , count the number of rows in below . This is equivalent to the following in SQL: with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select

data.table sum by group and return row with max value

阅读更多关于 data.table sum by group and return row with max value

问题 I have a data.table in this fashion: dd <- data.table(f = c("a", "a", "a", "b", "b"), g = c(1,2,3,4,5)) dd I need to sum the values g by factor f , and finally return a single row data.table object that has the maximum value of g , but that also contains the factor information. i.e. ___f|g 1: b 9 My closest attempt so far is tmp3 <- dd[, sum(g), by = f][, max(V1)] tmp3 Which results in: > tmp3 [1] 9 EDIT: I'm ideally looking for a purely data.table piece of code/workflow. I'm surprised that