data.table

How can I apply different aggregate functions to different columns in R?

匆匆过客 提交于 2020-01-03 02:27:31
问题 How can I apply different aggregate functions to different columns in R? The aggregate() function only offers one function argument to be passed: V1 V2 V3 1 18.45022 62.24411694 2 90.34637 20.86505214 1 50.77358 27.30074987 2 52.95872 30.26189013 1 61.36935 26.90993530 2 49.31730 70.60387016 1 43.64142 87.64433517 2 36.19730 83.47232907 1 91.51753 0.03056485 ... ... ... > aggregate(sample,by=sample["V1"],FUN=sum) V1 V1 V2 V3 1 1 10 578.5299 489.5307 2 2 20 575.2294 527.2222 How can I apply a

Count of values within specified range of value in each row using data.table

匆匆过客 提交于 2020-01-02 21:19:10
问题 To come up with a column of counts for each level (or combination of levels) for categorical variables is data.table syntax can be handled with something like: #setting up the data so it's pasteable df <- data.table(var1 = c('dog','cat','dog','cat','dog','dog','dog'), var2 = c(1,5,90,95,91,110,8), var3 = c('lamp','lamp','lamp','table','table','table','table')) #adding a count column for var1 df[, var1count := .N, by = .(var1)] #adding a count of each combo of var1 and var3 df[, var1and3comb :

How to change few column names in a data table

梦想与她 提交于 2020-01-02 13:28:34
问题 I have a data table with 10 columns. town tc one two three four five six seven total Need to generate mean for columns "one" to "total" for which I am using, DTmean <- DT[,(lapply(.SD,mean)),by = .(town,tc),.SDcols=3:10] This generates the mean, but then I want the column names to be suffixed with "_mean". How can we do this? Want the first two columns to remain the same as "town" and "tc". I tried the below but then it renames all "one" to "total" to just "_mean" for (i in 3:10) { setnames

data.table efficient recycling

泄露秘密 提交于 2020-01-02 12:39:50
问题 I frequently use recycling in data.table, for exemple when I need to make projections future years. I repeat my original data fro each future year. This can lead to something like that : library(data.table) dt <- data.table(cbind(1:500000, 500000:1)) dt2 <- dt[, c(.SD, .(year = 1:10)), by = 1:nrow(dt) ] But I often have to deal with millions of lines, and far more columns than in this toy exemple. The time increases .. Try this : library(data.table) dt <- data.table(cbind(1:50000000, 50000000

How to replicate observations based on weight

旧街凉风 提交于 2020-01-02 11:26:13
问题 Supposed we have, library(data.table) dt <- data.table(id = 1:4, x1 = 10:13, x2=21:24, wt=c(1,0,0.5,0.7)) return, id x1 x2 wt 1: 1 10 21 1.0 2: 2 11 22 0.0 3: 3 12 23 0.5 4: 4 13 24 0.7 I would like to replicate observations under the following conditions: If wt is 0 or 1, we assign flag equal to 1 and 0, respectively If 0 < wt < 1, we assign flag equal to 0. Further, we replicate this observation with wt = 1-wt and assign flag equal to 1. The return that I expect will be id x1 x2 wt flag 1:

data.table “sumproduct” style vector multiplication

瘦欲@ 提交于 2020-01-02 10:29:08
问题 In this toy example, I want to "sumproduct" a list of coefficients with each row's respective value and assign the result to a new column. The code below works for a given record, but when I remove the i parameter it behaves unexpectedly to me. I could do this in a loop or apply, but it seems like there's a data.table way that I'm missing. DT <- data.table(mtcars) vars <- c("mpg","cyl","wt") coeffs <- c(2,3,4) DT[1,Calc := sum(coeffs*DT[1,vars,with=FALSE])] # row 1 is assigned 70.480 DT[,Calc

Calculating average Based on Condition in R

一笑奈何 提交于 2020-01-02 10:23:09
问题 Referring to the question "Calculating average of based on condition", I need to calculate average of the column E based on the column F Below is my part of data frame df but my actual data is 65K values. E F 3.130658445 -1 4.175605237 -1 4.949554963 0 4.653496112 0 4.382672845 0 3.870951272 0 3.905365677 0 3.795199341 0 3.374740696 0 3.104690415 0 2.801178871 0 2.487881321 0 2.449349554 0 2.405409636 0 2.090901539 0 1.632416356 0 1.700583696 0 1.846504012 0 1.949797831 0 1.963114449 0 2

read.csv faster than data.table::fread [duplicate]

有些话、适合烂在心里 提交于 2020-01-02 09:56:59
问题 This question already has an answer here : Comparing speed of fread vs. read.table for reading the first 1M rows out of 100M (1 answer) Closed last year . across the web I can read that I should use data.table and fread to load my data. But when I run a benchmark, then I get the following results Unit: milliseconds expr min lq mean median uq max neval test1 1.229782 1.280000 1.382249 1.366277 1.460483 1.580176 10 test3 1.294726 1.355139 1.765871 1.391576 1.542041 4.770357 10 test2 23.115503

How to do a basic left outer join with data.table in R?

独自空忆成欢 提交于 2020-01-02 08:21:44
问题 I have a data.table of a and b that I've partitioned into below with b < .5 and above with b > .5: DT = data.table(a=as.integer(c(1,1,2,2,3,3)), b=c(0,0,0,1,1,1)) above = DT[DT$b > .5] below = DT[DT$b < .5, list(a=a)] I'd like to do a left outer join between above and below : for each a in above , count the number of rows in below . This is equivalent to the following in SQL: with dt as (select 1 as a, 0 as b union select 1, 0 union select 2, 0 union select 2, 1 union select 3, 1 union select

data.table sum by group and return row with max value

时间秒杀一切 提交于 2020-01-02 07:41:12
问题 I have a data.table in this fashion: dd <- data.table(f = c("a", "a", "a", "b", "b"), g = c(1,2,3,4,5)) dd I need to sum the values g by factor f , and finally return a single row data.table object that has the maximum value of g , but that also contains the factor information. i.e. ___f|g 1: b 9 My closest attempt so far is tmp3 <- dd[, sum(g), by = f][, max(V1)] tmp3 Which results in: > tmp3 [1] 9 EDIT: I'm ideally looking for a purely data.table piece of code/workflow. I'm surprised that