plyr

R use ddply or aggregate

被刻印的时光 ゝ 提交于 2019-11-27 09:15:57
I have a data frame with 3 columns: custId, saleDate, DelivDateTime. > head(events22) custId saleDate DelivDate 1 280356593 2012-11-14 14:04:59 11/14/12 17:29 2 280367076 2012-11-14 17:04:44 11/14/12 20:48 3 280380097 2012-11-14 17:38:34 11/14/12 20:45 4 280380095 2012-11-14 20:45:44 11/14/12 23:59 5 280380095 2012-11-14 20:31:39 11/14/12 23:49 6 280380095 2012-11-14 19:58:32 11/15/12 00:10 Here's the dput: > dput(events22) structure(list(custId = c(280356593L, 280367076L, 280380097L, 280380095L, 280380095L, 280380095L, 280364279L, 280364279L, 280398506L, 280336395L, 280364376L, 280368458L,

faster way to create variable that aggregates a column by id [duplicate]

与世无争的帅哥 提交于 2019-11-27 08:58:33
This question already has an answer here: Calculate group mean (or other summary stats) and assign to original data 4 answers Is there a faster way to do this? I guess this is unnecessary slow and that a task like this can be accomplished with base functions. df <- ddply(df, "id", function(x) cbind(x, perc.total = sum(x$cand.perc))) I'm quite new to R. I have looked at by() , aggregate() and tapply() , but didn't get them to work at all or in the way I wanted. Rather than returning a shorter vector, I want to attach the sum to the original dataframe. What is the best way to do this? Edit: Here

Summarizing by subgroup percentage in R

牧云@^-^@ 提交于 2019-11-27 08:54:54
I have a dataset like this: df = data.frame(group = c(rep('A',4), rep('B',3)), subgroup = c('a', 'b', 'c', 'd', 'a', 'b', 'c'), value = c(1,4,2,1,1,2,3)) group | subgroup | value ------------------------ A | a | 1 A | b | 4 A | c | 2 A | d | 1 B | a | 1 B | b | 2 B | c | 3 What I want is to get the percentage of the values of each subgroup within each group, i.e. the output should be: group | subgroup | percent ------------------------ A | a | 0.125 A | b | 0.500 A | c | 0.250 A | d | 0.125 B | a | 0.167 B | b | 0.333 B | c | 0.500 Example for group A, subgroup A: the value was 1, the sum of

Efficient method to filter and add based on certain conditions (3 conditions in this case)

…衆ロ難τιáo~ 提交于 2019-11-27 08:06:22
问题 I have a data frame which looks like this a b c d 1 1 1 0 1 1 1 200 1 1 1 300 1 1 2 0 1 1 2 600 1 2 3 0 1 2 3 100 1 2 3 200 1 3 1 0 I have a data frame which looks like this a b c d 1 1 1 250 1 1 2 600 1 2 3 150 1 3 1 0 I am currently doing it { n=nrow(subset(Wallmart, a==i & b==j & c==k )) sum=subset(Wallmart, a==i & b==j & c==k ) #sum sum1=append(sum1,sum(sum$d)/(n-1)) } I would like to add the 'd' coloumn and take the average by counting the number of rows without counting 0. For example

Standard error bars using stat_summary

感情迁移 提交于 2019-11-27 06:50:34
The following code produces bar plots with standard error bars using Hmisc, ddply and ggplot: means_se <- ddply(mtcars,.(cyl), function(df) smean.sdl(df$qsec,mult=sqrt(length(df$qsec))^-1)) colnames(means_se) <- c("cyl","mean","lower","upper") ggplot(means_se,aes(cyl,mean,ymax=upper,ymin=lower,group=1)) + geom_bar(stat="identity") + geom_errorbar() However, implementing the above using helper functions such as mean_sdl seems much better. For example the following code produces a plot with 95% CI error bars: ggplot(mtcars, aes(cyl, qsec)) + stat_summary(fun.y = mean, geom = "bar") + stat

Sum of rows based on column value

风流意气都作罢 提交于 2019-11-27 06:41:21
I want to sum rows that have the same value in one column: > df <- data.frame("1"=c("a","b","a","c","c"), "2"=c(1,5,3,6,2), "3"=c(3,3,4,5,2)) > df X1 X2 X3 1 a 1 3 2 b 5 3 3 a 3 4 4 c 6 5 5 c 2 2 For one column (X2), the data can be aggregated to get the sums of all rows that have the same X1 value: > ddply(df, .(X1), summarise, X2=sum(X2)) X1 X2 1 a 4 2 b 5 3 c 8 How do I do the same for X3 and an arbitrary number of other columns except X1? This is the result I want: X1 X2 X3 1 a 4 7 2 b 5 3 3 c 8 7 ddply(df, "X1", numcolwise(sum)) see ?numcolwise for details and examples. aggregate can

R: Split unbalanced list in data.frame column

岁酱吖の 提交于 2019-11-27 05:29:59
Suppose you have a data frame with the following structure: df <- data.frame(a=c(1,2,3,4), b=c("job1;job2", "job1a", "job4;job5;job6", "job9;job10;job11")) where the column b is a semicolon-delimited list (unbalanced by row). The ideal data.frame would be: id,job,jobNum 1,job1,1 1,job2,2 ... 3,job6,3 4,job9,1 4,job10,2 4,job11,3 I have a partial solution that takes almost 2 hours (170K rows): # Split the column by the semicolon. Results in a list. df$allJobs <- strsplit(df$b, ";", fixed=TRUE) # Function to reshape column that is a list as a data.frame simpleStack <- function(data){ start <- as

ddply + summarize for repeating same statistical function across large number of columns

霸气de小男生 提交于 2019-11-27 05:04:07
问题 Ok, second R question in quick succession. My data: Timestamp St_01 St_02 ... 1 2008-02-08 00:00:00 26.020 25.840 ... 2 2008-02-08 00:10:00 25.985 25.790 ... 3 2008-02-08 00:20:00 25.930 25.765 ... 4 2008-02-08 00:30:00 25.925 25.730 ... 5 2008-02-08 00:40:00 25.975 25.695 ... ... Basically normally I would use a combination of ddply and summarize to calculate ensembles (e.g. mean for every hour across the whole year). In the case above, I would create a category, e.g. hour (e.g. strptime

Returning first row of group

戏子无情 提交于 2019-11-27 04:41:28
I have a dataframe consisting of an ID , that is the same for each element in a group, two datetimes and the time interval between these two. One of the datetime objects is my relevant time marker. Now I like to get a subset of the dataframe that consists of the earliest entry for each group. The entries (especially the time interval) need to stay untouched. My first approach was to sort the frame according to 1. ID and 2. relevant datetime. However, I wasn't able to return the first entry for each new group. I then have been looking at the aggregate() as well as ddply() function but I could

doing a plyr operation on every row of a data frame in R

佐手、 提交于 2019-11-27 04:13:10
问题 I like the plyr syntax. Any time I have to use one of the *apply() commands I end up kicking the dog and going on a 3 day bender. So for the sake of my dog and my liver, what's concise syntax for doing a ddply operation on every row of a data frame? Here's an example that works well for a simple case: x <- rnorm(10) y <- rnorm(10) df <- data.frame(x,y) ddply(df,names(df) ,function(df) max(df$x,df$y)) that works fine and gives me what I want. But if things get more complex this causes plyr to