plyr | 易学教程

Find number of rows using dplyr/group_by

阅读更多关于 Find number of rows using dplyr/group_by

I am using the mtcars dataset. I want to find the number of records for a particular combination of data. Something very similar to the count(*) group by clause in SQL. ddply() from plyr is working for me library(plyr) ddply(mtcars, .(cyl,gear),nrow) has output cyl gear V1 1 4 3 1 2 4 4 8 3 4 5 2 4 6 3 2 5 6 4 4 6 6 5 1 7 8 3 12 8 8 5 2 Using this code library(dplyr) g <- group_by(mtcars, cyl, gear) summarise(g, length(gear)) has output length(cyl) 1 32 I found various functions to pass in to summarise() but none seem to work for me. One function I found is sum(G) , which returned Error in

using predict with a list of lm() objects

阅读更多关于 using predict with a list of lm() objects

问题 I have data which I regularly run regressions on. Each "chunk" of data gets fit a different regression. Each state, for example, might have a different function that explains the dependent value. This seems like a typical "split-apply-combine" type of problem so I'm using the plyr package. I can easily create a list of lm() objects which works well. However I can't quite wrap my head around how I use those objects later to predict values in a separate data.frame. Here's a totally contrived

Efficient alternatives to merge for larger data.frames R

阅读更多关于 Efficient alternatives to merge for larger data.frames R

问题 I am looking for an efficient (both computer resource wise and learning/implementation wise) method to merge two larger (size>1 million / 300 KB RData file) data frames. "merge" in base R and "join" in plyr appear to use up all my memory effectively crashing my system. Example load test data frame and try test.merged<-merge(test, test) or test.merged<-join(test, test, type="all") - The following post provides a list of merge and alternatives: How to join (merge) data frames (inner, outer,

R: speeding up “group by” operations

阅读更多关于 R: speeding up “group by” operations

I have a simulation that has a huge aggregate and combine step right in the middle. I prototyped this process using plyr's ddply() function which works great for a huge percentage of my needs. But I need this aggregation step to be faster since I have to run 10K simulations. I'm already scaling the simulations in parallel but if this one step were faster I could greatly decrease the number of nodes I need. Here's a reasonable simplification of what I am trying to do: library(Hmisc) # Set up some example data year <- sample(1970:2008, 1e6, rep=T) state <- sample(1:50, 1e6, rep=T) group1 <-

Sending in Column Name to ddply from Function

阅读更多关于 Sending in Column Name to ddply from Function

问题 I'd like to be able to send in a column name to a call that I am making to ddply . An example ddply call: ddply(myData, .(MyGrouping), summarise, count=sum(myColumnName)) If I have ddply wrapped within another function is it possible to wrap this so that I can pass in an arbitrary value as myColumnName to the calling function? 回答1: There has got to be a better way. And I couldn't figure out how to make it work with summarise. my.fun <- function(df, count.column) { ddply(df, .(x), function(d)

Blend of na.omit and na.pass using aggregate?

阅读更多关于 Blend of na.omit and na.pass using aggregate?

问题 I have a data set containing product prototype test data. Not all tests were run on all lots, and not all tests were executed with the same sample sizes. To illustrate, consider this case: > test <- data.frame(name = rep(c("A", "B", "C"), each = 4), var1 = rep(c(1:3, NA), 3), var2 = 1:12, var3 = c(rep(NA, 4), 1:8)) > test name var1 var2 var3 1 A 1 1 NA 2 A 2 2 NA 3 A 3 3 NA 4 A NA 4 NA 5 B 1 5 1 6 B 2 6 2 7 B 3 7 3 8 B NA 8 4 9 C 1 9 5 10 C 2 10 6 11 C 3 11 7 12 C NA 12 8 In the past, I've

Apply t-test on many columns in a dataframe split by factor

阅读更多关于 Apply t-test on many columns in a dataframe split by factor

问题 I have a dataframe with one factor column with two levels, and many numeric columns. I want to split the dataframe by the factor column and do t-test on the colunm pairs. Using the example dataset Puromycin I want the result to look something like this: Variable Treated Untreated p-value Test-statistic CI of difference**** Conc 0.3450 0.2763 XXX T XX - XX Rate 141.58 110.7272 xxx T XX - XX I think I am looking for a solution using PLYR that can an output the above results in a nice dataframe.

Remove group from data.frame if at least one group member meets condition

阅读更多关于 Remove group from data.frame if at least one group member meets condition

问题 I have a data.frame where I'd like to remove entire groups if any of their members meets a condition. In this first example, if the values are numbers and the condition is NA the code below works. df <- structure(list(world = c(1, 2, 3, 3, 2, NA, 1, 2, 3, 2), place = c(1, 1, 2, 2, 3, 3, 1, 2, 3, 1), group = c(1, 1, 1, 2, 2, 2, 3, 3, 3, 3)), .Names = c("world", "place", "group"), row.names = c(NA, -10L), class = "data.frame") ans <- ddply(df, . (group), summarize, code=mean(world)) ans$code[is

Joining aggregated values back to the original data frame [duplicate]

阅读更多关于 Joining aggregated values back to the original data frame [duplicate]

This question already has an answer here: Calculate group mean (or other summary stats) and assign to original data 4 answers One of the design patterns I use over and over is performing a "group by" or "split, apply, combine (SAC)" on a data frame and then joining the aggregated data back to the original data. This is useful, for example, when calculating each county's deviation from the state mean in a data frame with many states and counties. Rarely is my aggregate calculation only a simple mean, but it makes a good example. I often solve this problem the following way: require(plyr) set

Why is plyr so slow?

阅读更多关于 Why is plyr so slow?

I think I am using plyr incorrectly. Could someone please tell me if this is 'efficient' plyr code? require(plyr) plyr <- function(dd) ddply(dd, .(price), summarise, ss=sum(volume)) A little context: I have a few large aggregation problems and I have noted that they were each taking some time. In trying to solve the issues, I became interested in the performance of various aggregation procedures in R. I tested a few aggregation methods - and found myself waiting around all day. When I finally got results back, I discovered a huge gap between the plyr method and the others - which makes me