plyr | 易学教程

How does one aggregate and summarize data quickly?

阅读更多关于 How does one aggregate and summarize data quickly?

I have a dataset whose headers look like so: PID Time Site Rep Count I want sum the Count by Rep for each PID x Time x Site combo on the resulting data.frame, I want to get the mean value of Count for PID x Time x Site combo. Current function is as follows: dummy <- function (data) { A<-aggregate(Count~PID+Time+Site+Rep,data=data,function(x){sum(na.omit(x))}) B<-aggregate(Count~PID+Time+Site,data=A,mean) return (B) } This is painfully slow (original data.frame is 510000 20) . Is there a way to speed this up with plyr? Ramnath You should look at the package data.table for faster aggregation

How to get the name of a data.frame within a list?

阅读更多关于 How to get the name of a data.frame within a list?

问题 How can I get a data frame's name from a list? Sure, get() gets the object itself, but I want to have its name for use within another function. Here's the use case, in case you would rather suggest a work around: lapply(somelistOfDataframes, function(X) { ddply(X, .(idx, bynameofX), summarise, checkSum = sum(value)) }) There is a column in each data frame that goes by the same name as the data frame within the list. How can I get this name bynameofX ? names(X) would return the whole vector.

Using ddply to apply a function to a group of rows

阅读更多关于 Using ddply to apply a function to a group of rows

问题 I use ddply quite a bit but I do not consider myself an expert. I have a data frame (df) with grouping variable "Group" which has values of "A", "B" and "C" and the variable to summarize, "Var" has numeric values. If I use ddply(df, .(Group), summarize, mysum=sum(Var)) then I get the sum of each A, B and C, which is correct. But what I want to do is to sum over each grouping of the Group variables as they are arranged in the data frame. For instance, if the data frame has Group Var A 1.3 A 1

subset parameter in layers is no longer working with ggplot2 >= 2.0.0

阅读更多关于 subset parameter in layers is no longer working with ggplot2 >= 2.0.0

问题 I updated to the newest version of ggplot2 and run into problems by printing subsets in a layer. library(ggplot2) library(plyr) df <- data.frame(x=runif(100), y=runif(100)) ggplot(df, aes(x,y)) + geom_point(subset=.(x >= .5)) These lines of code worked in version 1.0.1 but not in 2.0.0 . It throws an error Error: Unknown parameters: subset . I couldn't find an official change log or a way how to subset specific layers. Specially because this plyr solution was not very nice documented, I think

R: Generic flattening of JSON to data.frame

阅读更多关于 R: Generic flattening of JSON to data.frame

This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries. There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr , lapply , etc. All the questions and answers I have found are about specific cases as opposed to offering a general approach for dealing with

Mean of elements in a list of data.frames

阅读更多关于 Mean of elements in a list of data.frames

Suppose I had a list of data.frames (of equal rows and columns) dat1 <- as.data.frame(matrix(rnorm(25), ncol=5)) dat2 <- as.data.frame(matrix(rnorm(25), ncol=5)) dat3 <- as.data.frame(matrix(rnorm(25), ncol=5)) all.dat <- list(dat1=dat1, dat2=dat2, dat3=dat3) How can I return a single data.frame that is the mean (or sum, etc.) for each element in the data.frames across the list (e.g., mean of first row and first column from lists 1, 2, 3 and so on)? I have tried lapply and ldply in plyr but these return the statistic for each data.frame within the list. Edit: For some reason, this was retagged

Calculating hourly averages from a multi-year timeseries

阅读更多关于 Calculating hourly averages from a multi-year timeseries

问题 I have a dataset filled with the average windspeed per hour for multiple years. I would like to create an 'average year', in which for each hour the average windspeed for that hour over multiple years is calculated. How can I do this without looping endlessly through the dataset? Ideally, I would like to just loop through the data once, extracting for each row the right month, day, and hour, and adding the windspeed from that row to the right row in a dataframe where the aggregates for each

Using ddply inside a function

阅读更多关于 Using ddply inside a function

问题 I'm trying to make a function using ddply inside of it. However I can't get to work. This is a dummy example reproducing what I get. Does this have anything to do this bug? library(ggplot2) data(diamonds) foo <- function(data, fac1, fac2, bar) { res <- ddply(data, .(fac1, fac2), mean(bar)) res } foo(diamonds, "color", "cut", "price") 回答1: I don't believe this is a bug. ddply expects the name of a function, which you haven't really supplied with mean(bar) . You need to write a complete

Updated: Plyr rename() not recognizing identical 'x'; Error: The following `from` values were not present in `x`:

阅读更多关于 Updated: Plyr rename() not recognizing identical 'x'; Error: The following `from` values were not present in `x`:

R 3.2.4 Plyr updated 2016-03-10 I am trying to rename columns in a large data set and running into the "The following from values were not present in x :" error. The columns from origin export are atrocious, which is why I'm using plyr rename, but it seems that even rename is having trouble. Example trouble column is [,3] in the linked data set and is titled: "Experimental.or.quasi.experimental..evaluation..compares.mentored.youth.to.a.comparison.or.â.œcontrolâ...group.of.non.mentored.youth..NS8" Download link to csv here : Code below: test<-read.csv(file="test.csv",header=TRUE) library(plyr)

Subset data based on Minimum Value

阅读更多关于 Subset data based on Minimum Value

This might an easy one. Here's the data: dat <- read.table(header=TRUE, text=" Seg ID Distance Seg46 V21 160.37672 Seg72 V85 191.24400 Seg373 V85 167.38930 Seg159 V147 14.74852 Seg233 V171 193.01636 Seg234 V171 200.21458 ") dat Seg ID Distance Seg46 V21 160.37672 Seg72 V85 191.24400 Seg373 V85 167.38930 Seg159 V147 14.74852 Seg233 V171 193.01636 Seg234 V171 200.21458 I am intending to get a table like the following that will give me Seg for the minimized distance (as duplication is seen in ID . Seg Crash_ID Distance Seg46 V21 160.37672 Seg373 V85 167.38930 Seg159 V147 14.74852 Seg233 V171 193