plyr | 易学教程

Aggregate rows by shared values in a variable

阅读更多关于 Aggregate rows by shared values in a variable

I have a somewhat dumb R question. If I have a matrix (or dataframe, whichever is easier to work with) like: Year Match 2008 1808 2008 137088 2008 1 2008 56846 2007 2704 2007 169876 2007 75750 2006 2639 2006 193990 2006 2 And I wanted to sum each of the match counts for the years (so, e.g. the 2008 row would be 2008 195743 , how would I go about doing this? I've got a few solutions in my head but they are all needlessly complicated and R tends to have some much easier solution tucked away somewhere. You can generate the same matrix above with: structure(c(2008L, 2008L, 2008L, 2008L, 2007L,

Apply a list of n functions to each row of a dataframe?

阅读更多关于 Apply a list of n functions to each row of a dataframe?

问题 I have a list of functions funs <- list(fn1 = function(x) x^2, fn2 = function(x) x^3, fn3 = function(x) sin(x), fn4 = function(x) x+1) #in reality these are all f = splinefun() And I have a dataframe: mydata <- data.frame(x1 = c(1, 2, 3, 2), x2 = c(3, 2, 1, 0), x3 = c(1, 2, 2, 3), x4 = c(1, 2, 1, 2)) #actually a 500x15 dataframe of 500 samples from 15 parameters For each of i rows, I would like to evaluate function j on each of the j columns and sum the results: unlist(funs) attach(mydata) a

Summarize dataframe by day from timestamp

阅读更多关于 Summarize dataframe by day from timestamp

问题 I have a dataset data that contains a timestamp and a suite of other variables with values at each timestamp. I am trying to use ddply within plyr to create a new dataframe that is the summary (e.g. mean) of a variable by the group day. How can I get ddply to group by day? Or how can I can create a group or grouping variable from the day (%d) within the timestamp? The result dataframe would consist of the average values per day for each day present in data . library(plyr) data <- read.csv(

How can I use variable names to refer to data frame columns with ddply?

阅读更多关于 How can I use variable names to refer to data frame columns with ddply?

I am trying to write a function that takes as arguments the name of a data frame holding time series data and the name of a column in that data frame. The function performs various manipulations on that data, one of which is adding a running total for each year in a column. I am using plyr. When I use the name of the column directly with ddply and cumsum I have no problems: require(plyr) df <- data.frame(date = seq(as.Date("2007/1/1"), by = "month", length.out = 60), sales = runif(60, min = 700, max = 1200)) df$year <- as.numeric(format(as.Date(df$date), format="%Y")) df <- ddply(df, .(year),

Subset a list - a plyr way?

阅读更多关于 Subset a list - a plyr way?

I often have data that is grouped by one or more variables, with several registrations within each group. From the data frame, I wish to select groups according to various criteria. I commonly use a split-sapply-rbind approach, where I extract elements from a list using a logical vector. Here is a small example. I start with a data frame with one grouping variable ('group'), and I wish to select groups that have a maximum mass of less than 45: dd <- data.frame(group = rep(letters[1:3], each = 5), mass = c(rnorm(5, 30), rnorm(5, 50), rnorm(5, 40))) dd2 <- split(x = dd, f = dd$group) dd3 <- dd2

difftime between rows using dplyr

阅读更多关于 difftime between rows using dplyr

I'm trying to calculate the time difference between two timestamps in two adjacent rows using the dplyr package. Here's the code: tidy_ex <- function () { library(dplyr) #construct example data data <- data.frame(code = c(10888, 10888, 10888, 10888, 10888, 10888, 10889, 10889, 10889, 10889, 10889, 10889, 10890, 10890, 10890), station = c("F1", "F3", "F4", "F5", "L5", "L7", "F1", "F3", "F4", "L5", "L6", "L7", "F1", "F3", "F5"), timestamp = c(1365895151, 1365969188, 1366105495, 1367433149, 1368005216, 1368011698, 1366244224, 1366414926, 1367513240, 1367790556, 1367946420, 1367923973, 1365896546,

Summary statistics using ddply

阅读更多关于 Summary statistics using ddply

I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat . mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index" index is factor with 2 levels "Short", "Long" "metric", "length", "species", "tree" and others are all continuous variables Function: summary1 <- function(arg1,arg2) { ... ss <- ddply(mat, .(index), function(X) data.frame( arg1 = as.list(summary(X$arg1)), arg2 = as.list(summary(X$arg2)), .parallel = FALSE) ss } I expect the output to look like this after calling

How to speed up summarise and ddply?

阅读更多关于 How to speed up summarise and ddply?

问题 I have a data frame with 2 million rows, and 15 columns. I want to group by 3 of these columns with ddply (all 3 are factors, and there are 780,000 unique combinations of these factors), and get the weighted mean of 3 columns (with weights defined by my data set). The following is reasonably quick: system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean)) user system elapsed 91.358 4.747 115.727 The problem is that I want to use weighted.mean instead of

Errors installing plyr / rcpp

阅读更多关于 Errors installing plyr / rcpp

I have two computers and in one of them I can't manage to install the plyr package for R. This is the error I get: * installing *source* package ‘plyr’ ... ** package ‘plyr’ successfully unpacked and MD5 sums checked ** libs g++ -I/usr/share/R/include -DNDEBUG -I"/usr/lib/R/site-library/Rcpp/include" -fpic -O2 -pipe -g -c RcppExports.cpp -o RcppExports.o RcppExports.cpp: En la función ‘SEXPREC* plyr_loop_apply(SEXP, SEXP)’: RcppExports.cpp:15:9: error: ‘input_parameter’ no es un miembro de ‘Rcpp::traits’ RcppExports.cpp:15:40: error: expected primary-expression before ‘int’ RcppExports.cpp:15

Calculate proportions within subsets of a data frame

阅读更多关于 Calculate proportions within subsets of a data frame

I am trying to obtain proportions within subsets of a data frame. For example, in this made-up data frame: DF<-data.frame(category1=rep(c("A","B"),each=9), category2=rep(rep(LETTERS[24:26],each=3),2), animal=rep(c("dog","cat","mouse"),6),number=sample(18)) I would like like to calculate the proportion of each of the three animals for each category1 by category2 combination (e.g., out of all animals that are both "A" and "X", what proportion are dogs?). With prop.table on column 4 of the data frame I can get the proportion that each row makes up of the total "number" column, but I have not