plyr

Aggregate rows by shared values in a variable

末鹿安然 提交于 2019-12-03 20:49:06
I have a somewhat dumb R question. If I have a matrix (or dataframe, whichever is easier to work with) like: Year Match 2008 1808 2008 137088 2008 1 2008 56846 2007 2704 2007 169876 2007 75750 2006 2639 2006 193990 2006 2 And I wanted to sum each of the match counts for the years (so, e.g. the 2008 row would be 2008 195743 , how would I go about doing this? I've got a few solutions in my head but they are all needlessly complicated and R tends to have some much easier solution tucked away somewhere. You can generate the same matrix above with: structure(c(2008L, 2008L, 2008L, 2008L, 2007L,

Apply a list of n functions to each row of a dataframe?

我的未来我决定 提交于 2019-12-03 19:13:36
问题 I have a list of functions funs <- list(fn1 = function(x) x^2, fn2 = function(x) x^3, fn3 = function(x) sin(x), fn4 = function(x) x+1) #in reality these are all f = splinefun() And I have a dataframe: mydata <- data.frame(x1 = c(1, 2, 3, 2), x2 = c(3, 2, 1, 0), x3 = c(1, 2, 2, 3), x4 = c(1, 2, 1, 2)) #actually a 500x15 dataframe of 500 samples from 15 parameters For each of i rows, I would like to evaluate function j on each of the j columns and sum the results: unlist(funs) attach(mydata) a

Summarize dataframe by day from timestamp

走远了吗. 提交于 2019-12-03 18:16:15
问题 I have a dataset data that contains a timestamp and a suite of other variables with values at each timestamp. I am trying to use ddply within plyr to create a new dataframe that is the summary (e.g. mean) of a variable by the group day. How can I get ddply to group by day? Or how can I can create a group or grouping variable from the day (%d) within the timestamp? The result dataframe would consist of the average values per day for each day present in data . library(plyr) data <- read.csv(

How can I use variable names to refer to data frame columns with ddply?

假如想象 提交于 2019-12-03 16:58:56
I am trying to write a function that takes as arguments the name of a data frame holding time series data and the name of a column in that data frame. The function performs various manipulations on that data, one of which is adding a running total for each year in a column. I am using plyr. When I use the name of the column directly with ddply and cumsum I have no problems: require(plyr) df <- data.frame(date = seq(as.Date("2007/1/1"), by = "month", length.out = 60), sales = runif(60, min = 700, max = 1200)) df$year <- as.numeric(format(as.Date(df$date), format="%Y")) df <- ddply(df, .(year),

Subset a list - a plyr way?

蹲街弑〆低调 提交于 2019-12-03 16:57:54
I often have data that is grouped by one or more variables, with several registrations within each group. From the data frame, I wish to select groups according to various criteria. I commonly use a split-sapply-rbind approach, where I extract elements from a list using a logical vector. Here is a small example. I start with a data frame with one grouping variable ('group'), and I wish to select groups that have a maximum mass of less than 45: dd <- data.frame(group = rep(letters[1:3], each = 5), mass = c(rnorm(5, 30), rnorm(5, 50), rnorm(5, 40))) dd2 <- split(x = dd, f = dd$group) dd3 <- dd2

difftime between rows using dplyr

痞子三分冷 提交于 2019-12-03 16:44:28
I'm trying to calculate the time difference between two timestamps in two adjacent rows using the dplyr package. Here's the code: tidy_ex <- function () { library(dplyr) #construct example data data <- data.frame(code = c(10888, 10888, 10888, 10888, 10888, 10888, 10889, 10889, 10889, 10889, 10889, 10889, 10890, 10890, 10890), station = c("F1", "F3", "F4", "F5", "L5", "L7", "F1", "F3", "F4", "L5", "L6", "L7", "F1", "F3", "F5"), timestamp = c(1365895151, 1365969188, 1366105495, 1367433149, 1368005216, 1368011698, 1366244224, 1366414926, 1367513240, 1367790556, 1367946420, 1367923973, 1365896546,

Summary statistics using ddply

瘦欲@ 提交于 2019-12-03 15:44:46
I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat . mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index" index is factor with 2 levels "Short", "Long" "metric", "length", "species", "tree" and others are all continuous variables Function: summary1 <- function(arg1,arg2) { ... ss <- ddply(mat, .(index), function(X) data.frame( arg1 = as.list(summary(X$arg1)), arg2 = as.list(summary(X$arg2)), .parallel = FALSE) ss } I expect the output to look like this after calling

How to speed up summarise and ddply?

梦想的初衷 提交于 2019-12-03 14:00:35
问题 I have a data frame with 2 million rows, and 15 columns. I want to group by 3 of these columns with ddply (all 3 are factors, and there are 780,000 unique combinations of these factors), and get the weighted mean of 3 columns (with weights defined by my data set). The following is reasonably quick: system.time(a2 <- aggregate(cbind(col1,col2,col3) ~ fac1 + fac2 + fac3, data=aggdf, FUN=mean)) user system elapsed 91.358 4.747 115.727 The problem is that I want to use weighted.mean instead of

Errors installing plyr / rcpp

半城伤御伤魂 提交于 2019-12-03 13:56:01
I have two computers and in one of them I can't manage to install the plyr package for R. This is the error I get: * installing *source* package ‘plyr’ ... ** package ‘plyr’ successfully unpacked and MD5 sums checked ** libs g++ -I/usr/share/R/include -DNDEBUG -I"/usr/lib/R/site-library/Rcpp/include" -fpic -O2 -pipe -g -c RcppExports.cpp -o RcppExports.o RcppExports.cpp: En la función ‘SEXPREC* plyr_loop_apply(SEXP, SEXP)’: RcppExports.cpp:15:9: error: ‘input_parameter’ no es un miembro de ‘Rcpp::traits’ RcppExports.cpp:15:40: error: expected primary-expression before ‘int’ RcppExports.cpp:15

Calculate proportions within subsets of a data frame

陌路散爱 提交于 2019-12-03 13:39:18
I am trying to obtain proportions within subsets of a data frame. For example, in this made-up data frame: DF<-data.frame(category1=rep(c("A","B"),each=9), category2=rep(rep(LETTERS[24:26],each=3),2), animal=rep(c("dog","cat","mouse"),6),number=sample(18)) I would like like to calculate the proportion of each of the three animals for each category1 by category2 combination (e.g., out of all animals that are both "A" and "X", what proportion are dogs?). With prop.table on column 4 of the data frame I can get the proportion that each row makes up of the total "number" column, but I have not