plyr | 易学教程

Grouping on multiple variables in R

阅读更多关于 Grouping on multiple variables in R

问题 I'm a power excel pivot table user who is forcing himself to learn R. I know exactly how to do this analysis in excel, but can't figure out the right way to code this in R. I'm trying to group user data by 2 different variables, while grouping the variables into ranges (or bins), then summarizing other variables. Here is what the data looks like: userid visits posts revenue 1 25 0 25 2 2 2 0 3 86 7 8 4 128 24 94 5 30 5 18 … … … … 280000 80 10 100 280001 42 4 25 280002 31 8 17 Here is what I

Producing a rolling average of ALL the previous observations per ID in an unbalanced panel data set

阅读更多关于 Producing a rolling average of ALL the previous observations per ID in an unbalanced panel data set

问题 I am trying to compute rolling means of an unbalanced data set. To illustrate my point I have produced this toy example of my data: ID year Var RollingAvg(Var) 1 2000 2 NA 1 2001 3 2 1 2002 4 2.5 1 2003 2 3 2 2001 2 NA 2 2002 5 2 2 2003 4 3.5 The column RollingAvg(Var) is what I want, but can't get. In words, I am looking for the rolling average of ALL the previous observations of Var for each ID . I have tried using rollapply and ddply in the zoo and the plyr package, but I can't see how to

Am I using plyr right? I seem to be using way too much memory

阅读更多关于 Am I using plyr right? I seem to be using way too much memory

问题 I have the following, somewhat large dataset: > dim(dset) [1] 422105 25 > class(dset) [1] "data.frame" > Without doing anything, the R process seems to take about 1GB of RAM. I am trying to run the following code: dset <- ddply(dset, .(tic), transform, date.min <- min(date), date.max <- max(date), daterange <- max(date) - min(date), .parallel = TRUE) Running that code, RAM usage skyrockets. It completely saturated 60GB's of RAM, running on a 32 core machine. What am I doing wrong? 回答1: If

Summary statistics using ddply

阅读更多关于 Summary statistics using ddply

问题 I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat . mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index" index is factor with 2 levels "Short", "Long" "metric", "length", "species", "tree" and others are all continuous variables Function: summary1 <- function(arg1,arg2) { ... ss <- ddply(mat, .(index), function(X) data.frame( arg1 = as.list(summary(X$arg1)), arg2 = as

ddply to multiple columns equivalent in data.table

阅读更多关于 ddply to multiple columns equivalent in data.table

问题 I am a big fan of the data.table package and I am having trouble converting some code in ddply of the plyr package into the equivalent in a data.table. The code for ddply is: dfx <- data.frame( group = c(rep('A', 8), rep('B', 15), rep('C', 6)), sex = sample(c("M", "F"), size = 29, replace = TRUE), age = runif(n = 29, min = 18, max = 54), age2 = runif(n = 29, min = 18, max = 54) ) ddply(dfx, .(group, sex), numcolwise(sum)) What I want to do is sum across multiple columns without having to

cumsum using ddply

阅读更多关于 cumsum using ddply

问题 I need to use group by in levels with ddply or aggregate if that's easier. I am not really sure how to do this as I need to use cumsum as my aggregate function. This is what my data looks like: level1 level2 hour product A tea 0 7 A tea 1 2 A tea 2 9 A coffee 17 7 A coffee 18 2 A coffee 20 4 B coffee 0 2 B coffee 1 3 B coffee 2 4 B tea 21 3 B tea 22 1 expected output: A tea 0 7 A tea 1 9 A tea 2 18 A coffee 17 7 A coffee 18 9 A coffee 20 13 B coffee 0 2 B coffee 1 5 B coffee 2 9 B tea 21 3 B

Using dplyr for exploratory plots

阅读更多关于 Using dplyr for exploratory plots

问题 I regularly used d_ply to produce exploratory plots. A trivial example: require(plyr) plot_species <- function(species_data){ p <- qplot(data=species_data, x=Sepal.Length, y=Sepal.Width) print(p) } d_ply(.data=iris, .variables="Species", function(x)plot_species(x)) Which produces three separate plots, one for each species. I would like to reproduce this behaviour using functions in dplyr. This seems to require the reassembly of the data.frame within the function called by summarise, which is

Combine frequency tables into a single data frame

阅读更多关于 Combine frequency tables into a single data frame

问题 I have a list in which each list item is a word frequency table derived from using "table()" on a different sample text. Each table is, therefore, a different length. I want to now convert the list into a single data frame in which each column is a word each row is a sample text. Here is a dummy example of my data: t1<-table(strsplit(tolower("this is a test in the event of a real word file you would see many more words here"), "\\W")) t2<-table(strsplit(tolower("Four score and seven years ago

Error when calculating values greater than 95% quantile using plyr

阅读更多关于 Error when calculating values greater than 95% quantile using plyr

问题 My data is structured as follows: Individ <- data.frame(Participant = c("Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Bill", "Harry", "Harry", "Harry", "Harry","Harry", "Harry", "Harry", "Harry", "Paul", "Paul", "Paul", "Paul"), Time = c(1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4, 1, 2, 3, 4), Condition = c("Placebo", "Placebo", "Placebo", "Placebo", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Expr", "Placebo", "Placebo",

Interpolate variables on subsets of dataframe

阅读更多关于 Interpolate variables on subsets of dataframe

问题 I have a large dataframe which has observations from surveys from multiple states for several years. Here's the data structure: state | survey.year | time1 | obs1 | time2 | obs2 CA | 2000 | 1 | 23 | 1.2 | 43 CA | 2001 | 2 | 43 | 1.4 | 52 CA | 2002 | 5 | 53 | 3.2 | 61 ... CA | 1998 | 3 | 12 | 2.3 | 20 CA | 1999 | 4 | 14 | 2.8 | 25 CA | 2003 | 5 | 19 | 4.3 | 29 ... ND | 2000 | 2 | 223 | 3.2 | 239 ND | 2001 | 4 | 233 | 4.2 | 321 ND | 2003 | 7 | 256 | 7.9 | 387 For each state/survey.year