plyr | 易学教程

ddply run in a function looks in the environment outside the function?

阅读更多关于 ddply run in a function looks in the environment outside the function?

问题 I'm trying to write a function to do some often repeated analysis, and one part of this is to count the number of groups and number of members within each group, so ddply to the rescue !, however, my code has a problem.... Here is some example data > dput(BGBottles) structure(list(Machine = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 3L, 3L, 3L, 4L, 4L, 4L), .Label = c("1", "2", "3", "4"), class = "factor"), weight = c(14.23, 14.96, 14.85, 16.46, 16.74, 15.94, 14.98, 14.88, 14.87, 15.94, 16.07, 14.91

using summarize in ddply to get entire row based on max() of one column

阅读更多关于 using summarize in ddply to get entire row based on max() of one column

问题 df1 primer timepoints mean sde Acan 0 1.0000000 0.000000e+00 Acan 20 0.8758265 7.856192e-02 Acan 40 1.0575400 4.680159e-02 Acan 60 1.2399106 2.238616e-01 Acan 120 1.1710685 2.085558e-02 Acan 240 1.6430670 NA Acan 360 1.7747940 NA all I want is the max value of mean (for any of these timepoints) w/ it's corresponding sde. ## this will only get me the mean obviously x <- ddply(x, .(primer), summarize, max = max(mean)) primer max Acan 1.774794 ## if I were to do this I would obviously not have

Fast crosstabs and stats on all pairs of variables

阅读更多关于 Fast crosstabs and stats on all pairs of variables

问题 I am trying to calculate a measure of association between all variables in a data.table . (This is not a stats question, but as an aside: the variables are all factors, and the measure is Cramér's V.) Example dataset: p = 50; n = 1e5; # actual dataset has p > 1e3, n > 1e5, much wider but barely longer set.seed(1234) obs <- as.data.table( data.frame( cbind( matrix(sample(c(LETTERS[1:4],NA), n*(p/2), replace=TRUE), nrow=n, ncol=p/2), matrix(sample(c(letters[1:6],NA), n*(p/2), replace=TRUE),

Error thrown within ddply crashes R

阅读更多关于 Error thrown within ddply crashes R

问题 I'm running into an issue where plyr consistently crashes when an error is thrown from the supplied function > require(plyr) Loading required package: plyr Warning message: package ‘plyr’ was built under R version 3.0.2 > df <- data.frame(group=c("A","A","B","B"), num=c(11,22,33,44)) > ddply(df, .(group), function(x) {x}) group num 1 A 11 2 A 22 3 B 33 4 B 44 > ddply(df, .(group), function(x) {stop("badness")}) called from: (function () { .rs.breakOnError(TRUE) })() Error in .fun(piece, ...)

How to calculate average values large datasets

阅读更多关于 How to calculate average values large datasets

问题 I am working with a dataset that has temperature readings once an hour, 24 hrs a day for 100+ years. I want to get an average temperature for each day to reduce the size of my dataset. The headings look like this: YR MO DA HR MN TEMP 1943 6 19 10 0 73 1943 6 19 11 0 72 1943 6 19 12 0 76 1943 6 19 13 0 78 1943 6 19 14 0 81 1943 6 19 15 0 85 1943 6 19 16 0 85 1943 6 19 17 0 86 1943 6 19 18 0 86 1943 6 19 19 0 87 etc for 600,000+ data points. How can I run a nested function to calculate daily

Aggregating duplicate rows by taking sum

阅读更多关于 Aggregating duplicate rows by taking sum

问题 Following on from my questions: 1. Identifying whether a set of variables uniquely identifies each row of the data or not; 2. Tagging all rows that are duplicates in terms of a given set of variables, I would now like to aggregate/consolidate all the duplicate rows in terms of a given set of variables, by taking their sum. Solution 1: There is some guidance on how to do this here, but when there are a large number of levels of the variables that form the index, the ddply method recommended

Add an index (or counter) to a dataframe by group in R [duplicate]

阅读更多关于 Add an index (or counter) to a dataframe by group in R [duplicate]

问题 This question already has answers here : Numbering rows within groups in a data frame (6 answers) Closed 3 years ago . I have a df like ProjectID Dist 1 x 1 y 2 z 2 x 2 h 3 k .... .... I want to add a third column such that we have an incrementing counter for each ProjectID: ProjectID Dist counter 1 x 1 1 y 2 2 z 1 2 x 2 2 h 3 1 k 3 .... .... I've had a look at seq rank and a couple of other bits particularly looking to see if I could use ddply to help: df$counter <- ddply(df,.(projectID),

Conditional Cross tabulation in R

阅读更多关于 Conditional Cross tabulation in R

问题 Looking for the quickest way to achieve below task using "expss" package. With a great package of "expss", we can easily do cross tabulation (which has other advantage and useful functions for cross-tabulations.), we can cross-tabulate multiple variables easily like below. #install.packages("expss") library("expss") data(mtcars) var1 <- "vs, am, gear, carb" var_names = trimws(unlist(strsplit(var1, split = ","))) mtcars %>% tab_prepend_values %>% tab_cols(total(), ..[(var_names)]) %>% tab

Add simulated poisson distributions to a ggplot

阅读更多关于 Add simulated poisson distributions to a ggplot

问题 I have made a poisson regression and then visualised the model: library(ggplot2) year <- 1990:2010 count <- c(29, 8, 13, 3, 20, 14, 18, 15, 10, 19, 17, 18, 24, 47, 52, 24, 25, 24, 31, 56, 48) df <- data.frame(year, count) my_glm <- glm(count ~ year, family = "poisson", data = df) my_glm$model$fitted <- predict(my_glm, type = "response") ggplot(my_glm$model) + geom_point(aes(year, count)) + geom_line(aes(year, fitted)) Now I want to add these simulated Poisson distributions to the plot:

Dummy for first new element in a series

阅读更多关于 Dummy for first new element in a series

问题 Suppose I have a variable that lasts for several periods. Like the amount of years that I have an Ipod. So I had the Ipod 1st generation from 2001 until 2004 and then in 2005 I've got Ipod 2 and so on. So my dataframe would look like: 2001 Ipod1 2002 Ipod1 2003 Ipod1 2004 Ipod1 2005 Ipod2 2006 Ipod2 2007 Ipod2 2008 Ipod2 2009 Ipod3 2010 Ipod3 What I want is to create a dummy for the period when a new variable arrives so I would get: Year Var Dummy 2001 Ipod1 1 2002 Ipod1 0 2003 Ipod1 0 2004