plyr | 易学教程

Function to count NA values at each level of a factor

阅读更多关于 Function to count NA values at each level of a factor

I have this dataframe: set.seed(50) data <- data.frame(age=c(rep("juv", 10), rep("ad", 10)), sex=c(rep("m", 10), rep("f", 10)), size=c(rep("large", 10), rep("small", 10)), length=rnorm(20), width=rnorm(20), height=rnorm(20)) data$length[sample(1:20, size=8, replace=F)] <- NA data$width[sample(1:20, size=8, replace=F)] <- NA data$height[sample(1:20, size=8, replace=F)] <- NA age sex size length width height 1 juv m large NA -0.34992735 0.10955641 2 juv m large -0.84160374 NA -0.41341885 3 juv m large 0.03299794 -1.58987765 NA 4 juv m large NA NA NA 5 juv m large -1.72760411 NA 0.09534935 6 juv

R ddply with multiple variables

阅读更多关于 R ddply with multiple variables

Here is a simple data frame for my real data set: df <- data.frame(ID=rep(101:102,each=9),phase=rep(1:3,6),variable=rep(LETTERS[1:3],each=3,times=2),mm1=c(1:18),mm2=c(19:36),mm3=c(37:54)) I would like to first group by ID and variable, then for values(mm1, mm2, mm3), phase 3 is subtracted from all phases(phase1 to phase3), which would make mm(1-3) in phase 1 all -2, in phase 2 all -1, and phase 3 all 0. R throws an error of "Error in Ops.data.frame(x, x[3, ]) : - only defined for equally-sized data frames" as I tried: df1 <- ddply(df, .(ID, variable), function(x) (x - x[3,])) Any advice would

How to create a rank variable under certain conditions?

阅读更多关于 How to create a rank variable under certain conditions?

My data contain time variable and chosen brand variable as below. time indicates the shopping time and chosenbrand indicates the purchased brand at the time. With this data, I would like to create rank variable as shown third column, fourth column, and so on. The rank of brands (e.g., brand1 - brand3) should be based on past 36 hours. So, to calculate the rank for the second row, which has shoptime as "2013-09-01 08:54:00 UTC" the rank should be based on all chosenbrand values within 36 hours before the time. ( brand1 in second row should not be in the 36 hours) Therefore, rank_brand1, rank

Cumulative sums over run lengths. Can this loop be vectorized?

阅读更多关于 Cumulative sums over run lengths. Can this loop be vectorized?

I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir , are either -1, 0, or 1. dir.rle <- rle(df$dir) I then take the run lengths and compute segmented cumulative sums across another column in the data frame. I'm using a for loop, but I feel like there should be a way to do this more intelligently. ndx <- 1 for(i in 1:length(dir.rle$lengths)) { l <- dir.rle$lengths[i] - 1 s <- ndx e <- ndx+l tmp[s:e,]$cumval <- cumsum(df[s:e,]$val) ndx <- e + 1 } The run lengths of dir define the start, s , and end, e , for each run. The above code

Replace missing values (NA) in one data set with values from another where columns match

阅读更多关于 Replace missing values (NA) in one data set with values from another where columns match

I have a data frame (datadf) with 3 columns, 'x', 'y, and z. Several 'x' values are missing ( NA ). 'y' and 'z' are non measured variables. x y z 153 a 1 163 b 1 NA d 1 123 a 2 145 e 2 NA c 2 NA b 1 199 a 2 I have another data frame (imputeddf) with the same three columns: x y z 123 a 1 145 a 2 124 b 1 168 b 2 123 c 1 176 c 2 184 d 1 101 d 2 I wish to replace NA in 'x' in 'datadf' with values from 'imputeddf' where 'y' and 'z' matches between the two data sets (each combo of 'y' and 'z' has its own value of 'x' to fill in). The desired result: x y z 153 a 1 163 b 1 184 d 1 123 a 2 145 e 2 176

Is the plyr package for R not available for R version 3.0.2? [duplicate]

阅读更多关于 Is the plyr package for R not available for R version 3.0.2? [duplicate]

This question already has an answer here: How should I deal with “package 'xxx' is not available (for R version x.y.z)” warning? 14 answers I tried installing the plyr package and I got the warning message saying it isn't available for R version 3.0.2. Is this true or is no? If not, why would I be getting this message? I tried using two different CRAN mirrors and both gave the same message. The answer is that the package is available in R (just checked this on my machine). The particular error message that you are getting is very misleading. It is R's `catch-all' condition for anything that it

join matching columns in a data.frame or data.table

阅读更多关于 join matching columns in a data.frame or data.table

I have the following data.frames: a <- data.frame(id = 1:3, v1 = c('a', NA, NA), v2 = c(NA, 'b', 'c')) b <- data.frame(id = 1:3, v1 = c(NA, 'B', 'C'), v2 = c("A", NA, NA)) > a id v1 v2 1 1 a <NA> 2 2 <NA> b 3 3 <NA> c > b id v1 v2 1 1 <NA> A 2 2 B <NA> 3 3 C <NA> note: There are no ids for which v1 or v2 are defined in both tables; there is only a single unique non-NA value in each column for each id value I would like to merge these data frames on matching values of "id': ab <- merge(a, b, by = "id") but I would also like to combine the two columns v1 and v2 , so that the data.frame ab will

How to better create stacked bar graphs with multiple variables from ggplot2?

阅读更多关于 How to better create stacked bar graphs with multiple variables from ggplot2?

I often have to make stacked barplots to compare variables, and because I do all my stats in R, I prefer to do all my graphics in R with ggplot2. I would like to learn how to do two things: First, I would like to be able to add proper percentage tick marks for each variable rather than tick marks by count. Counts would be confusing, which is why I take out the axis labels completely. Second, there must be a simpler way to reorganize my data to make this happen. It seems like the sort of thing I should be able to do natively in ggplot2 with plyR, but the documentation for plyR is not very clear

R: Generic flattening of JSON to data.frame

阅读更多关于 R: Generic flattening of JSON to data.frame

问题 This question is about a generic mechanism for converting any collection of non-cyclical homogeneous or heterogeneous data structures into a dataframe. This can be particularly useful when dealing with the ingestion of many JSON documents or with a large JSON document that is an array of dictionaries. There are several SO questions that deal with manipulating deeply nested JSON structures and turning them into dataframes using functionality such as plyr , lapply , etc. All the questions and

Beginner tips on using plyr to calculate year-over-year change across groups

阅读更多关于 Beginner tips on using plyr to calculate year-over-year change across groups

I am new to plyr (and R) and looking for a little help to get started. Using the baseball dataset as an exaple, how could I calculate the year-over-year (yoy) change in "at batts" by league and team (lg and team)? library(plyr) df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball) After doing a little aggregating to simplify the data fame, the data looks like this: head(df1) year lg team ab 1884 UA ALT 108 1997 AL ANA 1703 1998 AL ANA 1502 1999 AL ANA 660 2000 AL ANA 85 2001 AL ANA 219 I would like to end up with someting like this year lg team ab yoy 1997 AL ANA 1703 NA 1998 AL ANA 1502