plyr

Replace missing values (NA) in one data set with values from another where columns match

泪湿孤枕 提交于 2019-11-28 00:11:56
问题 I have a data frame (datadf) with 3 columns, 'x', 'y, and z. Several 'x' values are missing ( NA ). 'y' and 'z' are non measured variables. x y z 153 a 1 163 b 1 NA d 1 123 a 2 145 e 2 NA c 2 NA b 1 199 a 2 I have another data frame (imputeddf) with the same three columns: x y z 123 a 1 145 a 2 124 b 1 168 b 2 123 c 1 176 c 2 184 d 1 101 d 2 I wish to replace NA in 'x' in 'datadf' with values from 'imputeddf' where 'y' and 'z' matches between the two data sets (each combo of 'y' and 'z' has

Blend of na.omit and na.pass using aggregate?

我只是一个虾纸丫 提交于 2019-11-27 22:31:08
I have a data set containing product prototype test data. Not all tests were run on all lots, and not all tests were executed with the same sample sizes. To illustrate, consider this case: > test <- data.frame(name = rep(c("A", "B", "C"), each = 4), var1 = rep(c(1:3, NA), 3), var2 = 1:12, var3 = c(rep(NA, 4), 1:8)) > test name var1 var2 var3 1 A 1 1 NA 2 A 2 2 NA 3 A 3 3 NA 4 A NA 4 NA 5 B 1 5 1 6 B 2 6 2 7 B 3 7 3 8 B NA 8 4 9 C 1 9 5 10 C 2 10 6 11 C 3 11 7 12 C NA 12 8 In the past, I've only had to deal with cases of mis-matched repetitions, which has been easy with aggregate(cbind(var1,

Reshape multiple categorical variables to binary response variables

生来就可爱ヽ(ⅴ<●) 提交于 2019-11-27 22:21:10
I am trying to convert the following format: mydata <- data.frame(movie = c("Titanic", "Departed"), actor1 = c("Leo", "Jack"), actor2 = c("Kate", "Leo"))) movie actor1 actor2 1 Titanic Leo Kate 2 Departed Jack Leo to binary response variables: movie Leo Kate Jack 1 Titanic 1 1 0 2 Departed 1 0 1 I tried the solution described in Convert row data to binary columns but I could get it to work for two variables, not three. I would really appreciate if there is a clean way to do this. How much spice is too much? Here is a solution via tidyr : library(dplyr) library(tidyr) mydata %>% gather(actor

Apply t-test on many columns in a dataframe split by factor

怎甘沉沦 提交于 2019-11-27 21:37:12
I have a dataframe with one factor column with two levels, and many numeric columns. I want to split the dataframe by the factor column and do t-test on the colunm pairs. Using the example dataset Puromycin I want the result to look something like this: Variable Treated Untreated p-value Test-statistic CI of difference**** Conc 0.3450 0.2763 XXX T XX - XX Rate 141.58 110.7272 xxx T XX - XX I think I am looking for a solution using PLYR that can an output the above results in a nice dataframe. (The Puromycin only contains two numeric variables, but the solution I am looking for would work on a

ddply with lm() function

烂漫一生 提交于 2019-11-27 19:29:15
Hi guys how can I use ddply function for linear model: x1 <- c(1:10, 1:10) x2 <- c(1:5, 1:5, 1:5, 1:5) x3 <- c(rep(1,5), rep(2,5), rep(1,5), rep(2,5)) set.seed(123) y <- rnorm(20, 10, 3) mydf <- data.frame(x1, x2, x3, y) require(plyr) ddply(mydf, mydf$x3, .fun = lm(mydf$y ~ mydf$X1 + mydf$x2)) Generates this error: Error in model.frame.default(formula = mydf$y ~ mydf$X1 + mydf$x2, drop.unused.levels = TRUE) : invalid type (NULL) for variable 'mydf$X1' Appreciate your help. Here is what you need to do. mods = dlply(mydf, .(x3), lm, formula = y ~ x1 + x2) mods is a list of two objects containing

Is the plyr package for R not available for R version 3.0.2? [duplicate]

不羁的心 提交于 2019-11-27 17:38:48
问题 This question already has an answer here: How should I deal with “package 'xxx' is not available (for R version x.y.z)” warning? 15 answers I tried installing the plyr package and I got the warning message saying it isn't available for R version 3.0.2. Is this true or is no? If not, why would I be getting this message? I tried using two different CRAN mirrors and both gave the same message. 回答1: The answer is that the package is available in R (just checked this on my machine). The particular

doing a plyr operation on every row of a data frame in R

半腔热情 提交于 2019-11-27 17:18:22
I like the plyr syntax. Any time I have to use one of the *apply() commands I end up kicking the dog and going on a 3 day bender. So for the sake of my dog and my liver, what's concise syntax for doing a ddply operation on every row of a data frame? Here's an example that works well for a simple case: x <- rnorm(10) y <- rnorm(10) df <- data.frame(x,y) ddply(df,names(df) ,function(df) max(df$x,df$y)) that works fine and gives me what I want. But if things get more complex this causes plyr to get funky (and not like Bootsy Collins) because plyr is chewing on making "levels" out of all those

join matching columns in a data.frame or data.table

|▌冷眼眸甩不掉的悲伤 提交于 2019-11-27 16:35:02
问题 I have the following data.frames: a <- data.frame(id = 1:3, v1 = c('a', NA, NA), v2 = c(NA, 'b', 'c')) b <- data.frame(id = 1:3, v1 = c(NA, 'B', 'C'), v2 = c("A", NA, NA)) > a id v1 v2 1 1 a <NA> 2 2 <NA> b 3 3 <NA> c > b id v1 v2 1 1 <NA> A 2 2 B <NA> 3 3 C <NA> note: There are no ids for which v1 or v2 are defined in both tables; there is only a single unique non-NA value in each column for each id value I would like to merge these data frames on matching values of "id': ab <- merge(a, b,

How to better create stacked bar graphs with multiple variables from ggplot2?

萝らか妹 提交于 2019-11-27 15:02:56
问题 I often have to make stacked barplots to compare variables, and because I do all my stats in R, I prefer to do all my graphics in R with ggplot2. I would like to learn how to do two things: First, I would like to be able to add proper percentage tick marks for each variable rather than tick marks by count. Counts would be confusing, which is why I take out the axis labels completely. Second, there must be a simpler way to reorganize my data to make this happen. It seems like the sort of thing

Beginner tips on using plyr to calculate year-over-year change across groups

荒凉一梦 提交于 2019-11-27 14:12:44
问题 I am new to plyr (and R) and looking for a little help to get started. Using the baseball dataset as an exaple, how could I calculate the year-over-year (yoy) change in "at batts" by league and team (lg and team)? library(plyr) df1 <- aggregate(ab~year+lg+team, FUN=sum, data=baseball) After doing a little aggregating to simplify the data fame, the data looks like this: head(df1) year lg team ab 1884 UA ALT 108 1997 AL ANA 1703 1998 AL ANA 1502 1999 AL ANA 660 2000 AL ANA 85 2001 AL ANA 219 I