plyr

Block bootstrap from subject list

这一生的挚爱 提交于 2019-11-28 07:13:47
问题 I'm trying to efficiently implement a block bootstrap technique to get the distribution of regression coefficients. The main outline is as follows. I have a panel data set, and say firm and year are the indices. For each iteration of the bootstrap, I wish to sample n subjects with replacement. From this sample, I need to construct a new data frame that is an rbind() stack of all the observations for each sampled subject, run the regression, and pull out the coefficients. Repeat for a bunch of

Easiest way to subtract associated with one factor level from values associated with all other factor levels

匆匆过客 提交于 2019-11-28 06:25:26
问题 I've got a dataframe containing rates for 'live' treatments and rates for 'killed' treatments. I'd like to subtract the killed treatments from the live ones: df <- data.frame(id1=gl(2, 3, labels=c("a", "b")), id2=rep(gl(3, 1, labels=c("live1", "live2", "killed")), 2), y=c(10, 10, 1, 12, 12, 2), otherFactor = gl(3, 2)) I'd like to subtract the values of y for which id2=="killed" from all the other values of y , separated by the levels of id1, while preserving otherFactor . I would end up with

loading dplyr after plyr is causing issues

旧街凉风 提交于 2019-11-28 06:06:39
问题 Test Case: library(dplyr) library(plyr) library(dplyr) mtcars%>%rename(x=gear) This gives error. Any help would be greatly appreciated. 回答1: Based on @hadley's tweet. Best answer is to load plyr ALWAYS before dplyr, AND not load plyr again. Pasting his tweet for reference. Hadley Wickham ‏@hadleywickham Jul 27 @gunapemmaraju just load plyr before dplyr? 回答2: I have this problem when require plyr again sourcing files. You can do if("dplyr" %in% (.packages())){ detach("package:dplyr", unload

How do I use plyr to number rows?

牧云@^-^@ 提交于 2019-11-28 05:35:00
Basically I want an autoincremented id column based on my cohorts - in this case .(kmer, cvCut) > myDataFrame size kmer cvCut cumsum 1 8132 23 10 8132 10000 778 23 10 13789274 30000 324 23 10 23658740 50000 182 23 10 28534840 100000 65 23 10 33943283 200000 25 23 10 37954383 250000 584 23 12 16546507 300000 110 23 12 29435303 400000 28 23 12 34697860 600000 127 23 2 47124443 600001 127 23 2 47124570 I want a column added that has new row names based on the kmer/cvCut group > myDataFrame size kmer cvCut cumsum newID 1 8132 23 10 8132 1 10000 778 23 10 13789274 2 30000 324 23 10 23658740 3 50000

using predict with a list of lm() objects

北慕城南 提交于 2019-11-28 04:24:47
I have data which I regularly run regressions on. Each "chunk" of data gets fit a different regression. Each state, for example, might have a different function that explains the dependent value. This seems like a typical "split-apply-combine" type of problem so I'm using the plyr package. I can easily create a list of lm() objects which works well. However I can't quite wrap my head around how I use those objects later to predict values in a separate data.frame. Here's a totally contrived example illustrating what I'm trying to do: # setting up some fake data set.seed(1) funct <- function

ddply + summarize for repeating same statistical function across large number of columns

北城以北 提交于 2019-11-28 03:07:04
Ok, second R question in quick succession. My data: Timestamp St_01 St_02 ... 1 2008-02-08 00:00:00 26.020 25.840 ... 2 2008-02-08 00:10:00 25.985 25.790 ... 3 2008-02-08 00:20:00 25.930 25.765 ... 4 2008-02-08 00:30:00 25.925 25.730 ... 5 2008-02-08 00:40:00 25.975 25.695 ... ... Basically normally I would use a combination of ddply and summarize to calculate ensembles (e.g. mean for every hour across the whole year). In the case above, I would create a category, e.g. hour (e.g. strptime(data$Timestamp,"%H") -> data$hour and then use that category in ddply , like ddply(data,"hour", summarize,

Function to count NA values at each level of a factor

非 Y 不嫁゛ 提交于 2019-11-28 02:05:32
问题 I have this dataframe: set.seed(50) data <- data.frame(age=c(rep("juv", 10), rep("ad", 10)), sex=c(rep("m", 10), rep("f", 10)), size=c(rep("large", 10), rep("small", 10)), length=rnorm(20), width=rnorm(20), height=rnorm(20)) data$length[sample(1:20, size=8, replace=F)] <- NA data$width[sample(1:20, size=8, replace=F)] <- NA data$height[sample(1:20, size=8, replace=F)] <- NA age sex size length width height 1 juv m large NA -0.34992735 0.10955641 2 juv m large -0.84160374 NA -0.41341885 3 juv

How can I use functions returning vectors (like fivenum) with ddply or aggregate?

血红的双手。 提交于 2019-11-28 01:53:27
I would like to split my data frame using a couple of columns and call let's say fivenum on each group. aggregate(Petal.Width ~ Species, iris, function(x) summary(fivenum(x))) The returned value is a data.frame with only 2 columns and the second being a matrix. How can I turn it into normal columns of a data.frame? Update I want something like the following with less code using fivenum ddply(iris, .(Species), summarise, Min = min(Petal.Width), Q1 = quantile(Petal.Width, .25), Med = median(Petal.Width), Q3 = quantile(Petal.Width, .75), Max = max(Petal.Width) ) You can use do.call to call data

How to create a rank variable under certain conditions?

廉价感情. 提交于 2019-11-28 01:46:46
问题 My data contain time variable and chosen brand variable as below. time indicates the shopping time and chosenbrand indicates the purchased brand at the time. With this data, I would like to create rank variable as shown third column, fourth column, and so on. The rank of brands (e.g., brand1 - brand3) should be based on past 36 hours. So, to calculate the rank for the second row, which has shoptime as "2013-09-01 08:54:00 UTC" the rank should be based on all chosenbrand values within 36 hours

Cumulative sums over run lengths. Can this loop be vectorized?

三世轮回 提交于 2019-11-28 01:28:05
问题 I have a data frame on which I calculate a run length encoding for a specific column. The values of the column, dir , are either -1, 0, or 1. dir.rle <- rle(df$dir) I then take the run lengths and compute segmented cumulative sums across another column in the data frame. I'm using a for loop, but I feel like there should be a way to do this more intelligently. ndx <- 1 for(i in 1:length(dir.rle$lengths)) { l <- dir.rle$lengths[i] - 1 s <- ndx e <- ndx+l tmp[s:e,]$cumval <- cumsum(df[s:e,]$val