plyr

How to calculate time difference between datetimes, for each group (student-contract)?

偶尔善良 提交于 2019-12-06 08:28:53
I have a specific problem; I have data in the following format: # USER_ID SUBMISSION_DATE CONTRACT_REF 1 1 20/6 1:00 W001 2 1 20/6 2:00 W002 3 1 20/6 3:30 W003 4 4 20/6 4:00 W004 5 5 20/6 5:00 W005 6 5 20/6 6:00 W006 7 7 20/6 7:00 W007 8 7 20/6 8:00 W008 9 7 20/6 9:00 W009 10 7 20/6 10:00 W0010 Now I need to somehow calculate the time difference between the different submissions (uniquely identifiable). In other words: I have a table of submissions , in this table, there are all submissions for all users. I need to find a way how to calculate the time difference for each unique STUDENT

Need faster rolling apply function with start to stop indices

こ雲淡風輕ζ 提交于 2019-12-06 06:47:22
问题 Below is the piece of code. It gives percentile of the trade price level for rolling 15-minute(historical) window. It runs quickly if the length is 500 or 1000, but as you can see there are 45K observations, and for the entire data its very slow. Can I apply any of the plyr functions? Any other suggestions are welcome. This is how trade data looks like: > str(trade) 'data.frame': 45571 obs. of 5 variables: $ time : chr "2013-10-20 22:00:00.489" "2013-10-20 22:00:00.807" "2013-10-20 22:00:00

Compute one sample t-test for each column of a data frame and summarize results in a table

只谈情不闲聊 提交于 2019-12-06 05:53:26
Here is some sample data on my problem: mydf <- data.frame(A = rnorm(20, 1, 5), B = rnorm(20, 2, 5), C = rnorm(20, 3, 5), D = rnorm(20, 4, 5), E = rnorm(20, 5, 5)) Now I'd like to run a one-sample t-test on each column of the data.frame, to prove if it differs significantly from zero, like t.test(mydf$A) , and then store the mean of each column, the t-value and the p-value in a new data.frame. So the result should look something like this: A B C D E mean x x x x x t x x x x x p x x x x x I could definitely think of some tedious ways to do this, like looping through mydf , calculating the

Using svyglm within plyr call

非 Y 不嫁゛ 提交于 2019-12-06 05:51:04
问题 This is clearly something idiosyncratic to R's survey package. I'm trying to use llply from the plyr package to make a list of svyglm models. Here's an example: library(survey) library(plyr) foo <- data.frame(y1 = rbinom(50, size = 1, prob=.25), y2 = rbinom(50, size = 1, prob=.5), y3 = rbinom(50, size = 1, prob=.75), x1 = rnorm(50, 0, 2), x2 = rnorm(50, 0, 2), x3 = rnorm(50, 0, 2), weights = runif(50, .5, 1.5)) My list of dependent variables' column numbers dvnum <- 1:3 Indicating no clusters

Split overlapping intervals into non-overlapping intervals, within values of an identifier

五迷三道 提交于 2019-12-06 05:50:50
I would like to take a set of intervals, possibly overlapping, within categories of an identifier and create new intervals that are either exactly overlapping (ie same start/end values) or completely non-overlapping. These new intervals should collectively span the range of the original intervals and not include any ranges not in the original intervals. This needs to be a relatively fast operation because I'm working with lots of data. Here is some example data: library(data.table) set.seed(1113) start1 <- c(1,7,9, 17, 18,1,3,20) end1 <- c(10,12,15, 20, 23,3,5,25) id1 <- c(1,1,1,1,1,2,2,2) obs

Use of ddply + mutate with a custom function?

三世轮回 提交于 2019-12-06 05:16:48
问题 I use ddply quite frequently, but historically with summarize (occasionally mutate ) and only basic functions like mean() , var1 - var2 , etc. I have a dataset in which I'm trying to apply a custom, more involved function and started trying to dig into how to do this with ddply . I've got a successful solution, but I don't understand why it works like this vs. for more "normal" functions. Related Custom Function not recognized by ddply {plyr}... How do I pass variables to a custom function in

How can I rename the output rows/cols of **ply functions from plyr?

一世执手 提交于 2019-12-06 03:43:03
问题 I would like to state the row/col output names in a **ply function, ldply , from the plyr package. for example, I have a list, foo , that I want to convert to a data.frame and truncate significant digits with signif() foo <- list(var.a = runif(3), var.b = runif(3), var.c=runif(3)) What I have now is q <- ldply(foo, signif, 2) colnames(dq)[1] <- c('id', 'q1', 'q2','q3') rownames(dq) <- dq$id Is there an easier way? Two previous questions have asked how to use plyr to rename rows and cols using

converting uneven hierarchical list to a data frame

五迷三道 提交于 2019-12-06 02:42:24
I don't think this has been asked yet, but is there a way to combine information of a list with multiple levels and uneven structure into a data frame of "long" format? Specifically: library(XML) library(plyr) xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5.xml" xml.parse <- xmlInternalTreeParse(xml.inning) xml.list <- xmlToList(xml.parse) ## $top$atbat ## $top$atbat$pitch ## des id type x y ## "Ball" "310" "B" "70.39" "125.20" Where the following is the structure: > llply(xml.list, function(x) llply(x, function(x)

Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7

左心房为你撑大大i 提交于 2019-12-06 02:33:53
问题 I am trying to get ddply to run in parallel on my mac. The code I've used is as follows: library(doMC) library(ggplot2) # for the purposes of getting the baseball data.frame registerDoMC(2) > system.time(ddply(baseball, .(year), numcolwise(mean))) user system elapsed 0.959 0.106 1.522 > system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE)) user system elapsed 2.221 2.790 2.552 Why is ddply slower when I run .parallel=TRUE? I have searched online to no avail. I've also tried

plyr package writing the same function over multiple columns

一个人想着一个人 提交于 2019-12-06 02:07:41
问题 I want to write the same function to multiple columns using ddply function, but I'm tried keep writing them in one line, want to see is there better way of doing this? Here's a simple version of the data: data<-data.frame(TYPE=as.integer(runif(20,1,3)),A_MEAN_WEIGHT=runif(20,1,100),B_MEAN_WEIGHT=runif(20,1,10)) and I want to find out the sum of columns A_MEAN_WEIGHT and B_MEAN_WEIGHT by doing this: ddply(data,.(TYPE),summarise,MEAN_A=sum(A_MEAN_WEIGHT),MEAN_B=sum(B_MEAN_WEIGHT)) but in my