plyr | 易学教程

How to calculate time difference between datetimes, for each group (student-contract)?

阅读更多关于 How to calculate time difference between datetimes, for each group (student-contract)?

I have a specific problem; I have data in the following format: # USER_ID SUBMISSION_DATE CONTRACT_REF 1 1 20/6 1:00 W001 2 1 20/6 2:00 W002 3 1 20/6 3:30 W003 4 4 20/6 4:00 W004 5 5 20/6 5:00 W005 6 5 20/6 6:00 W006 7 7 20/6 7:00 W007 8 7 20/6 8:00 W008 9 7 20/6 9:00 W009 10 7 20/6 10:00 W0010 Now I need to somehow calculate the time difference between the different submissions (uniquely identifiable). In other words: I have a table of submissions , in this table, there are all submissions for all users. I need to find a way how to calculate the time difference for each unique STUDENT

Need faster rolling apply function with start to stop indices

阅读更多关于 Need faster rolling apply function with start to stop indices

问题 Below is the piece of code. It gives percentile of the trade price level for rolling 15-minute(historical) window. It runs quickly if the length is 500 or 1000, but as you can see there are 45K observations, and for the entire data its very slow. Can I apply any of the plyr functions? Any other suggestions are welcome. This is how trade data looks like: > str(trade) 'data.frame': 45571 obs. of 5 variables: $ time : chr "2013-10-20 22:00:00.489" "2013-10-20 22:00:00.807" "2013-10-20 22:00:00

Compute one sample t-test for each column of a data frame and summarize results in a table

阅读更多关于 Compute one sample t-test for each column of a data frame and summarize results in a table

Here is some sample data on my problem: mydf <- data.frame(A = rnorm(20, 1, 5), B = rnorm(20, 2, 5), C = rnorm(20, 3, 5), D = rnorm(20, 4, 5), E = rnorm(20, 5, 5)) Now I'd like to run a one-sample t-test on each column of the data.frame, to prove if it differs significantly from zero, like t.test(mydf$A) , and then store the mean of each column, the t-value and the p-value in a new data.frame. So the result should look something like this: A B C D E mean x x x x x t x x x x x p x x x x x I could definitely think of some tedious ways to do this, like looping through mydf , calculating the

Using svyglm within plyr call

阅读更多关于 Using svyglm within plyr call

问题 This is clearly something idiosyncratic to R's survey package. I'm trying to use llply from the plyr package to make a list of svyglm models. Here's an example: library(survey) library(plyr) foo <- data.frame(y1 = rbinom(50, size = 1, prob=.25), y2 = rbinom(50, size = 1, prob=.5), y3 = rbinom(50, size = 1, prob=.75), x1 = rnorm(50, 0, 2), x2 = rnorm(50, 0, 2), x3 = rnorm(50, 0, 2), weights = runif(50, .5, 1.5)) My list of dependent variables' column numbers dvnum <- 1:3 Indicating no clusters

Split overlapping intervals into non-overlapping intervals, within values of an identifier

阅读更多关于 Split overlapping intervals into non-overlapping intervals, within values of an identifier

I would like to take a set of intervals, possibly overlapping, within categories of an identifier and create new intervals that are either exactly overlapping (ie same start/end values) or completely non-overlapping. These new intervals should collectively span the range of the original intervals and not include any ranges not in the original intervals. This needs to be a relatively fast operation because I'm working with lots of data. Here is some example data: library(data.table) set.seed(1113) start1 <- c(1,7,9, 17, 18,1,3,20) end1 <- c(10,12,15, 20, 23,3,5,25) id1 <- c(1,1,1,1,1,2,2,2) obs

Use of ddply + mutate with a custom function?

阅读更多关于 Use of ddply + mutate with a custom function?

问题 I use ddply quite frequently, but historically with summarize (occasionally mutate ) and only basic functions like mean() , var1 - var2 , etc. I have a dataset in which I'm trying to apply a custom, more involved function and started trying to dig into how to do this with ddply . I've got a successful solution, but I don't understand why it works like this vs. for more "normal" functions. Related Custom Function not recognized by ddply {plyr}... How do I pass variables to a custom function in

How can I rename the output rows/cols of **ply functions from plyr?

阅读更多关于 How can I rename the output rows/cols of **ply functions from plyr?

问题 I would like to state the row/col output names in a **ply function, ldply , from the plyr package. for example, I have a list, foo , that I want to convert to a data.frame and truncate significant digits with signif() foo <- list(var.a = runif(3), var.b = runif(3), var.c=runif(3)) What I have now is q <- ldply(foo, signif, 2) colnames(dq)[1] <- c('id', 'q1', 'q2','q3') rownames(dq) <- dq$id Is there an easier way? Two previous questions have asked how to use plyr to rename rows and cols using

converting uneven hierarchical list to a data frame

阅读更多关于 converting uneven hierarchical list to a data frame

I don't think this has been asked yet, but is there a way to combine information of a list with multiple levels and uneven structure into a data frame of "long" format? Specifically: library(XML) library(plyr) xml.inning <- "http://gd2.mlb.com/components/game/mlb/year_2009/month_05/day_02/gid_2009_05_02_chamlb_texmlb_1/inning/inning_5.xml" xml.parse <- xmlInternalTreeParse(xml.inning) xml.list <- xmlToList(xml.parse) ## $top$atbat ## $top$atbat$pitch ## des id type x y ## "Ball" "310" "B" "70.39" "125.20" Where the following is the structure: > llply(xml.list, function(x) llply(x, function(x)

Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7

阅读更多关于 Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7

问题 I am trying to get ddply to run in parallel on my mac. The code I've used is as follows: library(doMC) library(ggplot2) # for the purposes of getting the baseball data.frame registerDoMC(2) > system.time(ddply(baseball, .(year), numcolwise(mean))) user system elapsed 0.959 0.106 1.522 > system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE)) user system elapsed 2.221 2.790 2.552 Why is ddply slower when I run .parallel=TRUE? I have searched online to no avail. I've also tried

plyr package writing the same function over multiple columns

阅读更多关于 plyr package writing the same function over multiple columns

问题 I want to write the same function to multiple columns using ddply function, but I'm tried keep writing them in one line, want to see is there better way of doing this? Here's a simple version of the data: data<-data.frame(TYPE=as.integer(runif(20,1,3)),A_MEAN_WEIGHT=runif(20,1,100),B_MEAN_WEIGHT=runif(20,1,10)) and I want to find out the sum of columns A_MEAN_WEIGHT and B_MEAN_WEIGHT by doing this: ddply(data,.(TYPE),summarise,MEAN_A=sum(A_MEAN_WEIGHT),MEAN_B=sum(B_MEAN_WEIGHT)) but in my