plyr

Exclude duplicate values in certain columns using ddply

烈酒焚心 提交于 2019-12-04 11:19:07
I have a data frame with the following structure: > dftest element seqnames start end width strand tx_id tx_name 1 1 chr19 58858172 58864865 6694 - 36769 NM_130786 2 10 chr8 18248755 18258723 9969 + 16614 NM_000015 3 100 chr20 43248163 43280376 32214 - 37719 NM_000022 4 1000 chr18 25530930 25757445 226516 - 33839 NM_001792 5 10000 chr1 243651535 244006584 355050 - 4182 NM_181690 6 10000 chr1 243663021 244006584 343564 - 4183 NM_005465 1316 100302285 chr12 12264886 12264967 82 + 24050 NR_036052 1317 100302285 chr12 9392066 9392147 82 - 25034 NR_036052 1318 100302285 chr2 232578024 232578105 82

Use of ddply + mutate with a custom function?

半腔热情 提交于 2019-12-04 10:17:01
I use ddply quite frequently, but historically with summarize (occasionally mutate ) and only basic functions like mean() , var1 - var2 , etc. I have a dataset in which I'm trying to apply a custom, more involved function and started trying to dig into how to do this with ddply . I've got a successful solution, but I don't understand why it works like this vs. for more "normal" functions. Related Custom Function not recognized by ddply {plyr}... How do I pass variables to a custom function in ddply? r-help: [R] Correct use of ddply with own function (I ended up basing my solution on this) Here

Using svyglm within plyr call

烂漫一生 提交于 2019-12-04 10:07:21
This is clearly something idiosyncratic to R's survey package . I'm trying to use llply from the plyr package to make a list of svyglm models. Here's an example: library(survey) library(plyr) foo <- data.frame(y1 = rbinom(50, size = 1, prob=.25), y2 = rbinom(50, size = 1, prob=.5), y3 = rbinom(50, size = 1, prob=.75), x1 = rnorm(50, 0, 2), x2 = rnorm(50, 0, 2), x3 = rnorm(50, 0, 2), weights = runif(50, .5, 1.5)) My list of dependent variables' column numbers dvnum <- 1:3 Indicating no clusters or strata in this sample wd <- svydesign(ids= ~0, strata= NULL, weights= ~weights, data = foo) A

R plyr, data.table, apply certain columns of data.frame

怎甘沉沦 提交于 2019-12-04 08:30:20
I am looking for ways to speed up my code. I am looking into the apply / ply methods as well as data.table . Unfortunately, I am running into problems. Here is a small sample data: ids1 <- c(1, 1, 1, 1, 2, 2, 2, 2) ids2 <- c(1, 2, 3, 4, 1, 2, 3, 4) chars1 <- c("aa", " bb ", "__cc__", "dd ", "__ee", NA,NA, "n/a") chars2 <- c("vv", "_ ww_", " xx ", "yy__", " zz", NA, "n/a", "n/a") data <- data.frame(col1 = ids1, col2 = ids2, col3 = chars1, col4 = chars2, stringsAsFactors = FALSE) Here is a solution using loops: library("plyr") cols_to_fix <- c("col3","col4") for (i in 1:length(cols_to_fix)) {

ddply summarise proportional count

邮差的信 提交于 2019-12-04 08:15:47
I am having some trouble using the ddply function from the plyr package. I am trying to summarise the following data with counts and proportions within each group. Here's my data: structure(list(X5employf = structure(c(1L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 3L, 1L, 1L, 1L, 2L, 2L, 3L, 3L, 3L, 1L, 2L, 2L, 2L, 2L, 2L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 3L, 2L, 1L, 1L, 3L, 1L, 3L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 3L, 3L, 3L, 3L, 1L), .Label = c("increase", "decrease", "same"), class = "factor"), X5employff = structure(c

How can I rename the output rows/cols of **ply functions from plyr?

笑着哭i 提交于 2019-12-04 07:06:19
I would like to state the row/col output names in a **ply function, ldply , from the plyr package. for example, I have a list, foo , that I want to convert to a data.frame and truncate significant digits with signif() foo <- list(var.a = runif(3), var.b = runif(3), var.c=runif(3)) What I have now is q <- ldply(foo, signif, 2) colnames(dq)[1] <- c('id', 'q1', 'q2','q3') rownames(dq) <- dq$id Is there an easier way? Two previous questions have asked how to use plyr to rename rows and cols using plyr, but I think my question is different. Can the names be stated at the same time as another

ddply: how do I pass column names as parameters?

ⅰ亾dé卋堺 提交于 2019-12-04 07:00:59
问题 I have a data frame where the column names are generated based on parameters - so I don't know their exact values. I want to pass these fields to ddply also as parameters. I guess the answer is obvious, but can someone please turn the light on for me. Example below using the iris data set that gives the idea of what I want to do, and the unintended result of my effort. The results of first example, iris1 is what I want to achieve, but by passing the column names in as parameters, as in my

plyr package writing the same function over multiple columns

拈花ヽ惹草 提交于 2019-12-04 06:50:28
I want to write the same function to multiple columns using ddply function, but I'm tried keep writing them in one line, want to see is there better way of doing this? Here's a simple version of the data: data<-data.frame(TYPE=as.integer(runif(20,1,3)),A_MEAN_WEIGHT=runif(20,1,100),B_MEAN_WEIGHT=runif(20,1,10)) and I want to find out the sum of columns A_MEAN_WEIGHT and B_MEAN_WEIGHT by doing this: ddply(data,.(TYPE),summarise,MEAN_A=sum(A_MEAN_WEIGHT),MEAN_B=sum(B_MEAN_WEIGHT)) but in my current data I have more than 8 "*_MEAN_WEIGHT", and I'm tired of writing them 8 times like ddply(data,.

when is plyr better than data.table? [closed]

核能气质少年 提交于 2019-12-04 06:37:41
问题 It's difficult to tell what is being asked here. This question is ambiguous, vague, incomplete, overly broad, or rhetorical and cannot be reasonably answered in its current form. For help clarifying this question so that it can be reopened, visit the help center. Closed 6 years ago . Better here can mean faster or easier to read/shorter syntax or it could also mean that the command is not even doable in data.table . I don't use plyr a lot and would like to know if there are cases when I

Slower ddply when .parallel=TRUE on Mac OS X Version 10.6.7

久未见 提交于 2019-12-04 05:37:59
I am trying to get ddply to run in parallel on my mac. The code I've used is as follows: library(doMC) library(ggplot2) # for the purposes of getting the baseball data.frame registerDoMC(2) > system.time(ddply(baseball, .(year), numcolwise(mean))) user system elapsed 0.959 0.106 1.522 > system.time(ddply(baseball, .(year), numcolwise(mean), .parallel=TRUE)) user system elapsed 2.221 2.790 2.552 Why is ddply slower when I run .parallel=TRUE? I have searched online to no avail. I've also tried registerDoMC() and the results were the same. The baseball data may be too small to see improvement by