plyr

Using plyr, doMC, and summarise() with very big dataset?

你离开我真会死。 提交于 2019-12-07 01:47:11
问题 I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC and the .parallel=TRUE flag with plyr like so (simplified a bit): library(plyr) require(doMC) registerDoMC() df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE) If I set the number of cores explicitly to two (using registerDoMC(cores=2) ) my 8 GB of

Problem loading the plyr package

不想你离开。 提交于 2019-12-06 21:35:55
问题 I use R 2.13.1 and have unsuccessfully tried to load the package "plyr 1.6" in R. I have manually installed it into a directory "~/R/library". My code is: .libPaths("~/R/library") library(plyr) I get the message: Error in library(plyr) : 'plyr' is not a valid installed package It works fine with other packages ("chron", "zoo", "ismev", "Lmoments"), but not for the "plyr" package, and I have no idea what is goin on. I have tried installing and loading earlier versions of "plyr", but with the

R: rollapplyr and lm factor error: Does rollapplyr change variable class?

。_饼干妹妹 提交于 2019-12-06 19:25:29
This question builds upon a previous one which was nicely answered for me here. R: Grouped rolling window linear regression with rollapply and ddply Wouldn't you know that the code doesn't quite work when extended to the real data rather than the example data? I have a somewhat large dataset with the following characteristics. str(T0_satData_reduced) 'data.frame': 45537 obs. of 5 variables: $ date : POSIXct, format: "2014-11-17 08:47:35" "2014-11-17 08:47:36" "2014-11-17 08:47:37" ... $ trial : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ vial : Factor w/ 4 levels "1","2",

R: Is there a good replacement for plyr::rbind.fill in dplyr?

断了今生、忘了曾经 提交于 2019-12-06 17:09:31
问题 for tidyverse users, dplyr is the new way to work with data. For users trying to avoid older package plyr, what is the equivalent function to rbind.fill in dplyr? 回答1: Yes. dplyr::bind_rows Credit goes to commenter. 来源: https://stackoverflow.com/questions/44464441/r-is-there-a-good-replacement-for-plyrrbind-fill-in-dplyr

Merging files (and file names) in R

我是研究僧i 提交于 2019-12-06 14:45:56
问题 I'm trying to merge a directory full of comma delimited text files using R, while also incorporating the file name of each file as a new variable in the data set. I've been using the following: library(plyr) file_list <- list.files() dataset <- ldply(file_list, read.table, header=FALSE, sep=",") Can anyone shed any light on how I'd add the file name for each file read as a new variable within dataset? Many thanks, -Jon 回答1: You can just make a wrapper around the read.table() function that

extracting p values from multiple linear regression (lm) inside of a ddply function using spatial data

核能气质少年 提交于 2019-12-06 14:26:55
I have a set of spatial coordinate (x,y) data that has a response variable for each coordinate over the course of several years. The following code generates a similar data frame: df <- data.frame( id = rep(1:2, 2), x = rep(c(25, 30),10), y = rep(c(100, 200), 10), year = rep(1980:1989, 2), response = rnorm(20) ) The resulting data frame: head(df) id x y year response 1 1 25 100 1980 0.1707431 2 2 30 200 1981 1.3562263 3 1 25 100 1982 -0.4590506 4 2 30 200 1983 1.3238410 5 1 25 100 1984 1.7765772 6 2 30 200 1985 -0.6258069 I want to run a linear regression on each cell through time to get the

Function “diff” over various groups in R

两盒软妹~` 提交于 2019-12-06 14:21:25
问题 i have a data frame with 2 groups 1 timevariable and an dependent variable. e.g.: name <- c("a", "a", "a", "a", "a", "a","a", "a", "a", "b", "b", "b","b", "b", "b","b", "b", "b") class <- c("c1", "c1", "c1", "c2", "c2", "c2", "c3", "c3", "c3","c1", "c1", "c1", "c2", "c2", "c2", "c3", "c3", "c3") year <- c("2010", "2009", "2008", "2010", "2009", "2008", "2010", "2009", "2008", "2010", "2009", "2008", "2010", "2009", "2008", "2010", "2009", "2008") value <- c(100, 33, 80, 90, 80, 100, 100, 90,

Column in the j-expression of a data.table (with/without a by statement)

橙三吉。 提交于 2019-12-06 12:28:43
Here are two artificial but I hope pedagogical examples of my problem. 1) When running this code: > dat0 <- data.frame(A=c("a","a","b"), B="") > data.table(dat0)[, lapply(.SD, function(x) length(A)) , by = "A"] A B 1: a 1 2: b 1 I expected the output A B 1: a 2 2: b 1 (similarly to plyr::ddply(dat0, .(A), nrow) ). Update to question 1) Let me give a less artificial example. Consider the following dataframe: dat0 <- data.frame(A=c("a","a","b"), x=c(1,2,3), y=c(9,8,7)) > dat0 A x y 1 a 1 9 2 a 2 8 3 b 3 7 Using plyr package, I get the means of x and y by each value of A as follows: > ddply(dat0,

Seasonal aggregate of monthly data

痞子三分冷 提交于 2019-12-06 11:50:47
I have dataframe df with x,y,and monthly.year data for each x,y point. I am trying to get the seasonal aggregate. I need to calculate seasonal means i.e. For winter mean of (December,January,February); for Spring mean of (March,April,May), for Summer mean of (June,July,August) and for autumn mean of (September,October,November). The data looks similar to: set.seed(1) df <- data.frame(x=1:3,y=1:3, matrix(rnorm(72),nrow=3) ) names(df)[3:26] <- paste(month.abb,rep(2009:2010,each=12),sep=".") x y Jan.2009 Feb.2009 ... Dec.2010 1 1 1 -0.6264538 1.5952808 ... 2.1726117 2 2 2 0.1836433 0.3295078 ...

Get row with highest value from one column after chunking with plyr - R

五迷三道 提交于 2019-12-06 08:41:11
Suppose I have a dataframe that looks like this: v1 v2 v3 v4 v5 v6 r1 1 2 2 4 5 9 r2 1 2 2 4 5 10 r3 1 2 2 4 5 7 r4 1 2 2 4 5 12 r5 2 2 2 4 5 9 r6 2 2 2 4 5 10 I would like to get the row with the highest value in v6 that has the value 1 in v1. I know how to get all rows where v1 = 1 and select the first row of that, thanks to this answer to a previous question: ddply( df , .variables = "v1" , .fun = function(x) x[1,] ) How can I change the function so that I get the row with the highest value in v6? From the previous results, I'd use [ to subset on your first condition using logical