plyr | 易学教程

Using plyr, doMC, and summarise() with very big dataset?

阅读更多关于 Using plyr, doMC, and summarise() with very big dataset?

问题 I have a fairly large dataset (~1.4m rows) that I'm doing some splitting and summarizing on. The whole thing takes a while to run, and my final application depends on frequent running, so my thought was to use doMC and the .parallel=TRUE flag with plyr like so (simplified a bit): library(plyr) require(doMC) registerDoMC() df <- ddply(df, c("cat1", "cat2"), summarize, count=length(cat2), .parallel = TRUE) If I set the number of cores explicitly to two (using registerDoMC(cores=2) ) my 8 GB of

Problem loading the plyr package

阅读更多关于 Problem loading the plyr package

问题 I use R 2.13.1 and have unsuccessfully tried to load the package "plyr 1.6" in R. I have manually installed it into a directory "~/R/library". My code is: .libPaths("~/R/library") library(plyr) I get the message: Error in library(plyr) : 'plyr' is not a valid installed package It works fine with other packages ("chron", "zoo", "ismev", "Lmoments"), but not for the "plyr" package, and I have no idea what is goin on. I have tried installing and loading earlier versions of "plyr", but with the

R: rollapplyr and lm factor error: Does rollapplyr change variable class?

阅读更多关于 R: rollapplyr and lm factor error: Does rollapplyr change variable class?

This question builds upon a previous one which was nicely answered for me here. R: Grouped rolling window linear regression with rollapply and ddply Wouldn't you know that the code doesn't quite work when extended to the real data rather than the example data? I have a somewhat large dataset with the following characteristics. str(T0_satData_reduced) 'data.frame': 45537 obs. of 5 variables: $ date : POSIXct, format: "2014-11-17 08:47:35" "2014-11-17 08:47:36" "2014-11-17 08:47:37" ... $ trial : Factor w/ 5 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ... $ vial : Factor w/ 4 levels "1","2",

R: Is there a good replacement for plyr::rbind.fill in dplyr?

阅读更多关于 R: Is there a good replacement for plyr::rbind.fill in dplyr?

问题 for tidyverse users, dplyr is the new way to work with data. For users trying to avoid older package plyr, what is the equivalent function to rbind.fill in dplyr? 回答1: Yes. dplyr::bind_rows Credit goes to commenter. 来源： https://stackoverflow.com/questions/44464441/r-is-there-a-good-replacement-for-plyrrbind-fill-in-dplyr

Merging files (and file names) in R

阅读更多关于 Merging files (and file names) in R

问题 I'm trying to merge a directory full of comma delimited text files using R, while also incorporating the file name of each file as a new variable in the data set. I've been using the following: library(plyr) file_list <- list.files() dataset <- ldply(file_list, read.table, header=FALSE, sep=",") Can anyone shed any light on how I'd add the file name for each file read as a new variable within dataset? Many thanks, -Jon 回答1: You can just make a wrapper around the read.table() function that

extracting p values from multiple linear regression (lm) inside of a ddply function using spatial data

阅读更多关于 extracting p values from multiple linear regression (lm) inside of a ddply function using spatial data

I have a set of spatial coordinate (x,y) data that has a response variable for each coordinate over the course of several years. The following code generates a similar data frame: df <- data.frame( id = rep(1:2, 2), x = rep(c(25, 30),10), y = rep(c(100, 200), 10), year = rep(1980:1989, 2), response = rnorm(20) ) The resulting data frame: head(df) id x y year response 1 1 25 100 1980 0.1707431 2 2 30 200 1981 1.3562263 3 1 25 100 1982 -0.4590506 4 2 30 200 1983 1.3238410 5 1 25 100 1984 1.7765772 6 2 30 200 1985 -0.6258069 I want to run a linear regression on each cell through time to get the

Function “diff” over various groups in R

阅读更多关于 Function “diff” over various groups in R

问题 i have a data frame with 2 groups 1 timevariable and an dependent variable. e.g.: name <- c("a", "a", "a", "a", "a", "a","a", "a", "a", "b", "b", "b","b", "b", "b","b", "b", "b") class <- c("c1", "c1", "c1", "c2", "c2", "c2", "c3", "c3", "c3","c1", "c1", "c1", "c2", "c2", "c2", "c3", "c3", "c3") year <- c("2010", "2009", "2008", "2010", "2009", "2008", "2010", "2009", "2008", "2010", "2009", "2008", "2010", "2009", "2008", "2010", "2009", "2008") value <- c(100, 33, 80, 90, 80, 100, 100, 90,

Column in the j-expression of a data.table (with/without a by statement)

阅读更多关于 Column in the j-expression of a data.table (with/without a by statement)

Here are two artificial but I hope pedagogical examples of my problem. 1) When running this code: > dat0 <- data.frame(A=c("a","a","b"), B="") > data.table(dat0)[, lapply(.SD, function(x) length(A)) , by = "A"] A B 1: a 1 2: b 1 I expected the output A B 1: a 2 2: b 1 (similarly to plyr::ddply(dat0, .(A), nrow) ). Update to question 1) Let me give a less artificial example. Consider the following dataframe: dat0 <- data.frame(A=c("a","a","b"), x=c(1,2,3), y=c(9,8,7)) > dat0 A x y 1 a 1 9 2 a 2 8 3 b 3 7 Using plyr package, I get the means of x and y by each value of A as follows: > ddply(dat0,

Seasonal aggregate of monthly data

阅读更多关于 Seasonal aggregate of monthly data

I have dataframe df with x,y,and monthly.year data for each x,y point. I am trying to get the seasonal aggregate. I need to calculate seasonal means i.e. For winter mean of (December,January,February); for Spring mean of (March,April,May), for Summer mean of (June,July,August) and for autumn mean of (September,October,November). The data looks similar to: set.seed(1) df <- data.frame(x=1:3,y=1:3, matrix(rnorm(72),nrow=3) ) names(df)[3:26] <- paste(month.abb,rep(2009:2010,each=12),sep=".") x y Jan.2009 Feb.2009 ... Dec.2010 1 1 1 -0.6264538 1.5952808 ... 2.1726117 2 2 2 0.1836433 0.3295078 ...

Get row with highest value from one column after chunking with plyr - R

阅读更多关于 Get row with highest value from one column after chunking with plyr - R

Suppose I have a dataframe that looks like this: v1 v2 v3 v4 v5 v6 r1 1 2 2 4 5 9 r2 1 2 2 4 5 10 r3 1 2 2 4 5 7 r4 1 2 2 4 5 12 r5 2 2 2 4 5 9 r6 2 2 2 4 5 10 I would like to get the row with the highest value in v6 that has the value 1 in v1. I know how to get all rows where v1 = 1 and select the first row of that, thanks to this answer to a previous question: ddply( df , .variables = "v1" , .fun = function(x) x[1,] ) How can I change the function so that I get the row with the highest value in v6? From the previous results, I'd use [ to subset on your first condition using logical