plyr | 易学教程

Calculating most frequent level by category with plyr

阅读更多关于 Calculating most frequent level by category with plyr

问题 I would like calculate the most frequent factor level by category with plyr using the code below. The data frame b shows the requested result. Why does c$mlevels only have the value "numeric"? require(plyr) set.seed(0) a <- data.frame(cat=round(runif(100, 1, 3)), levels=factor(round(runif(100, 1, 10)))) mode <- function(x) names(table(x))[which.max(table(x))] b <- data.frame(cat=1:3, mlevels=c(mode(a$levels[a$cat==1]), mode(a$levels[a$cat==2]), mode(a$levels[a$cat==3]))) c <- ddply(a, .(cat),

Sampling small data frame from a big dataframe

阅读更多关于 Sampling small data frame from a big dataframe

I am trying to sample a data frame from a given data frame such that there are enough samples from each of the levels of a variable. This can be achieved by separating the data frame by the levels and sample from each of those . I thought ddply (data-frame to data-frame) would do it for me. Taking a minimal example: set.seed(1) data1 <-data.frame(a=sample(c('B0','B1','B2'),100,replace=TRUE),b=rnorm(100),c=runif(100)) > summary(data1$a) B0 B1 B2 30 32 38 The following commands perform the sampling... When I enter... data2 <- ddply(data1,c('a'),function(x) sample(x,20,replace=FALSE)) I get the

Tag all duplicate rows in R as in Stata

阅读更多关于 Tag all duplicate rows in R as in Stata

Following up from my question here , I am trying to replicate in R the functionality of the Stata command duplicates tag , which allows me to tag all the rows of a dataset that are duplicates in terms of a given set of variables: clear * set obs 16 g f1 = _n expand 104 bys f1: g f2 = _n expand 2 bys f1 f2: g f3 = _n expand 41 bys f1 f2 f3: g f4 = _n des // describe the dataset in memory preserve sample 10 // draw a 10% random sample tempfile sampledata save `sampledata', replace restore // append the duplicate rows to the data append using `sampledata' sort f1-f4 duplicates tag f1-f4, generate

Calculating most frequent level by category with plyr

阅读更多关于 Calculating most frequent level by category with plyr

I would like calculate the most frequent factor level by category with plyr using the code below. The data frame b shows the requested result. Why does c$mlevels only have the value "numeric"? require(plyr) set.seed(0) a <- data.frame(cat=round(runif(100, 1, 3)), levels=factor(round(runif(100, 1, 10)))) mode <- function(x) names(table(x))[which.max(table(x))] b <- data.frame(cat=1:3, mlevels=c(mode(a$levels[a$cat==1]), mode(a$levels[a$cat==2]), mode(a$levels[a$cat==3]))) c <- ddply(a, .(cat), summarise, mlevels=mode(levels)) When you use summarise , plyr seems to "not see" the function

Using ifelse with transform in ddply

阅读更多关于 Using ifelse with transform in ddply

I am trying to use ddply with transform to populate a new variable ( summary_Date ) in a dataframe with variables ID and Date . The value of the variable is chosen based on the length of the piece that is being evaluated using ifelse : If there are less than five observations for an ID in a given month, I want to have summary_Date be calculated by rounding the date to the nearest month (using round_date from package lubridate ); if there are more than five observations for an ID in a given month, I want the summary_Date to simply be Date . require(plyr) require(lubridate) test.df <- structure(

Combine a list of data frames into one preserving row names

阅读更多关于 Combine a list of data frames into one preserving row names

I do know about the basics of combining a list of data frames into one as has been answered before . However, I am interested in smart ways to maintain row names. Suppose I have a list of data frames that are fairly equal and I keep them in a named list. library(plyr) library(dplyr) library(data.table) a = data.frame(x=1:3, row.names = letters[1:3]) b = data.frame(x=4:6, row.names = letters[4:6]) c = data.frame(x=7:9, row.names = letters[7:9]) l = list(A=a, B=b, C=c) When I use do.call , the list names are combined with the row names: > rownames(do.call("rbind", l)) [1] "A.a" "A.b" "A.c" "B.d"

How to use string variables to create variables list for ddply?

阅读更多关于 How to use string variables to create variables list for ddply?

Using R's builtin ToothGrowth example dataset, this works: ddply(ToothGrowth, .(supp,dose), function(df) mean(df$len)) But I would like to have the subsetting factors be variables, something like factor1 = 'supp' factor2 = 'dose' ddply(ToothGrowth, .(factor1,factor2), function(df) mean(df$len)) That doesn't work. How should this be done? I thought perhaps something like this: factorCombo = paste('.(',factor1,',',factor2,')', sep='') ddply(ToothGrowth, factorCombo, function(df) mean(df$len)) But it doesn't work either. I think I am close, but not sure the proper way to do it. I suppose the

ddply with fixed number of rows

阅读更多关于 ddply with fixed number of rows

I want to break up my data by 'number of rows'. That is to say I want to send a fixed number of rows to my function and when I get to the end of the data frame (last chunk) I need to just send the chunk whether it has the fixed number of rows or less. Something like this: ddply(df, .(8 rows), .fun=somefunction) If you want to use plyr you can add a category column: df <- data.frame(x=rnorm(100), y=rnorm(100)) somefunction <- function(df) { data.frame(mean(df$x), mean(df$y)) } df$category <- rep(letters[1:10], each=10) ddply(df, .(category), somefunction) But, the apply family might be a better

Co-occurrence matrix using SAC?

阅读更多关于 Co-occurrence matrix using SAC?

I have the following data frame 'x' id,item,volume a,c1,2 a,c2,3 a,c3,2 a,c4,1 a,c5,4 b,c6,6 b,c1,2 b,c3,1 b,c2,6 b,c4,4 c,c2,5 c,c8,6 c,c9,2 d,c1,1 e,c3,7 e,c2,3 e,c1,2 e,c9,5 e,c4,1 f,c1,7 f,c3,1 The first column is the id of a customer, the second column is the id of an item that customer bought and the third column is the number of those items bought. I'm trying to create a co-occurrence matrix which is a square matrix with 8 rows and columns, 8 being the number of distinct items. n = length(unique(x$cid)) Could this be done through a SAC paradigm? For every id, I need to update the above

Apply common function to all data frames and return data frames with same name

阅读更多关于 Apply common function to all data frames and return data frames with same name

I'm trying to apply a function to all similarly spelled data frames in my global environment in R. I want to apply this function to all these data frames, but I can't figure out how to do it without me specifying 1 by 1. I want to return the data frame to the global environment with the same spelling as it was before. mtcars_test = mtcars iris_test = iris #....etc......could be 2 of them or 88 of them...but they will all end in "_test" # figure out what data frames I am working with list_of_my_dfs = lapply(ls(pattern = "*_test"), get) #my function just multiples everything by 2 mytest_function