plyr

What does the dot mean in R – personal preference, naming convention or more?

混江龙づ霸主 提交于 2019-11-26 19:35:39
I am (probably) NOT referring to the "all other variables" meaning like var1~. here. I was pointed to plyr once again and looked into mlply and wondered why parameters are defined with leading dot like this: function (.data, .fun = NULL, ..., .expand = TRUE, .progress = "none", .parallel = FALSE) { if (is.matrix(.data) & !is.list(.data)) .data <- .matrix_to_df(.data) f <- splat(.fun) alply(.data = .data, .margins = 1, .fun = f, ..., .expand = .expand, .progress = .progress, .parallel = .parallel) } <environment: namespace:plyr> What's the use of that? Is it just personal preference, naming

R - Faster Way to Calculate Rolling Statistics Over a Variable Interval

跟風遠走 提交于 2019-11-26 18:14:15
问题 I'm curious if anyone out there can come up with a (faster) way to calculate rolling statistics (rolling mean, median, percentiles, etc.) over a variable interval of time (windowing). That is, suppose one is given randomly timed observations (i.e. not daily, or weekly data, observations just have a time stamp, as in ticks data), and suppose you'd like to look at center and dispersion statistics that you are able to widen and tighten the interval of time over which these statistics are

meaning of ddply error: 'names' attribute [9] must be the same length as the vector [1]

坚强是说给别人听的谎言 提交于 2019-11-26 17:49:20
I'm going through Machine Learning for Hackers, and I am stuck at this line: from.weight <- ddply(priority.train, .(From.EMail), summarise, Freq = length(Subject)) Which generates the following error: Error in attributes(out) <- attributes(col) : 'names' attribute [9] must be the same length as the vector [1] This is a traceback(): > traceback() 11: FUN(1:5[[1L]], ...) 10: lapply(seq_len(n), extract_col_rows, df = x, i = i) 9: extract_rows(x$data, x$index[[i]]) 8: `[[.indexed_df`(pieces, i) 7: pieces[[i]] 6: function (i) { piece <- pieces[[i]] if (.inform) { res <- try(.fun(piece, ...)) if

How to replace NA with mean by subset in R (impute with plyr?)

南笙酒味 提交于 2019-11-26 17:21:14
I have a dataframe with the lengths and widths of various arthropods from the guts of salamanders. Because some guts had thousands of certain prey items, I only measured a subset of each prey type. I now want to replace each unmeasured individual with the mean length and width for that prey. I want to keep the dataframe and just add imputed columns (length2, width2). The main reason is that each row also has columns with data on the date and location the salamander was collected. I could fill in the NA with a random selection of the measured individuals but for the sake of argument let's

Summarizing by subgroup percentage in R

旧街凉风 提交于 2019-11-26 16:43:28
问题 I have a dataset like this: df = data.frame(group = c(rep('A',4), rep('B',3)), subgroup = c('a', 'b', 'c', 'd', 'a', 'b', 'c'), value = c(1,4,2,1,1,2,3)) group | subgroup | value ------------------------ A | a | 1 A | b | 4 A | c | 2 A | d | 1 B | a | 1 B | b | 2 B | c | 3 What I want is to get the percentage of the values of each subgroup within each group, i.e. the output should be: group | subgroup | percent ------------------------ A | a | 0.125 A | b | 0.500 A | c | 0.250 A | d | 0.125 B

How to strsplit different number of strings in certain column by do function

血红的双手。 提交于 2019-11-26 16:39:29
I have a problem with split column value when element of column has different number of strings. I can do it in plyr e.g.: library(plyr) column <- c("jake", "jane jane","john john john") df <- data.frame(1:3, name = column) df$name <- as.character(df$name) df2 <- ldply(strsplit(df$name, " "), rbind) View(df2) As a result, we have data frame with number of column related to maximum number of stings in given element. When I try to do it in dplyr, I used do function: library(dplyr) df2 <- df %>% do(data.frame(strsplit(.$name, " "))) but I get an error: Error in data.frame("jake", c("jane", "jane"

How to get top n companies from a data frame in decreasing order

删除回忆录丶 提交于 2019-11-26 16:36:54
问题 I am trying to get the top 'n' companies from a data frame.Here is my code below. data("Forbes2000", package = "HSAUR") sort(Forbes2000$profits,decreasing=TRUE) Now I would like to get the top 50 observations from this sorted vector. 回答1: head and tail are really useful functions! head(sort(Forbes2000$profits,decreasing=TRUE), n = 50) If you want the first 50 rows of the data.frame, then you can use the arrange function from plyr to sort the data.frame and then use head library(plyr) head

Is there a R function that applies a function to each pair of columns?

雨燕双飞 提交于 2019-11-26 16:22:39
I often need to apply a function to each pair of columns in a dataframe/matrix and return the results in a matrix. Now I always write a loop to do this. For instance, to make a matrix containing the p-values of correlations I write: df <- data.frame(x=rnorm(100),y=rnorm(100),z=rnorm(100)) n <- ncol(df) foo <- matrix(0,n,n) for ( i in 1:n) { for (j in i:n) { foo[i,j] <- cor.test(df[,i],df[,j])$p.value } } foo[lower.tri(foo)] <- t(foo)[lower.tri(foo)] foo [,1] [,2] [,3] [1,] 0.0000000 0.7215071 0.5651266 [2,] 0.7215071 0.0000000 0.9019746 [3,] 0.5651266 0.9019746 0.0000000 which works, but is

ddply for sum by group in R

帅比萌擦擦* 提交于 2019-11-26 15:59:53
问题 I have a sample dataframe "data" as follows: X Y Month Year income 2281205 228120 3 2011 1000 2281212 228121 9 2010 1100 2281213 228121 12 2010 900 2281214 228121 3 2011 9000 2281222 228122 6 2010 1111 2281223 228122 9 2010 3000 2281224 228122 12 2010 1889 2281225 228122 3 2011 778 2281243 228124 12 2010 1111 2281244 228124 3 2011 200 2281282 228128 9 2010 7889 2281283 228128 12 2010 2900 2281284 228128 3 2011 3400 2281302 228130 9 2010 1200 2281303 228130 12 2010 2000 2281304 228130 3 2011

Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

青春壹個敷衍的年華 提交于 2019-11-26 15:48:24
问题 I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns. The data started out as a JSON file, which I converted to