split-apply-combine

Group by columns, then compute mean and sd of every other column in R

我的梦境 提交于 2021-02-05 06:10:23
问题 How do I group by columns, then compute the mean and standard deviation of every other column in R? As an example, consider the famous Iris data set. I want to do something similar to grouping by species, then compute the mean and sd of the petal/sepal length/width measurements. I know that this has something to do with split-apply-combine, but I am not sure how to proceed from there. What I can come up with: require(plyr) x <- ddply(iris, .(Species), summarise, Sepal.Length.Mean = mean(Sepal

Adding rows in `dplyr` output

旧时模样 提交于 2020-01-01 09:46:09
问题 In traditional plyr , returned rows are added automagically to the output even if they exceed the number of input rows for that grouping: set.seed(1) dat <- data.frame(x=runif(10),g=rep(letters[1:5],each=2)) > ddply( dat, .(g), function(df) df[c(1,1,1,2),] ) x g 1 0.26550866 a 2 0.26550866 a 3 0.26550866 a 4 0.37212390 a 5 0.57285336 b 6 0.57285336 b 7 0.57285336 b 8 0.90820779 b 9 0.20168193 c 10 0.20168193 c 11 0.20168193 c 12 0.89838968 c 13 0.94467527 d 14 0.94467527 d 15 0.94467527 d 16

Use dplyr's group_by to perform split-apply-combine

自闭症网瘾萝莉.ら 提交于 2019-12-29 07:54:08
问题 I am trying to use dplyr to do the following: tapply(iris$Petal.Length, iris$Species, shapiro.test) I want to split the Petal.Lengths by Speicies, and apply a function, in this case shapiro.test. I read this SO question and quite a number of other pages. I am sort of able to split the variable into groups, using do : iris %>% group_by(Species) %>% select(Petal.Length) %>% do(print(.$Petal.Length)) [1] 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 [16] 1.5 1.3 1.4 1.7 1.5 1.7 1.5

Applying multiple functions to each column in a data frame using aggregate

筅森魡賤 提交于 2019-12-22 10:34:00
问题 When I need to apply multiple functions to multiple columns sequentially and aggregate by multiple columns and want the results to be bound into a data frame I usually use aggregate() in the following manner: # bogus functions foo1 <- function(x){mean(x)*var(x)} foo2 <- function(x){mean(x)/var(x)} # for illustration purposes only npk$block <- as.numeric(npk$block) subdf <- aggregate(npk[,c("yield", "block")], by = list(N = npk$N, P = npk$P), FUN = function(x){c(col1 = foo1(x), col2 = foo2(x))

normalizing data by duplication

寵の児 提交于 2019-12-17 18:50:43
问题 note: this question is indeed a duplicate of Split pandas dataframe string entry to separate rows, but the answer provided here is more generic and informative, so with all respect due, I chose not to delete the thread I have a 'dataset' with the following format: id | value | ... --------|-------|------ a | 156 | ... b,c | 457 | ... e,g,f,h | 346 | ... ... | ... | ... and I would like to normalize it by duplicating all values for each ids: id | value | ... --------|-------|------ a | 156 | .

Combining rows by index in R [duplicate]

时光毁灭记忆、已成空白 提交于 2019-12-13 07:48:59
问题 This question already has answers here : Combining pivoted rows in R by common value (4 answers) Closed last year . EDIT: I am aware there is a similar question that has been answered, but it does not work for me on the dataset I have provided below. The above dataframe is the result of me using the spread function. I am still not sure how to consolidate it. EDIT2: I realized that the group_by function, which I had previously used on the data, is what was preventing the spread function from

Generate All ID Pairs, by group with data.table in R

China☆狼群 提交于 2019-12-13 07:26:21
问题 I have a data.table with many individuals (with ids) in many groups. Within each group, I would like to find every combination of ids (every pair of individuals). I know how to do this with a split-apply-combine approach, but I am hoping that a data.table would be faster. Sample data: dat <- data.table(ids=1:20, groups=sample(x=c("A","B","C"), 20, replace=TRUE)) Split-Apply-Combine Method: datS <- split(dat, f=dat$groups) datSc <- lapply(datS, function(x){ as.data.table(t(combn(x$ids, 2)))})

Group androgynous names and sum amount for each year in a data frame in R

青春壹個敷衍的年華 提交于 2019-12-13 05:25:03
问题 I have a data frame with 4 columns titled 'year' 'name' 'sex' 'amount'. Here is a sample data set set.seed(1) data = data.frame(year=sample(1950:2000, 50, replace=TRUE),name=sample(LETTERS, 50, replace=TRUE), sex=sample(c("M", "F"), 50, replace=TRUE), amount=sample(40:100, 50, replace=TRUE)) I want to find only names that occur as both an ‘m’ and an ‘f’ and sum the amount for each year. Any help would be greatly appreciated 回答1: I changed the data a bit, so that there would be common names in

How to add totals as well as group_by statistics in R

懵懂的女人 提交于 2019-12-11 15:16:56
问题 When computing any statistic using summarise and group_by we only get the summary statistic per-category, and not the value for all the population (Total). How to get both? I am looking for something clean and short. Until now I can only think of: bind_rows( iris %>% group_by(Species) %>% summarise( "Mean" = mean(Sepal.Width), "Median" = median(Sepal.Width), "sd" = sd(Sepal.Width), "p10" = quantile(Sepal.Width, probs = 0.1)) , iris %>% summarise( "Mean" = mean(Sepal.Width), "Median" = median

Simple moving average on an unbalanced panel in R

女生的网名这么多〃 提交于 2019-12-07 14:16:30
问题 I am working with an unbalanced, irregularly spaced cross-sectional time series. My goal is to obtain a lagged moving average vector for the "Quantity" vector, segmented by "Subject". In other words, say the the the following Quanatities have been observed for Subject_1: [1,2,3,4,5]. I first need to lag it by 1, yielding [NA,1,2,3,4]. Then I need to take a moving average of order 3, yielding [NA,NA,NA,(3+2+1)/3,(4+3+2)/3] The above needs to be done for all Subjects. # Construct example