问题
I would like to create a function in R, similar to dplyr
's group_by
function, that when combined with summarise
can give summary statistics for a dataset where group membership is not mutually exclusive. I.e., observations can belong to multiple groups. One way to think about it might be to consider tags; observations may belong to one or more tags which might overlap.
For example, take R's esoph
dataset (https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/esoph.html) documenting a case-control study of esophageal cancer. Suppose I'm interested in the number and proportion of cancer cases overall and per 'tag', where the tags are: 65+ years old; 80+ gm/day alcohol; 20+ gm/day tobacco; and a 'high risk' group where the previous 3 criteria are met.
Let's transform the dataset to long format (one participant per row) and then add these tags (logical columns) to the dataset:
library('dplyr')
data(esoph)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
) %>%
mutate(highage=(agegp %in% c('65-74','75+')),
highalc=(alcgp %in% c('80-119','120+')),
hightob=(tobgp %in% c('20-29','30+')),
highrisk=(highage & highalc & hightob)
)
My usual approach is to create a dataset where each observation is duplicated for every tag it belongs to, and then summarise
this dataset:
esophdup = bind_rows(esophlong %>% filter(highage) %>% mutate(tag='age>=65'),
esophlong %>% filter(highalc) %>% mutate(tag='alc>=80'),
esophlong %>% filter(hightob) %>% mutate(tag='tob>=20'),
esophlong %>% filter(highrisk) %>% mutate(tag='high risk'),
esophlong %>% filter() %>% mutate(tag='all')
) %>%
mutate(tag=factor(tag, levels = unique(.$tag)))
summary = esophdup %>%
group_by(tag) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
This approach is inefficient for large datasets or for a large number of tags and I will often run out of memory to store it.
An alternative is to summarise
each tag separately and then bind these summary datasets afterwards, as follows:
summary.age = esophlong %>%
filter(highage) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='age>=65')
summary.alc = esophlong %>%
filter(highalc) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='alc>=80')
summary.tob = esophlong %>%
filter(hightob) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='tob>=20')
summary.highrisk = esophlong %>%
filter(highrisk) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='high risk')
summary.all = esophlong %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case)) %>%
mutate(tag='all')
summary=bind_rows(summary.age,summary.alc,summary.tob,summary.highrisk,summary.all)
This approach is time-consuming and tedious when I have a large number of tags or I want to reuse the tags often for different summary measures throughout a project.
The function I have in mind, say group_by_tags(data, key, ...)
, which includes an argument to specify the name of the grouping column, should work something like this:
summary = esophlong %>%
group_by_tags(key='tags',
'age>=65'=highage,
'alc>=80'=highalc,
'tob>=20'=hightob,
'high risk'=highrisk,
'all ages'=1
) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
with the summary dataset looking like this:
> summary
tags n ncases case.rate
1 age>=65 273 68 0.2490842
2 alc>=80 301 96 0.3189369
3 tob>=20 278 64 0.2302158
4 high risk 11 5 0.4545455
5 all 1175 200 0.1702128
Even better, it could take variables of type "factor" as well as "logical" so that it could summarise, say, each age group individually, the 65+ year olds, and everybody:
summaryage = esophlong %>%
group_by_tags(key='Age.group',
agegp,
'65+'=(agegp %in% c('65-74','75+')),
'all'=1
) %>%
summarise(n=n(), ncases=sum(case), case.rate=mean(case))
>summaryage
Age.group n ncases case.rate
1 25-34 117 1 0.0085470
2 35-44 208 9 0.0432692
3 45-54 259 46 0.1776062
4 55-64 318 76 0.2389937
5 65-74 216 55 0.2546296
6 75+ 57 13 0.2280702
7 65+ 273 68 0.2490842
8 all 1175 200 0.1702128
Perhaps it's not possible with ...
and instead you might need to pass a vector/list of column names for the tags.
Any ideas?
EDIT: to be clear, the solution should take tag/group definitions and the required summary statistics as arguments, rather than being built into the function itself. Either as a two-step data %>% group_by_tags(tags) %>% summarise_tags(stats)
or a one-step data %>% summary_tags(tags,stats)
process.
回答1:
This is a variation on @eddi's answer. I am taking the definitions of highage
et al as part of the function's job:
library(data.table)
custom_summary = function(DT, tags, stats){
setDT(DT)
rows = stack(lapply(tags[-1], function(x) DT[eval(x), which=TRUE]))
DT[rows$values, eval(stats), by=.(tag = rows$ind)]
}
And some example usage:
data(esoph)
library(dplyr)
esophlong = bind_rows(esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
)
custom_summary(
DT = esophlong,
tags = quote(list(
'age>=65' = agegp %in% c('65-74','75+'),
'alc>=80' = alcgp %in% c('80-119','120+'),
'tob>=20' = tobgp %in% c('20-29','30+'),
'high risk' = eval(substitute(`age>=65` & `alc>=80` & `tob>=20`, as.list(tags))),
'all ages' = TRUE
)),
stats = quote(list(
n = .N,
n_cases = sum(case),
case.rate = mean(case)
))
)
tag n n_cases case.rate
1: age>=65 273 68 0.2490842
2: alc>=80 301 96 0.3189369
3: tob>=20 278 64 0.2302158
4: high risk 11 5 0.4545455
5: all ages 1175 200 0.1702128
The technique of using eval
inside DT[...]
is explained in the data.table FAQ.
回答2:
Not a completely functional answer, more "WIP" or start for discourse. This should ultimately go in a repo and either an additional package or a PR for dplyr.
One way is to mimic the attributes' structure from a "normally" grouped variable:
library(dplyr)
esoph %>% group_by(agegp, alcgp) %>% attributes %>% str
# List of 9
# $ names : chr [1:5] "agegp" "alcgp" "tobgp" "ncases" ...
# $ row.names : int [1:88] 1 2 3 4 5 6 7 8 9 10 ...
# $ class : chr [1:4] "grouped_df" "tbl_df" "tbl" "data.frame"
# $ vars :List of 2
# ..$ : symbol agegp
# ..$ : symbol alcgp
# $ drop : logi TRUE
# $ indices :List of 24
# ..$ : int [1:4] 0 1 2 3
# ..$ : int [1:4] 4 5 6 7
# ..$ : int [1:3] 8 9 10
# ...........
# $ group_sizes : int [1:24] 4 4 3 4 4 4 4 3 4 4 ...
# $ biggest_group_size: int 4
# $ labels :'data.frame': 24 obs. of 2 variables:
# ..$ agegp: Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 2 2 2 2 3 3 ...
# ..$ alcgp: Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 2 3 4 1 2 3 4 1 2 ...
# ..- attr(*, "vars")=List of 2
# .. ..$ : symbol agegp
# .. ..$ : symbol alcgp
# ..- attr(*, "drop")= logi TRUE
We can reproduce this artificially to see if/how it might work:
esoph2 <- esoph
syms <- list(as.symbol("agegp65"), as.symbol("alcgp80"))
attr(esoph2, "vars") <- syms
attr(esoph2, "drop") <- TRUE
# 'agegp' and 'aclgp' are ordered factors, for simplicity here just using ints
# `group_by` indices are 0-based
indices <- list(
which(as.integer(esoph2$agegp) >= 5) - 1,
which(as.integer(esoph2$alcgp) >= 3) - 1
)
attr(esoph2, "indices") <- indices
attr(esoph2, "group_sizes") <- lengths(indices)
attr(esoph2, "biggest_group_size") <- max(lengths(indices))
df <- data.frame(agegp65 = "agegp >= 65", alcgp80 = "alcgp >= 80", stringsAsFactors = FALSE)
attr(df, "vars") <- syms
attr(esoph2, "labels") <- df
class(esoph2) <- c("grouped_df", "tbl_df", "tbl", "data.frame")
Which "looks" like a normal grouped data.frame:
str(esoph2)
# Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 88 obs. of 5 variables:
# $ agegp : Ord.factor w/ 6 levels "25-34"<"35-44"<..: 1 1 1 1 1 1 1 1 1 1 ...
# $ alcgp : Ord.factor w/ 4 levels "0-39g/day"<"40-79"<..: 1 1 1 1 2 2 2 2 3 3 ...
# $ tobgp : Ord.factor w/ 4 levels "0-9g/day"<"10-19"<..: 1 2 3 4 1 2 3 4 1 2 ...
# $ ncases : num 0 0 0 0 0 0 0 0 0 0 ...
# $ ncontrols: num 40 10 6 5 27 7 4 7 2 1 ...
# - attr(*, "vars")=List of 2
# ..$ : symbol agegp65
# ..$ : symbol alcgp80
# - attr(*, "drop")= logi TRUE
# - attr(*, "indices")=List of 2
# ..$ : num 62 63 64 65 66 67 68 69 70 71 ...
# ..$ : num 8 9 10 11 12 13 14 23 24 25 ...
# - attr(*, "group_sizes")= int 26 42
# - attr(*, "biggest_group_size")= int 42
# - attr(*, "labels")='data.frame': 1 obs. of 2 variables:
# ..$ agegp65: chr "agegp >= 65"
# ..$ alcgp80: chr "alcgp >= 80"
# ..- attr(*, "vars")=List of 2
# .. ..$ : symbol agegp65
# .. ..$ : symbol alcgp80
esoph2
# Source: local data frame [88 x 5]
# Groups: agegp65, alcgp80 [2]
# agegp alcgp tobgp ncases ncontrols
# <ord> <ord> <ord> <dbl> <dbl>
# 1 25-34 0-39g/day 0-9g/day 0 40
# 2 25-34 0-39g/day 10-19 0 10
# 3 25-34 0-39g/day 20-29 0 6
# 4 25-34 0-39g/day 30+ 0 5
# 5 25-34 40-79 0-9g/day 0 27
# 6 25-34 40-79 10-19 0 7
# 7 25-34 40-79 20-29 0 4
# 8 25-34 40-79 30+ 0 7
# 9 25-34 80-119 0-9g/day 0 2
# 10 25-34 80-119 10-19 0 1
# # ... with 78 more rows
Unfortunately:
esoph2 %>% summarize(n = n())
# Error: corrupt 'grouped_df', contains 88 rows, and 68 rows in groups
Ergo my comment that summarize
assumes full coverage; you'd have to modify dplyr_summarise_impl (in C++), perhaps making a third option to summarise_grouped and summarise_not_grouped.
回答3:
library(data.table)
setDT(esophlong)
special.summary = function(dt, vars) {
rbindlist(lapply(seq_along(vars), function(i) {
var = vars[[i]]
if (is.logical(dt[, eval(var)])) {
dt[eval(var) == TRUE, .(.N, sum(case), mean(case))][, tag := names(vars)[i]][
, .SD, by = tag] # last step is a lazy version of setcolorder
} else {
dt[, .(.N, sum(case), mean(case)), by = .(tag = eval(var))]
}
}))
}
special.summary(esophlong, list('age>=65'=quote(highage),
'alc>=80'=quote(highalc),
'tob>=20'=quote(hightob),
'high risk'=quote(highrisk),
'all'=quote(TRUE)))
# tag N V2 V3
#1: age>=65 273 68 0.2490842
#2: alc>=80 301 96 0.3189369
#3: tob>=20 278 64 0.2302158
#4: high risk 11 5 0.4545455
#5: all 1175 200 0.1702128
special.summary(esophlong, list(quote(agegp),
'65+'=quote(agegp %in% c('65-74','75+')),
'all'=quote(TRUE)))
# tag N V2 V3
#1: 25-34 117 1 0.008547009
#2: 35-44 208 9 0.043269231
#3: 45-54 259 46 0.177606178
#4: 55-64 318 76 0.238993711
#5: 65-74 216 55 0.254629630
#6: 75+ 57 13 0.228070175
#7: 65+ 273 68 0.249084249
#8: all 1175 200 0.170212766
This can be made more customizable of course, and that's left as an exercise to the reader.
回答4:
In the absence of any knowledge of tidyverse internals, I avoided trying to create group_by()
-type function whose output should be passed to summarise()
and instead made a single function that combines both (similar to other answers but, I hope, more user-friendly and generalisable).
Since group_by() %>% summarise()
returns the joint summary information for each nested combination of grouping variables, I chose the name summarise_marginal()
since it will return the marginal summary information for each grouping variable independently.
Solution that doesn't work with grouped_df
objects
First, a solution that doesn't work with grouped_df
classes, but is extended below:
summarise_marginal0 <- function(.tbl, .vars, ..., .removeF=FALSE){
dots <- quos(...)
.tbl %>%
transmute(!!! .vars) %>%
map_dfr(
~ summarise(group_by(.tbl, 'value'=., add = TRUE), !!! dots) %>% # piping .tbl %>% group_by() %>% summarise() evaluates in the wrong order for some reason
filter_at(vars('value'), all_vars(!(.==FALSE & .removeF))) %>% # to remove rows where a logical group is FALSE.
mutate_at(vars('value'), as.character) # standardises 'value' column in case map_dfr tries to convert logical to factor
, .id='group'
)
}
mtcars %>%
summarise_marginal0(
vars(cyl, am),
meanmpg = mean(mpg),
meanwt = mean(wt)
)
#> # A tibble: 5 x 4
#> group value meanmpg meanwt
#> <chr> <chr> <dbl> <dbl>
#> 1 cyl 4 26.66364 2.285727
#> 2 cyl 6 19.74286 3.117143
#> 3 cyl 8 15.10000 3.999214
#> 4 am 0 17.14737 3.768895
#> 5 am 1 24.39231 2.411000
Capturing groups with vars()
(like with summarise_at()
or mutate_at()
) neatly separates the groups from the summary functions, and allows new groups to be created on-the-fly:
mtcars %>%
summarise_marginal0(
vars(cyl, hp_lt100 = hp<100),
meanmpg = mean(mpg),
meanwt = mean(wt)
)
#> # A tibble: 5 x 4
#> group value meanmpg meanwt
#> <chr> <chr> <dbl> <dbl>
#> 1 cyl 4 26.66364 2.285727
#> 2 cyl 6 19.74286 3.117143
#> 3 cyl 8 15.10000 3.999214
#> 4 hp_lt100 FALSE 17.45217 3.569652
#> 5 hp_lt100 TRUE 26.83333 2.316667
We can use the .removeF
argument to remove FALSE
logical values. Useful if you want to summarise certain rows but not their compliment:
mtcars %>%
summarise_marginal0(
vars(cyl==6, hp_lt100 = hp<100, hp_lt200 = hp<200),
meanmpg = mean(mpg),
meanwt = mean(wt),
.removeF = TRUE
)
#> # A tibble: 3 x 4
#> group value meanmpg meanwt
#> <chr> <chr> <dbl> <dbl>
#> 1 cyl == 6 TRUE 19.74286 3.117143
#> 2 hp_lt100 TRUE 26.83333 2.316667
#> 3 hp_lt200 TRUE 21.96000 2.911320
Notice the how even without explicitly naming the cyl == 6
group we still get a useful name for it.
Solution that does work with grouped_df
objects
summarise_marginal0()
can be extended to work with grouped_df
objects returned by group_by()
:
summarise_marginal <- function(.tbl, .vars, ...){
dots <- quos(...)
.tbl %>%
nest() %>%
mutate(
summarised = map(data, ~summarise_marginal0(., .vars, !!! dots))
) %>%
unnest(summarised) %>%
purrrlyr::slice_rows(group_vars(.tbl))
}
mtcars %>%
group_by(am) %>%
summarise_marginal(
vars(cyl, hp_lt100 = hp<100),
meanmpg = mean(mpg),
meanwt = mean(wt)
)
#> # A tibble: 10 x 5
#> # Groups: am [2]
#> am group value meanmpg meanwt
#> <dbl> <chr> <chr> <dbl> <dbl>
#> 1 1 cyl 4 28.07500 2.042250
#> 2 1 cyl 6 20.56667 2.755000
#> 3 1 cyl 8 15.40000 3.370000
#> 4 1 hp_lt100 FALSE 20.61429 2.756857
#> 5 1 hp_lt100 TRUE 28.80000 2.007500
#> 6 0 cyl 4 22.90000 2.935000
#> 7 0 cyl 6 19.12500 3.388750
#> 8 0 cyl 8 15.05000 4.104083
#> 9 0 hp_lt100 FALSE 16.06875 3.925250
#> 10 0 hp_lt100 TRUE 22.90000 2.935000
In fact, summarise_marginal()
will work for both grouped and ungrouped data.frame
s so this function alone is suitable.
This is a useful solution, but given that group_by()
has uses beyond summarise()
, for example with nest()
or do()
, I think the idea of a group_by_marginal()
(or group_by_tag()
or whatever name is best) is worth pursuing.
Some remaining issues:
The function needs to convert integer, factor and logical columns to character so their values all sit nicely in the same
values
column. This is a slight violation of tidy data principles, though is no different to howgather()
behaves.Assuming a
group_by_marginal()
function is possible, it's output cannot be passed tomutate()
without resolving the ambiguity of where to place the values from each group. From the example above, which value ofmeanmpg
should be given to a row withcyl==4
andam==0
? Both26.66364
(fromcyl==4
) and17.14737
(fromam==0
) are relevant. (Note there is no ambiguity forgroup_by() %>% mutate()
since it will return the joint summary function forcyl==4 & am==0
). Three possible options forgroup_by_marginal() %>% mutate()
:- It should be forbidden.
- It should create multiple columns, eg,
meanmpg_cyl
andmeanmpg_am
. - It should replicate rows for each group.
- Speed. I'm certain my implementation of this concept is inefficient and could be improved.
Finally, to demonstrate on the original example problem:
bind_rows(
esoph %>% .[rep(seq_len(nrow(.)), .$ncases), 1:3] %>% mutate(case=1),
esoph %>% .[rep(seq_len(nrow(.)), .$ncontrols), 1:3] %>% mutate(case=0)
) %>%
summarise_marginal(
vars(highage = agegp %in% c('65-74','75+'),
highalc = alcgp %in% c('80-119','120+'),
hightob = tobgp %in% c('20-29','30+'),
highrisk = highage & highalc & hightob,
all = 1),
n=length(agegp),
ncases=sum(case),
case.rate=mean(case),
.removeF=TRUE
)
#> # A tibble: 5 x 5
#> group value n ncases case.rate
#> <chr> <chr> <int> <dbl> <dbl>
#> 1 highage TRUE 273 68 0.2490842
#> 2 highalc TRUE 301 96 0.3189369
#> 3 hightob TRUE 278 64 0.2302158
#> 4 highrisk TRUE 11 5 0.4545455
#> 5 all 1 1175 200 0.1702128
回答5:
Here's a (mostly) dplyr
version:
Given the columns the OP created, the tags could be:
tags = list('age>=65'="highage",
'alc>=80'="highalc",
'tob>=20'="hightob",
'high risk'="highrisk",
'all'=TRUE)
But better to create the filtering expressions from the original data as @Frank did:
tags1 = list(
'age>=65' = ~agegp %in% c('65-74','75+'),
'alc>=80' = ~alcgp %in% c('80-119','120+'),
'tob>=20' = ~tobgp %in% c('20-29','30+'),
'high risk' = ~agegp %in% c('65-74','75+') & alcgp %in% c('80-119','120+') & tobgp %in% c('20-29','30+'),
'all ages' = TRUE
)
Then create a function that uses lapply
to run the dplyr
summary on each row of tags1
:
my_summary = function(dat, groups) {
bind_rows(lapply(1:length(groups), function(i) {
dat %>% filter_(groups[[i]]) %>%
summarise(tag=names(groups)[i],
n=n(),
ncases=sum(case),
case.rate=mean(case))
}))
}
my_summary(esophlong, tags1)
tag n ncases case.rate 1 age>=65 273 68 0.2490842 2 alc>=80 301 96 0.3189369 3 tob>=20 278 64 0.2302158 4 high risk 11 5 0.4545455 5 all 1175 200 0.1702128
I was hoping to create a simpler way to generate the filtering expressions, but I'm still a bit mystified about how to create complex expressions to use in standard-evaluation versions of dplyr
functions.
For example, I'd be interested in how to use an approach something like the one below. The filt
function is intended to create the filtering expression, but the returned expression needs to be unquoted and have a ~
in front of it for filter_
to interpret it properly. Or maybe there's some gyration with interp
that's required. Anyway, I'd be interested in how to make this work (or suggestions for a better way) and also how to create a filter with multiple conditions (as in the 'high risk' filter) by combining the individual filters:
# Create a filtering expression
filt = function(var, cutoff) {
paste("as.numeric(gsub('([0-9]{1,3})[-+].*','\\1',", var, ")) >= ", cutoff)
}
# Run the summary function with three different filters plus "all"
my_summary(esophlong, c(mapply(filt, c("agegp","alcgp","tobgp"), c(65,80,20)), 'all'=TRUE))
来源:https://stackoverflow.com/questions/39106850/function-similar-to-group-by-when-groups-are-not-mutually-exlcusive