问题
As an example of my data, I have GROUP 1 with three rows of data, and GROUP 2 with two rows of data, in a data frame:
GROUP VARIABLE 1 VARIABLE 2 VARIABLE 3
1 2 6 5
1 4 NA 1
1 NA 3 8
2 1 NA 2
2 9 NA NA
I would like to sample a single variable, per column from GROUP 1, to make a new row representing GROUP 1. I do not want to sample one single and complete row from GROUP 1, but rather the sampling needs to occur individually for each column. I would like to do the same for GROUP 2. Also, the sampling should not consider/include NA's, unless all rows for that group's variable have NA's (such as GROUP 2, VARIABLE 2, above).
For example, after sampling, I could have as a result:
GROUP VARIABLE 1 VARIABLE 2 VARIABLE 3
1 4 6 1
2 9 NA 2
Only GROUP 2, VARIABLE 2, can result in NA
here. I actually have 39 groups, 50,000+ variables, and a substantial number of NA
. I would sincerely appreciate the code to make a new data frame of rows, each row having the sampling results per group.
回答1:
We can use data.table
. Convert the 'data.frame' to 'data.table' (setDT(df1)
), grouped by 'GROUP', we loop through the columns (lapply(.SD,
), if
all
of the elements are NA we return NA or else we get the sample
of non-NA elements.
library(data.table)
setDT(df1)[,lapply(.SD, function(x)
if(all(is.na(x))) NA_integer_ else sample(na.omit(x),1)) , by = GROUP]
回答2:
To ignore NA
s just pass one more argument to the summarize function na.rm = TRUE
. it will ignore all the NA
s.
I used dplyr
to perform the requested grouping but you can use base function also. dplyr
is easy to use and read.
below is the code
if the summarise function is same for all columns you can use summarise_each
and do the grouping in one go.
library(dplyr)
df = df %>%
group_by(Group) %>%
summarise(Var_1 = max(Var_1, na.rm = TRUE),Var_2 = max(Var_2, na.rm = TRUE),Var_3 = min(Var_3, na.rm = TRUE))
来源:https://stackoverflow.com/questions/34389797/sample-a-single-row-per-column-within-a-subset-of-a-data-frame-in-r-while-fol