Sample a single row, per column, within a subset of a data frame in R, while following conditions

社会主义新天地 提交于 2020-01-07 03:03:21

问题


As an example of my data, I have GROUP 1 with three rows of data, and GROUP 2 with two rows of data, in a data frame:

GROUP   VARIABLE 1   VARIABLE 2   VARIABLE 3 
    1            2            6            5 
    1            4           NA            1 
    1           NA            3            8
    2            1           NA            2      
    2            9           NA           NA 

I would like to sample a single variable, per column from GROUP 1, to make a new row representing GROUP 1. I do not want to sample one single and complete row from GROUP 1, but rather the sampling needs to occur individually for each column. I would like to do the same for GROUP 2. Also, the sampling should not consider/include NA's, unless all rows for that group's variable have NA's (such as GROUP 2, VARIABLE 2, above).

For example, after sampling, I could have as a result:

GROUP   VARIABLE 1   VARIABLE 2   VARIABLE 3 
    1            4            6            1 
    2            9           NA            2 

Only GROUP 2, VARIABLE 2, can result in NA here. I actually have 39 groups, 50,000+ variables, and a substantial number of NA. I would sincerely appreciate the code to make a new data frame of rows, each row having the sampling results per group.


回答1:


We can use data.table. Convert the 'data.frame' to 'data.table' (setDT(df1)), grouped by 'GROUP', we loop through the columns (lapply(.SD,), if all of the elements are NA we return NA or else we get the sample of non-NA elements.

library(data.table)
setDT(df1)[,lapply(.SD, function(x)
     if(all(is.na(x))) NA_integer_ else sample(na.omit(x),1)) , by = GROUP]



回答2:


To ignore NAs just pass one more argument to the summarize function na.rm = TRUE. it will ignore all the NAs.

I used dplyr to perform the requested grouping but you can use base function also. dplyr is easy to use and read.

below is the code

if the summarise function is same for all columns you can use summarise_each and do the grouping in one go.

library(dplyr)

    df = df %>%
      group_by(Group) %>%
      summarise(Var_1 = max(Var_1, na.rm = TRUE),Var_2 = max(Var_2, na.rm = TRUE),Var_3 = min(Var_3, na.rm = TRUE))


来源:https://stackoverflow.com/questions/34389797/sample-a-single-row-per-column-within-a-subset-of-a-data-frame-in-r-while-fol

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!