Variable results with dplyr summarise, depending on output variable naming

后端 未结 1 1768
囚心锁ツ
囚心锁ツ 2020-12-11 20:20

I\'m using the dplyr package (dplyr 0.4.3; R 3.2.3) for basic summary of grouped data (summarise), but get inconsistent results (NaN f

相关标签:
1条回答
  • 2020-12-11 20:48

    The transformations you specify in summarize are performed in the order they appear, that means if you change variable values, then those new values appear for the subsequent columns (this is different from the base function tranform()). When you do

    df %>% group_by(time) %>%
      summarise(glucose=mean(glucose, na.rm=TRUE),
            glucose.sd=sd(glucose, na.rm=TRUE),
            n=sum(!is.na(glucose)))
    

    The glucose=mean(glucose, na.rm=TRUE) part has changed the value of the glucose variable such that when you calculate the glucose.sd=sd(glucose, na.rm=TRUE) part, the sd() does not see the original glucose values, it see the new value that is the mean of the original values. If you re-order the columns, it will work.

    df %>% group_by(time) %>%
      summarise(glucose.sd=sd(glucose, na.rm=TRUE),
            n=sum(!is.na(glucose)), 
            glucose=mean(glucose, na.rm=TRUE))
    

    If you are wondering why this is the default behavior, this is because it is often nice to create a column and then use that column value later in the transformations. For example, with mutate()

    df %>% group_by(time) %>%
      mutate(glucose_sq = glucose^2,
            glucose_sq_plus2 = glucose_sq+2)
    
    0 讨论(0)
提交回复
热议问题