发表新帖

发表新帖

Variable results with dplyr summarise, depending on output variable naming

后端未结

关注

 1  1768

I\'m using the dplyr package (dplyr 0.4.3; R 3.2.3) for basic summary of grouped data (summarise), but get inconsistent results (NaN f

相关标签:

1条回答

伪装坚强ぢ

2020-12-11 20:48
The transformations you specify in summarize are performed in the order they appear, that means if you change variable values, then those new values appear for the subsequent columns (this is different from the base function tranform()). When you do
```
df %>% group_by(time) %>%
  summarise(glucose=mean(glucose, na.rm=TRUE),
        glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)))
```
The glucose=mean(glucose, na.rm=TRUE) part has changed the value of the glucose variable such that when you calculate the glucose.sd=sd(glucose, na.rm=TRUE) part, the sd() does not see the original glucose values, it see the new value that is the mean of the original values. If you re-order the columns, it will work.
```
df %>% group_by(time) %>%
  summarise(glucose.sd=sd(glucose, na.rm=TRUE),
        n=sum(!is.na(glucose)), 
        glucose=mean(glucose, na.rm=TRUE))
```
If you are wondering why this is the default behavior, this is because it is often nice to create a column and then use that column value later in the transformations. For example, with mutate()
```
df %>% group_by(time) %>%
  mutate(glucose_sq = glucose^2,
        glucose_sq_plus2 = glucose_sq+2)
```
0 讨论(0)
发布评论:

提交评论
- 加载中...

热议问题