Use dplyr´s filter and mutate to generate a new variable

◇◆丶佛笑我妖孽 提交于 2021-01-29 01:51:58

问题


i choose the hflights-dataset as an example.

I try to create a variable/column that contains the "TailNum" from the planes, but only for the planes that are under the 10% with the longest airtime.

install.packages("hflights") 
library("hflights") 
flights <-tbl_df(hflights) 
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new_var=TailNum)

EDIT: The resulting dataframe has only 22208 obs instead of 227496. Is there a way to keep the original dataframe, but add a new variable with the TeilNum for the planes with top10-percent airtime?


回答1:


You don't need the flights in mutate() after the pipe.

flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% mutate(new = TailNum)

Also, new is a function, so best avoid that as a variable name. See ?new. As an illustration:

flights <-tbl_df(hflights) 
flights %>% filter(cume_dist(desc(AirTime)) < 0.1) %>% 
+   mutate(new_var = TailNum, new = TailNum) %>%
+   select(AirTime, TailNum, new_var)
Source: local data frame [22,208 x 3]

   AirTime TailNum new_var
1      255  N614AS  N614AS
2      257  N627AS  N627AS
3      260  N627AS  N627AS
4      268  N618AS  N618AS
5      273  N607AS  N607AS
6      278  N624AS  N624AS
7      274  N611AS  N611AS
8      269  N607AS  N607AS
9      253  N609AS  N609AS
10     315  N626AS  N626AS
..     ...     ...     ...

To retain all observations, lose the filter(). My normal approach is to use ifelse() instead. Others may be able to suggest a better solution.

f2 <- flights %>% mutate(cumdist = cume_dist(desc(AirTime)), 
                   new_var = ifelse(cumdist < 0.1, TailNum, NA)) %>%
  select(AirTime, TailNum, cumdist, new_var)

table(is.na(f2$new_var))

 FALSE   TRUE 
 22208 205288 


来源:https://stackoverflow.com/questions/27780019/use-dplyr%c2%b4s-filter-and-mutate-to-generate-a-new-variable

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!