问题
Data
I'm working with a data set resembling the data.frame generated below:
set.seed(1)
dta <- data.frame(observation = 1:20,
valueA = runif(n = 20),
valueB = runif(n = 20),
valueC = runif(n = 20),
valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA
The columns have NA values with the last column having more than 60% of observations NAs.
> sapply(dta, function(x) {table(is.na(x))})
$observation
FALSE
20
$valueA
FALSE
20
$valueB
FALSE TRUE
16 4
$valueC
FALSE TRUE
11 9
$valueD
FALSE TRUE
6 14
Problem
I would like to be able to remove this column in dplyr pipe line somehow passing it to the select argument.
Attempts
This can be easily done in base. For example to select columns with less than 50% NAs I can do:
dta[, colSums(is.na(dta)) < nrow(dta) / 2]
which produces:
> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
observation valueA valueB valueC
1 1 0.2655087 0.9347052 0.8209463
2 2 0.3721239 NA NA
Task
I'm interested in achieving the same flexibility in dplyr pipe line:
Vectorize(require)(package = c("dplyr", # Data manipulation
"magrittr"), # Reverse pipe
char = TRUE)
dta %<>%
# Some transformations I'm doing on the data
mutate_each(funs(as.numeric)) %>%
# I want my select to take place here
回答1:
Like this perhaps?
dta %>% select(which(colMeans(is.na(.)) < 0.5)) %>% head
# observation valueA valueB valueC
#1 1 0.2655087 0.9347052 0.8209463
#2 2 0.3721239 NA NA
#3 3 0.5728534 NA NA
#4 4 0.9082078 NA NA
#5 5 0.2016819 NA NA
#6 6 0.8983897 0.3861141 NA
Updated with colMeans instead of colSums which means you don't need to divide by the number of rows any more.
And, just for the record, in base R you could also use colMeans:
dta[,colMeans(is.na(dta)) < 0.5]
回答2:
I think this does the job:
dta %>% select_if(~mean(is.na(.)) < 0.5) %>% head()
observation valueA valueB valueC
1 0.2655087 0.9347052 0.8209463
2 0.3721239 NA NA
3 0.5728534 NA NA
4 0.9082078 NA NA
5 0.2016819 NA NA
6 0.8983897 0.3861141 NA
`
回答3:
We can use extract from magrittr after getting a logical vector with summarise_each/unlist
library(magrittr)
library(dplyr)
dta %>%
summarise_each(funs(sum(is.na(.)) < n()/2)) %>%
unlist() %>%
extract(dta,.)
Or use Filter from base R
Filter(function(x) sum(is.na(x)) < length(x)/2, dta)
Or a slightly compact option is
Filter(function(x) mean(is.na(x)) < 0.5, dta)
来源:https://stackoverflow.com/questions/34852112/conditionally-selecting-columns-in-dplyr-where-certain-proportion-of-values-is-n