I have a large data.frame that was generated by a process outside my control, which may or may not contain variables with zero variance (i.e. all the observations are the sa
Check this custom function. I did not try it on data frames with 100+ variables.
remove_low_variance_cols <- function(df, threshold = 0) {
n <- Sys.time() #See how long this takes to run
remove_cols <- df %>%
select_if(is.numeric) %>%
map_dfr(var) %>%
gather() %>%
filter(value <= threshold) %>%
spread(key, value) %>%
names()
if(length(remove_cols)) {
print("Removing the following columns: ")
print(remove_cols)
}else {
print("There are no low variance columns with this threshold")
}
#How long did this script take?
print(paste("Time Consumed: ", Sys.time() - n, "Secs."))
return(df[, setdiff(names(df), remove_cols)])
}