问题
I'm having trouble with a data frame and couldn't really resolve that issue myself:
The dataframe has arbitrary properties as columns and each row represents one data set.
The question is:
How to get rid of columns where for ALL rows the value is NA?
回答1:
Try this:
df <- df[,colSums(is.na(df))<nrow(df)]
回答2:
The two approaches offered thus far fail with large data sets as (amongst other memory issues) they create is.na(df)
, which will be an object the same size as df
.
Here are two approaches that are more memory and time efficient
An approach using Filter
Filter(function(x)!all(is.na(x)), df)
and an approach using data.table (for general time and memory efficiency)
library(data.table)
DT <- as.data.table(df)
DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]
examples using large data (30 columns, 1e6 rows)
big_data <- replicate(10, data.frame(rep(NA, 1e6), sample(c(1:8,NA),1e6,T), sample(250,1e6,T)),simplify=F)
bd <- do.call(data.frame,big_data)
names(bd) <- paste0('X',seq_len(30))
DT <- as.data.table(bd)
system.time({df1 <- bd[,colSums(is.na(bd) < nrow(bd))]})
# error -- can't allocate vector of size ...
system.time({df2 <- bd[, !apply(is.na(bd), 2, all)]})
# error -- can't allocate vector of size ...
system.time({df3 <- Filter(function(x)!all(is.na(x)), bd)})
## user system elapsed
## 0.26 0.03 0.29
system.time({DT1 <- DT[,which(unlist(lapply(DT, function(x)!all(is.na(x))))),with=F]})
## user system elapsed
## 0.14 0.03 0.18
回答3:
dplyr
now has a select_if
verb that may be helpful here:
library(dplyr)
temp <- data.frame(x = 1:5, y = c(1,2,NA,4, 5), z = rep(NA, 5))
not_all_na <- function(x) any(!is.na(x))
not_any_na <- function(x) all(!is.na(x))
> temp
x y z
1 1 1 NA
2 2 2 NA
3 3 NA NA
4 4 4 NA
5 5 5 NA
> temp %>% select_if(not_all_na)
x y
1 1 1
2 2 2
3 3 NA
4 4 4
5 5 5
> temp %>% select_if(not_any_na)
x
1 1
2 2
3 3
4 4
5 5
回答4:
Another way would be to use the apply()
function.
If you have the data.frame
df <- data.frame (var1 = c(1:7,NA),
var2 = c(1,2,1,3,4,NA,NA,9),
var3 = c(NA)
)
then you can use apply()
to see which columns fulfill your condition and so you can simply do the same subsetting as in the answer by Musa, only with an apply
approach.
> !apply (is.na(df), 2, all)
var1 var2 var3
TRUE TRUE FALSE
> df[, !apply(is.na(df), 2, all)]
var1 var2
1 1 1
2 2 2
3 3 1
4 4 3
5 5 4
6 6 NA
7 7 NA
8 NA 9
回答5:
Late to the game but you can also use the janitor
package. This function will remove columns which are all NA, and can be changed to remove rows that are all NA as well.
df <- janitor::remove_empty(df, which = "cols")
回答6:
df[sapply(df, function(x) all(is.na(x)))] <- NULL
回答7:
The accepted answer does not work with non-numeric columns. From this answer, the following works with columns containing different data types
Filter(function(x) !all(is.na(x)), df)
回答8:
I hope this may also help. It could be made into a single command, but I found it easier for me to read by dividing it in two commands. I made a function with the following instruction and worked lightning fast.
naColsRemoval = function (DataTable) {
na.cols = DataTable [ , .( which ( apply ( is.na ( .SD ) , 2 , all ) ) )]
DataTable [ , unlist (na.cols) := NULL , with = F]
}
.SD will allow to limit the verification to part of the table, if you wish, but it will take the whole table as
来源:https://stackoverflow.com/questions/2643939/remove-columns-from-dataframe-where-all-values-are-na