问题
Is any elegant solution to clean a dataframe from NA values without remove the row or column where the NA is?
Example:
Input dataframe
C1 C2 C3
R1 A <NA> <NA>
R2 <NA> A <NA>
R3 <NA> <NA> A
R4 B <NA> <NA>
R5 <NA> B <NA>
R6 <NA> <NA> <NA>
R7 C <NA> B
R8 C <NA>
R9 <NA>
R10 <NA>
R11 C
Output dataframe
C1 C2 C3
R1 A A A
R2 B B B
R3 C C C
For example, here is a messy dataframe (df1) full of NA values
A B C D E F G H I J K
1 Healthy <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
2 <NA> Healthy <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
3 <NA> <NA> Healthy <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
4 <NA> <NA> <NA> Healthy <NA> <NA> <NA> <NA> <NA> <NA> <NA>
5 <NA> <NA> <NA> <NA> Healthy <NA> <NA> <NA> <NA> <NA> <NA>
6 <NA> <NA> <NA> <NA> <NA> Healthy <NA> <NA> <NA> <NA> <NA>
Here is how it should be the dataframe.
X1 X2 X3 X4 X5 X6 X7 X8 X9 X10 X11
1 Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
2 Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
3 Healthy ICDAS_1_2 ICDAS_1_2 Healthy ICDAS_1_2 ICDAS_1_2 ICDAS_1_2 Healthy Healthy ICDAS_1_2 ICDAS_1_2
4 Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
5 Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
6 Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy Healthy
Note that the cell B-2 from the original dataframe now is in the X2-1. So the main issue here is to find the equivalent to "delete the cell and move all the cells up" function from Calc or Excel
All the answers that I found delete all the row or column where the <NA> value was. The way I managed to do it is (and sorry if this is primitive) was to extract only the valid values to a new dataframe:
First. I create an empty dataframe
library("data.table") # required package
new_dataframe <- data.frame(matrix("", ncol = 11, nrow = 1400) )
Then, I copy every valid value from the old to the new dataframe
new_dataframe$X1 <- df1$A[!is.na(df2$A)]
new_dataframe$X2 <- df1$B[!is.na(df2$B)]
new_dataframe$X3 <- df1$C[!is.na(df2$C)]
etc
So, my question is: is any more elegant solution to "clean" a dataframe from NA values?
Any help is greatly appreciated.
回答1:
If this works for you manually:
new_dataframe$X1 <- df1$A[!is.na(df2$A)]
new_dataframe$X2 <- df1$B[!is.na(df2$B)]
new_dataframe$X3 <- df1$C[!is.na(df2$C)]
then this should work automatically:
new_dataframe = as.data.frame(lapply(df1, na.omit))
should also work (on an arbitrary number of columns). (A more direct translation of your code is what Pierre suggested in the comments: as.data.frame(lapply(mydf, function(x) x[!is.na(x)]))
.)
Beware that data frames must be rectangular (each column must have the same number of rows), so this will work as you might hope and expect only if each column has the same number of non-missing values. If some rows have fewer non-missing values, they will be recycled to fill out the length of the data frame:
x = data.frame(a = c(1, NA, 2), b = c(2, NA, 3), c = c(NA, "A", NA))
x
# a b c
# 1 1 2 <NA>
# 2 NA NA A
# 3 2 3 <NA>
as.data.frame(lapply(x, na.omit))
# a b c
# 1 1 2 A
# 2 2 3 A
A better approach might be to just convert to a list first:
y = lapply(x, na.omit)
You can then see what you've got sapply(y, length)
before deciding if you want to coerce to data frame or not.
来源:https://stackoverflow.com/questions/34619124/how-to-clean-or-remove-na-values-from-a-dataset-without-remove-the-column-or-row