How to clean or remove NA values from a dataset without remove the column or row

问题

Is any elegant solution to clean a dataframe from NA values without remove the row or column where the NA is?

Example:

Input dataframe

    C1    C2     C3
 R1  A   <NA>  <NA>
 R2 <NA>  A    <NA>
 R3 <NA> <NA>   A
 R4  B   <NA>  <NA>
 R5 <NA>  B    <NA>
 R6 <NA> <NA>  <NA>
 R7  C   <NA>   B
 R8       C    <NA>
 R9            <NA>
 R10           <NA>
 R11            C

Output dataframe

    C1  C2  C3
R1  A   A   A
R2  B   B   B
R3  C   C   C

For example, here is a messy dataframe (df1) full of NA values

    A       B       C       D       E       F    G    H    I    J    K
1 Healthy    <NA>    <NA>    <NA>    <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
2    <NA> Healthy    <NA>    <NA>    <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
3    <NA>    <NA> Healthy    <NA>    <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
4    <NA>    <NA>    <NA> Healthy    <NA>    <NA> <NA> <NA> <NA> <NA> <NA>
5    <NA>    <NA>    <NA>    <NA> Healthy    <NA> <NA> <NA> <NA> <NA> <NA>
6    <NA>    <NA>    <NA>    <NA>    <NA> Healthy <NA> <NA> <NA> <NA> <NA>

Here is how it should be the dataframe.

   X1        X2        X3      X4        X5        X6        X7      X8      X9       X10       X11
1 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy
2 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy
3 Healthy ICDAS_1_2 ICDAS_1_2 Healthy ICDAS_1_2 ICDAS_1_2 ICDAS_1_2 Healthy Healthy ICDAS_1_2 ICDAS_1_2
4 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy
5 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy
6 Healthy   Healthy   Healthy Healthy   Healthy   Healthy   Healthy Healthy Healthy   Healthy   Healthy

Note that the cell B-2 from the original dataframe now is in the X2-1. So the main issue here is to find the equivalent to "delete the cell and move all the cells up" function from Calc or Excel

All the answers that I found delete all the row or column where the <NA> value was. The way I managed to do it is (and sorry if this is primitive) was to extract only the valid values to a new dataframe:

First. I create an empty dataframe

library("data.table") # required package
new_dataframe <-  data.frame(matrix("", ncol = 11, nrow = 1400) )

Then, I copy every valid value from the old to the new dataframe

new_dataframe$X1 <- df1$A[!is.na(df2$A)]
new_dataframe$X2 <- df1$B[!is.na(df2$B)]
new_dataframe$X3 <- df1$C[!is.na(df2$C)]

etc

So, my question is: is any more elegant solution to "clean" a dataframe from NA values?

Any help is greatly appreciated.

回答1:

If this works for you manually:

new_dataframe$X1 <- df1$A[!is.na(df2$A)]
new_dataframe$X2 <- df1$B[!is.na(df2$B)]
new_dataframe$X3 <- df1$C[!is.na(df2$C)]

then this should work automatically:

new_dataframe = as.data.frame(lapply(df1, na.omit))

should also work (on an arbitrary number of columns). (A more direct translation of your code is what Pierre suggested in the comments: as.data.frame(lapply(mydf, function(x) x[!is.na(x)])).)

Beware that data frames must be rectangular (each column must have the same number of rows), so this will work as you might hope and expect only if each column has the same number of non-missing values. If some rows have fewer non-missing values, they will be recycled to fill out the length of the data frame:

x = data.frame(a = c(1, NA, 2), b = c(2, NA, 3), c = c(NA, "A", NA))
x
#    a  b    c
# 1  1  2 <NA>
# 2 NA NA    A
# 3  2  3 <NA>

as.data.frame(lapply(x, na.omit))
#   a b c
# 1 1 2 A
# 2 2 3 A

A better approach might be to just convert to a list first:

y = lapply(x, na.omit)

You can then see what you've got sapply(y, length) before deciding if you want to coerce to data frame or not.

来源：https://stackoverflow.com/questions/34619124/how-to-clean-or-remove-na-values-from-a-dataset-without-remove-the-column-or-row

标签