Efficient method to subset drop rows with NA values in R

后端 未结 3 2039
一整个雨季
一整个雨季 2021-02-03 12:51

Background Before running a stepwise model selection, I need to remove missing values for any of my model terms. With quite a few terms in my model, there are t

3条回答
  •  南旧
    南旧 (楼主)
    2021-02-03 13:53

    This is one way:

    #  create some random data
    df <- data.frame(y=rnorm(100),x1=rnorm(100), x2=rnorm(100),x3=rnorm(100))
    # introduce random NA's
    df[round(runif(10,1,100)),]$x1 <- NA
    df[round(runif(10,1,100)),]$x2 <- NA
    df[round(runif(10,1,100)),]$x3 <- NA
    
    # this does the actual work...
    # assumes data is in columns 2:4, but can be anywhere
    for (i in 2:4) {df <- df[!is.na(df[,i]),]}
    

    And here's another, using sapply(...) and Reduce(...):

    xx <- data.frame(!sapply(df[2:4],is.na))
    yy <- Reduce("&",xx)
    zz <- df[yy,]
    

    The first statement "applies" the function is.na(...) to columns 2:4 of df, and inverts the result (we want !NA). The second statement applies the logical & operator to the columns of xx in succession. The third statement extracts only rows with yy=T. Clearly this can be combined into one horrifically complicated statement.

    zz <-df[Reduce("&",data.frame(!sapply(df[2:4],is.na))),]
    

    Using sapply(...) and Reduce(...) can be faster if you have very many columns.

    Finally, most modeling functions have parameters that can be set to deal with NA's directly (without resorting to all this). See, for example the na.action parameter in lm(...).

提交回复
热议问题