Remove columns of dataframe based on conditions in R

前端 未结 2 523
死守一世寂寞
死守一世寂寞 2020-12-10 00:00

I have to remove columns in my dataframe which has over 4000 columns and 180 rows.The conditions I want to set in to remove the column in the dataframe are: (i) Remove the c

2条回答
  •  Happy的楠姐
    2020-12-10 00:26

    I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA values in a column, obviously the whole column aren't NAs. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff per column- vecotrize the whole thing):

    cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1
    

    This works because if there are no consecutive values in a column, the whole column will become NAs.

    Then, just

    df[, cond, drop = FALSE]
    #        A     E
    # 1  0.018    NA
    # 2  0.017    NA
    # 3  0.019    NA
    # 4  0.018    NA
    # 5  0.018    NA
    # 6  0.015 0.037
    # 7  0.016 0.031
    # 8  0.019 0.025
    # 9  0.016 0.035
    # 10 0.018 0.035
    # 11 0.017 0.043
    # 12 0.023 0.040
    # 13 0.022 0.042
    

    Per your edit, it seems like you have a data.table object and you also have a Date column so the code would need some modifications.

    cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] 
    df[, c(TRUE, cond), with = FALSE]
    

    Some explanations:

    • We want to ignore the first column in our calculations so we specify .SDcols = -1 when operating on our .SD (which means Sub Data in data.tableis)
    • .N is just the rows count (similar to nrow(df)
    • Next step is to subset by condition. We need not forget to grab the first column too so we start with c(TRUE,...
    • Finally, data.table works with non standard evaluation by default, hence, if you want to select column as if you would in a data.frame you will need to specify with = FALSE

    A better way though, would be just to remove the column by reference using := NULL

    cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
    df[, which(cond) := NULL]
    

提交回复
热议问题