Subset dataframe based on number of observations in each column

社会主义新天地 提交于 2019-11-28 14:12:40

Try

df1[, colSums(!is.na(df1)) >= 7]
#   A1 A3
#1  87 NA
#2  67 38
#3  80 10
#4  36 41
#5  71 NA
#6   6 66
#7  26 NA
#8  15  7
#9  14 29
#10 46 NA
#11 19 70
#12 93 23
#13  5 46
#14 94 55

step by step

What you need to do first is to find out which values of your data are not missing.

!is.na(df1)

This returns a logical matrix

#        A1    A2    A3
# [1,] TRUE  TRUE FALSE
# [2,] TRUE FALSE  TRUE
# [3,] TRUE  TRUE  TRUE
# [4,] TRUE  TRUE  TRUE
# [5,] TRUE  TRUE FALSE
# [6,] TRUE  TRUE  TRUE
# [7,] TRUE  TRUE FALSE
# [8,] TRUE FALSE  TRUE
# [9,] TRUE FALSE  TRUE
#[10,] TRUE FALSE FALSE
#[11,] TRUE FALSE  TRUE
#[12,] TRUE FALSE  TRUE
#[13,] TRUE FALSE  TRUE
#[14,] TRUE FALSE  TRUE

Use colSums to find out how many observations per column are not missing

colSums(!is.na(df1))
#A1 A2 A3 
#14  6 10

Apply you condition "greater or equal of 7 observations(count) per columns"

colSums(!is.na(df1)) >= 7
#   A1    A2    A3 
# TRUE FALSE  TRUE

Finally, you need to use this vector to subset your data

df1[, colSums(!is.na(df1)) >= 7]

Turn this into a function if you need it regulary

almost_complete_cols <- function(data, min_obs) {
  data[, colSums(!is.na(data)) >= min_obs, drop = FALSE]
}

almost_complete_cols(df1, 7)
标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!