R studio: use only specific variables but being able to work on and not lose other variable information

问题

Hi everyone I need a little bit of help with a problem I'm facing, which I'm sure is quite simple but I can't seem to be able to solve it by myself. Basically this is my dataset:

Age Gender Group  V1 V2 V3 V4 V5
20     1     1       2   1     4
21     2     1    2      2     1
35     2     2    2         1
22           2    1         2

I see that many suggest subset/select function to perform analysis with specific variables, but what I need is to work from v1 to v5 to understand how many row to delete cause of the missing data but without losing the age, gender and group information. So I basically need to tell r to delete all row that from v1 to v5 have more than 3 missing data (which I know how to do it) and give me back a data frame with all the information of the remaining data (that's what I'm missing). Something like this:

Age Gender Group  V1 V2 V3 V4 V5
20     1     1       2   1     4
21     2     1    2      2     1

I don't know if I manage to explain my self enough, but thank you in advance

回答1:

We can use rowSums on selected columns. (columns that start with "V" and a number).

cols <- grep('^V\\d+', names(df))

If you have NA values as missing data

df[rowSums(is.na(df[cols])) < 3, ]

#  Age Gender Group V1 V2 V3 V4 V5
#1  20      1     1 NA  2  1 NA  4
#2  21      2     1  2 NA  2 NA  1

If empty cells as missing data.

df[rowSums(df[cols] == '') < 3, ]

Another option with row-wise apply

df[apply(is.na(df[cols]), 1, sum) < 3, ]

data

df <- structure(list(Age = c(20L, 21L, 35L, 22L), Gender = c(1L, 2L, 
2L, NA), Group = c(1L, 1L, 2L, 2L), V1 = c(NA, 2L, 2L, 1L), V2 = c(2L, 
NA, NA, NA), V3 = c(1L, 2L, NA, NA), V4 = c(NA, NA, 1L, 2L), 
V5 = c(4L, 1L, NA, NA)), class = "data.frame", row.names = c(NA, -4L))

来源：https://stackoverflow.com/questions/60808165/r-studio-use-only-specific-variables-but-being-able-to-work-on-and-not-lose-oth

标签

dataframe