问题
Hi everyone I need a little bit of help with a problem I'm facing, which I'm sure is quite simple but I can't seem to be able to solve it by myself. Basically this is my dataset:
Age Gender Group V1 V2 V3 V4 V5
20 1 1 2 1 4
21 2 1 2 2 1
35 2 2 2 1
22 2 1 2
I see that many suggest subset/select function to perform analysis with specific variables, but what I need is to work from v1 to v5 to understand how many row to delete cause of the missing data but without losing the age, gender and group information. So I basically need to tell r to delete all row that from v1 to v5 have more than 3 missing data (which I know how to do it) and give me back a data frame with all the information of the remaining data (that's what I'm missing). Something like this:
Age Gender Group V1 V2 V3 V4 V5
20 1 1 2 1 4
21 2 1 2 2 1
I don't know if I manage to explain my self enough, but thank you in advance
回答1:
We can use rowSums
on selected columns. (columns that start with "V" and a number).
cols <- grep('^V\\d+', names(df))
If you have NA
values as missing data
df[rowSums(is.na(df[cols])) < 3, ]
# Age Gender Group V1 V2 V3 V4 V5
#1 20 1 1 NA 2 1 NA 4
#2 21 2 1 2 NA 2 NA 1
If empty cells as missing data.
df[rowSums(df[cols] == '') < 3, ]
Another option with row-wise apply
df[apply(is.na(df[cols]), 1, sum) < 3, ]
data
df <- structure(list(Age = c(20L, 21L, 35L, 22L), Gender = c(1L, 2L,
2L, NA), Group = c(1L, 1L, 2L, 2L), V1 = c(NA, 2L, 2L, 1L), V2 = c(2L,
NA, NA, NA), V3 = c(1L, 2L, NA, NA), V4 = c(NA, NA, 1L, 2L),
V5 = c(4L, 1L, NA, NA)), class = "data.frame", row.names = c(NA, -4L))
来源:https://stackoverflow.com/questions/60808165/r-studio-use-only-specific-variables-but-being-able-to-work-on-and-not-lose-oth