问题
I have a dataset df
of which the structure looks similar to the example below:
nr countrycode questionA questionB questionC WeightquestionA WeightquestionB WeightquestionC
1 NLD 2 1 4 0.6 0.2 0.2
2 NLD NA 4 NA 0.4 0.4 0.2
3 NLD 4 4 1 0.2 0.2 0.6
4 BLG 1 NA 1 0.1 0.5 0.4
5 BLG 5 3 5 0.2 0.2 0.6
The questions A, B and C relate to a similar topic and as a result I would like to create an average score for all questions, taking into account the importance of each question (WeightquestionA WeightquestionB WeightquestionC
).
Currently I have manually calculated the average score.
(questionA*WeightquestionA) + (questionB*WeightquestionB) + (questionC*WeightquestionC)
This would not be an insurmountable problem were it not for the NA's (for which: no they cannot be removed). As a result I would like to automate the process.
I am currently thinking of using sum(!is.na())
for counting the non-NA's in each question (A,B,C) for each row (1 through 5) and putting that value into a new column.
With data.table I however always have trouble getting the syntax right. I believe it should be something like:
df[, NonNA:=sum(!is.na(questionA + questionB + questionC))]
But this sums all NA's in the column, instead of for each row. How should I write the syntax to calculate per row?
I would like to refer to the columns separately by name, because they are not next to each other in the actual df.
Desired output:
nr countrycode qA qB qC WeightquestionA WeightquestionB WeightquestionC NonNA
1 NLD 2 1 4 0.6 0.2 0.2 3
2 NLD NA 4 NA 0.4 0.4 0.2 1
3 NLD 4 4 1 0.2 0.2 0.6 3
4 BLG 1 NA 1 0.1 0.5 0.4 2
5 BLG 5 3 5 0.2 0.2 0.6 3
回答1:
Using data.table
, this you could do this:
df[, NonNA := sum(!is.na(questionA), !is.na(questionB), !is.na(questionC)), by = .(nr)]
A base solution:
df$nonNA <- rowSums(!is.na(df[,c("questionA", "questionB", "questionC")]))
回答2:
Another alternative with recommendation from snoram:
df[, NonNA := rowSums(!is.na(.SD)),
.SDcols=paste0("question", LETTERS[1:3])]
And also:
df[, NonNA := Reduce(function(x, y) x + !is.na(y), .SD, init=rep(0L, .N)),
.SDcols=paste0("question", LETTERS[1:3])]
回答3:
We can count non NA
(for column questionA
,questionB
and questionC
i.e. column number 3 to 5) using apply
as below:
df$nonNA=apply(df[,3:5], 1, function(x) length(which(!is.na(x))))
or (suggestion from snoarm)
df$nonNA=apply(df[,3:5], 1, function(x) sum(!is.na(x)))
Sample output:
questionA questionB questionC nonNA
1 2 1 4 3
2 NA 4 NA 1
3 4 4 1 3
4 1 NA 1 2
5 5 3 5 3
来源:https://stackoverflow.com/questions/52513666/counting-the-nas-in-a-part-of-a-row-in-data-table