问题
I am using R to summarize a large amount of data for a report. I want to be able to use lapply()
to generate a list of tables from the table()
function, from which I can extract my desired statistics. There are a lot of these, so I've written a function to do it. My issue is that I am having difficulty returning the number of missing (NA
) values even though I have that in each table, because I can't figure out how to tell R that I want the element from table()
that holds the number of NA
values. As far as I can tell, R is "naming" that element NA
...and I can't call that.
I'm trying to avoid writing some complex statement where I say something like which(is.na(names(element[1]))) | names(element[1])=="var_I_want"
because I feel like that's just really wordy. I was hoping there was some way to either tell R to label the NA
variable in each table with a character name, or to tell it to pick the one labeled NA
, but I haven't had much luck yet.
Minimal example:
example <- data.frame(ID=c(10,20,30,40,50),
V1=c("A","B","A",NA,"C"),
V2=c("Dog","Cat",NA,"Cat","Bunny"),
V3=c("Yes","No","No","Yes","No"),
V4=c("No",NA,"No","No","Yes"),
V5=c("No","Yes","Yes",NA,"No"))
varlist <- c("V1","V2","V3","V4","V5")
list_o_tables <- lapply(X=example[varlist],FUN=table,useNA="always")
list(V1=list_o_tables[["V1"]]["A"],
V2=list_o_tables[["V2"]]["Cat"],
V3=list_o_tables[["V3"]]["Yes"],
V4=list_o_tables[["V4"]]["Yes"],
V5=list_o_tables[["V5"]]["Yes"])
What I get:
$V1
A
2
$V2
Cat
2
$V3
Yes
2
$V4
Yes
1
$V5
Yes
2
What I'd like:
$V1
A <NA>
2 1
$V2
Cat <NA>
2 1
$V3
Yes <NA>
2 0
$V4
Yes <NA>
1 1
$V5
Yes <NA>
2 1
回答1:
This is ugly (IMHO) but it works:
my_table <- function(x){
setNames(table(x,useNA = "always"),c(sort(unique(x[!is.na(x)])),'NA'))
}
So you'd lapply
this instead, and then you'd have access to the NA
column.
Looking more closely, this is rooted in the behavior of factor
:
levels(factor(c(1,NA,2),exclude = NULL))
[1] "1" "2" NA
My recollection is that the distinction between a factor level of NA
versus "NA"
has been at the very least a source of confusion in R in the past. I feel like I've seen some debates about the merits of this on r-devel, but I can't recall for sure at the moment.
So the issue is, if you have a factor with NA
values, what do you call the levels? Technically, this is correct, one of the levels is "missing" not literally "NA". It would be nice (IMHO) if table
didn't adhere to this quite so strictly, though.
回答2:
tab[match(NA, names(tab))]
seems to work where tab[NA]
, tab[NA_character_]
, tab["NA_character_"]
, tab["<NA>"]
, etc. etc. fail...
f <- function(nms, obj) {
obj[sapply(c(nms, NA), function(X) match(X, names(obj)))]
}
f("Cat", list_o_tables[["V2"]])
# Cat <NA>
# 2 1
mapply(f, list("A", "Cat", "Yes", "Yes", "Yes"), list_o_tables, SIMPLIFY=FALSE)
# [[1]]
#
# A <NA>
# 2 1
#
# [[2]]
#
# Cat <NA>
# 2 1
#
# [[3]]
#
# Yes <NA>
# 2 0
#
# [[4]]
#
# Yes <NA>
# 1 1
#
# [[5]]
#
# Yes <NA>
# 2 1
回答3:
Why not just fix the names up after the fact?
tables <- lapply(example[-1], table, useNA = "ifany")
fix_names <- function(x) {
names(x)[is.na(names(x))] <- "<NA>"
x
}
lapply(tables, fix_names)
回答4:
When you set useNA="always"
, table()
always adds NA
as the last result, therefore one way to do this would be to use tail
to your advantage. Assuming we have your list
from above (which I'll call l1
)...
l1 <- list(V1=list_o_tables[["V1"]]["A"],
V2=list_o_tables[["V2"]]["Cat"],
V3=list_o_tables[["V3"]]["Yes"],
V4=list_o_tables[["V4"]]["Yes"],
V5=list_o_tables[["V5"]]["Yes"])
We can get the NA
and then join them like this..
l2 <- lapply( list_o_tables , tail , 1 )
mapply( c , l1, l2 , SIMPLIFY = FALSE )
#$V1
# A <NA>
# 2 1
#$V2
# Cat <NA>
# 2 1
#$V3
# Yes <NA>
# 2 0
#$V4
# Yes <NA>
# 1 1
#$V5
# Yes <NA>
# 2 1
来源:https://stackoverflow.com/questions/20434764/in-r-can-i-make-the-table-function-return-the-number-of-na-values-in-a-named