问题
I need your help here. I need to calculate variance manually in R. I have achieved it with this codes, it is to not robust enough for missing values and non-numeric data types.
a= c(1,2,3,4,5)
k=mean(a,na.rm = T)
storage=a
for(i in 1:length(a)) {
storage[i]= ((i-k)^2)
}
storage =sum((storage)/(length(a)-1))
storage
I run into trouble when I have a= c(1,2,3,4,5,c,NA) Please how would I edit the code?
回答1:
You are using a for loop but that is really unnecessary, you can make a function to vectorise it which removes the NAs from the data as the first step, via conversion to character then numeric vector types (because c
is a function)...
# Create data
set.seed(1)
x1 <- sample(1:10, 5)
x2 <- c(x1, c, NA)
# Make the function
varFunc <- function(x){
# Convert to character then numeric (non numeric become NA) then remove NAs
x <- as.numeric(as.character(x))[!is.na(as.numeric(as.character(x)))]
# Return Variance
sum((x-mean(x))^2) / (length(x)-1)
}
# Use the function
varFunc(x1)
varFunc(x2)
# Sanity check
var(x1)
var(x2, na.rm = TRUE)
回答2:
First, a few observations:
- In R, you can do an operation on the whole vector. E.g.
(c(1, 2, 3))^2
yields1 4 9
. There's no need to use afor
loop. mean
isn't the only function that needsna.rm = TRUE
;sum
does too.- In R, atomic vectors (which are pretty much all vectors that aren't a list) can only have elements of one single data type. There are four primary types: logical, integer, double and character. If there's more than one type in the vector, all the elements are coerced to be the same, in the following order: character → double → integer → logical. For example,
c(1, 'c')
will return the character vector"1", "c"
. That's why you were having trouble. (Note: If there's anNA
in the vector, its type will be the same type of the vector.)
Unfortunately for that specific vector, c(1,2,3,4,5,c,NA)
, I don't think there's a simple way to coerce it to an integer. That's because it's a list that has a function as an element: the function c()
.
However, this function works whenever x
is an atomic vector:
variance <- function(x){
x = as.numeric(x)
x = na.omit(x)
m = mean(x)
return(
sum((x-m)^2, na.rm = TRUE)/(length(x) - 1)
)
}
First we coerce the vector to numeric, so we can deal with a vector like c(1, 2, 'a')
. Then we remove the NA
's, so we don't have to write na.rm = TRUE
in mean
and sum
. Then we just write down the formula.
A minor inconvenience is that when converting a character vector to numeric, we get a warning saying that NA
s were generated. This can be solved if we write x = suppressWarnings(as.numeric(x))
instead.
If you want your function to be able to handle lists with functions, let me know.
回答3:
One possible approach: first, clean up a
. If you start with something like a = c(1, 2, 3, 4, 5, "c", NA)
, then a
will not be stored as a numeric variable (because of the non-numeric entry). You might first coerce it to a numeric vector, which will give an extra NA
entry:
a = c(1, 2, 3, 4, 5, "c", NA)
a <- as.numeric(a)
a
## 1 2 3 4 5 NA NA
Then, you could subset the original vector by retaining only the entries from this that are numeric (by using !
):
a <- a[!is.na(as.numeric(a))]
a
## 1 2 3 4 5
You could do these right after your initial declaration of a
, for instance. Gregor Thomas also suggested na.omit()
, which could work if combined properly with as.numeric()
.
I notice that you computed the mean by using the built-in mean()
function and using na.rm = T
... if you're able to use that same approach here, note that var()
also has an optional na.rm = T
parameter. I suspect you're not allowed to use it since you were instructed to compute the variance by hand, but perhaps you could use this to check your answers.
来源:https://stackoverflow.com/questions/61132506/calculate-variance-manually-in-r