Calculate Variance Manually in R

让人想犯罪 __ 提交于 2020-04-30 06:24:48

问题


I need your help here. I need to calculate variance manually in R. I have achieved it with this codes, it is to not robust enough for missing values and non-numeric data types.

a= c(1,2,3,4,5)
k=mean(a,na.rm = T)
storage=a
for(i in 1:length(a)) {
  storage[i]= ((i-k)^2)
}
storage =sum((storage)/(length(a)-1))
storage

I run into trouble when I have a= c(1,2,3,4,5,c,NA) Please how would I edit the code?


回答1:


You are using a for loop but that is really unnecessary, you can make a function to vectorise it which removes the NAs from the data as the first step, via conversion to character then numeric vector types (because c is a function)...

# Create data
set.seed(1)
x1 <- sample(1:10, 5)
x2 <- c(x1, c, NA)

# Make the function
varFunc <- function(x){
 # Convert to character then numeric (non numeric become NA) then remove NAs
  x <- as.numeric(as.character(x))[!is.na(as.numeric(as.character(x)))]
  # Return Variance 
  sum((x-mean(x))^2) / (length(x)-1)
}

# Use the function 
varFunc(x1)
varFunc(x2)

# Sanity check
var(x1)
var(x2, na.rm = TRUE)



回答2:


First, a few observations:

  1. In R, you can do an operation on the whole vector. E.g. (c(1, 2, 3))^2 yields 1 4 9. There's no need to use a for loop.
  2. mean isn't the only function that needs na.rm = TRUE; sum does too.
  3. In R, atomic vectors (which are pretty much all vectors that aren't a list) can only have elements of one single data type. There are four primary types: logical, integer, double and character. If there's more than one type in the vector, all the elements are coerced to be the same, in the following order: character → double → integer → logical. For example, c(1, 'c') will return the character vector "1", "c". That's why you were having trouble. (Note: If there's an NA in the vector, its type will be the same type of the vector.)

Unfortunately for that specific vector, c(1,2,3,4,5,c,NA), I don't think there's a simple way to coerce it to an integer. That's because it's a list that has a function as an element: the function c().

However, this function works whenever x is an atomic vector:

variance <- function(x){
  x = as.numeric(x)
  x = na.omit(x)
  m = mean(x)
  return(
    sum((x-m)^2, na.rm = TRUE)/(length(x) - 1)
  )
}

First we coerce the vector to numeric, so we can deal with a vector like c(1, 2, 'a'). Then we remove the NA's, so we don't have to write na.rm = TRUE in mean and sum. Then we just write down the formula.

A minor inconvenience is that when converting a character vector to numeric, we get a warning saying that NAs were generated. This can be solved if we write x = suppressWarnings(as.numeric(x)) instead.

If you want your function to be able to handle lists with functions, let me know.




回答3:


One possible approach: first, clean up a. If you start with something like a = c(1, 2, 3, 4, 5, "c", NA), then a will not be stored as a numeric variable (because of the non-numeric entry). You might first coerce it to a numeric vector, which will give an extra NA entry:

a = c(1, 2, 3, 4, 5, "c", NA)
a <- as.numeric(a)

a

## 1  2  3  4  5 NA NA

Then, you could subset the original vector by retaining only the entries from this that are numeric (by using !):

a <- a[!is.na(as.numeric(a))]

a

## 1  2  3  4  5

You could do these right after your initial declaration of a, for instance. Gregor Thomas also suggested na.omit(), which could work if combined properly with as.numeric().

I notice that you computed the mean by using the built-in mean() function and using na.rm = T... if you're able to use that same approach here, note that var() also has an optional na.rm = T parameter. I suspect you're not allowed to use it since you were instructed to compute the variance by hand, but perhaps you could use this to check your answers.



来源:https://stackoverflow.com/questions/61132506/calculate-variance-manually-in-r

标签
易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!