Correlation between NA columns

筅森魡賤 提交于 2019-12-04 06:56:24

问题


I have to write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate (two columns) from each file where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no files meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows

My code looks like this

corr <- function(directory,threshold=0){
    a<-list.files("specdata")
    for (i in a) {
        data <- read.csv(paste(directory, "/", i, sep =""))
        x<-complete.cases(data)
        j<-sum(as.numeric(x))
        sulfate<-data[,2]
        nitrate<-data[,3]
        b<-cor(sulfate,nitrate)
    }  
    if (j>threshold) 
        return(b) 
    else
        numeric()
}

there's no error messege

If I type

z<-corr("specdata")

head(z) [1] NA

I don't know what the problem is. I don't know if NA values in the columns have to do with it. I think something is missing in my code. I think the read.csv creates a unique data frame when I need one data frame per file but I don't see why the return is NA in this case (when there's no threshold).

However, if I introduce a bigger threshold (1000):

z<-corr("specdata",1000)
head(z)
numeric(0)

The expected output I need is

cr <- corr("specdata", 150) 
head(cr) 
[1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814

回答1:


this is the correct and running solution you can refer to this 

corr <- function(directory, threshold = 0) {
  ## 'directory' is a character vector of length 1 indicating the location of
  ## the CSV files

  ## 'threshold' is a numeric vector of length 1 indicating the number of
  ## completely observed observations (on all variables) required to compute
  ## the correlation between nitrate and sulfate; the default is 0

  ## Return a numeric vector of correlations
  df = complete(directory)
  ids = df[df["nobs"] > threshold, ]$id
  corrr = numeric()
  for (i in ids) {

    newRead = read.csv(paste(directory, "/", formatC(i, width = 3, flag = "0"), 
                             ".csv", sep = ""))
    dff = newRead[complete.cases(newRead), ]
    corrr = c(corrr, cor(dff$sulfate, dff$nitrate))
  }
  return(corrr)
}
complete <- function(directory, id = 1:332) {
  f <- function(i) {
    data = read.csv(paste(directory, "/", formatC(i, width = 3, flag = "0"), 
                          ".csv", sep = ""))
    sum(complete.cases(data))
  }
  nobs = sapply(id, f)
  return(data.frame(id, nobs))
}
cr <- corr("specdata", 150)
head(cr)



回答2:


This problem would probably best be broken up into two steps -- computing the value for each file and collecting the results for all your files.

corr.file <- function(filename) {
  data <- read.csv(paste(directory, "/", i, sep =""))
  x <- complete.cases(data)
  sulfate <- data[,2]
  nitrate <- data[,3]
  b <- cor(sulfate,nitrate)
  if (j>threshold) return(b) else return(numeric())
}

a <- list.files("specdata")
correlations <- sapply(a, corr.file)


来源:https://stackoverflow.com/questions/21240049/correlation-between-na-columns

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!