问题
I have to write a function that takes a directory of data files and a threshold for complete cases and calculates the correlation between sulfate and nitrate (two columns) from each file where the number of completely observed cases (on all variables) is greater than the threshold. The function should return a vector of correlations for the monitors that meet the threshold requirement. If no files meet the threshold requirement, then the function should return a numeric vector of length 0. A prototype of this function follows
My code looks like this
corr <- function(directory,threshold=0){
a<-list.files("specdata")
for (i in a) {
data <- read.csv(paste(directory, "/", i, sep =""))
x<-complete.cases(data)
j<-sum(as.numeric(x))
sulfate<-data[,2]
nitrate<-data[,3]
b<-cor(sulfate,nitrate)
}
if (j>threshold)
return(b)
else
numeric()
}
there's no error messege
If I type
z<-corr("specdata")
head(z) [1] NA
I don't know what the problem is. I don't know if NA values in the columns have to do with it. I think something is missing in my code. I think the read.csv creates a unique data frame when I need one data frame per file but I don't see why the return is NA in this case (when there's no threshold).
However, if I introduce a bigger threshold (1000):
z<-corr("specdata",1000)
head(z)
numeric(0)
The expected output I need is
cr <- corr("specdata", 150)
head(cr)
[1] -0.01895754 -0.14051254 -0.04389737 -0.06815956 -0.12350667 -0.07588814
回答1:
this is the correct and running solution you can refer to this
corr <- function(directory, threshold = 0) {
## 'directory' is a character vector of length 1 indicating the location of
## the CSV files
## 'threshold' is a numeric vector of length 1 indicating the number of
## completely observed observations (on all variables) required to compute
## the correlation between nitrate and sulfate; the default is 0
## Return a numeric vector of correlations
df = complete(directory)
ids = df[df["nobs"] > threshold, ]$id
corrr = numeric()
for (i in ids) {
newRead = read.csv(paste(directory, "/", formatC(i, width = 3, flag = "0"),
".csv", sep = ""))
dff = newRead[complete.cases(newRead), ]
corrr = c(corrr, cor(dff$sulfate, dff$nitrate))
}
return(corrr)
}
complete <- function(directory, id = 1:332) {
f <- function(i) {
data = read.csv(paste(directory, "/", formatC(i, width = 3, flag = "0"),
".csv", sep = ""))
sum(complete.cases(data))
}
nobs = sapply(id, f)
return(data.frame(id, nobs))
}
cr <- corr("specdata", 150)
head(cr)
回答2:
This problem would probably best be broken up into two steps -- computing the value for each file and collecting the results for all your files.
corr.file <- function(filename) {
data <- read.csv(paste(directory, "/", i, sep =""))
x <- complete.cases(data)
sulfate <- data[,2]
nitrate <- data[,3]
b <- cor(sulfate,nitrate)
if (j>threshold) return(b) else return(numeric())
}
a <- list.files("specdata")
correlations <- sapply(a, corr.file)
来源:https://stackoverflow.com/questions/21240049/correlation-between-na-columns