问题
I have been trying use a custom function that I found on here to recalculate median household income from census tracts aggregated to neighborhoods. My data looks like this
> inc_df[, 1:5]
San Francisco Bayview Hunters Point Bernal Heights Castro/Upper Market Chinatown
2500-9999 22457 1057 287 329 1059
10000-14999 20708 920 288 463 1327
1500-19999 12701 626 145 148 867
20000-24999 12106 491 285 160 689
25000-29999 10129 554 238 328 167
30000-34999 10310 338 257 179 289
35000-39999 9028 383 184 163 326
40000-44999 9532 472 334 173 264
45000-49999 8406 394 345 241 193
50000-59999 17317 727 367 353 251
60000-74999 25947 1037 674 794 236
75000-99999 36378 1185 980 954 289
100000-124999 33890 990 640 1208 199
125000-149999 24935 522 666 957 234
150000-199999 37190 814 1310 1535 150
200000-250001 65763 796 2122 3175 302
The function is as follows:
GroupedMedian <- function(frequencies, intervals, sep = NULL, trim = NULL) {
# If "sep" is specified, the function will try to create the
# required "intervals" matrix. "trim" removes any unwanted
# characters before attempting to convert the ranges to numeric.
if (!is.null(sep)) {
if (is.null(trim)) pattern <- ""
else if (trim == "cut") pattern <- "\\[|\\]|\\(|\\)"
else pattern <- trim
intervals <- sapply(strsplit(gsub(pattern, "", intervals), sep), as.numeric)
}
Midpoints <- rowMeans(intervals)
cf <- cumsum(frequencies)
Midrow <- findInterval(max(cf)/2, cf) + 1
L <- intervals[1, Midrow] # lower class boundary of median class
h <- diff(intervals[, Midrow]) # size of median class
f <- frequencies[Midrow] # frequency of median class
cf2 <- cf[Midrow - 1] # cumulative frequency class before median class
n_2 <- max(cf)/2 # total observations divided by 2
unname(L + (n_2 - cf2)/f * h)
}
And the code to apply the function looks like this:
GroupedMedian(inc_df[, "Bernal Heights"], rownames(inc_df), sep="-", trim="cut")
This all works fine but I can't figure out how to apply this to each column of the matrix instead of typing out each column name and running it again and again. I have tried this:
> minc_hood <- data.frame(apply(inc_df, 2, function(x) GroupedMedian(inc_df[, x],
rownames(inc_df), sep="-", trim="cut")))
But I get this error message
Error in inc_df[, x] : subscript out of bounds
回答1:
There are a couple of things at play here:
advice: never use
applywith adata.frame(unless you are absolutely certain you don't mind the overhead of converting tomatrix^1 and can accept the potential data loss^2).even if you're going to use
apply, you're doing it a little "off": when you sayapply(df, 2, func), it takes the first column ofdfand presents it as the arguments, so for instanceapply(mtcars, 2, mean)will make calls like
mean(c(21, 21, 22.8, 21.4, 18.7, ...)) # mpg mean(c(6, 6, 4, 6, 8, ...)) # cyl mean(c(160, 160, 108, 258, 360, ...)) # disp # ... etcIn that context, your use of
apply(inc_df, 2, function(x) GroupedMedian(inc_df[, x], ...))is wrong, sincexis replaced by all values of the first column ofinc_df(and then all values of the 2nd column, etc).
Since your function looks like it accepts a vector of values (plus some other arguments), I suggest you try something like
inc_df[] <- lapply(inc_df, GroupedMedian, rownames(inc_df), sep="-", trim="cut")
If you want to apply this function to a subset of those columns, then something like this works well:
ind <- c(1,3,7)
inc_df[ind] <- lapply(inc_df[ind], GroupedMedian, rownames(inc_df), sep="-", trim="cut")
The use of inc_df[] <- ... (when not doing a column-subset) ensures that we replace the values of the columns without losing the attribute that it is a data.frame. It is effectively the same as inc_df <- as.data.frame(...) with some other minor nuances.
Notes:
^1: apply will always convert a data.frame to a matrix. This might be alright, but with larger data will take a non-zero amount of time. It also may have consequences, see next ...
^2: a matrix can have only one class, unlike a data.frame. That means that all columns will be up-converted to the highest common type, in the order of logical < integer < numeric < POSIXct < character. This means that if you have all numeric columns and one character, then the function you are applying on it will see all character data. This can be mitigated by only selecting those columns with the types you expect, perhaps with:
isnum <- sapply(inc_df, is.numeric)
inc_df[isnum] <- apply(inc_df[isnum], 2, GroupedMedian, ...)
and in that case, the worst conversion you will get will be integer-to-numeric, likely an acceptable (and reversible) conversion.
来源:https://stackoverflow.com/questions/50124473/how-to-apply-a-custom-function-over-each-column-of-a-matrix