问题
I'm trying to understand what's going on with my calculation of canberra distance. I write my own simple canberra.distance
function, however the results are not consistent with dist
function. I added option na.rm = T
to my function, to be able calculate the sum when there is zero denominator. From ?dist
I understand that they use similar approach: Terms with zero numerator and denominator are omitted from the sum and treated as if the values were missing.
canberra.distance <- function(a, b){
sum( (abs(a - b)) / (abs(a) + abs(b)), na.rm = T )
}
a <- c(0, 1, 0, 0, 1)
b <- c(1, 0, 1, 0, 1)
canberra.distance(a, b)
> 3
# the result that I expected
dist(rbind(a, b), method = "canberra")
> 3.75
a <- c(0, 1, 0, 0)
b <- c(1, 0, 1, 0)
canberra.distance(a, b)
> 3
# the result that I expected
dist(rbind(a, b), method = "canberra")
> 4
a <- c(0, 1, 0)
b <- c(1, 0, 1)
canberra.distance(a, b)
> 3
dist(rbind(a, b), method = "canberra")
> 3
# now the results are the same
Pairs 0-0 and 1-1 seem to be problematic. In the first case (0-0) both numerator and denominator are equal to zero and this pair should be omitted. In the second case (1-1) numerator is 0 but denominator is not and the term is then also 0 and the sum should not change.
What am I missing here?
EDIT:
To be in line with R definition, function canberra.distance
can be modified as follows:
canberra.distance <- function(a, b){
sum( abs(a - b) / abs(a + b), na.rm = T )
}
However, the results are the same as before.
回答1:
This might shed some light on the difference. As far as I can see this is the actual code being run for computing the distance
static double R_canberra(double *x, int nr, int nc, int i1, int i2)
{
double dev, dist, sum, diff;
int count, j;
count = 0;
dist = 0;
for(j = 0 ; j < nc ; j++) {
if(both_non_NA(x[i1], x[i2])) {
sum = fabs(x[i1] + x[i2]);
diff = fabs(x[i1] - x[i2]);
if (sum > DBL_MIN || diff > DBL_MIN) {
dev = diff/sum;
if(!ISNAN(dev) ||
(!R_FINITE(diff) && diff == sum &&
/* use Inf = lim x -> oo */ (int) (dev = 1.))) {
dist += dev;
count++;
}
}
}
i1 += nr;
i2 += nr;
}
if(count == 0) return NA_REAL;
if(count != nc) dist /= ((double)count/nc);
return dist;
}
I think the culprit is this line
if(!ISNAN(dev) ||
(!R_FINITE(diff) && diff == sum &&
/* use Inf = lim x -> oo */ (int) (dev = 1.)))
which handles a special case and may not be documented.
来源:https://stackoverflow.com/questions/38894675/canberra-distance-inconsistent-results