I was surprised to find out that clara
from library(cluster)
allows NAs. But function documentation says nothing about how it handles these values.
Although not stated explicitly, I believe that NA
are handled in the manner described in the ?daisy
help page. The Details section has:
In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.
Given internally the same code will be being used by clara()
that is how I understand that NA
s in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.
Update The C
sources for clara.c
clearly indicate that this (the above) is how NA
s are handled by clara()
(lines 350-356 in ./src/clara.c
):
if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
/* in the following line (Fortran!), x[-2] ==> seg.fault
{BDR to R-core, Sat, 3 Aug 2002} */
if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
continue /* next j */;
}
}