clustering with NA values in R

后端 未结 3 2126
一向
一向 2021-02-13 20:07

I was surprised to find out that clara from library(cluster) allows NAs. But function documentation says nothing about how it handles these values.

3条回答
  •  生来不讨喜
    2021-02-13 20:39

    Although not stated explicitly, I believe that NA are handled in the manner described in the ?daisy help page. The Details section has:

    In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.

    Given internally the same code will be being used by clara() that is how I understand that NAs in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.

    Update The C sources for clara.c clearly indicate that this (the above) is how NAs are handled by clara() (lines 350-356 in ./src/clara.c):

        if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
            /* in the following line (Fortran!), x[-2] ==> seg.fault
               {BDR to R-core, Sat, 3 Aug 2002} */
            if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
            continue /* next j */;
            }
        }
    

提交回复
热议问题