clustering with NA values in R

人盡茶涼 提交于 2019-12-04 01:52:46

Although not stated explicitly, I believe that NA are handled in the manner described in the ?daisy help page. The Details section has:

In the daisy algorithm, missing values in a row of x are not included in the dissimilarities involving that row.

Given internally the same code will be being used by clara() that is how I understand that NAs in the data can be handled - they just don't take part in the computation. This is a reasonably standard way of proceeding in such cases and is for example used in the definition of Gower's generalised similarity coefficient.

Update The C sources for clara.c clearly indicate that this (the above) is how NAs are handled by clara() (lines 350-356 in ./src/clara.c):

    if (has_NA && jtmd[j] < 0) { /* x[,j] has some Missing (NA) */
        /* in the following line (Fortran!), x[-2] ==> seg.fault
           {BDR to R-core, Sat, 3 Aug 2002} */
        if (x[lj] == valmd[j] || x[kj] == valmd[j]) {
        continue /* next j */;
        }
    }
data-frame-gg

Not sure if kmeans can handle missing data by ignoring the missing values in a row.

There are two steps in kmeans;

  1. calculating the distance between an observation and original cluster mean.
  2. updating the new cluster mean based on the newly calculated distances.

When we have missing data in our observations: Step 1 can be handled by adjusting the distance metric appropriately as in the clara/pam/daisy package. But Step 2 can only be performed if we have some value for each column of an observation. Therefore imputing might be the next best option for kmeans to deal missing data.

By looking at the Clara c code, I noticed that in clara algorithm, when there are missing values in the observations, the sum of squares is "reduced" proportional to the number of missing values, which I think is wrong! line 646 of clara.c is like " dsum *= (nobs / pp) " which shows it counts the number of non-missing values in each pair of observations (nobs), divides it by the number of variables (pp) and multiplies this by the sum of squares. I think it must be done in other way, i.e. " dsum *= (pp / nobs) ".

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!