R unique columns or rows incomparables with NA

问题

Anyone know if the incomparables argument of unique() or duplicated() has ever been implemented beyond incomparables=FALSE?

Maybe I don't understand how it is supposed to work...

Anyway I'm looking for a slick solution to keep only unique columns (or rows) that are identical to another column besides extra NAs? I can brute force it using cor() for example, but for tens of thousands of columns, this is intractable.

Heres an example, sorry if its a little messy, but I think it illustrates the point. Make some matrix z:

z <- matrix(sample(c(1:3, NA), 100, replace=TRUE), 10, 10)
colnames(z) <- paste("c", 1:10, sep="")
rownames(z) <- paste("r",1:10, sep="")

lets add a couple duplicate columns with extra NAs, and randomize the columns, (that way they aren't always at the end).

c3.1 <- z[, 3]
c3.1[sample(1:10, 3)] <- NA
c8.1 <- z[, 8]
c8.1[sample(1:10, 5)] <- NA

z <- cbind(z, c3.1, c8.1)
z <- z[, sample(1:ncol(z))]

So I could sort by the number missing, then it would seem as though duplicated() or unique() would work, but it doesn't like to ignore missing.

missing <- apply(z, 2, function(x) {length(which(is.na(x)))})
z.sorted <- z[, order(missing)]

z.sorted[,!duplicated(z.sorted,MARGIN=2)]
unique(z.sorted,MARGIN=2)

I figured this is what the incomparables argument was specifically for, but it doesn't appear to be implemented yet:

z.sorted[,!duplicated(z.sorted,MARGIN=2,incomparables=NA)]
unique(z.sorted,MARGIN=2,incomparables=NA)

I know I will likely find a less elegant solution soon enough, I guess I'm more asking about why this hasn't been implemented yet? or if I'm just using it wrong. Seems I run into this quite often, yet I searched around for quite a while without finding answer. Any thoughts?

回答1:

As you suspect, for the data.frame and matrix methods of unique, incomparables != FALSE is not yet implemented. It is implemented in the default method, which is used for vectors without dims. E.g.:

unique(c(1, 2, 2, 3, 3, 3, NA, NA, NA), incomparables=2)
## [1]  1  2  2  3 NA

unique(c(1, 2, 2, 3, 3, 3, NA, NA, NA), incomparables=NA)
## [1]  1  2  3 NA NA NA

Take a look at the source of unique.matrix versus unique.default (just type the function names into the console and hit Enter, or press F2 in RStudio ro open the source in a new pane).

In your case, you could use outer to create a matrix indicating whether particular pairs of rows/columns are the same or not, disregarding NAs.

same <- outer(seq_len(ncol(z)), seq_len(ncol(z)), 
              Vectorize(function(x, y) all(z[, x]==z[, y], na.rm=TRUE)))

same

##        [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
##  [1,]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [2,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [3,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [4,] FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [5,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [6,] FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
##  [7,] FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
##  [8,] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
##  [9,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
## [10,] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [11,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [12,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE

Then, if you want to keep only those columns that are the same as, e.g., the second column (which is column c8.1 for me - see bottom of this post for the full z matrix I used), you can do:

z[, same[2, ]] # or, equivalently, z[, same[, 2]]

##     c8.1 c8
## r1     2  2
## r2     1  1
## r3    NA  3
## r4    NA  1
## r5     3  3
## r6    NA  1
## r7     2  2
## r8    NA  1
## r9     3  3
## r10   NA  1

To reduce the matrix to the set of columns that is unique (ignoring NA), and has the least number of NAs, you can then do:

z[, unique(sapply(apply(same, 2, which), function(x) 
  x[which.min(colSums(is.na(z))[x])]))]

##      c7 c8 c3 c1 c6 c10 c2 c9 c4
##  r1   2  2  1  2  1   1  1  2 NA
##  r2   3  1  3  1  3  NA  1  2  2
##  r3   2  3  2  3  1  NA  2  1 NA
##  r4   2  1  1  2  2   1  3 NA  2
##  r5  NA  3  2  1  3   2 NA NA  3
##  r6   2  1  2  2  1   1  2  1 NA
##  r7   2  2  2  2 NA   3  1  2  2
##  r8  NA  1  1  3  2  NA  1 NA  1
##  r9   1  3  3  2 NA   2  1 NA  2
## r10  NA  1  1 NA  1   1  1  2  3

For reference, here is the z I was working with:

    c7 c8.1 c3 c1 c5 c10 c8 c6 c2 c3.1 c9 c4
r1   2    2  1  2  1   1  2  1  1    1  2 NA
r2   3    1  3  1  3  NA  1  3  1    3  2  2
r3   2   NA  2  3  1  NA  3  1  2    2  1 NA
r4   2   NA  1  2 NA   1  1  2  3   NA NA  2
r5  NA    3  2  1  3   2  3  3 NA    2 NA  3
r6   2   NA  2  2  1   1  1  1  2    2  1 NA
r7   2    2  2  2  1   3  2 NA  1    2  2  2
r8  NA   NA  1  3 NA  NA  1  2  1   NA NA  1
r9   1    3  3  2  1   2  3 NA  1   NA NA  2
r10 NA   NA  1 NA NA   1  1  1  1    1  2  3

来源：https://stackoverflow.com/questions/34625319/r-unique-columns-or-rows-incomparables-with-na

标签

unique