R merged loop performance

问题

I have 2000 rows of data for 4000 columns. What I'm trying to do is to compare each row to the rest of the rows and see how similar they are in terms of different columns/total columns.

What I did so far is as follows:

for (i in 1:nrow(data))
{
    for (j in (i+1):nrow(data))
    { 
        mycount[[i,j]] = length(which(data[i,] != data[j,]))
    }
}

~~There are 2 problems with it, j doesn't start from i+1 (which is probably a basic mistake)~~ The main problem however is time it consumes, it takes ages...

Could someone please suggest a more proper way to achieve the same result, result being the percentage of each rows similarity to the other rows?

Here's an example of data and what I want to achieve:

The output should be something like:

mycount[1,2] = 2 (S# and var3 columns are different)
mycount[1,3] = 2 (S# and var1 columns are different)
mycount[1,4] = 2 (S# and var4 columns are different)
mycount[2,3] = ...
mycount[2,4] = ...
mycount[3,4] =  3 (S#, var1 and var 4 are different)

回答1:

One problem in your code is that the value of mycount[[i]] is updated in each iteration of the j loop (the previous value is overwritten) so what you end up with is mycount[[i]] being equal to length(which(data[i,] != data[nrow(data),])). Another issue is that i+1:nrow(data) does not produce the numbers i+1, i+2, ... nrow(data) but i + (1:nrow(data)). So what you want is either (i + 1):nrow(data) or seq(i + 1, nrow(data)).

You can try the following code, which will be faster than the double loop (probably still too slow though)

rows <- lapply(seq(nrow(data)), function(i) data[i, ])
outer(X = rows, Y = rows, FUN = Vectorize(function(x, y) sum(x == y)))

来源：https://stackoverflow.com/questions/40738758/r-merged-loop-performance

标签

performance

compare

similarity