R merged loop performance

北慕城南 提交于 2019-12-10 11:57:46

问题


I have 2000 rows of data for 4000 columns. What I'm trying to do is to compare each row to the rest of the rows and see how similar they are in terms of different columns/total columns.

What I did so far is as follows:

for (i in 1:nrow(data))
{
    for (j in (i+1):nrow(data))
    { 
        mycount[[i,j]] = length(which(data[i,] != data[j,]))
    }
}

There are 2 problems with it, j doesn't start from i+1 (which is probably a basic mistake) The main problem however is time it consumes, it takes ages...

Could someone please suggest a more proper way to achieve the same result, result being the percentage of each rows similarity to the other rows?

Here's an example of data and what I want to achieve:

The output should be something like:

mycount[1,2] = 2 (S# and var3 columns are different)
mycount[1,3] = 2 (S# and var1 columns are different)
mycount[1,4] = 2 (S# and var4 columns are different)
mycount[2,3] = ...
mycount[2,4] = ...
mycount[3,4] =  3 (S#, var1 and var 4 are different)

回答1:


One problem in your code is that the value of mycount[[i]] is updated in each iteration of the j loop (the previous value is overwritten) so what you end up with is mycount[[i]] being equal to length(which(data[i,] != data[nrow(data),])). Another issue is that i+1:nrow(data) does not produce the numbers i+1, i+2, ... nrow(data) but i + (1:nrow(data)). So what you want is either (i + 1):nrow(data) or seq(i + 1, nrow(data)).

You can try the following code, which will be faster than the double loop (probably still too slow though)

rows <- lapply(seq(nrow(data)), function(i) data[i, ])
outer(X = rows, Y = rows, FUN = Vectorize(function(x, y) sum(x == y)))


来源:https://stackoverflow.com/questions/40738758/r-merged-loop-performance

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!