Getting rows of a matrix which coincide with a series of vectors, without using apply

让人想犯罪 __ 提交于 2019-12-10 21:07:41

问题


My question is sort of related to my earlier question.

Suppose I have one matrix and 4 vectors (can consider this another matrix, since the order of the vectors matters), and I want to get the row numbers which coincide to each vector, in order. I would like the solution to avoid repeating vectors and be as efficient as possible, since the problem is large scale.

Example.

 set.seed(1)

    M = matrix(rpois(50,5),5,10)
    v1 = c(3, 2, 7, 7, 4, 4, 7,  4, 5, 6)
    v2=  c(8, 6,  4, 4, 3,  8,  3, 6, 5, 6)
    v3=  c(4,  8, 3,  5, 9, 4, 5,  6, 7 ,7)
    v4=  c(4,  9, 3, 6,  3, 1, 5, 7,6, 1)

Vmat = cbind(v1,v2,v3,v4)

M
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]    4    8    3    5    9    4    5    6    7     7
[2,]    4    9    3    6    3    1    5    7    6     1
[3,]    5    6    6   11    6    4    5    2    7     5
[4,]    8    6    4    4    3    8    3    6    5     6
[5,]    3    2    7    7    4    4    7    4    5     6

Vmat
      v1 v2 v3 v4
 [1,]  3  8  4  4
 [2,]  2  6  8  9
 [3,]  7  4  3  3
 [4,]  7  4  5  6
 [5,]  4  3  9  3
 [6,]  4  8  4  1
 [7,]  7  3  5  5
 [8,]  4  6  6  7
 [9,]  5  5  7  6
[10,]  6  6  7  1

The output should be...

5 4 1 2

回答1:


Similar to @user295691's answer, we merge, but now with which=TRUE option in merge.data.table:

set.seed(1)
matdata  <- create_data(1e6,20,1e5) # using @user295691's example data

library(data.table)
M = as.data.table(matdata$M)
V = as.data.table(matdata$V)

r <- M[V, on=names(V), which=TRUE]

To verify that it is correct...

V[1,]
#    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1:  7  5  3  2  5  6  3  3  5   5   3   2   4   9   4   4   3   6   4   3
M[r[1],]
#    V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20
# 1:  7  5  3  2  5  6  3  3  5   5   3   2   4   9   4   4   3   6   4   3

Benchmarks

OP's example data (in a deleted answer):

set.seed(1)

NM    = 1e6
NV    = 1e5
Ncols = 20
MM = matrix(rpois(NM*Ncols,Ncols),NM,Ncols)

rows=sample(NM,NV,replace = FALSE)

Vmat=t(MM[rows,])

# converted to data.frames, because why not?
M = as.data.frame(MM)
V = as.data.frame(t(Vmat))

# converted to data.tables
M2 = setDT(copy(M))
V2 = setDT(copy(V))

Functions to test:

match_strings <- function(){
  m = do.call(function(...) paste(...,sep="_"), M)
  v = do.call(function(...) paste(...,sep="_"), V)
  match(v,m)
}

merge_df <- function(){ # from @user295691's answer
  M$mid = seq(nrow(M))
  V$vid = seq(nrow(V))
  with(merge(M,V), mid[order(vid)])
}

merge_dt <- function(){
  M2[V2, on=names(V2), which=TRUE]
}

Results:

system.time({r_strings = match_strings()})
#    user  system elapsed 
#   10.40    0.06   10.49     
system.time({r_merge_df = merge_df()})
#    user  system elapsed 
#   14.71    0.10   14.84
system.time({r_merge_dt = merge_dt()})
#    user  system elapsed 
#    0.39    0.00    0.40 

identical(r_strings,r_merge_df) # TRUE
identical(r_strings,r_merge_dt) # TRUE



回答2:


I think collapsing each vector to a single value is the way to go, following @bunk:

m = do.call(function(...) paste(...,sep="_"), split(M, col(M)))
v = sapply(list(v1,v2,v3,v4), paste0, collapse="_")
match(v,m)
# [1] 5 4 1 2

The more natural way of building m would use apply, but that's verboten. If you store M as a data.frame, another option is:

m = do.call(function(...) paste(...,sep="_"), as.data.frame(M))



回答3:


If we switch these to data.frames, then we can use merge to do the trick. Also, we rotate Vmat for easy matching.

haystack <- as.data.frame(M)
haystack$haystack_id <- rownames(haystack)
needle <- as.data.frame(t(Vmat))
needle$needle_id <- rownames(needle)

lookups <- merge(needle, haystack)
lookups <- lookups[order(lookups$needle_id), ]

If we compare this to the string/match solution above, it appears to be faster by a reasonable degree

create_data <- function(haystack.rows, cols, needle.rows) {
   M <- matrix(rpois(haystack.rows * cols, 5), haystack.rows, cols)
   V <- M[sample(1:haystack.rows, needle.rows, replace=T),]
   list(M=M, V=V)
}

> set.seed(1); data <- create_data(1000000, 20, 10000);
> system.time({haystack <- as.data.frame(data$M); haystack$hid <- seq_along(haystack$V1); needle <- as.data.frame(data$V); needle$nid <- seq_along(needle$V1); ret <- merge(needle, haystack); ret <- ret[order(ret$nid),]})
   user  system elapsed
  5.900   0.000   5.906

> system.time({mstr <- apply(data$M, 1, paste0, collapse="_"); vstr <- apply(data$V, 1, paste0, collapse="_"); matchstr <- match(vstr, mstr)})
   user  system elapsed
  8.372   0.000   8.377

match on strings is much faster than merge but you have to pay the cost of transforming the data, whereas converting to a data frame is very cheap, since it uses the same underlying data.

EDIT: added a sort step to the merge version to get the rows in order. Also fixed a typo in the timed version of the merge version. Times remained in the same order of magnitude

EDIT2: Thanks to @Frank, found a bug in the match version of the time, which sped up things substantially (I had been using a local example called asdf which was even larger). Still not as fast as the merge solution, though.



来源:https://stackoverflow.com/questions/32654938/getting-rows-of-a-matrix-which-coincide-with-a-series-of-vectors-without-using

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!