How to get the largest possible column sequence with the least possible row NAs from a huge matrix?

自闭症网瘾萝莉.ら 提交于 2019-12-06 04:12:56

This takes less than one second on the huge data

l1 = combn(2:length(d), 2, function(x) d[x[1]:x[2]], simplify = FALSE)
# If you also need "combinations" of only single columns, then uncomment the next line
# l1 = c(d[-1], l1)
l2 = sapply(l1, function(x) sum(complete.cases(x)))

score = sapply(1:length(l1), function(i) NCOL(l1[[i]]) * l2[i])
best_score = which.max(score)
best = l1[[best_score]]

The question was unclear about how to rank the various combinations. We can use different scoring formulae to generate different preferences. For example, to weight number of rows versus columns separately we can do

col_weight = 2
row_weight = 1
score = sapply(1:length(l1), function(i) col_weight*NCOL(l1[[i]]) +  row_weight * l2[i])

Convert to matrix and calculate Na counts for each column:

dm <- is.na(d[, -1])
na_counts <- colSums(dm)
x <- data.frame(na_counts = na_counts, non_na_count = nrow(dm) - na_counts)
x <- as.matrix(x)

# create all combinations for column indexes:
nx <- 1:nrow(x)
psr <- do.call(c, lapply(seq_along(nx), combn, x = nx, simplify = FALSE))
# test if continuous:
good <- sapply(psr, function(y) !any(diff(sort.int(y)) != 1L))
psr <- psr[good == T] # remove non continuous
# for each combo count nas and non NA:
s <- sapply(psr, function(y) colSums(x[y, , drop = F]))

# put all together in table:
res <- data.frame(var_count = lengths(psr), t(s))
res$var_indexes <- sapply(psr, paste, collapse = ',')
res
#    var_count na_counts non_na_count var_indexes
# 1          1         1           10           1
# 2          1         3            8           2
# 3          1         5            6           3
# 4          1         7            4           4
# 5          1         9            2           5
# 6          2         4           18         1,2
# 7          2         8           14         2,3
# 8          2        12           10         3,4
# 9          2        16            6         4,5
# 10         3         9           24       1,2,3
# 11         3        15           18       2,3,4
# 12         3        21           12       3,4,5
# 13         4        16           28     1,2,3,4
# 14         4        24           20     2,3,4,5
# 15         5        25           30   1,2,3,4,5

# choose

As var indexes are sorted, for speed we can use simply:

good <- sapply(psr, function(y) !any(diff(y) != 1L))

Just to clarify, the seqsubm function from TraMineR has no problem at all with NAs, nor with sequences of different length. However, the function expects a state sequence object (to be created with seqdef) as input.

The function seqsubm is for computing substitution costs (i.e. dissimilarities) between states by means of different methods. You probably refer to the method ('TRATE') that derives the costs from the observed transition probabilities, namely as 2-p(i|j) - p(j|i), where p(i|j) is the probability to be in state i in t when we were in state j in t-1. So, all we need are the transition probabilities, which can easily be estimated from a set of sequences of different length or with gaps within them.

I illustrate below using the ex1 data that ships with TraMineR. (Due to the high number of different states in your toy example, the resulting matrix of substitution costs would be too large (28 x 28) for this illustration.)

library(TraMineR)
data(ex1)
sum(is.na(ex1))

# [1] 38

sq <- seqdef(ex1[1:13])
sq

#    Sequence                 
# s1 *-*-*-A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B      
# s3 *-D-D-D-D-D-D-D-D-D-D    
# s4 A-A-*-*-B-B-B-B-D-D      
# s5 A-*-A-A-A-A-*-A-A-A      
# s6 *-*-*-C-C-C-C-C-C-C      
# s7 *-*-*-*-*-*-*-*-*-*-*-*-*

sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)

#      A-> B->   C-> D->
# A->   0 2.000   2 2.000
# B->   2 0.000   2 1.823
# C->   2 2.000   0 2.000
# D->   2 1.823   2 0.000

Now, it is not clear to me what you want to do with the state dissimilarities. Inputting them in a clustering algorithm, you would cluster the states. If you want to cluster the sequences, then you should first compute dissimilarities between sequences (using seqdist and possibly passing the matrix of substitution costs returned by seqsubm as sm argument) and then input the resulting distance matrix in the clustering algorithm.

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!