How to compute dissimilarities between sequences when sequences contain gaps?

拈花ヽ惹草 提交于 2019-12-24 11:24:03

问题


I want to cluster sequences with optimal matching with TraMineR::seqdist() from data that contains missings, i.e. sequences containing gaps.

library(TraMineR)
data(ex1)
sum(is.na(ex1))

# [1] 38

sq <- seqdef(ex1[1:13])
sq

#    Sequence                 
# s1 *-*-*-A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B      
# s3 *-D-D-D-D-D-D-D-D-D-D    
# s4 A-A-*-*-B-B-B-B-D-D      
# s5 A-*-A-A-A-A-*-A-A-A      
# s6 *-*-*-C-C-C-C-C-C-C      
# s7 *-*-*-*-*-*-*-*-*-*-*-*-*

sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)

#      A-> B->   C-> D->
# A->   0 2.000   2 2.000
# B->   2 0.000   2 1.823
# C->   2 2.000   0 2.000
# D->   2 1.823   2 0.000

When I run seqdist()

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)

I'm receiving

Error: 'with.missing' must be TRUE when 'seqdata' or 'refseq' contains missing values

but when I set option with.missing=TRUE I'm receiving

 [>] including missing values as an additional state
 [>] 7 sequences with 5 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
Error:  [!] size of substitution cost matrix must be 5x5

So, how can we compute the dissimilarities between sequences using seqdist() with the output of seqsubm() the right way when the data contain missings i.e. sequences contain gaps?

Note: I'm not very sure if this makes sense at all. So far I just exclude observations with missings but due to my data I loose lots of observations by that. Therefore it would be worthwhile to know if there's such an option.


回答1:


There are different strategies to computing distances when you have gaps.

1) A first solution is to consider the missing state as an additional state. This is what seqdist does when you set with.missing=TRUE. In that case the sm matrix should contain costs of substituting any state with the missing state. Using seqsubm you just need to specify with.missing=TRUE for that function too. By default, the substitution costs to substitute 'missing' are set as the fixed value miss.cost (2 by default).

sm <- seqsubm(sq, method='TRATE', with.missing=TRUE)
round(sm,digits=3)

#     A->   B-> C->   D-> *->
# A->   0 2.000   2 2.000   2
# B->   2 0.000   2 1.823   2
# C->   2 2.000   0 2.000   2
# D->   2 1.823   2 0.000   2
# *->   2 2.000   2 2.000   0

To get substitution costs for 'missing' based on transition probabilities

sm <- seqsubm(sq, method='TRATE', with.missing=TRUE, miss.cost.fixed=FALSE)
round(sm,digits=3)

#       A->   B->   C->   D->   *->
# A-> 0.000 2.000 2.000 2.000 1.703
# B-> 2.000 0.000 2.000 1.823 1.957
# C-> 2.000 2.000 0.000 2.000 1.957
# D-> 2.000 1.823 2.000 0.000 1.957
# *-> 1.703 1.957 1.957 1.957 0.000

Using the latter sm, we get the distances between sequences

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm, with.missing=TRUE)
round(dist.om, digits=2)

#       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
# [1,]  0.00 22.87 21.91 18.41  6.41 17.00 17.03
# [2,] 22.87  0.00 13.76 11.56 19.91 19.87 22.57
# [3,] 21.91 13.76  0.00 14.25 18.96 18.91 21.57
# [4,] 18.41 11.56 14.25  0.00 13.70 15.70 18.14
# [5,]  6.41 19.91 18.96 13.70  0.00 15.70 16.62
# [6,] 17.00 19.87 18.91 15.70 15.70  0.00 16.70
# [7,] 17.03 22.57 21.57 18.14 16.62 16.70  0.00

Of course, sequences would then be close from each other just because they share many missing states (*). Therefore, you may want to retain only sequences with say less than 10 % of their elements missing.

2) A second solution is to delete the gaps, which you do in seqdef. (However, note that this changes alignment.)

## Here, we drop seq 7 that contains only missing values

sq <- seqdef(ex1[-7,1:13], left='DEL', gaps='DEL')
sq

#    Sequence           
# s1 A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B
# s3 D-D-D-D-D-D-D-D-D-D
# s4 A-A-B-B-B-B-D-D    
# s5 A-A-A-A-A-A-A-A    
# s6 C-C-C-C-C-C-C  

sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)

#       A->   B-> C->   D->
# A-> 0.000 1.944   2 2.000
# B-> 1.944 0.000   2 1.823
# C-> 2.000 2.000   0 2.000
# D-> 2.000 1.823   2 0.000

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)
round(dist.om, digits=2)

#       [,1]  [,2]  [,3]  [,4]  [,5] [,6]
# [1,]  0.00 19.61 20.00 13.78  2.00   17
# [2,] 19.61  0.00 12.76  9.59 17.61   17
# [3,] 20.00 12.76  0.00 13.29 18.00   17
# [4,] 13.78  9.59 13.29  0.00 11.78   15
# [5,]  2.00 17.61 18.00 11.78  0.00   15
# [6,] 17.00 17.00 17.00 15.00 15.00    0


来源:https://stackoverflow.com/questions/53701467/how-to-compute-dissimilarities-between-sequences-when-sequences-contain-gaps

易学教程内所有资源均来自网络或用户发布的内容,如有违反法律规定的内容欢迎反馈
该文章没有解决你所遇到的问题?点击提问,说说你的问题,让更多的人一起探讨吧!