How to compute dissimilarities between sequences when sequences contain gaps?

问题

I want to cluster sequences with optimal matching with TraMineR::seqdist() from data that contains missings, i.e. sequences containing gaps.

library(TraMineR)
data(ex1)
sum(is.na(ex1))

# [1] 38

sq <- seqdef(ex1[1:13])
sq

#    Sequence                 
# s1 *-*-*-A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B      
# s3 *-D-D-D-D-D-D-D-D-D-D    
# s4 A-A-*-*-B-B-B-B-D-D      
# s5 A-*-A-A-A-A-*-A-A-A      
# s6 *-*-*-C-C-C-C-C-C-C      
# s7 *-*-*-*-*-*-*-*-*-*-*-*-*

sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)

#      A-> B->   C-> D->
# A->   0 2.000   2 2.000
# B->   2 0.000   2 1.823
# C->   2 2.000   0 2.000
# D->   2 1.823   2 0.000

When I run seqdist()

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)

I'm receiving

Error: 'with.missing' must be TRUE when 'seqdata' or 'refseq' contains missing values

but when I set option with.missing=TRUE I'm receiving

 [>] including missing values as an additional state
 [>] 7 sequences with 5 distinct states
 [>] checking 'sm' (one value for each state, triangle inequality)
Error:  [!] size of substitution cost matrix must be 5x5

So, how can we compute the dissimilarities between sequences using seqdist() with the output of seqsubm() the right way when the data contain missings i.e. sequences contain gaps?

Note: I'm not very sure if this makes sense at all. So far I just exclude observations with missings but due to my data I loose lots of observations by that. Therefore it would be worthwhile to know if there's such an option.

回答1:

There are different strategies to computing distances when you have gaps.

1) A first solution is to consider the missing state as an additional state. This is what seqdist does when you set with.missing=TRUE. In that case the sm matrix should contain costs of substituting any state with the missing state. Using seqsubm you just need to specify with.missing=TRUE for that function too. By default, the substitution costs to substitute 'missing' are set as the fixed value miss.cost (2 by default).

sm <- seqsubm(sq, method='TRATE', with.missing=TRUE)
round(sm,digits=3)

#     A->   B-> C->   D-> *->
# A->   0 2.000   2 2.000   2
# B->   2 0.000   2 1.823   2
# C->   2 2.000   0 2.000   2
# D->   2 1.823   2 0.000   2
# *->   2 2.000   2 2.000   0

To get substitution costs for 'missing' based on transition probabilities

sm <- seqsubm(sq, method='TRATE', with.missing=TRUE, miss.cost.fixed=FALSE)
round(sm,digits=3)

#       A->   B->   C->   D->   *->
# A-> 0.000 2.000 2.000 2.000 1.703
# B-> 2.000 0.000 2.000 1.823 1.957
# C-> 2.000 2.000 0.000 2.000 1.957
# D-> 2.000 1.823 2.000 0.000 1.957
# *-> 1.703 1.957 1.957 1.957 0.000

Using the latter sm, we get the distances between sequences

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm, with.missing=TRUE)
round(dist.om, digits=2)

#       [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]
# [1,]  0.00 22.87 21.91 18.41  6.41 17.00 17.03
# [2,] 22.87  0.00 13.76 11.56 19.91 19.87 22.57
# [3,] 21.91 13.76  0.00 14.25 18.96 18.91 21.57
# [4,] 18.41 11.56 14.25  0.00 13.70 15.70 18.14
# [5,]  6.41 19.91 18.96 13.70  0.00 15.70 16.62
# [6,] 17.00 19.87 18.91 15.70 15.70  0.00 16.70
# [7,] 17.03 22.57 21.57 18.14 16.62 16.70  0.00

Of course, sequences would then be close from each other just because they share many missing states (*). Therefore, you may want to retain only sequences with say less than 10 % of their elements missing.

2) A second solution is to delete the gaps, which you do in seqdef. (However, note that this changes alignment.)

## Here, we drop seq 7 that contains only missing values

sq <- seqdef(ex1[-7,1:13], left='DEL', gaps='DEL')
sq

#    Sequence           
# s1 A-A-A-A-A-A-A-A-A-A
# s2 D-D-D-B-B-B-B-B-B-B
# s3 D-D-D-D-D-D-D-D-D-D
# s4 A-A-B-B-B-B-D-D    
# s5 A-A-A-A-A-A-A-A    
# s6 C-C-C-C-C-C-C  

sm <- seqsubm(sq, method='TRATE')
round(sm,digits=3)

#       A->   B-> C->   D->
# A-> 0.000 1.944   2 2.000
# B-> 1.944 0.000   2 1.823
# C-> 2.000 2.000   0 2.000
# D-> 2.000 1.823   2 0.000

dist.om <- seqdist(sq, method="OM", indel=1, sm=sm)
round(dist.om, digits=2)

#       [,1]  [,2]  [,3]  [,4]  [,5] [,6]
# [1,]  0.00 19.61 20.00 13.78  2.00   17
# [2,] 19.61  0.00 12.76  9.59 17.61   17
# [3,] 20.00 12.76  0.00 13.29 18.00   17
# [4,] 13.78  9.59 13.29  0.00 11.78   15
# [5,]  2.00 17.61 18.00 11.78  0.00   15
# [6,] 17.00 17.00 17.00 15.00 15.00    0

来源：https://stackoverflow.com/questions/53701467/how-to-compute-dissimilarities-between-sequences-when-sequences-contain-gaps

标签

traminer

sequence-analysis